# Feature Engineering and Classification
## Summary
This notebook uses the [10000-books-and-their-genres-standardized](https://www.kaggle.com/michaelrussell4/10000-books-and-their-genres-standardized) dataset. This data includes around 10,000 full-length books and their associated genres obtained from GoodReads. 

In this notebook, I clean and preprocess the text of each book in the dataset, filter out books with insufficient length or genre tags, and then convert the text via several different feature-engineering techniques and, after fitting the several new datasets obtained from the previous step to several machine-learning algorithms, I compare the results with the intention of evaluating different eature-engineering techniques and how they affect a machine-learning algorithms' performance in predicting multi-label targets.

The overall objective of this notebook is to compare two common methods for feature engineering text against several novel feature-engineering methods. The task is unique in this objective as well as in the endeavor to create multi-label classifiers on such a dataset, i.e., predict one or more genres for a given text where, in this case, the text is an entire book! Following are descriptions of the feature engineering methods referred to:
1. [spaCy's](https://spacy.io/) document vectorization.
- spaCy creates a vector of a word or document based on a pretrained language model. In my case, I'll be using the `en_core_web_lg` model which breaks down to an English, Web-data trained, large model. The vectorized form of a document or word attempts to encapsulate the meaning of the word or document based on word semantic and meaning relationships, e.g., the vectors for words whose meanings are related, like 'dog' and 'canine', will depict such relationship via the vector similarities. spaCy is an incredible tool in this regard and makes such vectorization easy to implement.
2. [Sklearn's TF-IDF vectorizor](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- Scikit learn, or Sklearn, provides a method for transforming preprocessed text to a term frequency-inverse document frequency vectorized model. For more information on this method, see https://en.wikipedia.org/wiki/Tf%E2%80%93idf.
3. Novel techniques
- Part of speech relative frequencies.
- Reading complexity scores obtained via 3 well-established algorithms implemented through [textstat](https://pypi.org/project/textstat/).

-

*I'd like to thank the following for their contributions to the libraries used in this notebook.*
- [Scikit-learn](https://scikit-learn.org/)
- [spaCy](https://spacy.io/)

<hr>

# 1. Import libraries and dataset
I'll be using several libraries quite frequently so I'll import them below. I also will import the 10000-books-and-their-genres-standardized dataset mentioned in the notebook summary.

In [None]:
pip install textstat

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
import spacy
nlp = spacy.load('en_core_web_lg')
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer
nlp_sm = spacy.load('en_core_web_sm')
import textstat as ts
vocab = list(nlp_sm.vocab.strings)
import seaborn as sns
import gc
import re

In [None]:
%%time
df = pd.read_csv('../input/10000-books-and-their-genres-standardized/books_and_genres.csv', index_col=0)

Here is a sample of the dataset:

In [None]:
df.head()

Because the dataset is over 3 GB, I only use a portion while testing and developing. Uncomment this line when running the whole dataset.

In [None]:
# Uncomment this line when not testing
df = df.iloc[:1750, :]

<hr>

# 2. Preprocessing Text and Feature Engineering

## Reformat book text

Some rows have text that, for whatever reason, is not a `str` type. Cast these to `str`. 
spaCy's `nlp()` can only handle documents of a certain size before the RAM requirments get too hefty. To avoid errors being thrown, limit the book size (represented in the `text` column) to 1,000,000 characters. This is an ample amount since the average book has less than 500,000 characters in it (https://www.quora.com/How-many-characters-of-text-letters-are-in-an-average-book). 

Also, discard books that have less than 1,000 characters as they are insufficient.

In [None]:
df['text'] = df['text'].astype(str)
df['text_len'] = df.text.apply(len)
df.drop(df[df.text_len < 1000].index, inplace=True)
df.drop('text_len', axis=1, inplace=True)
max_char = 1000000 # spaCy can only allocate enough RAM for this many chars in a doc
df.text = df.text.apply(lambda t:t[:max_char])

In [None]:
df = df.reset_index()
df.head()

## Remove non-English books
Non-English books won't be processed correctly by our preprocessing algorithms.

In [None]:
pip install langdetect

In [None]:
from langdetect import detect
from langdetect import detect_langs

In [None]:
before_filter = len(df)

In [None]:
df['lang'] = df.text.apply(detect)
df.drop(df[df.lang != 'en'].index, inplace=True)

In [None]:
print(f'{before_filter - len(df)} books removed from the dataset for non-English language')

## Restructure genre data
The genre data was a set before it was ported to a `csv` file but when it was ported it changed the set of genres to a string representation. This code translates the genres into a list of genres for each row.

In [None]:
def reformatGenres(genres):
    l_genres = list(filter(None, re.split(r"[\{\}\,\s\']", genres)))
    # Combine history and historical tags since they refer to the same genre
    r_genres = list({e if e!='history' else 'historical' for e in l_genres})
    return r_genres

In [None]:
df.genres = df.loc[:, 'genres'].apply(reformatGenres)

## Reduce the number of genre tags
Multi-label classification problems are already very difficult—to help make this problem more feasible, I'll reduce the number of unique genre tags used to a more moderate number.

Get all of the genre tags from every book in the dataset into a list.

In [None]:
genres = [g for r in df.genres for g in r]

Get the counts of how many times each genre appears in the dataset and plot the results.

In [None]:
genres_counts = pd.Series(genres).value_counts()
print(genres_counts)
genres_counts.plot(figsize=(15,5), title='Genre total count in dataset')
plt.show()

Based on the graph above, using the 10 most popular genre tags, which almost all have over 2,000 occurences in the dataset, will clean up the genre tags to make classification more feasible later on in the notebook.

__Note__: _Originally, I used code to obtain the 10 genre tags I would keep in the dataset, but after obtaining these via code, I hard-coded them back in with some manual selection modifications_

In [None]:
# This line was originally used but for future use and simplicity its result is hard coded
# [k for k, v in genres_counts.items() if v >= 2000]
popular_genres = ['fiction', 'classics', 'historical', '20th-century', 'non-fiction', 'literature', 'historical-fiction', 'romance', 'fantasy', 'adventure']
print(f'{len(popular_genres)} genres have over 2000 counts in dataset.\n\nThey are:\n\n{popular_genres}')

Now, for each example, remove genre tags not in the list shown above.

In [None]:
df.genres = df.genres.apply(lambda genres:[g for g in genres if g in popular_genres])
df['genre_count'] = df.genres.apply(len)
df = df[df.genre_count > 0]
df.drop('genre_count', axis=1, inplace=True)

### Visuals of genre tag distribution
Here is a bar graph showing the new distribution of genres in the dataset. 

In [None]:
genres = [g for r in df.genres for g in r]
genres_counts = pd.Series(genres).value_counts()
print(genres_counts)
ax = genres_counts.plot(figsize=(15,5), kind='bar')
ax.set_title('Genre total count in dataset', fontdict={'fontsize':22}, pad=20)
plt.show()

Here is an alternative visualization of the distribution

In [None]:
ax = genres_counts.plot(figsize=(20, 10), kind='pie', ylabel='')
ax.set_title('Genre count distribution pi chart', fontdict={'fontsize':22}, pad=20)
plt.show()

Now all of the genres that aren't used at least 2000 times in this dataset have been removed and any entries that no longer had genres were removed. Only 10 genres are now in use.

In [None]:
df.head()

## Preprocess and feature engineer the text
This function below takes a spaCy doc or token and filters out stop words and punctuation, strips extra spacing, and lemmatizes words of the text.
The `preprocess_text` function below performs the following preprocessing and feature-engineering tasks:

__Preprocessing__
_* Most of the preprocessing steps are performed via [spaCy's](https://spacy.io/) automated web-trained large English core pipeline*_
- Lemmatize text
- Remove upper capitalization
- Strip excess spaces and carriage returns
- Remove punctuation
- Remove non-alpha characters
- Remove stop words
- Tokenize the text

__Feature Engineering__
- Vectorize the text via [spaCy's](https://spacy.io/) `doc.vector()` method
- Vectorize the text via [Sklearn's TF-IDF vectorizor](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- Compute reading complexity scores via [textstat](https://pypi.org/project/textstat/)
- Compute the relative Part-of-speech frequencies via [spaCy's linguistic features](https://spacy.io/usage/linguistic-features)


__Note__: _The preprocessing and feature-engineering steps are taken together to save huge amounts of data. Doing it this way always us to delete the full text for each row and retain the preprocessed and feature-engineered representations of the full text. Otherwise, saving the `spaCy` `doc` objects alone would take more than 200 GB of memory that would have to be stored in RAM. After attempting to solve this dilemma through various other methods, I resorted to this method. Additionally, I needed to obtain the reading complexity scores via `textstat` and these scores take into account punctuation, capitilization, etc., basically, the input text for these algorithms needed to be raw book text, not preprocessed text. Combining the preprocessing and feature engineering steps allows for this to easily be done in one function._

@article{scikit-learn,
 title={Scikit-learn: Machine Learning in {P}ython},
 author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
 journal={Journal of Machine Learning Research},
 volume={12},
 pages={2825--2830},
 year={2011}
}

In [None]:
tfidf_vectorizer = TfidfVectorizer(vocabulary=vocab)
unique_pos = ['ADJ','ADP','ADV','AUX','CCONJ','DET','INTJ','NOUN','NUM','PART','PRON','PROPN','PUNCT','SCONJ','SPACE','SYM','VERB','X']
def preprocess_text(text):
    # reading complexity scores
    flesch_re = ts.flesch_reading_ease(text)
    flesch_k = ts.flesch_kincaid_grade(text)
    smog = ts.smog_index(text)
    
    # spacy text cleaning, tokenizing, etc. 
    doc = nlp(text)
    preprocessed_text = ' '.join([token.lemma_.strip().lower() for token in doc
                    if token.is_alpha and not(token.is_stop)])
    
    # spaCy vectorization
    doc_vector = nlp(preprocessed_text).vector
    
    # Term frequency-inverse document frequency vectorization
    tfidf_vector = tfidf_vectorizer.fit_transform([preprocessed_text])
    tfidf_vector = tfidf_vector.toarray()[0]
    
    # POS counts
    pos = [[token.pos_] for token in doc]
    df1 = pd.DataFrame(data=pos, columns=['POS'])
    pos_counts = df1.POS.value_counts(normalize=True)
    # split into columns named after unique_pos
    pos_cols = [pos_counts[k] if k in pos_counts.keys() else 0 for k in unique_pos]
    
    # replace the text with nan to save space
    filler_text = np.nan
    
    ret_data = filler_text, flesch_re, flesch_k, smog, *pos_cols, doc_vector, tfidf_vector

    return ret_data

In [None]:
df.memory_usage(deep=True)

Call the `preprocess_text` method on the text data from our dataset.

In [None]:
processed_data = df.text.progress_apply(preprocess_text)

Create or replace columns in our dataframe with the results.

In [None]:
df[['text', 'flesch_reading_ease', 'flesch_kincaid_grade', 'smog_index', *unique_pos, 'doc_vectr', 'tfidf_vectr']] = pd.DataFrame([*processed_data])

In [None]:
df.memory_usage(deep=True)

In [None]:
df.head()

Save the results in a pickled file for future use or reference.

In [None]:
import pickle

with open('output_df.p', 'wb') as f:
    pickle.dump(df, f)