# Classifying Reddit Autos Selfposts<div class="tocSkip">
    
&copy; Jens Albrecht, 2021
    
This notebook can be freely copied and modified.  
Attribution, however, is highly appreciated.

<hr/>

See also: 

Albrecht, Ramachandran, Winkler: **Blueprints for Text Analytics in Python** (O'Reilly 2020)  
Chapter 6: [Text Classification Algorithms](https://learning.oreilly.com/library/view/blueprints-for-text/9781492074076/ch06.html#ch-classification) + [Link to Github](https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/README.md)

## Setup<div class='tocSkip'/>

Set directory locations. If working on Google Colab: copy files and install required libraries.

In [None]:
import sys, os
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    GIT_ROOT = 'https://github.com/jsalbr/tdwi-2021-text-mining/raw/main'
    os.system(f'wget {GIT_ROOT}/notebooks/setup.py')

%run -i setup.py

## Load Python Settings<div class="tocSkip"/>

Common imports, defaults for formatting in Matplotlib, Pandas etc.

In [None]:
%run "$BASE_DIR/notebooks/settings.py"

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'png'

# to print output of all statements and not just the last
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# otherwise text between $ signs will be interpreted as formula and printed in italic
pd.set_option('display.html.use_mathjax', False)
pd.options.plotting.backend = "matplotlib"

# path to import blueprints packages
sys.path.append(f'{BASE_DIR}/packages')

## Preparing Data for Machine Learning

### Load Data

In [None]:
df = pd.read_csv(f"{BASE_DIR}/data/reddit-autos-selfposts-prepared.csv", sep=";", decimal=".")

len(df)

In [None]:
# set display column width unlimited to show full text
pd.set_option('max_colwidth', -1)

df.sample(5)

# reset display column width to 30
pd.reset_option('max_colwidth')

### Define Label

Store the label in a variable to make modifications easier.

In [None]:
label = 'subreddit'

In [None]:
df[label].value_counts().to_frame()

In [None]:
df[label].value_counts().plot(kind='barh').invert_yaxis()

### Vectorization

Here we use scikit-learn's TF-IDF vectorizers for bag-of-words vectorization, i.e. creating the TF or TF-IDF matrix.

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn-feature-extraction-text-countvectorizer

In [None]:
# choose text column for vectorization
text_col = 'lemmas'

#### Term-Frequency Matrix (Counts)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# learn vocabulary for all data
count_vect = CountVectorizer(ngram_range=(1, 1), 
                             min_df=1, 
                             max_df=1.0, 
                             lowercase=True,
                             stop_words=None,
                             tokenizer=str.split)

# alternatively: only nouns or nouns+adjs+verbs
X_tf = count_vect.fit_transform(df[text_col])

type(X_tf)
X_tf.shape

Optional: Play with hyperparameters i.e.

  * `ngram_range=(1, 2)` to include bigrams
  * `max_df=0.5` to exclude words occuring in more than 50% of the documents
  * `min_df=2` to include only words occuring in at least 2 documents

#### TF-IDF-Matrix

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_vect = TfidfTransformer()

X_tfidf = tfidf_vect.fit_transform(X_tf)

X_tfidf.shape

### Train-Test-Split

Choose data matrix `X` and label vector `y` for training:

In [None]:
# alternatively: X = X_tf
X = X_tfidf

# define label vector
y = df[label]

Now split with `train_test_split()`.

Recommendation: use `stratify=y`

In [None]:
from sklearn.model_selection import train_test_split

# define holdout
test_size = 0.2

if test_size > 0.0:
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=test_size,
                                                        stratify = y,
                                                        random_state=43
                                                       )
else:
    X_train, X_test, y_train, y_test = X, None, y, None
    
    
print("Trainigsmatrix:", X_train.shape)
print("Testmatrix:    ", X_test.shape)

Store information about train/test records in data frame.

In [None]:
df['train_test'] = pd.Series(df.index.isin(y_test.index)).map(lambda x: 'Test' if x else 'Train')

In [None]:
df['train_test'].value_counts()

Stratification enforces 80:20 split even within classes:

In [None]:
df[[label, 'train_test']].pivot_table(index=label, columns='train_test', aggfunc=len, fill_value=0)

## Training and Evaluation


### Support Vector Machine

We use the Support Vector Machine for training which works excellent for TF-IDF vectors. Fastest implementation is `LinearSVC`, but allows only linear kernels. Alternatively use `SGDClassifier`.
To use a different classifier like logistic regression, just uncomment the respective lines.

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

In [None]:
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import SGDClassifier, LogisticRegression

print(f'Training on column {label}')

clf = LinearSVC(C=1.0)
# clf = SGDClassifier(loss='hinge', max_iter=1000, tol=1e-3, random_state=42)

clf.fit(X_train, y_train);

print("Done.")

Extremely fast, right!?

### Evaluation

Apply classifier to test data with `predict()`.

In [None]:
from sklearn.metrics import accuracy_score

y_test_pred = clf.predict(X_test)
y_train_pred = clf.predict(X_train)

print(f"Classifier: {clf.__class__}\n")

print('Accuracy Summary')
print('================')

print(f'Test:    {accuracy_score(y_test, y_test_pred)*100:6.2f}%')
print(f'Train:   {accuracy_score(y_train, y_train_pred)*100:6.2f}%')

Not bad for a 12-class classifier!

$$Accuracy = \frac{\text{number of correctly classified data points}}{\text{all data points}}$$

In [None]:
sum(y_test == y_test_pred)/len(y_test)

Looking at the per-class metrics with `classification_report`:

In [None]:
from sklearn.metrics import classification_report

print("Classification Report")
print("=====================")
print(classification_report(y_true=y_test, y_pred=y_test_pred))

And on the training data:

In [None]:
print("Classification Report")
print("=====================")
print(classification_report(y_true=y_train, y_pred=y_train_pred))

### Confusion Matrix

Visualizing the `confusion_matrix` by `sns.heatmap`.

Not surprisingly, the generic category "AskMechanics" is frequently confused.

In [None]:
from sklearn.metrics import confusion_matrix

# label names - specifies order in confusion matrix
label_names = sorted(y_test.unique())

# scale figure size depending on number of categories
fsize = len(label_names)/2
sns.set(font_scale=1)

conf_mat = confusion_matrix(y_test, y_test_pred, labels=label_names)

_ = fig, ax = plt.subplots(figsize=(fsize, fsize))
_ = sns.heatmap(conf_mat, annot=True, fmt="d", cmap="Blues", cbar=False, 
                xticklabels=label_names, yticklabels=label_names)
_ = plt.ylabel("Actual")
_ = plt.xlabel("Predicted")
_ = ax.set_title(f"Confusion Matrix for {label}", fontsize=14)

sns.set(font_scale=1)

### Checking misclassified Data

Looking at samples of misclassified and correctly classified data.

Add the predictions to the dataframe to simplify the analysis:

In [None]:
# transform prediction vectors to pandas series with correct indexes
y_test_pred = pd.Series(y_test_pred, index=y_test.index)
y_train_pred = pd.Series(y_train_pred, index=y_train.index)

df['pred'] = pd.concat([y_test_pred, y_train_pred])

Check sample of misclassified data:

In [None]:
# adjust size of visible columns
pd.set_option('max_colwidth', 3000)

df.query(f'train_test=="Test" and {label}!=pred')[[label, 'pred', 'text', text_col]].sample(5)

Check sample of correctly classified data:

In [None]:
df.query(f'train_test=="Test" and {label}==pred')[[label, 'pred', 'text', text_col]].sample(5)

### Save DataFrame with Predictions

In [None]:
df.to_csv("reddit-autos-selfposts-predicted.csv", sep=";", decimal=".", index=False)

## Explaining the Classifier

### Measuring Feature Importance

The coefficients of the SVM can be used to display the feature (=word) importance per class, positively and negatively.

In [None]:
def plot_coefficients(classifier, vect, top_features=20):

    # get the feature names from the vectorizer
    feature_names = np.array(vect.get_feature_names())

    for i, category in enumerate(clf.classes_):

        # get class coefficients
        coef = classifier.coef_[i]

        # get the top and worst features
        top_pos_coefs = np.argsort(coef)[-top_features:]
        top_neg_coefs = np.argsort(coef)[:top_features]
        top_coefs = np.hstack([top_neg_coefs, top_pos_coefs])[::-1]

        # create plot
        plt.figure(figsize=(10, 5))
        plt.title(f'Coefficients for category "{category}"')
        colors = ['xkcd:green' if c > 0 else 'xkcd:red' for c in coef[top_coefs]]
        plt.bar(np.arange(2*top_features), coef[top_coefs], color=colors)

        feature_names[top_coefs]

        np.arange(0, 2 * top_features)

        plt.xticks(np.arange(0, 2 * top_features), feature_names[top_coefs], rotation=90, ha='center')
        plt.grid(linestyle='dashed')
        plt.tight_layout()
        plt.show()

In [None]:
plot_coefficients(clf, count_vect)

### Classifier Explanation with LIME

We need a classifier with prediction probabilities. `LinearSVC` does not yield these. We use Logistic Regression instead.

In [None]:
from sklearn.linear_model import SGDClassifier

print(f'Training on column {label}')

# log loss gives logistic regression
clf = SGDClassifier(loss='log', max_iter=1000, tol=1e-3, random_state=42)

clf.fit(X_train, y_train);

print("Done.")

In [None]:
from sklearn.metrics import accuracy_score

y_test_pred = clf.predict(X_test)
y_train_pred = clf.predict(X_train)

print(f"Classifier: {clf.__class__}\n")

print('Accuracy Summary')
print('================')

print(f'Test:    {accuracy_score(y_test, y_test_pred)*100:6.2f}%')
print(f'Train:   {accuracy_score(y_train, y_train_pred)*100:6.2f}%')

In [None]:
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(count_vect, tfidf_vect, clf)

In [None]:
# use lemmas only here, because model is trained on lemmas
samples = [
    "BMW great", 
    "Electric charge take long"
]

pred = pipeline.predict_proba(samples)
pred

In [None]:
pd.options.display.float_format = '{:.4f}'.format

In [None]:
columns = [f"Sample {i+1}" for i in range(pred.shape[0])]

pred_df = pd.DataFrame(pred.T, index=clf.classes_, columns=columns)
pred_df

In [None]:
# adjust size of visible columns
pd.set_option('max_colwidth', 3000)

#df.query(f'train_test=="Test" and {label}!=pred')[[label, 'pred', 'text', text_col]].sample(5)

Example: Predicted is "teslamotors", but correct is "Toyota"

In [None]:
df.iloc[7468].to_frame()

In [None]:
text = df.iloc[7468]['text']

In [None]:
from lime.lime_text import LimeTextExplainer

explainer = LimeTextExplainer(class_names=clf.classes_)

exp = explainer.explain_instance(text, pipeline.predict_proba, num_features=6, top_labels=3)

print([exp.class_names[i] for i in exp.available_labels()])

In [None]:
exp.show_in_notebook(text=False)

## ENDE <div class="tocSkip">