In [1]:
import pandas as pd
import numpy as np
from datasets import load_dataset, logging
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

import matplotlib.pyplot as plt
# enabling inline plots in Jupyter
%matplotlib inline
# disabling verbose messages from dataset library
logging.set_verbosity_error()

  from .autonotebook import tqdm as notebook_tqdm


# Exercise: Classification II

In this exercise session, you will be using cross-validation to check the out-of-sample performance of different models for classifying movie review sentiment (using TF-IDF features as in the Classification I problem set). You will compare a logistic regression model to SVM and Naive Bayes. You will also use cross-validation for hyperparameter grid search.

Pro tip: As you will be fitting a lot of models in this exercise, why not take a look at how sklearn handles [parallelism](https://scikit-learn.org/stable/computing/parallelism.html#parallelism). A lot of method in the sklearn library take a parameter [n_jobs](https://scikit-learn.org/stable/glossary.html#term-n_jobs). By setting it to -1, you can use all of your CPUs (cores) at once. Depending on your hardware you may see 8x faster code, which means less waiting and more learning.

# 1. Cross-validation

Cross-validate the logistic regression classifier on the `rotten_tomatoes` dataset with TF-IDF vectorization that we used in the previous exercise. Perform 5-fold stratified cross-validation with the built-in method `cross_val_score` method in the sklearn `model_selection` module. Throughout this exercise (up to step 6), set the `scoring` parameter of `cross_val_score` to "accuracy". This means we will be using accuracy as our performance metric.

Compare performance (averaged across the five folds) to the model's in-sample performance on the training set. Does the model seem to be overfitting?


Reference: sklearn `cross_val_score` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [2]:
# load the 2-class sentiment classification model from rotten tomatoes
train = load_dataset('rotten_tomatoes', split='train')
val = load_dataset('rotten_tomatoes',  split='validation')
test = load_dataset('rotten_tomatoes', split='test')

In [3]:
# vectorizing the data with TF-IDF corpus
vectorizer = TfidfVectorizer() # the default ngram range is (1,1)

train_corpus = [x["text"] for x in train]
train_labels = [x["label"] for x in train]
train_features = vectorizer.fit_transform(train_corpus)

val_corpus = [x["text"] for x in val]
val_labels = [x["label"] for x in val]
val_features = vectorizer.transform(val_corpus)

test_corpus = [x["text"] for x in test]
test_labels = [x["label"] for x in test]
test_features = vectorizer.transform(test_corpus)

In [4]:

#Perform 5-fold stratified cross-validation with the built-in method `cross_val_score` method
lr_score = cross_val_score(LogisticRegression(), train_features, train_labels, cv=StratifiedKFold(n_splits=5), scoring="accuracy", n_jobs=-1)

#Compare performance (averaged across the five folds) to the model's in-sample performance on the training set
lr = LogisticRegression().fit(train_features, train_labels)
preds = lr.predict(train_features)

print("Logistic Regression in-sample accuracy: ", accuracy_score(train_labels, preds))
print("Logistic Regression cross-validation accuracy: ", np.mean(lr_score))


Logistic Regression in-sample accuracy:  0.8960140679953107
Logistic Regression cross-validation accuracy:  0.7514654161781946


# 2. Regularization

Look up the documentation for [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). What parameters related to regularization are there?

1. Add the regularization term to the logistic regression classifier with L2 regularization and retrain it. Set the value of the regularization parameter to any non-default value within its range.
2. Compare the cross-validation performance to the unregularized classifier. Did anything change? Why do you think that is the case?

In [5]:
lr_score_reg = cross_val_score(LogisticRegression( C=0.5, penalty="l2"), train_features, train_labels, cv=StratifiedKFold(n_splits=5), scoring="accuracy", n_jobs=-1)

lr = LogisticRegression(C=0.5, penalty="l2").fit(train_features, train_labels)
preds = lr.predict(train_features)

print("Regularized Logistic Regression in-sample accuracy: ", accuracy_score(train_labels, preds))
print("Regularized Logistic Regression cross-validation accuracy: ", np.mean(lr_score_reg))
print("Logistic Regression cross-validation accuracy: ", np.mean(lr_score))

Regularized Logistic Regression in-sample accuracy:  0.8575615474794842
Regularized Logistic Regression cross-validation accuracy:  0.7403282532239155
Logistic Regression cross-validation accuracy:  0.7514654161781946


# 3. Hyperparameter search

1. Is the default value for the regularization parameter the best possible one? Use grid search with cross-validation to try several options.
2. What is your best model? Compare its cross-validation performance to that of the original, non-regularized model.

Reference: [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) documentation

In [6]:
# GridSearchCV is a module that enables running
# cross-validated grid-search over a parameter grid
from sklearn.model_selection import GridSearchCV

# the parameters to explore are passed as param_grid parameter
param_grid = {'C': [0.001, 0.01, 0.1, 0.5, 0.8, 1, 10]}
lr_grid = GridSearchCV(LogisticRegression(penalty="l2"), param_grid, cv=5, n_jobs=-1)

lr_grid.fit(train_features, train_labels)

print("Best cross-validation score: {:.2f}".format(lr_grid.best_score_))
print("Best parameters: ", lr_grid.best_params_)
print("Best estimator: ", lr_grid.best_estimator_)

#Compare its cross-validation performance to that of the original, non-regularized model.
print("Logistic Regression cross-validation accuracy: ", np.mean(lr_score))

Best cross-validation score: 0.76
Best parameters:  {'C': 10}
Best estimator:  LogisticRegression(C=10)
Logistic Regression cross-validation accuracy:  0.7514654161781946


# 4. SVM classifier

Perform the same experiment with the LinearSVC classifier (this is an SVM with a linear kernel) on the *rotten_tomatoes* dataset.

1. Start with the default parameter settings.
2. Try to find the best option for the *c* hyperparameter with grid search. What is your best model performance?
3. Optional: try the SVM with a non-linear RBF kernel, and do the hyperparameter search on both *gamma* and *c*.

Documentation for the LinearSVC classifier: [link](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

More about SVMs: [link](https://scikit-learn.org/stable/modules/svm.html)

In [7]:
# default LinearSVC
linSVC = LinearSVC()

linSVC_score = cross_val_score(LinearSVC(), train_features, train_labels, cv=StratifiedKFold(n_splits=5), scoring="accuracy", n_jobs=-1)
print("LinearSVC cross-validation accuracy: ", np.mean(linSVC_score))


LinearSVC cross-validation accuracy:  0.7566236811254397


In [8]:
# hyperparameter search on LinearSVC
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
linSVC_grid = GridSearchCV(LinearSVC(), param_grid, cv=5, n_jobs=-1)

linSVC_grid.fit(train_features, train_labels)

print("Best cross-validation score: {:.2f}".format(linSVC_grid.best_score_))
print("Best parameters: ", linSVC_grid.best_params_)
print("Best estimator: ", linSVC_grid.best_estimator_)
print("LinearSVC cross-validation accuracy: ", np.mean(linSVC_score))

Best cross-validation score: 0.76
Best parameters:  {'C': 1}
Best estimator:  LinearSVC(C=1)
LinearSVC cross-validation accuracy:  0.7566236811254397


In [9]:
# hyperparameter search on SVC with rbf kernel
param_grid = {'C': [0.001, 0.1, 10], 'gamma': [0.001, 0.1, 10]}
svc_grid = GridSearchCV(SVC(kernel="rbf"), param_grid, cv=5, n_jobs = -1, verbose = 2)

svc_grid.fit(train_features, train_labels)

print("Best cross-validation score: {:.2f}".format(svc_grid.best_score_))
print("Best parameters: ", svc_grid.best_params_)
print("Best estimator: ", svc_grid.best_estimator_)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best cross-validation score: 0.76
Best parameters:  {'C': 10, 'gamma': 0.1}
Best estimator:  SVC(C=10, gamma=0.1)


# 5. Naive Bayes classifier

Perform the same experiment with the Naive Bayes classifier. You can use a Multinomial Naive Bayes model (`MultinomialNB`) here with default parameter settings, as this is the variant that we covered in class (predicting categories from word occurence counts).

1. Multinomial Naive Bayes models don't take TF-IDF features, but rather word occurrence counts (so we need to leave out the IDF step). For that reason, re-vectorize the training data and then the test data using the `sklearn` `CountVectorizer` instead.
2. Run the model on the count-vectorized training data. You don't need to do a hyperparameter grid search. What is your model performance?

Documentation for the MultinomialNB classifier: [link](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

More about Naive Bayes models in sklearn: [link](https://scikit-learn.org/stable/modules/naive_bayes.html)

`Note` it seems the behavior of the classifier can be unstable when using n_jobs=-1. It should be fast enough without it. 


In [27]:
counterizer = CountVectorizer()
train_counts = counterizer.fit_transform(train_corpus)
test_counts = counterizer.transform(test_corpus)

nb = MultinomialNB()
nb_score = cross_val_score(nb, train_counts.toarray(), train_labels, cv=StratifiedKFold(n_splits=5), scoring="accuracy", verbose=2)

print("Naive Bayes cross-validation accuracy: ", np.mean(nb_score))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END .................................................... total time=   1.7s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.7s remaining:    0.0s


[CV] END .................................................... total time=   1.6s
[CV] END .................................................... total time=   1.7s
[CV] END .................................................... total time=   1.5s
[CV] END .................................................... total time=   1.5s
Naive Bayes cross-validation accuracy:  0.7648300117233294


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    8.0s finished


# 6. Comparative analysis of classifier performance

1. Compare the performance of logistic regression, Linear SVC and Naive Bayes classifers (with the best hyperparameters you could find for the first two models, and using the count-vectorized test data for the Naive Bayes classifier). Use both accuracy and F1 metrics. Are the two metrics consistent? Which is the best-performing model?
2. Bonus: evaluate your three classifiers on your small test dataset that you annotated yourself in Classification I class. Are all the classifiers behaving the same way?

Note: to get the best performing model, you can take the result of `GridSearchCV` and use its attribute `.best_estimator_`. Then, to use that model to make predictions on a new data set, you can apply the `.predict()` method to the model, giving it the new data set's features as an argument.

## F1 score and accuracy

In [35]:
# getting LR and LinearSVC predictions

lr_test_preds = lr_grid.best_estimator_.predict(test_features)
linSVC_test_preds = linSVC_grid.best_estimator_.predict(test_features)

# getting NB predictions
nb.fit(train_counts.toarray(), train_labels)
nb_test_preds = nb.predict(test_features.toarray())

In [12]:
# collecting the accuracy data data
results = {"Accuracy":dict(),"F1 score":dict()}
results["Accuracy"]["LR"] = accuracy_score(test_labels, lr_test_preds)
results["Accuracy"]["SVC"] = accuracy_score(test_labels, linSVC_test_preds)
results["Accuracy"]["NB"] = accuracy_score(test_labels, nb_test_preds)

In [13]:
# adding F1 data
results["F1 score"]["LR"] = f1_score(test_labels, lr_test_preds, average="macro")
results["F1 score"]["SVC"] = f1_score(test_labels, linSVC_test_preds, average="macro")
results["F1 score"]["NB"] = f1_score(test_labels, nb_test_preds, average="macro")
results

{'Accuracy': {'LR': 0.776735459662289,
  'SVC': 0.7729831144465291,
  'NB': 0.797373358348968},
 'F1 score': {'LR': 0.7766969440923813,
  'SVC': 0.7729439515561187,
  'NB': 0.7973277000264062}}

In [14]:
results_df = pd.DataFrame(results)
results_df.round(3)

Unnamed: 0,Accuracy,F1 score
LR,0.777,0.777
SVC,0.773,0.773
NB,0.797,0.797


We see that the logistic regression and linear SVC perform equally well. Both are outperformed by the Naive Bayes model.

## Evaluation on out-of-distribution data

We created a short dataset of reviews of the Mario Bros. movie, also taken from Rotten Tomatoes.

In [2]:
# reading in and vectorizing the data
mydata = pd.read_csv("../dataset/classification1_annotation.csv")
mytest_corpus= list(mydata["text"])
mytest_labels = list(mydata["label"])
mytest_features = vectorizer.transform(mytest_corpus)
mytest_counts = counterizer.transform(mytest_corpus)

NameError: name 'pd' is not defined

In [38]:
results["OOD accuracy"] = {}
results["OOD accuracy"]["LR"] = accuracy_score(mytest_labels, lr_grid.best_estimator_.predict(mytest_features))
results["OOD accuracy"]["SVC"] = accuracy_score(mytest_labels, linSVC_grid.best_estimator_.predict(mytest_features))
results["OOD accuracy"]["NB"] = accuracy_score(mytest_labels, nb.predict(mytest_counts.toarray()))
results_df = pd.DataFrame(results)
results_df.round(3)


Unnamed: 0,Accuracy,F1 score,OOD accuracy
LR,0.777,0.777,0.667
SVC,0.773,0.773,0.667
NB,0.797,0.797,0.583


All models have worse performance out-of-sample, but in our hand-annotated data, the Naive Bayes classifier goes down the most.