In [1]:
import pandas as pd
import numpy as np
from datasets import load_dataset, logging
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

import matplotlib.pyplot as plt
# enabling inline plots in Jupyter
%matplotlib inline
# disabling verbose messages from dataset library
logging.set_verbosity_error()

  from .autonotebook import tqdm as notebook_tqdm


# Exercise: Classification II

In this exercise session, you will be using cross-validation to check the out-of-sample performance of different models for classifying movie review sentiment (using TF-IDF features as in the Classification I problem set). You will compare a logistic regression model to SVM and Naive Bayes. You will also use cross-validation for hyperparameter grid search.

Pro tip: As you will be fitting a lot of models in this exercise, why not take a look at how sklearn handles [parallelism](https://scikit-learn.org/stable/computing/parallelism.html#parallelism). A lot of method in the sklearn library take a parameter [n_jobs](https://scikit-learn.org/stable/glossary.html#term-n_jobs). By setting it to -1, you can use all of your CPUs (cores) at once. Depending on your hardware you may see 8x faster code, which means less waiting and more learning.

# 1. Cross-validation

Cross-validate the logistic regression classifier on the `rotten_tomatoes` dataset with TF-IDF vectorization that we used in the previous exercise. Perform 5-fold stratified cross-validation with the built-in method `cross_val_score` method in the sklearn `model_selection` module. Throughout this exercise (up to step 6), set the `scoring` parameter of `cross_val_score` to "accuracy". This means we will be using accuracy as our performance metric.

Compare performance (averaged across the five folds) to the model's in-sample performance on the training set. Does the model seem to be overfitting?


Reference: sklearn `cross_val_score` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

# 2. Regularization

Look up the documentation for [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). What parameters related to regularization are there?

1. Add the regularization term to the logistic regression classifier with L2 regularization and retrain it. Set the value of the regularization parameter to any non-default value within its range.
2. Compare the cross-validation performance to the unregularized classifier. Did anything change? Why do you think that is the case?

# 3. Hyperparameter search

1. Is the default value for the regularization parameter the best possible one? Use grid search with cross-validation to try several options.
2. What is your best model? Compare its cross-validation performance to that of the original, non-regularized model.

Reference: [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) documentation

# 4. SVM classifier

Perform the same experiment with the LinearSVC classifier (this is an SVM with a linear kernel) on the *rotten_tomatoes* dataset.

1. Start with the default parameter settings.
2. Try to find the best option for the *c* hyperparameter with grid search. What is your best model performance?
3. Optional: try the SVM with a non-linear RBF kernel, and do the hyperparameter search on both *gamma* and *c*.

Documentation for the LinearSVC classifier: [link](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

More about SVMs: [link](https://scikit-learn.org/stable/modules/svm.html)

# 5. Naive Bayes classifier

Perform the same experiment with the Naive Bayes classifier. You can use a Multinomial Naive Bayes model (`MultinomialNB`) here with default parameter settings, as this is the variant that we covered in class (predicting categories from word occurence counts).

1. Multinomial Naive Bayes models don't take TF-IDF features, but rather word occurrence counts (so we need to leave out the IDF step). For that reason, re-vectorize the training data and then the test data using the `sklearn` `CountVectorizer` instead.
2. Run the model on the count-vectorized training data. You don't need to do a hyperparameter grid search. What is your model performance?

Documentation for the MultinomialNB classifier: [link](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

More about Naive Bayes models in sklearn: [link](https://scikit-learn.org/stable/modules/naive_bayes.html)

Note: it seems the behavior of the classifier can be unstable when using n_jobs=-1. It should be fast enough without it. 


# 6. Comparative analysis of classifier performance

1. Use the code below as a starting point to compare the performance of logistic regression, Linear SVC and Naive Bayes classifers (with the best hyperparameters you could find for the first two models, and using the count-vectorized test data for the Naive Bayes classifier). Use both accuracy and F1 metrics. Are the two metrics consistent? Which is the best-performing model?
2. Bonus: evaluate your three classifiers on your small test dataset that you annotated yourself in Classification I class. Are all the classifiers behaving the same way?

Note: to get the best performing model, you can take the result of `GridSearchCV` and use its attribute `.best_estimator_`. Then, to use that model to make predictions on a new data set, you can apply the `.predict()` method to the model, giving it the new data set's features.