# Movie Review Sentiment Classifier

In this notebook, you will implement a simple linear classifier to infer the sentiment of a movie review from its text. 

You will also implement a hyper-parameter tuning method presented in the lectures to find a good value for the regularisation parameter of your logistic regression classifier. 

The [scikit-learn](https://scikit-learn.org/stable/index.html) machine learning package will be used throughout this notebook.

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Load the movie review data

In [2]:
df = pd.read_csv(os.path.join("data", "movie_reviews_labelled.csv"))

Shuffle the rows and sample a fraction of the dataset for this lab so that you don't have to wait so long for the model to train

In [3]:
df = df.sample(frac=0.3, random_state=1).reset_index(drop=True)

Split the data into training, validation and test sets.

In [4]:
# convert pandas series to lists
Xr = df["text"].tolist()
Yr = df["label"].tolist()

# compute the train, val, test splits
train_frac, val_frac, test_frac = 0.7, 0.1, 0.2
train_end = int(train_frac*len(Xr))
val_end = int((train_frac + val_frac)*len(Xr))

# store the train val test splits
X_train = Xr[0:train_end]
Y_train = Yr[0:train_end]
X_val = Xr[train_end:val_end]
Y_val = Yr[train_end:val_end]
X_test = Xr[val_end:]
Y_test = Yr[val_end:]

Fit a linear classification model

In [12]:
def fit_model(Xtr, Ytr, C):
    """Tokenizes the sentences, calculates TF vectors, and trains a logistic regression model.
    
    Args:
    - Xtr: A list of training documents provided as text
    - Ytr: A list of training class labels
    - C: The regularization parameter
    """

    # TODO: write model fitting code using CountVectorizer and LogisticRegression
    #       CountVectorizer is used to convert the text into sparse TF vectors
    #       See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    #       LogisticRegression will train the classifier using these vectors
    #       See https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
    count_vectoriser = CountVectorizer()
    X = count_vectoriser.fit_transform(Xtr)
    model = LogisticRegression(C=C, max_iter=10000).fit(X, Ytr)
    
    # return the model and CountVectorizer
    # Note: we need to return the CountVectorizer because 
    # it stores a mapping from words -> ids which we need for testing
    return model, count_vectoriser

Test a fitted linear classifier

In [16]:
def test_model(Xtst, Ytst, model, count_vectoriser):
    """Evaluate a trained classifier on the test set.
    
    Args:
    - Xtst: A list of test or validation documents
    - Ytst: A list of test or validation class labels
    - count_vectoriser: A fitted CountVectorizer
    """
    
    # TODO: write code to test a fitted linear model and return accuracy
    #       you will need to use count_vec to convert the text into TF vectors
    # Hint: the function accuracy_score from sklearn may be helpful
    #       See https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html 
    X = count_vectoriser.transform(Xtst)
    Y = model.predict(X)
    score = accuracy_score(Ytst, Y)
    return score

Hyper-parameter tuning: search a good value for the hyper-parameter `C`

In [23]:
# TODO: search for the best C parameter by 
#       training on the training set and testing on the validation set
#       you should use fit_model and test_model
C = 0.001
score_opt = float('-inf')
while C < 100:
    model, count_vectoriser = fit_model(X_train, Y_train, C)
    score = test_model(X_val, Y_val, model, count_vectoriser)
    if score > score_opt:
        C_opt = C
    C *= 10

Train your classifier using both the training and validation data, and the best value of `C`

In [29]:
# TODO: fit the model to the concatenated training and validation set
#       test on the test set and print the result
model, count_vectoriser = fit_model((X_train + X_val), (Y_train + Y_val), C_opt)
score = test_model(X_test, Y_test, model, count_vectoriser)
score

0.8640453182272576

Inspect the co-efficients of your logistic regression classifier

In [None]:
# TODO: find the words corresponding to the 5 largest (most positive) and 
#       5 smallest (most negative) co-efficients of the linear model
# Hint: a fitted LogisticRegression model in sklearn has a coef_ attribute which stores the co-efficients
#       CountVectorizer has a vocabulary_ attribute that stores a mapping of terms to feature indices