This notebook trains a binary classifier on a dataset which contains movie reviews which are labelled as containing either *positive* or *negative* sentiment towards the movie.

First we will install *sklearn* which we will be using to do the machine learning.

In [1]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


Next we will install the dataset. We will use the IMDB sentiment analysis dataset available from the [huggingface datasets library](https://huggingface.co/datasets/imdb) and described in [Maas et al. 2011](https://aclanthology.org/P11-1015.pdf).

In [2]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


Now let's load the IMDB training set. We will print out the last instance.

In [3]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")['train']
print(imdb_dataset[-1])

  from .autonotebook import tqdm as notebook_tqdm


{'text': 'The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.', 'label': 1}


Let's convert the training data into the format expected by scikit-learn - a list of input vectors (documents) and a list of associated output labels.

In [4]:
train_data = []
train_data_labels = []
for item in imdb_dataset:
  train_data.append(item['text'])
  train_data_labels.append(item['label'])
print(train_data[-1])
print(train_data_labels[-1])

The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.
1


We'll use the CountVectorizer class to extract the words in each review as the features the algorithm will learn from. Each document is represented as a 1000 dimension vector of word  counts. Counts > 1 are clipped to 1. Stop words are removed.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

def custom_preprocessor(text):
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Join tokens back into a sentence
    processed_text = ' '.join(tokens)
    return processed_text


vectorizer = TfidfVectorizer(
    analyzer='word',
    max_features=16000,
    lowercase=True,
    binary=True,
    ngram_range=(1, 2),
    stop_words=['or', 'if', 'from', 'so', 'film', 'movie', 'movies', 'films','plot','was','make','sense','time','same','CGI','sense','ever','minute','hard','hired','managed','tears','fell','must','you','matter','forget','experience'],
    preprocessor=custom_preprocessor
)

features = vectorizer.fit_transform(train_data).toarray()

As a sanity check, let's check we have a 2-d array where each row is one of the 25,000 instances and each column is one of 16000 words or word bigrams. Print out the ngrams that will be used for classification.

In [6]:
print(features.shape)
print(vectorizer.get_feature_names_out())

(25000, 16000)
['ABC' 'ALL' 'AND' ... 'zero' 'zombie' 'zombies']


Split the data into a training and validation (dev) set. We'll use the validation set to test our model. We'll use 75% of the data for training and 25% for testing.

In [7]:
from sklearn.feature_selection import SelectKBest, chi2

# Assuming 'features' is your feature matrix and 'labels' is your target variable
# Select the top k features based on chi-squared (chi2) test
k_best = SelectKBest(score_func=chi2, k=16000)  # Adjust 'k' as needed
selected_features = k_best.fit_transform(features, train_data_labels)

In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(selected_features, train_data_labels, test_size=0.2, random_state=42)

# Initialize and train your classifier
model_nb = MultinomialNB()
model_lr = LogisticRegression()
model_rf = RandomForestClassifier()
model_nb = model_nb.fit(X=X_train,y=y_train)
model_lr = model_lr.fit(X=X_train,y=y_train)
model_rf = model_rf.fit(X=X_train,y=y_train)

# Make predictions on the test set
prediction_nb = model_nb.predict(X_test)
prediction_lr = model_lr.predict(X_test)
prediction_rf = model_rf.predict(X_test)

# Calculate accuracy
accuracy_nb = accuracy_score(y_test, prediction_nb)
accuracy_lr = accuracy_score(y_test, prediction_lr)
accuracy_rf = accuracy_score(y_test, prediction_rf)
print("Naive-Bayes Accuracy:", accuracy_nb)
print(confusion_matrix(y_test,prediction_nb))
print()
print("Logistic Regression Accuracy:", accuracy_lr)
print(confusion_matrix(y_test,prediction_lr))
print()
print("Random Forest Accuracy:", accuracy_rf)
print(confusion_matrix(y_test,prediction_rf))

Naive-Bayes Accuracy: 0.8752
[[2188  327]
 [ 297 2188]]

Logistic Regression Accuracy: 0.8918
[[2240  275]
 [ 266 2219]]

Random Forest Accuracy: 0.835
[[2089  426]
 [ 399 2086]]


Test the model on the validation set.

In [9]:
import pickle
models = {
    'model_nb': model_nb, 
    'model_lr': model_lr,  
    'model_rf': model_rf,
}
for model_name, model in models.items():
    pickle.dump(model, open(f'ca337-{model_name}-model.pkl', "wb"))
    
pickle.dump(vectorizer,open('ca337-nb1000-features.pkl','wb'))