This notebook trains a binary classifier on a dataset which contains movie reviews which are labelled as containing either *positive* or *negative* sentiment towards the movie.

First we will install *sklearn* which we will be using to do the machine learning.

In [41]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


Next we will install the dataset. We will use the IMDB sentiment analysis dataset available from the [huggingface datasets library](https://huggingface.co/datasets/imdb) and described in [Maas et al. 2011](https://aclanthology.org/P11-1015.pdf).

In [42]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


Now let's load the IMDB training set. We will print out the last instance.

In [43]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")['train']
print(imdb_dataset[-1])

{'text': 'The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.', 'label': 1}


Let's convert the training data into the format expected by scikit-learn - a list of input vectors (documents) and a list of associated output labels.

In [44]:
train_data = []
train_data_labels = []
for item in imdb_dataset:
  train_data.append(item['text'])
  train_data_labels.append(item['label'])
print(train_data[-1])
print(train_data_labels[-1])

The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.
1


We'll use the CountVectorizer class to extract the words in each review as the features the algorithm will learn from. Each document is represented as a 200 dimension vector of word counts. Only the 200 most frequent words are used in this version.

In [45]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',max_features=5000,lowercase=True)
features = vectorizer.fit_transform(train_data).toarray()

As a sanity check, let's check we have a 2-d array where each row is one of the 25,000 instances and each column is one of 200 words. Print out the words that will be used for classification.

In [46]:
print(features.shape)
print(vectorizer.get_feature_names_out())

(25000, 5000)
['00' '000' '10' ... 'zombie' 'zombies' 'zone']


Split the data into a training and validation (dev) set. We'll use the validation set to test our model. We'll use 75% of the data for training and 25% for testing.

In [47]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.75,random_state=123)

We will use Multinomial Naive Bayes to do the classification. Create the model.

In [54]:
from sklearn.ensemble import RandomForestClassifier
feature_names = [f"feature {i}" for i in range(features.shape[1])]
model = RandomForestClassifier(random_state=42)

Train the model.

In [55]:
model = model.fit(X=X_train,y=y_train)

In [60]:
feature_importances = model.feature_importances_
top_50_indices = feature_importances.argsort()[-50:][::-1]
top_50_feature_names = [feature_names[i] for i in top_50_indices]

Test the model on the validation set.

In [56]:
y_pred = model.predict(X_val)

Now let's calculate the accuracy of the model's predictions on the validation set.

In [57]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_val,y_pred))

0.832


In [71]:
import pandas as pd
feature_names = [vectorizer.get_feature_names_out()[i] for i in range(features.shape[1])]
importances = model.feature_importances_
forest_importances = pd.Series(importances, index=feature_names)
forest_importances = forest_importances.sort_values(ascending=False)
print(forest_importances[0:50])

bad           0.020492
worst         0.018124
awful         0.010721
great         0.009569
waste         0.008169
boring        0.006811
terrible      0.006638
no            0.006342
and           0.005842
best          0.005364
excellent     0.005201
nothing       0.004935
wonderful     0.004700
poor          0.004633
worse         0.004611
stupid        0.004327
love          0.003902
the           0.003848
is            0.003715
of            0.003507
movie         0.003229
plot          0.003183
horrible      0.003169
in            0.003135
poorly        0.003066
perfect       0.003048
to            0.003018
minutes       0.003017
also          0.002970
well          0.002954
even          0.002953
this          0.002947
supposed      0.002897
just          0.002838
dull          0.002774
it            0.002707
amazing       0.002686
beautiful     0.002598
as            0.002594
money         0.002561
crap          0.002502
was           0.002493
script        0.002476
favorite   

The code here uses the command model.feature_importances_ to get the rating of how important each word is for classification, then uses pandas to put them in a series and then sorted these features based on how important they are and then finally it prints the top 50 results.