This notebook trains a binary classifier on a dataset which contains movie reviews which are labelled as containing either *positive* or *negative* sentiment towards the movie. 

First we will install *sklearn* which we will be using to do the machine learning.

In [1]:
pip install sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1310 sha256=d9ad6ea575d1b27004cb533e020c6a3501f58ec23af483307b0077f9696104f6
  Stored in directory: /root/.cache/pip/wheels/46/ef/c3/157e41f5ee1372d1be90b09f74f82b10e391eaacca8f22d33e
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0


Next we will install the dataset. We will use the IMDB sentiment analysis dataset available from the [huggingface datasets library](https://huggingface.co/datasets/imdb) and described in [Maas et al. 2011](https://aclanthology.org/P11-1015.pdf).

In [2]:
pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 8.6 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 46.0 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 50.8 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 51.7 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 53.7 MB/s 
Installing collected package

Now let's load the IMDB training set. We will print out the last instance.

In [12]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")['train']
print(imdb_dataset[-1])



  0%|          | 0/3 [00:00<?, ?it/s]

{'text': 'The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.', 'label': 1}


Let's convert the training data into the format expected by scikit-learn - a list of input vectors (documents) and a list of associated output labels.

In [13]:
train_data = []
train_data_labels = []
for item in imdb_dataset:
  train_data.append(item['text'])
  train_data_labels.append(item['label'])
print(train_data[-1])
print(train_data_labels[-1])

The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.
1


We'll use the CountVectorizer class to extract the words in each review as the features the algorithm will learn from. Each document is represented as a 200 dimension vector of word counts. Only the 200 most frequent words are used in this version. 

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
features = vectorizer.fit_transform(train_data).toarray()

As a sanity check, let's check we have a 2-day array where each row is one of the 25,000 instances and each column is one of 200 words. Print out the words that will be used for classification.

In [21]:
print(features.shape)
print(vectorizer.get_feature_names_out())

(25000, 200)
['10' 'about' 'acting' 'action' 'actors' 'actually' 'after' 'again' 'all'
 'also' 'an' 'and' 'another' 'any' 'are' 'around' 'as' 'at' 'back' 'bad'
 'be' 'because' 'been' 'before' 'being' 'best' 'better' 'between' 'big'
 'both' 'br' 'but' 'by' 'can' 'cast' 'character' 'characters' 'could'
 'did' 'didn' 'director' 'do' 'does' 'doesn' 'don' 'down' 'end' 'enough'
 'even' 'ever' 'every' 'fact' 'few' 'film' 'films' 'find' 'first' 'for'
 'from' 'funny' 'get' 'give' 'go' 'going' 'good' 'got' 'great' 'had' 'has'
 'have' 'he' 'her' 'here' 'him' 'his' 'horror' 'how' 'however' 'if' 'in'
 'into' 'is' 'it' 'its' 'just' 'know' 'life' 'like' 'little' 'long' 'look'
 'lot' 'love' 'made' 'make' 'makes' 'man' 'many' 'may' 'me' 'more' 'most'
 'movie' 'movies' 'much' 'my' 'never' 'new' 'no' 'not' 'nothing' 'now'
 'of' 'off' 'old' 'on' 'one' 'only' 'or' 'original' 'other' 'out' 'over'
 'own' 'part' 'people' 'plot' 'pretty' 'quite' 're' 'real' 'really'
 'right' 'same' 'say' 'scene' 'scenes' 'see'

Split the data into a training and validation (dev) set. We'll use the validation set to test our model. We'll use 75% of the data for training and 25% for testing.

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.75,random_state=123)

We will use Multinomial Naive Bayes to do the classification. Create the model.

In [23]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

Train the model.

In [24]:
model = model.fit(X=X_train,y=y_train)

Test the model on the validation set.

In [25]:
y_pred = model.predict(X_val)

Now let's calculate the accuracy of the model's predictions on the validation set.

In [26]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_val,y_pred))

0.70368
