This notebook trains a binary classifier on a dataset which contains movie reviews which are labelled as containing either *positive* or *negative* sentiment towards the movie.

First we will install *sklearn* which we will be using to do the machine learning.

In [2]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


Next we will install the dataset. We will use the IMDB sentiment analysis dataset available from the [huggingface datasets library](https://huggingface.co/datasets/imdb) and described in [Maas et al. 2011](https://aclanthology.org/P11-1015.pdf).

In [3]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


Now let's load the IMDB training set. We will print out the last instance.

In [4]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")['train']
print(imdb_dataset[-1])

  from .autonotebook import tqdm as notebook_tqdm


{'text': 'The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.', 'label': 1}


Let's convert the training data into the format expected by scikit-learn - a list of input vectors (documents) and a list of associated output labels.

In [5]:
train_data = []
train_data_labels = []
for item in imdb_dataset:
  train_data.append(item['text'])
  train_data_labels.append(item['label'])
print(train_data[-1])
print(train_data_labels[-1])

The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.
1


We'll use the CountVectorizer class to extract the words in each review as the features the algorithm will learn from. Each document is represented as a 10000 dimension vector of word counts. Only the 10000 most frequent words are used in this version.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',max_features=16000,lowercase=True,binary=True,ngram_range=(1,2),stop_words='english')
features = vectorizer.fit_transform(train_data).toarray()

As a sanity check, let's check we have a 2-d array where each row is one of the 25,000 instances and each column is one of 10000 words. Print out the words that will be used for classification.

In [7]:
print(features.shape)
print(vectorizer.get_feature_names_out())

(25000, 16000)
['00' '000' '000 000' ... 'zone' 'zoo' 'zoom']


Split the data into a training and validation (dev) set. We'll use the validation set to test our model. We'll use 75% of the data for training and 25% for testing.

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.75,random_state=123)

We will use Multinomial Naive Bayes to do the classification. Create the model.

In [9]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

Train the model.

In [10]:
model = model.fit(X=X_train,y=y_train)

Test the model on the validation set.

In [11]:
y_pred = model.predict(X_val)

Now let's calculate the accuracy of the model's predictions on the validation set.

In [12]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
accuracy_NB = accuracy_score(y_val,y_pred)
print(f"Naive Bayes accuracy: {accuracy_NB}")
print(confusion_matrix(y_val, y_pred))


Naive Bayes accuracy: 0.85744
[[2621  464]
 [ 427 2738]]


In [13]:
from sklearn.tree import DecisionTreeClassifier
model_DT = DecisionTreeClassifier(random_state=42,max_depth=14, min_samples_leaf=8)
model_DT = model_DT.fit(X=X_train,y=y_train)
y_pred_DT = model_DT.predict(X_val)
accuracy_DT = accuracy_score(y_val,y_pred_DT)
print(f"Decision Trees accuracy: {accuracy_DT}")
print(confusion_matrix(y_val, y_pred_DT))

Decision Trees accuracy: 0.72576
[[1927 1158]
 [ 556 2609]]


In [15]:
from sklearn.metrics import accuracy_score, confusion_matrix

test_data = [
    "A true masterpiece if you enjoy three-hour-long naps.",
    "Fantastic CGI, almost as if they hired a toddler with crayons.",
    "Brilliant acting, especially if you're into wooden mannequins.",
    "The plot was so unpredictable; I fell asleep from the suspense.",
    "A must-watch for those who love movies that make absolutely no sense.",
    "Worst movie ever! I loved every excruciating minute of it.",
    "The director clearly aimed for 'so bad it's good' but landed on 'just plain terrible.'",
    "Incredible how every actor managed to forget their lines at the same time.",
    "I laughed so hard at the dialogues; I cried tears of regret.",
    "A cinematic experience you won't forget, no matter how desperately you try."
]

test_data_labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]  # Corrected to match the number of test_data items

test_pred_NB = model.predict(vectorizer.transform(test_data).toarray())
test_pred_DT = model_DT.predict(vectorizer.transform(test_data).toarray())

new_accuracy_NB = accuracy_score(test_data_labels, test_pred_NB)
new_accuracy_DT = accuracy_score(test_data_labels, test_pred_DT)

print(f"Naive Bayes accuracy: {new_accuracy_NB}")
print(confusion_matrix(test_data_labels, test_pred_NB))
print(f"Decision Trees accuracy: {new_accuracy_DT}")
print(confusion_matrix(test_data_labels, test_pred_DT))



Naive Bayes accuracy: 0.6
[[3 2]
 [2 3]]
Decision Trees accuracy: 0.7
[[2 3]
 [0 5]]


In [16]:
def categorize_instances(y_true, y_pred_NB, y_pred_DT):
    correct_both = []
    incorrect_both = []
    NB_correct_DT_incorrect = []
    DT_correct_NB_incorrect = []

    for i in range(len(y_true)):
        if y_true[i] == y_pred_NB[i] and y_true[i] == y_pred_DT[i]:
            correct_both.append(i)
        elif y_true[i] != y_pred_NB[i] and y_true[i] != y_pred_DT[i]:
            incorrect_both.append(i)
        elif y_true[i] == y_pred_NB[i] and y_true[i] != y_pred_DT[i]:
            NB_correct_DT_incorrect.append(i)
        elif y_true[i] != y_pred_NB[i] and y_true[i] == y_pred_DT[i]:
            DT_correct_NB_incorrect.append(i)

    return correct_both, incorrect_both, NB_correct_DT_incorrect, DT_correct_NB_incorrect

correct_both, incorrect_both, NB_correct_DT_incorrect, DT_correct_NB_incorrect = categorize_instances(test_data_labels, test_pred_NB, test_pred_DT)

print("Instances correctly classified by both models:")
print(correct_both)

print("Instances incorrectly classified by both models:")
print(incorrect_both)

print("Instances correctly classified by Naive-Bayes Model and incorrectly by Decision Trees Model:")
print(NB_correct_DT_incorrect)

print("Instances correctly classified by Decision Trees Model and incorrectly by Naive-Bayes Model:")
print(DT_correct_NB_incorrect)

Instances correctly classified by both models:
[0, 1, 2, 5, 6]
Instances incorrectly classified by both models:
[7, 8]
Instances correctly classified by Naive-Bayes Model and incorrectly by Decision Trees Model:
[9]
Instances correctly classified by Decision Trees Model and incorrectly by Naive-Bayes Model:
[3, 4]


    "Incredible how every actor managed to forget their lines at the same time.",
    "I laughed so hard at the dialogues; I cried tears of regret.",

    The use of words like incredible and laughed gives the impression that the reviews are positive when in fact they are negative, i can understand where the mistake came from

   "A cinematic experience you won't forget, no matter how desperately you try."
    
    Even reading the review it looks quite ambiguous so i can understand how the Decision trees model got it wrong

    "The plot was so unpredictable; I fell asleep from the suspense.",
    "A must-watch for those who love movies that make absolutely no sense.",

    Phrases like fell asleep and no can give a negative view and thats clearly why the Naive Bayes model classified them wrong

    
    "A true masterpiece if you enjoy three-hour-long naps.",
    "Fantastic CGI, almost as if they hired a toddler with crayons.",
    "Brilliant acting, especially if you're into wooden mannequins.",
    "Worst movie ever! I loved every excruciating minute of it.",
    "The director clearly aimed for 'so bad it's good' but landed on 'just plain terrible.'",

    These are the reviews that both models classified correctly 
    
    
    