The usual preliminaries....


In [2]:
pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [4]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")['train']

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
train_data = []
train_data_labels = []
for item in imdb_dataset:
  train_data.append(item['text'])
  train_data_labels.append(item['label'])

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',max_features=1000,lowercase=True,stop_words='english',ngram_range=(1,2))
features = vectorizer.fit_transform(train_data).toarray()

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.75,random_state=123)

We will use three models, Multinomial Naive Bayes, Random Forests and a Decision Tree to do the classification. Create the models.

In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
model_nb = MultinomialNB()
model_dt = DecisionTreeClassifier()
model_rf = RandomForestClassifier()

Train the models.

In [9]:
model_nb = model_nb.fit(X=X_train,y=y_train)
model_dt = model_dt.fit(X=X_train,y=y_train)
model_rf = model_rf.fit(X=X_train,y=y_train)

Test the models on the validation set.

In [10]:
y_pred_nb = model_nb.predict(X_val)
y_pred_dt = model_dt.predict(X_val)
y_pred_rf = model_rf.predict(X_val)

Now let's calculate the accuracy of the models' predictions on the validation set.

In [11]:
from sklearn.metrics import accuracy_score, confusion_matrix
print("Naive Bayes", accuracy_score(y_val,y_pred_nb))
print(confusion_matrix(y_val,y_pred_nb))
print()
print("Decision Tree", accuracy_score(y_val,y_pred_dt))
print(confusion_matrix(y_val,y_pred_dt))
print()
print("Random Forest", accuracy_score(y_val,y_pred_rf))
print(confusion_matrix(y_val,y_pred_rf))

Naive Bayes 0.81728
[[2463  622]
 [ 520 2645]]

Decision Tree 0.70048
[[2192  893]
 [ 979 2186]]

Random Forest 0.81344
[[2536  549]
 [ 617 2548]]


Now create the voting ensemble...
Top help implement the voting ensemble, the sklean.ensemble library should be helpful.

In [12]:
from sklearn.ensemble import VotingClassifier

voting_ensemble = VotingClassifier(
    estimators=[('nb', model_nb), ('dt', model_dt), ('rf', model_rf)],
    voting='hard'  # 'hard' for majority voting, 'soft' for averaging probabilities
)

Next we need to fit the training data for the new ensemble

In [13]:
voting_ensemble.fit(X_train, y_train)

With this we need to get the predictions made by the ensemble.

In [14]:
y_pred_ensemble = voting_ensemble.predict(X_val)

Next up is printing out the accuracy score and confusion matrix based on the ensembles predictions on the validation set.

In [15]:
ensemble_accuracy = accuracy_score(y_val, y_pred_ensemble)
ensemble_conf_matrix = confusion_matrix(y_val, y_pred_ensemble)

print("Ensemble Accuracy:", ensemble_accuracy)
print("Ensemble Confusion Matrix:\n", ensemble_conf_matrix)

Ensemble Accuracy: 0.8224
Ensemble Confusion Matrix:
 [[2525  560]
 [ 550 2615]]


In the past, our best-performing model was the Naive Bayes Model, with an accuracy of 0.81728. But now, our ensemble model has improved it with an accuracy score of 0.82176. This success confirms the reliability of our approach, where we base our predictions on what at least 2 out of the 3 models agree upon for each review in the validation set. 

Following on from this we want to test out the voting ensemble model on the difficult test set used in A3. 

In [16]:
from sklearn.metrics import accuracy_score, confusion_matrix

test_data = [
    "A true masterpiece if you enjoy three-hour-long naps.",
    "Fantastic CGI, almost as if they hired a toddler with crayons.",
    "Brilliant acting, especially if you're into wooden mannequins.",
    "The plot was so unpredictable; I fell asleep from the suspense.",
    "A must-watch for those who love movies that make absolutely no sense.",
    "Worst movie ever! I loved every excruciating minute of it.",
    "The director clearly aimed for 'so bad it's good' but landed on 'just plain terrible.'",
    "Incredible how every actor managed to forget their lines at the same time.",
    "I laughed so hard at the dialogues; I cried tears of regret.",
    "A cinematic experience you won't forget, no matter how desperately you try."
]

test_data_labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
test_vectorizer = vectorizer.transform(test_data).toarray()
ensemble_test_pred = voting_ensemble.predict(test_vectorizer)
new_accuracy = accuracy_score(test_data_labels, ensemble_test_pred)
print(f"Ensemble Accuracy Score: {new_accuracy}")
print(confusion_matrix(test_data_labels, ensemble_test_pred))

Ensemble Accuracy Score: 0.6
[[2 3]
 [1 4]]


This is by no means an excellent score but we should compare with how the individual models got on by themselves.

In [17]:
nb_test_pred = model_nb.predict(test_vectorizer)
dt_test_pred = model_dt.predict(test_vectorizer)
rf_test_pred = model_rf.predict(test_vectorizer)
print("Naive-Bayes Accuracy:")
print(accuracy_score(test_data_labels, nb_test_pred))
print("Decision Trees Accuracy:")
print(accuracy_score(test_data_labels, dt_test_pred))
print("Random Forests Accuracy:")
print(accuracy_score(test_data_labels, rf_test_pred))

Naive-Bayes Accuracy:
0.5
Decision Trees Accuracy:
0.6
Random Forests Accuracy:
0.6


We see here with a very small sample size the ensemble does help as the correct predictions from the DT and RF model have outvoted the incorrect predictions form the NB model.

Next, we are going to compare the Decision Trees model with our ensemble on every instance in our difficult test set

In [18]:
print("Ensemble Conf Matrix:")
print(confusion_matrix(test_data_labels, ensemble_test_pred))
print("Decision Trees Conf Matrix:")
print(confusion_matrix(test_data_labels, dt_test_pred))

Ensemble Conf Matrix:
[[2 3]
 [1 4]]
Decision Trees Conf Matrix:
[[2 3]
 [1 4]]


As we can see the confusion matrices are exactly the same, which would infer to us that the cases where the ensemble method is going wrong are the same as where the decision trees formula is going wrong. Maybe let's try compare the Ensemble method with the Random Forest model, as our Random Forest model had the same accuracy as the Decision Trees model, and maybe it will give us some other insights. 

In [19]:
print("Ensemble Conf Matrix:")
print(confusion_matrix(test_data_labels, ensemble_test_pred))
print("Decision Trees Conf Matrix:")
print(confusion_matrix(test_data_labels, rf_test_pred))

Ensemble Conf Matrix:
[[2 3]
 [1 4]]
Decision Trees Conf Matrix:
[[2 3]
 [1 4]]


This shows us that there is also no difference between our ensemble method and our Random Forests model, maybe looking at the individual cases might shed some light on the issue.

In [21]:
def catagorize_instances(y_true, y_pred_rf, y_pred_ensemble):
    both_correct = []
    both_incorrect = []
    model1_correct_only = []
    model2_correct_only = []
    for i in range(len(y_true)):
        if y_true[i] == y_pred_rf[i] and y_true[i] == y_pred_ensemble[i]:
            both_correct.append(i)
        elif y_true[i] != y_pred_rf[i] and y_true[i] != y_pred_ensemble[i]:
            both_incorrect.append(i)
        elif y_true[i] == y_pred_rf[i] and y_true[i] != y_pred_ensemble[i]:
            model1_correct_only.append(i)
        elif y_true[i] != y_pred_rf[i] and y_true[i] == y_pred_ensemble[i]:
            model2_correct_only.append(i)
    return both_correct, both_incorrect, model1_correct_only, model2_correct_only

both_correct, both_incorrect, model1_correct_only, model2_correct_only = catagorize_instances(test_data_labels, rf_test_pred, ensemble_test_pred) 
print("Reviews where both models were correct:\n")
for index in both_correct:
    print(test_data[index])
print("\nReviews where the Random Forest model predicted correctly and the Ensemble predicted incorrectly:\n")
for index in model1_correct_only:
    print(test_data[index])
print("\nReviews where the Ensemble predicted correctly and the Random Forest model predicted incorrectly:\n")
for index in model2_correct_only:
    print(test_data[index])
print("\nReviews where both models predicted incorrectly:\n")
for index in both_incorrect:
    print(test_data[index])

Reviews where both models were correct:

A true masterpiece if you enjoy three-hour-long naps.
Fantastic CGI, almost as if they hired a toddler with crayons.
Brilliant acting, especially if you're into wooden mannequins.
A must-watch for those who love movies that make absolutely no sense.
Worst movie ever! I loved every excruciating minute of it.
The director clearly aimed for 'so bad it's good' but landed on 'just plain terrible.'

Reviews where the Random Forest model predicted correctly and the Ensemble predicted incorrectly:


Reviews where the Ensemble predicted correctly and the Random Forest model predicted incorrectly:


Reviews where both models predicted incorrectly:

The plot was so unpredictable; I fell asleep from the suspense.
Incredible how every actor managed to forget their lines at the same time.
I laughed so hard at the dialogues; I cried tears of regret.
A cinematic experience you won't forget, no matter how desperately you try.


As we can see both the ensemble method and the Random Forests method are predicting the instances the same. This is really intriguing as it shows our voting method isn't really doing anything at all now this could be because all the individual models performed similarly so they were all voting the same ones correctly and the same ones incorrectly so it is doing nothing to improve the accuracy overall.