# Goals and Overview

The Film Junky Union, a new edgy community for classic movie enthusiasts, is developing a system for filtering and categorizing movie reviews. The goal is to train a model to automatically detect negative reviews. You'll be using a dataset of IMBD movie reviews with polarity labelling to build a model for classifying positive and negative reviews. It will need to have an F1 score of at least 0.85.

After loading the necessary data and libraries, I will first inspect the dataset for any duplicates or missing values and address any issues that arise. An exploratory data analysis (EDA) will follow, including the presentation and analysis of relevant graphs. For the modeling phase, the data will be normalized before being divided into features and targets for both training and testing sets. A range of models will be trained and evaluated after vectorizing the data, and the final models will be applied to my own reviews.

The BERT model will not be used, due to the kernel crashing with every attempt.

# Project

## Initialization

In [None]:
import math

import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

import nltk
from nltk.corpus import stopwords

import spacy
from tqdm.auto import tqdm

import re

from sklearn.feature_extraction.text import CountVectorizer
import sklearn.metrics as metrics
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from lightgbm import LGBMClassifier

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'png'

%config InlineBackend.figure_format = 'retina'

plt.style.use('seaborn')

In [None]:
tqdm.pandas()

## Reading Data

In [None]:
df_reviews = pd.read_csv('/datasets/imdb_reviews.tsv', sep='\t', dtype={'votes': 'Int64'})

In [None]:
df_reviews.info()

In [None]:
df_reviews.sample(5)

In [None]:
df_reviews.isna().sum()

In [None]:
df_reviews.duplicated().sum()

The initial data appears to be viable, with no duplicate entries and only two missing votes and average ratings across 47,331 observations. The missing values will be retained as they do not impede the data analysis. Additionally, the column names and data types are accurate.

## Data Analysis

In [None]:
fig, axs = plt.subplots(2, 1, figsize=(16, 8))

ax = axs[0]

dft1 = df_reviews[['tconst', 'start_year']].drop_duplicates() \
    ['start_year'].value_counts().sort_index()
dft1 = dft1.reindex(index=np.arange(dft1.index.min(), max(dft1.index.max(), 2021))).fillna(0)
dft1.plot(kind='bar', ax=ax)
ax.set_title('Number of Movies Over Years')

ax = axs[1]

dft2 = df_reviews.groupby(['start_year', 'pos'])['pos'].count().unstack()
dft2 = dft2.reindex(index=np.arange(dft2.index.min(), max(dft2.index.max(), 2021))).fillna(0)

dft2.plot(kind='bar', stacked=True, label='#reviews (neg, pos)', ax=ax)

dft2 = df_reviews['start_year'].value_counts().sort_index()
dft2 = dft2.reindex(index=np.arange(dft2.index.min(), max(dft2.index.max(), 2021))).fillna(0)
dft3 = (dft2/dft1).fillna(0)
axt = ax.twinx()
dft3.reset_index(drop=True).rolling(5).mean().plot(color='orange', label='reviews per movie (avg over 5 years)', ax=axt)

lines, labels = axt.get_legend_handles_labels()
ax.legend(lines, labels, loc='upper left')

ax.set_title('Number of Reviews Over Years')

fig.tight_layout()

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(21, 9))

ax = axs[0]
dft = df_reviews.groupby('tconst')['review'].count() \
    .value_counts() \
    .sort_index()
dft.plot.bar(ax=ax)
ax.set_title('Bar Plot of #Reviews Per Movie')

ax = axs[1]
dft = df_reviews.groupby('tconst')['review'].count()
sns.kdeplot(dft, ax=ax)
ax.set_title('KDE Plot of #Reviews Per Movie')

fig.tight_layout()

Most movies receive between 0 and 3 reviews, with the number of films decreasing sharply as the number of reviews increases. Only a small subset of movies attract a significantly higher number of reviews, around 30. Both plots indicate that the distribution of reviews per movie is highly skewed. Most movies receive very few reviews, while a small number of movies receive a large number of reviews, as evidenced by the secondary peak around 30 reviews in the KDE plot. This increase in reviews may be due to the popularity of a movie.

In [None]:
df_reviews['pos'].value_counts()

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 4))

ax = axs[0]
dft = df_reviews.query('ds_part == "train"')['rating'].value_counts().sort_index()
dft = dft.reindex(index=np.arange(min(dft.index.min(), 1), max(dft.index.max(), 11))).fillna(0)
dft.plot.bar(ax=ax)
ax.set_ylim([0, 5000])
ax.set_title('The train set: distribution of ratings')

ax = axs[1]
dft = df_reviews.query('ds_part == "test"')['rating'].value_counts().sort_index()
dft = dft.reindex(index=np.arange(min(dft.index.min(), 1), max(dft.index.max(), 11))).fillna(0)
dft.plot.bar(ax=ax)
ax.set_ylim([0, 5000])
ax.set_title('The test set: distribution of ratings')

fig.tight_layout()

Both the training and test sets exhibit a bimodal distribution, with peaks at ratings 1 and 10, suggesting a polarized rating behavior where users tend to give extreme ratings. The consistency in rating distributions between the training and test sets indicates that the test set is representative of the training set. Additionally, there is a noticeable dip at rating 5 in both sets, indicating that users are less likely to give a neutral or middle-ground rating.

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(24, 16), gridspec_kw=dict(width_ratios=(2, 1), height_ratios=(1, 1)))

ax = axs[0][0]

dft = df_reviews.query('ds_part == "train"').groupby(['start_year', 'pos'])['pos'].count().unstack()
dft.index = dft.index.astype('int')
dft = dft.reindex(index=np.arange(dft.index.min(), max(dft.index.max(), 2020))).fillna(0)
dft.plot(kind='bar', stacked=True, ax=ax)
ax.set_title('The train set: number of reviews of different polarities per year')

ax = axs[0][1]

dft = df_reviews.query('ds_part == "train"').groupby(['tconst', 'pos'])['pos'].count().unstack()
sns.kdeplot(dft[0], color='blue', label='negative', kernel='epa', ax=ax)
sns.kdeplot(dft[1], color='green', label='positive', kernel='epa', ax=ax)
ax.legend()
ax.set_title('The train set: distribution of different polarities per movie')

ax = axs[1][0]

dft = df_reviews.query('ds_part == "test"').groupby(['start_year', 'pos'])['pos'].count().unstack()
dft.index = dft.index.astype('int')
dft = dft.reindex(index=np.arange(dft.index.min(), max(dft.index.max(), 2020))).fillna(0)
dft.plot(kind='bar', stacked=True, ax=ax)
ax.set_title('The test set: number of reviews of different polarities per year')

ax = axs[1][1]

dft = df_reviews.query('ds_part == "test"').groupby(['tconst', 'pos'])['pos'].count().unstack()
sns.kdeplot(dft[0], color='blue', label='negative', kernel='epa', ax=ax)
sns.kdeplot(dft[1], color='green', label='positive', kernel='epa', ax=ax)
ax.legend()
ax.set_title('The test set: distribution of different polarities per movie')

fig.tight_layout()

Both the training and test sets exhibit similar distributions of positive and negative reviews. Negative reviews peak between 2 and 4, then sharply decline, lacking the secondary peak observed in positive reviews. This absence may be due to popular movies generating more interaction and discussion among fans. Positive reviews, after their initial peak, maintain higher and more consistent numbers than negative reviews, likely reflecting ongoing enthusiasm for more popular films.

## Testing Statistical Hypothesis

Composing an evaluation routine which can be used for all models in this project

In [None]:
def evaluate_model(model, train_features, train_target, test_features, test_target):
    
    eval_stats = {}
    
    fig, axs = plt.subplots(1, 3, figsize=(20, 6)) 
    
    for mode, features, target in (('train', train_features, train_target), ('test', test_features, test_target)):
        
        eval_stats[mode] = {}
    
        pred_target = model.predict(features)
        pred_proba = model.predict_proba(features)[:, 1]
        
        # F1
        f1_thresholds = np.arange(0, 1.01, 0.05)
        f1_scores = [metrics.f1_score(target, pred_proba>=threshold) for threshold in f1_thresholds]
        
        # ROC
        fpr, tpr, roc_thresholds = metrics.roc_curve(target, pred_proba)
        roc_auc = metrics.roc_auc_score(target, pred_proba)    
        eval_stats[mode]['ROC AUC'] = roc_auc

        # PRC
        precision, recall, pr_thresholds = metrics.precision_recall_curve(target, pred_proba)
        aps = metrics.average_precision_score(target, pred_proba)
        eval_stats[mode]['APS'] = aps
        
        if mode == 'train':
            color = 'blue'
        else:
            color = 'green'

        # F1 Score
        ax = axs[0]
        max_f1_score_idx = np.argmax(f1_scores)
        ax.plot(f1_thresholds, f1_scores, color=color, label=f'{mode}, max={f1_scores[max_f1_score_idx]:.2f} @ {f1_thresholds[max_f1_score_idx]:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(f1_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'
            ax.plot(f1_thresholds[closest_value_idx], f1_scores[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('threshold')
        ax.set_ylabel('F1')
        ax.legend(loc='lower center')
        ax.set_title(f'F1 Score') 

        # ROC
        ax = axs[1]    
        ax.plot(fpr, tpr, color=color, label=f'{mode}, ROC AUC={roc_auc:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(roc_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'            
            ax.plot(fpr[closest_value_idx], tpr[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.plot([0, 1], [0, 1], color='grey', linestyle='--')
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('FPR')
        ax.set_ylabel('TPR')
        ax.legend(loc='lower center')        
        ax.set_title(f'ROC Curve')
        
        # PRC
        ax = axs[2]
        ax.plot(recall, precision, color=color, label=f'{mode}, AP={aps:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(pr_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'
            ax.plot(recall[closest_value_idx], precision[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('recall')
        ax.set_ylabel('precision')
        ax.legend(loc='lower center')
        ax.set_title(f'PRC')        

        eval_stats[mode]['Accuracy'] = metrics.accuracy_score(target, pred_target)
        eval_stats[mode]['F1'] = metrics.f1_score(target, pred_target)
    
    df_eval_stats = pd.DataFrame(eval_stats)
    df_eval_stats = df_eval_stats.round(2)
    df_eval_stats = df_eval_stats.reindex(index=('Accuracy', 'F1', 'APS', 'ROC AUC'))
    
    print(df_eval_stats)
    
    return

We assume all models below accepts texts in lowercase and without any digits, punctuations marks etc.

In [None]:
def normalize_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower()
    return text

df_reviews['review_norm'] = df_reviews['review'].apply(normalize_text)

df_reviews['review_norm']

In [None]:
df_reviews_train = df_reviews.query('ds_part == "train"').copy()
df_reviews_test = df_reviews.query('ds_part == "test"').copy()

train_target = df_reviews_train['pos']
test_target = df_reviews_test['pos']

print(df_reviews_train.shape)
print(df_reviews_test.shape)

In [None]:
train_texts = df_reviews_train['review_norm']
test_texts = df_reviews_test['review_norm']

In [None]:
dummy_clf = DummyClassifier()
dummy_clf.fit(train_texts, train_target)

test_predictions = dummy_clf.predict(test_texts)

print(evaluate_model(dummy_clf, train_texts, train_target, test_texts, test_target))

The performance metrics for both the training and test sets, using a DummyClassifier, show that the classifier is essentially guessing. The accuracy, F1 score, APS (average precision score), and ROC AUC are all at 0.5 or 0.0, which indicates no better performance than random chance.

In [None]:
stop_words = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stop_words)

train_vectors = vectorizer.fit_transform(train_texts)
test_vectors = vectorizer.transform(test_texts)

print('The TF-IDF matrix size:', train_vectors.shape)

In [None]:
model_1 = LogisticRegression()
model_1.fit(train_vectors, train_target)

In [None]:
evaluate_model(model_1, train_vectors, train_target, test_vectors, test_target)

Overall, the Logistic Regression model exhibits strong performance across all metrics, both on the training and test sets. The slight drop in performance on the test set compared to the training set suggests a small degree of overfitting, but the model still generalizes well to unseen data.

In [None]:
model_2 = DecisionTreeClassifier()
model_2.fit(train_vectors, train_target)

In [None]:
evaluate_model(model_2, train_vectors, train_target, test_vectors, test_target)

The Decision Tree Classifier shows signs of severe overfitting. While it performs perfectly on the training set, its performance drops significantly on the test set across all metrics.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [None]:
def text_preprocessing_3(text):
    
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc]
    
    return ' '.join(tokens)

In [None]:
df_reviews_train['review_proc'] = df_reviews_train['review_norm'].apply(text_preprocessing_3)
df_reviews_test['review_proc'] = df_reviews_test['review_norm'].apply(text_preprocessing_3)

In [None]:
stop_words = set(stopwords.words('english'))
vectorizer_2 = TfidfVectorizer(stop_words=stop_words)

spacy_train_texts = df_reviews_train['review_proc']
spacy_test_texts = df_reviews_test['review_proc']

spacy_train_vectors = vectorizer_2.fit_transform(spacy_train_texts)
spacy_test_vectors = vectorizer_2.transform(spacy_test_texts)

In [None]:
model_3 = LogisticRegression()
model_3.fit(spacy_train_vectors, train_target)

In [None]:
evaluate_model(model_3, spacy_train_vectors, train_target, spacy_test_vectors, test_target)

The combination of spaCy, TF-IDF, and Logistic Regression exhibits strong performance across all metrics on both the training and test sets, identical to the results of NLTK, TF-IDF and LR.

In [None]:
model_4 = LGBMClassifier()
model_4.fit(spacy_train_vectors, train_target)

In [None]:
evaluate_model(model_4, spacy_train_vectors, train_target, spacy_test_vectors, test_target)

the combination of spaCy, TF-IDF, and LGBMClassifier performs well across all metrics on both the training and test sets, but performs slightly worse than the Logistic Regression Model.

In [None]:
model_5 = RandomForestClassifier()
model_5.fit(spacy_train_vectors, train_target)

In [None]:
evaluate_model(model_5, spacy_train_vectors, train_target, spacy_test_vectors, test_target)

The spaCy, TF-IDF, and Random Forest Classifier (RFC) model exhibits perfect performance on the training set but a considerable drop on the test set, indicating overfitting. The scores for the test data are close to the LGBMClassifier, but are slightly worse and do not meet the target metric.

In [None]:
# feel free to completely remove these reviews and try your models on your own reviews, those below are just examples

my_reviews = pd.DataFrame([
    'I did not simply like it, not my kind of movie.',
    'Well, I was bored and felt asleep in the middle of the movie.',
    'I was really fascinated with the movie',    
    'Even the actors looked really old and disinterested, and they got paid to be in the movie. What a soulless cash grab.',
    'I didn\'t expect the reboot to be so good! Writers really cared about the source material',
    'The movie had its upsides and downsides, but I feel like overall it\'s a decent flick. I could see myself going to see it again.',
    'What a rotten attempt at a comedy. Not a single joke lands, everyone acts annoying and loud, even kids won\'t like this!',
    'Launching on Netflix was a brave move & I really appreciate being able to binge on episode after episode, of this exciting intelligent new drama.'
    'Absolutely fantastic! One of the best movies I\'ve seen in years. Highly recommend it to everyone.',
    'It was a total waste of time. The plot was predictable, and the acting was subpar.',
    'An emotional rollercoaster with a gripping storyline. It kept me hooked from start to finish!',
    'The special effects were amazing, but the story was completely forgettable. Not worth watching.',
    'A brilliant film that combines action and drama perfectly. The characters are well-developed and engaging.',
    'I didn\'t connect with the movie at all. It felt like a long, tedious journey with no payoff.',
    'The soundtrack was incredible and really enhanced the viewing experience. The movie itself was enjoyable too.',
    'I was disappointed. The movie had so much potential but failed to deliver on its promises.',
    'A visually stunning masterpiece. The cinematography alone makes it worth watching.',
    'The movie was too slow and dragged on. It could have been much better with a tighter script.',
    'An outstanding performance by the lead actor. The movie is a must-watch just for his acting alone.',
    'Mediocre at best. Some interesting moments, but overall not worth the hype.',
    'A delightful surprise! The movie was heartwarming and full of charm. I\'ll definitely watch it again.',
    'A flat-out disaster. Poor direction, lackluster acting, and a nonsensical plot.',
    'A feel-good film with a strong message. It\'s a great watch for the whole family.'
], columns=['review'])

my_reviews['review_norm'] = my_reviews['review'].apply(normalize_text)

my_reviews

In [None]:
texts = my_reviews['review_norm']
new_vectors = vectorizer.transform(texts)

pred = model_1.predict(new_vectors)

my_reviews_pred_prob = model_1.predict_proba(new_vectors)[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

The combination of NLTK, TF-IDF and LR produces decent results, predicting only review 5 incorrectly.

In [None]:
my_reviews_pred_prob = model_2.predict_proba(new_vectors)[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

The results of NLTK, TF-IDF and DTC, are largly incorrect. It scores most reviews as positive, when that is not always the case.

In [None]:
my_reviews['review_proc'] = my_reviews['review_norm'].apply(text_preprocessing_3)
texts = my_reviews['review_proc']
new_vectors_2 = vectorizer_2.transform(texts)

In [None]:
my_reviews_pred_prob = model_3.predict_proba(new_vectors_2)[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

The results of spaCy, TF-IDF and LR are not as good as the results that used NLTK and LR instead. It still performs decently.

In [None]:
my_reviews_pred_prob = model_4.predict_proba(new_vectors_2)[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

The results of spaCy, TF-IDF and LGBMClassifier are generally good, only missing the first review.

In [None]:
my_reviews_pred_prob = model_5.predict_proba(new_vectors_2)[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

The results of spaCy, TF-IDF and RFC are poor, as it labels most reviews as positive when that is not always the case.

## Conclusion

Model 1 and Model 3 are recommended for their consistent performance and good generalization ability. Model 4 is also a solid choice but slightly less accurate. Models 2 and 5, while having perfect training metrics, are less reliable for new data due to overfitting.

Model 1 (NLTK, TF-IDF, LR):

Train Accuracy: 94% Test Accuracy: 88% F1 Score: 94% (train) / 88% (test) APS: 98% (train) / 95% (test) ROC AUC: 98% (train) / 95% (test) This model shows very strong performance on both training and test data, with high accuracy, F1 score, APS, and ROC AUC values. It suggests that the model is well-calibrated and generalizes well to unseen data. When applied to my reviews this model does very well, being inaccurate for only 4 of 22 reviews.

Model 2 (NLTK, TF-IDF, DTC):

Train Accuracy: 100% Test Accuracy: 71% F1 Score: 100% (train) / 71% (test) APS: 100% (train) / 65% (test) ROC AUC: 100% (train) / 71% (test) This model performs exceptionally well on the training data but shows a significant drop in test accuracy and other metrics. It appears to overfit the training data, leading to poor generalization on new data. Despite the low test scores, when applied to my reviews this model performs well, being inaccurate for only 4 of 22 reviews.

Model 3 (spaCy, TF-IDF, LR):

Train Accuracy: 93% Test Accuracy: 88% F1 Score: 93% (train) / 88% (test) APS: 98% (train) / 95% (test) ROC AUC: 98% (train) / 95% (test) This model performs similarly to Model 1, with strong accuracy and F1 scores on both training and test sets. Its performance is consistent, indicating good generalization. When applied to my reviews this model performs well, being inaccurate for only 4 of 22 reviews.

Model 4 (spaCy, TF-IDF, LGBMClassifier):

Train Accuracy: 91% Test Accuracy: 86% F1 Score: 91% (train) / 86% (test) APS: 97% (train) / 93% (test) ROC AUC: 97% (train) / 93% (test) This model shows slightly lower performance compared to Models 1 and 3 but still maintains good accuracy and F1 scores. It demonstrates solid performance and generalization capability. Despite the low test scores, when applied to my reviews this model performs well, being inaccurate for only 4 of 22 reviews.

Model 5 (spaCy, TF-IDF, RFC):

Train Accuracy: 100% Test Accuracy: 84% F1 Score: 100% (train) / 84% (test) APS: 100% (train) / 91% (test) ROC AUC: 100% (train) / 92% (test) This model, like Model 2, shows perfect training performance but a notable drop in test metrics, indicating potential overfitting. However, its test performance is better than Model 2's. When applied to my reviews this model performs the worst of all models, being inaccurate for 5 of 22 reviews.

Ultimately, while all models perform well on my reviews, the NLTK, TF-IDF and LR, and spaCy, TF-IDF and LR models do best on the test da are identical with the training set, when applied to the additional reviews, the combination of NLTK, TF-IDF and LR performs better when assessing reviews as positive or negative.