# Movie Review NLP Sentiment Analysis

The Film Junky Union, a new edgy community for classic movie enthusiasts, is developing a system for filtering and categorizing movie reviews. The goal is to train a model to automatically detect negative reviews. You'll be using a dataset of IMBD movie reviews with polarity labelling to build a model for classifying positive and negative reviews. It will need to have an F1 score of at least 0.85.

## Initialization

In [22]:
import math

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

from tqdm.auto import tqdm

In [23]:
%matplotlib inline
%config InlineBackend.figure_format = 'png'
# the next line provides graphs of better quality on HiDPI screens
%config InlineBackend.figure_format = 'retina'

plt.style.use('seaborn-v0_8')

In [24]:
# this is to use progress_apply, read more at https://pypi.org/project/tqdm/#pandas-integration
tqdm.pandas()

## Load Data

In [25]:
df_reviews = pd.read_csv('../data/imdb_reviews.tsv', sep='\t', dtype={'votes': 'Int64'})

In [26]:
print(df_reviews.info())
display(df_reviews.head())
print()
missing_df = (
    df_reviews.isnull()
    .mean()
    .mul(100)
    .round(4)
    .reset_index()
    .rename(columns={'index': 'column', 0: 'percent_missing'})
    .sort_values(by='percent_missing', ascending=False)
)


display(missing_df)
print()
df_reviews = df_reviews.dropna().reset_index()
print(f'After dropping, there are now {df_reviews.isna().sum().sum()} missing values.')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47331 entries, 0 to 47330
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   tconst           47331 non-null  object 
 1   title_type       47331 non-null  object 
 2   primary_title    47331 non-null  object 
 3   original_title   47331 non-null  object 
 4   start_year       47331 non-null  int64  
 5   end_year         47331 non-null  object 
 6   runtime_minutes  47331 non-null  object 
 7   is_adult         47331 non-null  int64  
 8   genres           47331 non-null  object 
 9   average_rating   47329 non-null  float64
 10  votes            47329 non-null  Int64  
 11  review           47331 non-null  object 
 12  rating           47331 non-null  int64  
 13  sp               47331 non-null  object 
 14  pos              47331 non-null  int64  
 15  ds_part          47331 non-null  object 
 16  idx              47331 non-null  int64  
dtypes: Int64(1),

Unnamed: 0,tconst,title_type,primary_title,original_title,start_year,end_year,runtime_minutes,is_adult,genres,average_rating,votes,review,rating,sp,pos,ds_part,idx
0,tt0068152,movie,$,$,1971,\N,121,0,"Comedy,Crime,Drama",6.3,2218,The pakage implies that Warren Beatty and Gold...,1,neg,0,train,8335
1,tt0068152,movie,$,$,1971,\N,121,0,"Comedy,Crime,Drama",6.3,2218,How the hell did they get this made?! Presenti...,1,neg,0,train,8336
2,tt0313150,short,'15','15',2002,\N,25,0,"Comedy,Drama,Short",6.3,184,There is no real story the film seems more lik...,3,neg,0,test,2489
3,tt0313150,short,'15','15',2002,\N,25,0,"Comedy,Drama,Short",6.3,184,Um .... a serious film about troubled teens in...,7,pos,1,test,9280
4,tt0313150,short,'15','15',2002,\N,25,0,"Comedy,Drama,Short",6.3,184,I'm totally agree with GarryJohal from Singapo...,9,pos,1,test,9281





Unnamed: 0,column,percent_missing
10,votes,0.0042
9,average_rating,0.0042
0,tconst,0.0
1,title_type,0.0
2,primary_title,0.0
5,end_year,0.0
6,runtime_minutes,0.0
3,original_title,0.0
4,start_year,0.0
8,genres,0.0



After dropping, there are now 0 missing values.


__COMMENTARY:__
- All columns names formatted correctly.
- No apparent issues in data preview using `.head()` function.
- Analysis of missing values shows that there are less than a one-hundredth of a percent of the values missing from the `average_rating` and `votes` columns.
    - All missing values were dropped and the results were verfied.

In [27]:
duplicates_df = (
    df_reviews.duplicated()
    .mean() * 100
)

print(f'Percent of duplicated rows in dataset: {duplicates_df:.2f}%')


Percent of duplicated rows in dataset: 0.00%


__COMMENTARY:__
- Duplicate values reviewed, none found in the dataset.
- Data is ready for EDA.

## EDA

Let's check the number of movies and reviews over years.

In [28]:
fig, axs = plt.subplots(2, 1, figsize=(16, 8))

ax = axs[0]

dft1 = df_reviews[['tconst', 'start_year']].drop_duplicates() \
    ['start_year'].value_counts().sort_index()
dft1 = dft1.reindex(index=np.arange(dft1.index.min(), max(dft1.index.max(), 2021))).fillna(0)
dft1.plot(kind='bar', ax=ax)
ax.set_title('Number of Movies Over Years')

ax = axs[1]

dft2 = df_reviews.groupby(['start_year', 'pos'])['pos'].count().unstack()
dft2 = dft2.reindex(index=np.arange(dft2.index.min(), max(dft2.index.max(), 2021))).fillna(0)

dft2.plot(kind='bar', stacked=True, label='#reviews (neg, pos)', ax=ax)

dft2 = df_reviews['start_year'].value_counts().sort_index()
dft2 = dft2.reindex(index=np.arange(dft2.index.min(), max(dft2.index.max(), 2021))).fillna(0)
dft3 = (dft2/dft1).fillna(0)
axt = ax.twinx()
dft3.reset_index(drop=True).rolling(5).mean().plot(color='orange', label='reviews per movie (avg over 5 years)', ax=axt)

lines, labels = axt.get_legend_handles_labels()
ax.legend(lines, labels, loc='upper left')

ax.set_title('Number of Reviews Over Years')

fig.tight_layout()

__COMMENTARY:__
- Overall, it appears the distribution of new movies released over the years makes sense, as the technological constraints and novelty of the motion pictures presented production volume limitations prior to the late 1980s/early 1990s, when special effects and CGI slowly started to become the norm.
- The one interesting note in regards to the top charts is the gap in number of movies produced during the 1960s - it appears there may be some missing data there, as it does not make sense that there would be a sharp drop-off in new movies at the end of the 1950s, and then promptly spikes back up in the 1970s.
- Regarding the bottom chart concerning the number of reviews throughout the years and per movie shows that engagement is fairly consistent over time.
- There also appears to be a relatively even split between positive and negative reviews.

Let's check the distribution of number of reviews per movie with the exact counting and KDE (just to learn how it may differ from the exact counting)

In [29]:
fig, axs = plt.subplots(1, 2, figsize=(16, 5))

ax = axs[0]
dft = df_reviews.groupby('tconst')['review'].count() \
    .value_counts() \
    .sort_index()
dft.plot.bar(ax=ax)
ax.set_title('Bar Plot of #Reviews Per Movie')

ax = axs[1]
dft = df_reviews.groupby('tconst')['review'].count()
sns.kdeplot(dft, ax=ax)
ax.set_title('KDE Plot of #Reviews Per Movie')

fig.tight_layout()

__COMMENTARY:__
- The left bar chart shows that the vast majority of the movies have few reviews (under 5), although there is a long tail going up to 30 reviews per movie.
- The tail can be explained by the occurrence of generational blockbuster movies that are critically acclaimed and widely commented on.
- The right chart showing the KDE aligns with the above statements pertaining to the bar chart, but is expressing the data in a smoother curve.

In [30]:
df_reviews['pos'].value_counts()

pos
0    23715
1    23614
Name: count, dtype: int64

In [31]:
fig, axs = plt.subplots(1, 2, figsize=(12, 4))

ax = axs[0]
dft = df_reviews.query('ds_part == "train"')['rating'].value_counts().sort_index()
dft = dft.reindex(index=np.arange(min(dft.index.min(), 1), max(dft.index.max(), 11))).fillna(0)
dft.plot.bar(ax=ax)
ax.set_ylim([0, 5000])
ax.set_title('The train set: distribution of ratings')

ax = axs[1]
dft = df_reviews.query('ds_part == "test"')['rating'].value_counts().sort_index()
dft = dft.reindex(index=np.arange(min(dft.index.min(), 1), max(dft.index.max(), 11))).fillna(0)
dft.plot.bar(ax=ax)
ax.set_ylim([0, 5000])
ax.set_title('The test set: distribution of ratings')

fig.tight_layout()

__COMMENTARY:__
- In matters of preference, this distribution makes sense, as positivity and negativity in how one feels about a reivew tends to be polarizing in the first place - the viewer either liked the movie, or they did not.
- There will be significantly fewer folks in the middle , thus there is minimal danger in a blurred sentiment in terms of model training.
- It is reassuring that both the training and test sets are showing the same general spread, supporting the statement that the model training will be representative of the general population and should therefore be fairly accurate in predictions once the correct model is selected that can track the behaviors.

Distribution of negative and positive reviews over the years for two parts of the dataset

In [32]:
fig, axs = plt.subplots(2, 2, figsize=(16, 8), gridspec_kw=dict(width_ratios=(2, 1), height_ratios=(1, 1)))

ax = axs[0][0]

dft = df_reviews.query('ds_part == "train"').groupby(['start_year', 'pos'])['pos'].count().unstack()
dft.index = dft.index.astype('int')
dft = dft.reindex(index=np.arange(dft.index.min(), max(dft.index.max(), 2020))).fillna(0)
dft.plot(kind='bar', stacked=True, ax=ax)
ax.set_title('The train set: number of reviews of different polarities per year')

ax = axs[0][1]

dft = df_reviews.query('ds_part == "train"').groupby(['tconst', 'pos'])['pos'].count().unstack()
sns.kdeplot(dft[0], color='blue', label='negative', kernel='epa', ax=ax)
sns.kdeplot(dft[1], color='green', label='positive', kernel='epa', ax=ax)
ax.legend()
ax.set_title('The train set: distribution of different polarities per movie')

ax = axs[1][0]

dft = df_reviews.query('ds_part == "test"').groupby(['start_year', 'pos'])['pos'].count().unstack()
dft.index = dft.index.astype('int')
dft = dft.reindex(index=np.arange(dft.index.min(), max(dft.index.max(), 2020))).fillna(0)
dft.plot(kind='bar', stacked=True, ax=ax)
ax.set_title('The test set: number of reviews of different polarities per year')

ax = axs[1][1]

dft = df_reviews.query('ds_part == "test"').groupby(['tconst', 'pos'])['pos'].count().unstack()
sns.kdeplot(dft[0], color='blue', label='negative', kernel='epa', ax=ax)
sns.kdeplot(dft[1], color='green', label='positive', kernel='epa', ax=ax)
ax.legend()
ax.set_title('The test set: distribution of different polarities per movie')

fig.tight_layout()

__COMMENTARY:__
- The distribution of reviews by year is consistent between the training and test sets, with reviews increasing sharply after 1990.
- The training set includes slightly more older reviews (pre-1980), while the test set shows a stronger peak in the 2000s.
- When looking at reviews per movie by polarity, both sets confirm that most movies receive only a handful of reviews, with a slight dominance of negative reviews in this low-review region.
- The training set contains a few more movies with large numbers of positive reviews (20+), whereas the test set shows a sharper peak in the 1–2 negative review range.

## Evaluation Procedure

Composing an evaluation routine which can be used for all models in this project

In [33]:
import sklearn.metrics as metrics

def evaluate_model(model, train_features, train_target, test_features, test_target):
    
    eval_stats = {}
    
    fig, axs = plt.subplots(1, 3, figsize=(20, 6)) 
    
    for type, features, target in (('train', train_features, train_target), ('test', test_features, test_target)):
        
        eval_stats[type] = {}
    
        pred_target = model.predict(features)
        pred_proba = model.predict_proba(features)[:, 1]
        
        # F1
        f1_thresholds = np.arange(0, 1.01, 0.05)
        f1_scores = [metrics.f1_score(target, pred_proba>=threshold) for threshold in f1_thresholds]
        
        # ROC
        fpr, tpr, roc_thresholds = metrics.roc_curve(target, pred_proba)
        roc_auc = metrics.roc_auc_score(target, pred_proba)    
        eval_stats[type]['ROC AUC'] = roc_auc

        # PRC
        precision, recall, pr_thresholds = metrics.precision_recall_curve(target, pred_proba)
        aps = metrics.average_precision_score(target, pred_proba)
        eval_stats[type]['APS'] = aps
        
        if type == 'train':
            color = 'blue'
        else:
            color = 'green'

        # F1 Score
        ax = axs[0]
        max_f1_score_idx = np.argmax(f1_scores)
        ax.plot(f1_thresholds, f1_scores, color=color, label=f'{type}, max={f1_scores[max_f1_score_idx]:.2f} @ {f1_thresholds[max_f1_score_idx]:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(f1_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'
            ax.plot(f1_thresholds[closest_value_idx], f1_scores[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('threshold')
        ax.set_ylabel('F1')
        ax.legend(loc='lower center')
        ax.set_title(f'F1 Score') 

        # ROC
        ax = axs[1]    
        ax.plot(fpr, tpr, color=color, label=f'{type}, ROC AUC={roc_auc:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(roc_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'            
            ax.plot(fpr[closest_value_idx], tpr[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.plot([0, 1], [0, 1], color='grey', linestyle='--')
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('FPR')
        ax.set_ylabel('TPR')
        ax.legend(loc='lower center')        
        ax.set_title(f'ROC Curve')
        
        # PRC
        ax = axs[2]
        ax.plot(recall, precision, color=color, label=f'{type}, AP={aps:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(pr_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'
            ax.plot(recall[closest_value_idx], precision[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('recall')
        ax.set_ylabel('precision')
        ax.legend(loc='lower center')
        ax.set_title(f'PRC')        

        eval_stats[type]['Accuracy'] = metrics.accuracy_score(target, pred_target)
        eval_stats[type]['F1'] = metrics.f1_score(target, pred_target)
    
    df_eval_stats = pd.DataFrame(eval_stats)
    df_eval_stats = df_eval_stats.round(2)
    df_eval_stats = df_eval_stats.reindex(index=('Accuracy', 'F1', 'APS', 'ROC AUC'))
    
    print(df_eval_stats)
    
    return

## Normalization

We assume all models below accepts texts in lowercase and without any digits, punctuations marks etc.

In [34]:
# Apply to whole column at once (much faster than .apply)
df_reviews["review_norm"] = (
    df_reviews["review"]
      .str.lower()
      .str.replace(r"\d+", " ", regex=True)          # drop digits
      .str.replace(r"[^\w\s]", " ", regex=True)      # drop punctuation
      .str.replace(r"\s+", " ", regex=True)          # collapse whitespace
      .str.strip()
)

## Train / Test Split

Luckily, the whole dataset is already divided into train/test one parts. The corresponding flag is 'ds_part'.

In [35]:
df_reviews_train = df_reviews.query('ds_part == "train"').copy()
df_reviews_test = df_reviews.query('ds_part == "test"').copy()

# Features
train_features = df_reviews_train['review_norm']
test_features = df_reviews_test['review_norm']

# Target - Sentiment Labels
train_target = df_reviews_train['pos']
test_target = df_reviews_test['pos']

print(df_reviews_train.shape)
print(df_reviews_test.shape)

(23796, 19)
(23533, 19)


## Working with models

### Model 1 - Constant

In [36]:
from sklearn.dummy import DummyClassifier

# Stratified for class balance, predicting on "most_frequent"
dummy = DummyClassifier(strategy='most_frequent', random_state=777)

In [37]:
# Fit on split data
dummy.fit(train_features, train_target)

0,1,2
,strategy,'most_frequent'
,random_state,777
,constant,


In [38]:
# Evaluate
evaluate_model(dummy, train_features, train_target, test_features, test_target)

          train  test
Accuracy    0.5   0.5
F1          0.0   0.0
APS         0.5   0.5
ROC AUC     0.5   0.5


__COMMENTARY:__
- The baseline model is performing as expected, where the `DummyClassifier` is simply selecting the majority class everytime.
- As was evidenced from the prior EDA, the dataset was shown to be relatively balanced, and therefore guessing the majority class should yield 50% **accuracy**.
- The `F1` was 0, which is to be expected as the model never selected the minority class, by design.
- The `DummyClassifier` serves its purpose as sanity checking baseline and nothing more.


### Model 2 - NLTK, TF-IDF and LR

In [39]:
import nltk
nltk.download("stopwords")
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sethc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [40]:

# Vectorize on training only, perform on test
tfidf_lr = TfidfVectorizer(
    stop_words=stopwords.words("english"),
    ngram_range=(1, 2),
    max_features=20000,
    token_pattern=r"(?u)\b\w\w+\b"
)
train_features_lr = tfidf_lr.fit_transform(df_reviews_train['review_norm'])
test_features_lr = tfidf_lr.transform(df_reviews_test['review_norm'])

# Logistic Regression
model_lr = LogisticRegression(max_iter=200, solver='liblinear')
model_lr.fit(train_features_lr, train_target)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'liblinear'
,max_iter,200


In [41]:
evaluate_model(model_lr, train_features_lr, train_target, test_features_lr, test_target)

          train  test
Accuracy   0.94  0.89
F1         0.94  0.89
APS        0.98  0.95
ROC AUC    0.98  0.96


__COMMENTARY:__
- The first non-baseline model evaluated was the `LogisticRegression` model support by the `nltk` module and vectorization.
- The model performed well in many respects; It was sufficiently fast in training, and both the training and test scores across the board were favorable.
- Most critically, the `F1` score was 0.89 in test, comfortably meeting the criteria required by the **Film Junky Union**.
- This is a top candidate for model selection.

### Model 3 - spaCy, TF-IDF and LR

In [42]:
import re
import spacy
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# keep tagger + lemmatizer; disable heavy parts
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner", "textcat"])
nlp.max_length = 2_000_000

def spacy_lemmas_stream(texts, remove_stop=True, batch_size=500):
    """
    Generator: lowercase -> strip digits/punct -> POS tag -> lemmatize.
    Yields one normalized string per input doc. No big lists or new columns.
    """
    cleaned = (re.sub(r"[^\w\s]|\d+", " ", str(t).lower()).strip() for t in texts)
    for doc in nlp.pipe(cleaned, batch_size=batch_size, n_process=1):
        if remove_stop:
            yield " ".join(tok.lemma_ for tok in doc if not tok.is_stop and not tok.is_space)
        else:
            yield " ".join(tok.lemma_ for tok in doc if not tok.is_space)


In [43]:
# Train/test - streaming only
train_stream = spacy_lemmas_stream(df_reviews_train["review"], batch_size=500)
test_stream  = spacy_lemmas_stream(df_reviews_test["review"],  batch_size=500)

In [44]:
# TF-IDF on streamed text; use float32 to halve RAM
tfidf_spacy = TfidfVectorizer(
    stop_words=None,            # already removed above
    ngram_range=(1, 2),
    max_features=20000,
    token_pattern=r"(?u)\b\w\w+\b",
    dtype=np.float32
)

# fit on TRAIN stream, transform TEST stream (still streaming, no new columns)
Xtr_sp = tfidf_spacy.fit_transform(train_stream)
Xte_sp = tfidf_spacy.transform(test_stream)

# model
lr_spacy = LogisticRegression(max_iter=200, solver="liblinear")
lr_spacy.fit(Xtr_sp, train_target)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'liblinear'
,max_iter,200


In [45]:
evaluate_model(lr_spacy, Xtr_sp, train_target, Xte_sp, test_target)

          train  test
Accuracy   0.93  0.88
F1         0.93  0.88
APS        0.98  0.95
ROC AUC    0.98  0.95


__COMMENTARY:__
- The next model assessed was `LogisticRegression` supported by the `spaCy` module with vectorization.
- Significantly greater processing time was required during the training of this model.
- Though all of the test scores were favorable, suggesting strong **precision** and **recall**, the critical `F1` metric was still outperformed by the `LogisticRegression` supported by `nltk`.
- Though the model is a strong candidate, the sheer speed drawbacks make this model a runner-up for this application.

### Model 4 - spaCy, TF-IDF and LGBMClassifier

In [46]:
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split

In [47]:
# Vectorize (reuse normalized text)
tfidf_lgbm = TfidfVectorizer(
    stop_words="english",
    ngram_range=(1, 2),
    max_features=20000,              
    token_pattern=r"(?u)\b\w\w+\b",
    dtype=np.float32                  # halves memory vs float64
)

X_train_all = tfidf_lgbm.fit_transform(df_reviews_train["review_norm"])
y_train_all = train_target.values
X_test = tfidf_lgbm.transform(df_reviews_test["review_norm"])

# Make a small validation split from TRAIN for early stopping
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train_all, y_train_all, test_size=0.15, random_state=777, stratify=y_train_all
)

In [48]:
from lightgbm.callback import early_stopping, log_evaluation

# LightGBM – tuned for sparse TF-IDF + speed
lgbm = LGBMClassifier(
    objective="binary",
    learning_rate=0.1,
    n_estimators=2000,           # let early stopping pick how many
    num_leaves=63,               # good default for text
    min_child_samples=20,
    subsample=0.9,               # row sampling
    colsample_bytree=0.9,        # feature sampling
    reg_alpha=0.0,
    reg_lambda=0.0,
    random_state=42,
    n_jobs=-1
)

# Fit with early stopping (uses AUC on the val split)
lgbm.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    eval_metric="auc",
    callbacks=[
        early_stopping(stopping_rounds=50),
        log_evaluation(50)
    ]
)

[LightGBM] [Info] Number of positive: 10101, number of negative: 10125
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.378238 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 505532
[LightGBM] [Info] Number of data points in the train set: 20226, number of used features: 14926
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499407 -> initscore=-0.002373
[LightGBM] [Info] Start training from score -0.002373
Training until validation scores don't improve for 50 rounds
[50]	valid_0's auc: 0.921453	valid_0's binary_logloss: 0.367143
[100]	valid_0's auc: 0.938832	valid_0's binary_logloss: 0.318698
[150]	valid_0's auc: 0.943098	valid_0's binary_logloss: 0.305336
[200]	valid_0's auc: 0.944813	valid_0's binary_logloss: 0.30134
[250]	valid_0's auc: 0.945864	valid_0's binary_logloss: 0.300826
Early stopping, best iteration is:
[227]	valid_0's auc: 0.945766	valid_0's binary_logloss: 0.299715


0,1,2
,boosting_type,'gbdt'
,num_leaves,63
,max_depth,-1
,learning_rate,0.1
,n_estimators,2000
,subsample_for_bin,200000
,objective,'binary'
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [49]:
evaluate_model(lgbm, X_train_all, y_train_all, X_test, test_target)

          train  test
Accuracy   0.98  0.87
F1         0.98  0.87
APS        1.00  0.94
ROC AUC    1.00  0.95


__COMMENTARY:__
- `LightGBM Classifier` certainly trained more quickly than **Model 3** above, and all of its key performance metrics were favorable.
- Once again, the `F1 Score` of 0.87 met the criteria; however it was simply outperformed by **Model 2**.


In [50]:
"""
import torch
import transformers
"""

'\nimport torch\nimport transformers\n'

In [51]:
"""
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')
config = transformers.BertConfig.from_pretrained('bert-base-uncased')
model = transformers.BertModel.from_pretrained('bert-base-uncased')
"""

"\ntokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')\nconfig = transformers.BertConfig.from_pretrained('bert-base-uncased')\nmodel = transformers.BertModel.from_pretrained('bert-base-uncased')\n"

In [52]:
"""
def BERT_text_to_embeddings(texts, max_length=512, batch_size=100, force_device=None, disable_progress_bar=False):
    
    ids_list = []
    attention_mask_list = []

    # text to padded ids of tokens along with their attention masks
    
    # <put your code here to create ids_list and attention_mask_list>
    
    if force_device is not None:
        device = torch.device(force_device)
    else:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        
    model.to(device)
    if not disable_progress_bar:
        print(f'Using the {device} device.')
    
    # gettings embeddings in batches

    embeddings = []

    for i in tqdm(range(math.ceil(len(ids_list)/batch_size)), disable=disable_progress_bar):
            
        ids_batch = torch.LongTensor(ids_list[batch_size*i:batch_size*(i+1)]).to(device)
        # <put your code here to create attention_mask_batch
            
        with torch.no_grad():            
            model.eval()
            batch_embeddings = model(input_ids=ids_batch, attention_mask=attention_mask_batch)   
        embeddings.append(batch_embeddings[0][:,0,:].detach().cpu().numpy())
        
    return np.concatenate(embeddings)
"""

"\ndef BERT_text_to_embeddings(texts, max_length=512, batch_size=100, force_device=None, disable_progress_bar=False):\n    \n    ids_list = []\n    attention_mask_list = []\n\n    # text to padded ids of tokens along with their attention masks\n    \n    # <put your code here to create ids_list and attention_mask_list>\n    \n    if force_device is not None:\n        device = torch.device(force_device)\n    else:\n        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n        \n    model.to(device)\n    if not disable_progress_bar:\n        print(f'Using the {device} device.')\n    \n    # gettings embeddings in batches\n\n    embeddings = []\n\n    for i in tqdm(range(math.ceil(len(ids_list)/batch_size)), disable=disable_progress_bar):\n            \n        ids_batch = torch.LongTensor(ids_list[batch_size*i:batch_size*(i+1)]).to(device)\n        # <put your code here to create attention_mask_batch\n            \n        with torch.no_grad():            \n     

In [53]:
"""
# Attention! Running BERT for thousands of texts may take long run on CPU, at least several hours
train_features_9 = BERT_text_to_embeddings(df_reviews_train['review_norm'], force_device='cuda')
"""

"\n# Attention! Running BERT for thousands of texts may take long run on CPU, at least several hours\ntrain_features_9 = BERT_text_to_embeddings(df_reviews_train['review_norm'], force_device='cuda')\n"

In [54]:
"""
print(df_reviews_train['review_norm'].shape)
print(train_features_9.shape)
print(train_target.shape)
"""

"\nprint(df_reviews_train['review_norm'].shape)\nprint(train_features_9.shape)\nprint(train_target.shape)\n"

In [55]:
# if you have got the embeddings, it's advisable to save them to have them ready if 
# np.savez_compressed('features_9.npz', train_features_9=train_features_9, test_features_9=test_features_9)

# and load...
# with np.load('features_9.npz') as data:
#     train_features_9 = data['train_features_9']
#     test_features_9 = data['test_features_9']

## My Reviews

In [56]:
my_reviews = pd.DataFrame([
    'I did not simply like it, not my kind of movie.',
    'Well, I was bored and felt asleep in the middle of the movie.',
    'I was really fascinated with the movie',    
    'Even the actors looked really old and disinterested, and they got paid to be in the movie. What a soulless cash grab.',
    'I didn\'t expect the reboot to be so good! Writers really cared about the source material',
    'The movie had its upsides and downsides, but I feel like overall it\'s a decent flick. I could see myself going to see it again.',
    'What a rotten attempt at a comedy. Not a single joke lands, everyone acts annoying and loud, even kids won\'t like this!',
    'Launching on Netflix was a brave move & I really appreciate being able to binge on episode after episode, of this exciting intelligent new drama.'
], columns=['review'])

my_reviews["review_norm"] = (
    my_reviews["review"].str.lower()
        .str.replace(r"\d+", " ", regex=True)
        .str.replace(r"[^\w\s]", " ", regex=True)
        .str.replace(r"\s+", " ", regex=True)
        .str.strip()
)

my_reviews

Unnamed: 0,review,review_norm
0,"I did not simply like it, not my kind of movie.",i did not simply like it not my kind of movie
1,"Well, I was bored and felt asleep in the middl...",well i was bored and felt asleep in the middle...
2,I was really fascinated with the movie,i was really fascinated with the movie
3,Even the actors looked really old and disinter...,even the actors looked really old and disinter...
4,I didn't expect the reboot to be so good! Writ...,i didn t expect the reboot to be so good write...
5,"The movie had its upsides and downsides, but I...",the movie had its upsides and downsides but i ...
6,What a rotten attempt at a comedy. Not a singl...,what a rotten attempt at a comedy not a single...
7,Launching on Netflix was a brave move & I real...,launching on netflix was a brave move i really...


### Model 2 - NLTK, TF-IDF and LR

In [57]:
texts = my_reviews["review_norm"]
X_m2 = tfidf_lr.transform(texts)              # uses existing tfidf_lr
probs_m2 = model_lr.predict_proba(X_m2)[:, 1] # existing model_lr
preds_m2 = (probs_m2 >= 0.5).astype(int)

### Model 3 - spaCy, TF-IDF and LR

In [58]:
def spacy_lemmas_list(texts, remove_stop=True, batch_size=200):
    """
    Convert a list/Series of texts into lemmatized strings using spaCy.
    Mirrors earlier preprocessing stream.
    """
    cleaned = (re.sub(r"[^\w\s]|\d+", " ", str(t).lower()).strip() for t in texts)
    out = []
    for doc in nlp.pipe(cleaned, batch_size=batch_size, n_process=1):
        toks = [tok.lemma_ for tok in doc if not tok.is_space and (not tok.is_stop if remove_stop else True)]
        out.append(" ".join(toks))
    return out

texts_spacy = spacy_lemmas_list(my_reviews["review"])    # preprocess like  training stream
X_m3 = tfidf_spacy.transform(texts_spacy)                 # existing tfidf_spacy
probs_m3 = lr_spacy.predict_proba(X_m3)[:, 1]            # existing lr_spacy
preds_m3 = (probs_m3 >= 0.5).astype(int)

### Model 4 - spaCy, TF-IDF and LGBMClassifier

In [59]:
X_m4 = tfidf_lgbm.transform(my_reviews["review_norm"])   # existing tfidf_lgbm
probs_m4 = lgbm.predict_proba(X_m4)[:, 1]                # existing lgbm
preds_m4 = (probs_m4 >= 0.5).astype(int)

### Comparison Table

In [60]:
out = my_reviews[["review"]].copy()
out["M2_LR_prob_pos"]   = probs_m2
out["M2_LR_pred"]       = preds_m2
out["M3_spaCy_prob_pos"]= probs_m3
out["M3_spaCy_pred"]    = preds_m3
out["M4_LGBM_prob_pos"] = probs_m4
out["M4_LGBM_pred"]     = preds_m4

# Pretty display
pd.set_option("display.max_colwidth", 140)
display(out.style.format({
    "M2_LR_prob_pos": "{:.2f}",
    "M3_spaCy_prob_pos": "{:.2f}",
    "M4_LGBM_prob_pos": "{:.2f}"
}))

Unnamed: 0,review,M2_LR_prob_pos,M2_LR_pred,M3_spaCy_prob_pos,M3_spaCy_pred,M4_LGBM_prob_pos,M4_LGBM_pred
0,"I did not simply like it, not my kind of movie.",0.26,0,0.24,0,0.5,0
1,"Well, I was bored and felt asleep in the middle of the movie.",0.23,0,0.19,0,0.15,0
2,I was really fascinated with the movie,0.53,1,0.61,1,0.68,1
3,"Even the actors looked really old and disinterested, and they got paid to be in the movie. What a soulless cash grab.",0.11,0,0.21,0,0.1,0
4,I didn't expect the reboot to be so good! Writers really cared about the source material,0.29,0,0.23,0,0.32,0
5,"The movie had its upsides and downsides, but I feel like overall it's a decent flick. I could see myself going to see it again.",0.44,0,0.34,0,0.77,1
6,"What a rotten attempt at a comedy. Not a single joke lands, everyone acts annoying and loud, even kids won't like this!",0.04,0,0.06,0,0.02,0
7,"Launching on Netflix was a brave move & I really appreciate being able to binge on episode after episode, of this exciting intelligent new drama.",0.85,1,0.92,1,0.93,1


In [61]:
"""
texts = my_reviews['review_norm']

my_reviews_features_9 = BERT_text_to_embeddings(texts, disable_progress_bar=True)

my_reviews_pred_prob = model_9.predict_proba(my_reviews_features_9)[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')
"""

"\ntexts = my_reviews['review_norm']\n\nmy_reviews_features_9 = BERT_text_to_embeddings(texts, disable_progress_bar=True)\n\nmy_reviews_pred_prob = model_9.predict_proba(my_reviews_features_9)[:, 1]\n\nfor i, review in enumerate(texts.str.slice(0, 100)):\n    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')\n"

## Conclusions

__Project Objectives:__
- Train a model to automatically detect negative reviews for **The Film Junky Union**.
- The model must reach a minimum `F1 Score` of **0.85**.

__Results Summary:__
- **Model 1:**
    - The baseline model is performing as expected, where the `DummyClassifier` is simply selecting the majority class everytime.
    - As was evidenced from the prior EDA, the dataset was shown to be relatively balanced, and therefore guessing the majority class should yield 50% **accuracy**.
    - The `F1` was 0, which is to be expected as the model never selected the minority class, by design.
    - The `DummyClassifier` serves its purpose as sanity checking baseline and nothing more.
- **Model 2:**
    - The first non-baseline model evaluated was the `LogisticRegression` model support by the `nltk` module and vectorization.
    - The model performed well in many respects; It was sufficiently fast in training, and both the training and test scores across the board were favorable.
    - Most critically, the `F1` score was 0.89 in test, comfortably meeting the criteria required by the **Film Junky Union**.
- **Model 3:**
    - The next model assessed was `LogisticRegression` supported by the `spaCy` module with vectorization.
    - Significantly greater processing time was required during the training of this model.
    - Though all of the test scores were favorable, suggesting strong **precision** and **recall**, the critical `F1` metric was still outperformed by the `LogisticRegression` supported by `nltk`.
    - Though the model is a strong candidate, the sheer speed drawbacks make this model a runner-up for this application.
- **Model4:**
    - `LightGBM Classifier` certainly trained more quickly than **Model 3** above, and all of its key performance metrics were favorable.
    - Once again, the `F1 Score` of 0.87 met the criteria; however it was simply outperformed by **Model 2**.

__Observations on Custom Reviews:__
- Both LR models (M2, M3) agree on most reviews and tend to be more conservative in assigning positive sentiment.
- LightGBM (M4) produces sharper probabilities and sometimes flips to positive where the LR models stay negative (e.g., Review #5).
- This shows the models weigh features differently: LR models rely more on word frequency (TF-IDF), while LightGBM picks up interactions and context, which can shift borderline cases.

__Next Steps & Recommendations:__
- Future work should focus on optimizing preprocessing pipelines for efficiency and consistency across models.
- It’s also recommended to benchmark against newer transformer-based models to assess potential performance gains.