In [186]:
import nltk
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
from urllib import request
from nltk import FreqDist
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    accuracy_score, confusion_matrix
)
from gensim.models import Word2Vec
from IPython.display import display_html


## 1. Introduction

The IMDB Dataset is a widely recognized benchmark in natural language processing, primarily used for document classification and sentiment analysis. It comprises a large collection of movie reviews from the Internet Movie Database (IMDB), each labeled as either positive or negative, providing a balanced and well-structured corpus for analyzing text-based sentiment.

## 2. Data Preparation

Each movie review was cleaned to remove noise and ensure consistency. The text was converted to lowercase, HTML tags and punctuation were removed, and extra spaces were collapsed.  
The cleaned text was stored in a new column called **clean_review**. Sentiment labels were also converted from **“positive”** and **“negative”** to binary values (**1** and **0**, respectively) to prepare the data for machine learning classification.


In [187]:
movies_df = pd.read_csv("IMDB Dataset.csv")


In [188]:

def clean_text(text):
    text = text.lower()
    text = re.sub(r"<.*?>", " ", text)  
    text = re.sub(r"[^a-z\s]", " ", text)    
    text = re.sub(r"\s+", " ", text).strip() 
    return text

movies_df["clean_review"] = movies_df["review"].apply(clean_text)
movies_df["label"] = movies_df["sentiment"].map({"positive": 1, "negative": 0})


In [189]:
movies_df[["clean_review","label"]].head().style.hide(axis="index")

clean_review,label
one of the other reviewers has mentioned that after watching just oz episode you ll be hooked they are right as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to many aryans muslims gangstas latinos christians italians irish and more so scuffles death stares dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wouldn t dare forget pretty pictures painted for mainstream audiences forget charm forget romance oz doesn t mess around the first episode i ever saw struck me as so nasty it was surreal i couldn t say i was ready for it but as i watched more i developed a taste for oz and got accustomed to the high levels of graphic violence not just violence but injustice crooked guards who ll be sold out for a nickel inmates who ll kill on order and get away with it well mannered middle class inmates being turned into prison bitches due to their lack of street skills or prison experience watching oz you may become comfortable with what is uncomfortable viewing thats if you can get in touch with your darker side,1
a wonderful little production the filming technique is very unassuming very old time bbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great master s of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwell s murals decorating every surface are terribly well done,1
i thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air conditioned theater and watching a light hearted comedy the plot is simplistic but the dialogue is witty and the characters are likable even the well bread suspected serial killer while some may be disappointed when they realize this is not match point risk addiction i thought it was proof that woody allen is still fully in control of the style many of us have grown to love this was the most i d laughed at one of woody s comedies in years dare i say a decade while i ve never been impressed with scarlet johanson in this she managed to tone down her sexy image and jumped right into a average but spirited young woman this may not be the crown jewel of his career but it was wittier than devil wears prada and more interesting than superman a great comedy to go see with friends,1
basically there s a family where a little boy jake thinks there s a zombie in his closet his parents are fighting all the time this movie is slower than a soap opera and suddenly jake decides to become rambo and kill the zombie ok first of all when you re going to make a film you must decide if its a thriller or a drama as a drama the movie is watchable parents are divorcing arguing like in real life and then we have jake with his closet which totally ruins all the film i expected to see a boogeyman similar movie and instead i watched a drama with some meaningless thriller spots out of just for the well playing parents descent dialogs as for the shots with jake just ignore them,0
petter mattei s love in the time of money is a visually stunning film to watch mr mattei offers us a vivid portrait about human relations this is a movie that seems to be telling us what money power and success do to people in the different situations we encounter this being a variation on the arthur schnitzler s play about the same theme the director transfers the action to the present time new york where all these different characters meet and connect each one is connected in one way or another to the next person but no one seems to know the previous point of contact stylishly the film has a sophisticated luxurious look we are taken to see how these people live and the world they live in their own habitat the only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits a big city is not exactly the best place in which human relations find sincere fulfillment as one discerns is the case with most of the people we encounter the acting is good under mr mattei s direction steve buscemi rosario dawson carol kane michael imperioli adrian grenier and the rest of the talented cast make these characters come alive we wish mr mattei good luck and await anxiously for his next work,1


The dataset was divided into training and testing subsets using an 80/20 split. The **random_state 456** ensures the split is reproducible, and **stratify=movies_df["label"]** maintains the same proportion of positive and negative reviews in both sets, preserving class balance.


In [190]:
x_train,x_test, y_train,y_test = train_test_split(
    movies_df["clean_review"],
    movies_df["label"],
    train_size=0.8,
    test_size=0.2, 
    random_state=456,
    stratify=movies_df["label"]
    )

To transform the textual movie reviews into a numerical format suitable for machine learning, **Word2Vec** was used as the feature extraction method. This approach converts each word into a dense vector that captures both syntactic and semantic relationships, allowing the model to understand how words relate to one another in context.

The reviews were first tokenized into individual words and then used to train a Word2Vec model with key parameters chosen to enhance representation quality. The **vector_size** was set to **500**, meaning each word is represented as a 500-dimensional vector, providing enough capacity to encode complex linguistic features. The **window** parameter was set to **6**, defining the number of neighboring words on each side that the model considers when learning word associations. The **min_count** was set to **2**, filtering out words that appear only once and thus reducing noise from rare or misspelled terms.

Training was parallelized with **workers=4** for faster computation, and **sg=1** was used to apply the **skip-gram** architecture, which is particularly effective at learning high-quality embeddings for less frequent words by predicting context words from a target word.

Once trained, each review was transformed into a single vector using the `document_vector()` function, which computes the average of all word vectors in the review. This process produced consistent-length feature representations (**v_train_set** and **v_test_set**) that capture the overall meaning of each review and serve as input features for the classification models.




In [163]:
# vectorizer = TfidfVectorizer(max_features=5000,stop_words="english",ngram_range=(1,4))
# v_train_set = vectorizer.fit_transform(x_train)
# v_test_set =  vectorizer.fit_transform(x_test)
x_train_tokens = [text.split() for text in x_train]
x_test_tokens  = [text.split() for text in x_test]

w2v_model = Word2Vec(
    sentences=x_train_tokens,
    vector_size=500,
    window=6,
    min_count=2,
    workers=4,
    sg=1
)

def document_vector(words):
    words = [w for w in words if w in w2v_model.wv]
    if len(words) == 0:
        return np.zeros(w2v_model.vector_size)
    return np.mean(w2v_model.wv[words], axis=0)

v_train_set = np.vstack([document_vector(words) for words in x_train_tokens])
v_test_set  = np.vstack([document_vector(words) for words in x_test_tokens])

In [None]:
w2v_model.save("w2v_model_imdb.model")


## 3. Model Development

Four models were trained to classify IMDB movie reviews: **Support Vector Machine (SVM)**, **Logistic Regression**, **Random Forest**, and **XGBoost**. Each model aimed to predict whether a review expressed a positive or negative sentiment using the cleaned text data.

The models were evaluated using six metrics: **Accuracy**, **Precision**, **Recall**, **Sensitivity**, **Specificity**, and **F1-score**. These metrics provided a balanced view of overall and class-specific performance, helping identify which model achieved the best results while minimizing misclassifications.



In [165]:
cv_param = 5

In [166]:
model_metrics = [
        "Set",
        "Accuracy",
        "Precision",
        "Recall",
        "Sensitivity",
        "Specificity",
        "F1"
        ]

def evaluate_model(y_true, y_pred):
   

    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred)
    rec = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)

    cm = confusion_matrix(y_true, y_pred)
    TP, FN, FP, TN = cm[0, 0], cm[0, 1], cm[1, 0], cm[1, 1]
    sensitivity = TP / (TP + FN) if (TP + FN) else 0
    specificity = TN / (TN + FP) if (TN + FP) else 0

    return {
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "Sensitivity": sensitivity,
        "Specificity": specificity,
        "F1": f1
    }
    
def generate_report(model_instance,trainX,trainY,testX,testY):
    y_train_pred = model_instance.predict(trainX)
    y_test_pred = model_instance.predict(testX)
    train_set_metrics = evaluate_model(trainY,y_train_pred)
    test_set_metrics = evaluate_model(testY,y_test_pred)
    train_set_metrics["Set"] = "Training"
    test_set_metrics["Set"] = "Test"
    model_metrics_df = pd.DataFrame(columns=model_metrics,data= [train_set_metrics,test_set_metrics])
    styled_report = model_metrics_df.style.hide(axis="index")
    return model_metrics_df,styled_report

def display_side_by_side(dfs, titles=None):
    html_str = "<div style='display:flex;flex-flow:row nowrap;column-gap:20px'>"
    for df, title in zip(dfs, titles):
        html_str += f"""
        <div style="margin:10px">
            <h4 style="text-align:center">{title}</h4>
            {df.to_html()}
        </div>"""
    html_str += "</div>"

    display_html(html_str, raw=True)

    

### 3.1 SVM

The Support Vector Machine model was tuned using a Grid Search focused on the regularization parameter **C**, tested from **0.001 to 1** in increments of **0.009**. This parameter controls the trade-off between fitting the training data well and maintaining good generalization on unseen data.


In [167]:
svm_param_grid = {'C': np.arange(0.001, 1, 0.009)}
svm_model = LinearSVC(random_state=500)

grid = GridSearchCV(svm_model, svm_param_grid, cv=cv_param, scoring='accuracy', n_jobs=-1, verbose=0)
grid.fit(v_train_set, y_train)
# print("Best Parameters:", grid.best_params_)
# print("Best CV Accuracy:", round(grid.best_score_, 3))
svm_model = grid.best_estimator_

svm_df, svm_df_styled = generate_report(svm_model,v_train_set,y_train,v_test_set,y_test)


### 3.2 Logistic Regression

For Logistic Regression, a Grid Search explored **C** values from **0.001 to 1** in increments of **0.005**, along with two solver options: **liblinear** and **lbfgs**. These settings helped identify the best regularization strength and optimization approach for handling text-based sentiment data efficiently.


In [168]:
log_reg = LogisticRegression(max_iter=1000)

log_param_grid = {'C': np.arange(0.001, 1, 0.005), 'solver': ['liblinear', 'lbfgs']}

grid = GridSearchCV(estimator=log_reg, param_grid=log_param_grid,cv=cv_param, scoring='accuracy', n_jobs=-1, verbose=0)

grid.fit(v_train_set, y_train)

# print("Best Parameters:", grid.best_params_)
# print("Best CV Accuracy:", round(grid.best_score_, 3))

logistic_model = grid.best_estimator_

lgreg_df, lgreg_df_styled = generate_report(logistic_model,v_train_set,y_train,v_test_set,y_test)


### 3.3 xgboost

The XGBoost model was trained using a parameter grid that focused on optimizing learning speed and predictive performance. The **n_estimators** parameter was tested with **100** and **500** trees to evaluate the impact of ensemble size on accuracy and overfitting. The **learning_rate** values (**0.01**, **0.1**, and **0.2**) controlled how much the model adjusted with each boosting step, balancing convergence speed and model stability.

Although the **max_depth** parameter was considered during experimentation, it was later omitted to simplify the tuning process and reduce overfitting risk. These parameter settings allowed the model to capture nonlinear relationships efficiently while maintaining good generalization.


In [169]:

xgboost = GradientBoostingClassifier(random_state=500)
xgboost_param_grid = {'n_estimators':[100,500],'learning_rate':[0.01,0.1,0.2]
                    #   ,'max_depth':[2,3,4]
                      }
grid = GridSearchCV(xgboost,param_grid=xgboost_param_grid,cv=cv_param,scoring='accuracy',n_jobs=-1,verbose=0)
grid.fit(v_train_set,y_train)
# print("Best Params:",grid.best_params_)
# print("Best CV Accuracy:",round(grid.best_score_,3))
xgboost = grid.best_estimator_

naxgboost_df, xgboost_df_styled = generate_report(xgboost,v_train_set,y_train,v_test_set,y_test)

### 3.4 Random Forest

The Random Forest model was trained using a parameter grid that aimed to balance accuracy and generalization. The **n_estimators** values (**200** and **400**) controlled the number of trees, while **max_depth** (**None** and **20**) adjusted tree growth to manage complexity. The **max_features** parameter was set to **"sqrt"** to reduce feature correlation and improve model diversity.

The grid also tuned **min_samples_split** (**2**, **5**) and **min_samples_leaf** (**1**, **2**) to control how many samples were needed to create or end splits in each tree. These settings helped identify the best configuration for stable and efficient model performance.


In [170]:
random_forest = RandomForestClassifier(random_state=500)

rf_param_grid = param_grid = {
    "n_estimators": [200, 400],
    "max_depth": [None, 20],
    "max_features": ["sqrt"],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2]
}

grid = GridSearchCV(estimator=random_forest, param_grid=rf_param_grid,cv=cv_param, scoring='accuracy', n_jobs=-1, verbose=0)

grid.fit(v_train_set, y_train)

# print("Best Parameters:", grid.best_params_)
# print("Best CV Accuracy:", round(grid.best_score_, 3))

random_forest = grid.best_estimator_

random_forest_df, random_forest_df_styled = generate_report(random_forest,v_train_set,y_train,v_test_set,y_test)
random_forest_df_styled

Set,Accuracy,Precision,Recall,Sensitivity,Specificity,F1
Training,0.999025,0.9998,0.99825,0.9998,0.99825,0.999024
Test,0.8414,0.831456,0.8564,0.8264,0.8564,0.843744


## 4. Model Evaluation and Recommendation

In [175]:
svm_df["Model"] = "SVM"
random_forest_df["Model"] = "Random Forest"
lgreg_df["Model"] = "Logistic Regression"
naxgboost_df["Model"] = "XGBoost"

combined_df = pd.concat([svm_df, random_forest_df, lgreg_df, naxgboost_df], ignore_index=True)
cols = ["Model"] + [col for col in combined_df.columns if col != "Model"]
combined_df = combined_df[cols]
training_summary = combined_df[combined_df["Set"] == "Training"].style.hide(axis="index")
test_summary = combined_df[combined_df["Set"] == "Test"].style.hide(axis="index")


display_side_by_side([training_summary,test_summary],["Training Evaluation","Test Set Evaluation"])

Model,Set,Accuracy,Precision,Recall,Sensitivity,Specificity,F1
SVM,Training,0.890675,0.888171,0.8939,0.88745,0.8939,0.891026
Random Forest,Training,0.999025,0.9998,0.99825,0.9998,0.99825,0.999024
Logistic Regression,Training,0.882925,0.880811,0.8857,0.88015,0.8857,0.883249
XGBoost,Training,0.92775,0.925326,0.9306,0.9249,0.9306,0.927955

Model,Set,Accuracy,Precision,Recall,Sensitivity,Specificity,F1
SVM,Test,0.888,0.882794,0.8948,0.8812,0.8948,0.888756
Random Forest,Test,0.8414,0.831456,0.8564,0.8264,0.8564,0.843744
Logistic Regression,Test,0.8799,0.875321,0.886,0.8738,0.886,0.880628
XGBoost,Test,0.8788,0.87505,0.8838,0.8738,0.8838,0.879403



The training results indicate that all models performed strongly after incorporating Word2Vec embeddings, which effectively captured semantic relationships between words and improved text representation. The **Random Forest** model achieved the highest training accuracy (**0.999**), along with near-perfect precision (**0.9998**), recall (**0.9983**), and F1 (**0.9990**). However, this exceptional performance did not carry over to the test set, where accuracy dropped to **0.841** and the F1 score to **0.8437**. This sharp contrast suggests that the Random Forest model memorized patterns from the training data instead of learning generalizable relationships, a clear indication of overfitting.

The **SVM**, **Logistic Regression**, and **XGBoost** models demonstrated more consistent behavior between training and test results. The **SVM** achieved a training accuracy of **0.8907** and a test accuracy of **0.888**, showing minimal performance drop. Similarly, **Logistic Regression** performed with **0.8829** accuracy in training and **0.8799** on the test set, while **XGBoost** achieved **0.9278** and **0.8788**, respectively. The closeness of these metrics across both sets reflects strong model generalization and the ability to classify sentiment effectively without overfitting.

In terms of balance between sensitivity and specificity, the **SVM** and **XGBoost** models performed particularly well, maintaining consistent values across both sets. The **SVM** achieved the highest F1 score on the test data (**0.8888**), confirming its reliability in identifying both positive and negative sentiments with high precision. Logistic Regression followed closely, also maintaining a strong trade-off between recall and precision.

Overall, the results confirm that the **Word2Vec embeddings contributed most to the performance gains** by providing dense and meaningful text features. The small variations in accuracy and F1 across models indicate that the embedding layer played the key role in enabling robust sentiment classification. While the **SVM** model slightly outperformed the others in generalization, **Logistic Regression** and **XGBoost** produced similarly strong and stable results, making any of the three a valid choice for deployment.



## 5. Conclusion

This project demonstrated the effectiveness of combining **Word2Vec embeddings** with traditional machine learning algorithms for sentiment classification on the IMDB movie review dataset. Through systematic data cleaning, feature extraction, and model evaluation, it became clear that the quality of text representation played the most significant role in achieving strong model performance. The Word2Vec model captured meaningful semantic relationships between words, allowing classifiers to interpret sentiment and context more effectively.

Among the four models tested, **SVM**, **Logistic Regression**, **Random Forest**, and **XGBoost**, the **SVM** achieved the best balance between accuracy and generalization, while Logistic Regression and XGBoost produced similar, consistent results. The Random Forest model achieved near-perfect accuracy during training but showed clear signs of overfitting when evaluated on unseen data.

Overall, the findings highlight that **feature quality is more important than model complexity** in text classification tasks. The Word2Vec embeddings provided a solid foundation for sentiment understanding, enabling even simple models to perform well. Future work could involve fine-tuning embedding parameters, using pre-trained Word2Vec or Transformer-based models, and extending this framework to more diverse or multi-class datasets to further improve generalization and interpretability.
