# Sentiment Analysis on the IMDB Dataset with XGBoost

This notebook demonstrates binary sentiment classification on movie reviews using the XGBoost algorithm and various machine learning techniques.

**Dataset Information:**  
The dataset used in this notebook is the ["IMDB Dataset of 50K Movie Reviews"](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) by lakshmi25npathi, which contains 50,000 labeled movie reviews for binary sentiment classification (positive/negative).

**Preprocessing:**  
The preprocessed dataset used here can be obtained as the output of the `MRA_Preprocessing` notebook. The original raw data can be found at the Kaggle link above.

**Techniques and Libraries Used:**  
- **XGBoost:** For classification.
- **K-Fold Cross Validation:** For robust evaluation.
- **scikit-learn:** For model selection, evaluation, and preprocessing.
- **pandas & numpy:** For data manipulation and analysis.

---

In [None]:
!pip install gensim

In [30]:
import numpy as np
import pandas as pd
from gensim.models import Word2Vec
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from joblib import parallel_backend
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm
from functools import partial
import joblib
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv("movie_review_cleaned.csv")

In [3]:
df = data.copy()

In [4]:
df

Unnamed: 0,review,sentiment,tokens,lemmas
0,one reviewers mentioned watching episode 'll h...,positive,"['one', 'reviewers', 'mentioned', 'watching', ...","['one', 'reviewer', 'mention', 'watch', 'episo..."
1,wonderful little production filming technique ...,positive,"['wonderful', 'little', 'production', 'filming...","['wonderful', 'little', 'production', 'filming..."
2,thought wonderful way spend time hot summer we...,positive,"['thought', 'wonderful', 'way', 'spend', 'time...","['think', 'wonderful', 'way', 'spend', 'time',..."
3,basically family little boy jake thinks zombie...,negative,"['basically', 'family', 'little', 'boy', 'jake...","['basically', 'family', 'little', 'boy', 'jake..."
4,petter matters love time money visually stunni...,positive,"['petter', 'matters', 'love', 'time', 'money',...","['petter', 'matter', 'love', 'time', 'money', ..."
...,...,...,...,...
49995,thought movie right good job want creative ori...,positive,"['thought', 'movie', 'right', 'good', 'job', '...","['thought', 'movie', 'right', 'good', 'job', '..."
49996,bad plot bad dialogue bad acting idiotic direc...,negative,"['bad', 'plot', 'bad', 'dialogue', 'bad', 'act...","['bad', 'plot', 'bad', 'dialogue', 'bad', 'act..."
49997,catholic taught parochial elementary schools n...,negative,"['catholic', 'taught', 'parochial', 'elementar...","['catholic', 'teach', 'parochial', 'elementary..."
49998,going disagree previous comment side martin on...,negative,"['going', 'disagree', 'previous', 'comment', '...","['go', 'disagree', 'previous', 'comment', 'sid..."


In [5]:
print("1. Data Preparation...")
with tqdm(total=3, desc="Preprocessing") as pbar:
    X = df['lemmas'].apply(eval)  # Convert stringified lists to actual lists
    y = df['sentiment'].map({'positive': 1, 'negative': 0})
    pbar.update(1)

    # Handle missing values
    X = X.fillna('').apply(list)
    pbar.update(1)

    # Verify data shapes
    print(f"\nData shape: {X.shape}, Target shape: {y.shape}")
    pbar.update(1)

1. Data Preparation...


Preprocessing:   0%|          | 0/3 [00:00<?, ?it/s]


Data shape: (50000,), Target shape: (50000,)


In [7]:
from gensim.models.callbacks import CallbackAny2Vec


In [8]:
class TqdmProgressCallback(CallbackAny2Vec):
    def __init__(self, epochs):
        self.epoch_pbar = tqdm(total=epochs, desc="Training Epochs", unit="epoch")
        self.epoch_losses = []

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        if len(self.epoch_losses) > 0:
            delta_loss = loss - self.epoch_losses[-1]
        else:
            delta_loss = loss
        self.epoch_losses.append(loss)

        self.epoch_pbar.update(1)
        self.epoch_pbar.set_postfix({
            "delta_loss": f"{delta_loss:.2f}",
            "total_loss": f"{loss:.2f}"
        })

    def on_train_end(self, model):
        self.epoch_pbar.close()

In [9]:
print("\n2. Word2Vec Training...")
epochs = 20
progress_callback = TqdmProgressCallback(epochs)

w2v = Word2Vec(
    sentences=X,
    vector_size=300,
    window=7,
    min_count=3,
    workers=4,
    epochs=epochs,
    sg=1,
    hs=0,
    negative=10,
    compute_loss=True,
    callbacks=[progress_callback]  # Attach the callback
)

print(f"\nVocab size: {len(w2v.wv)} | Final loss: {w2v.get_latest_training_loss():.2f}")


2. Word2Vec Training...


Training Epochs:   0%|          | 0/20 [00:00<?, ?epoch/s]


Vocab size: 32067 | Final loss: 86382608.00


In [10]:
# 3. Document Vectorization
print("\n3. Vectorization...")
def document_vector(tokens):
    vectors = []
    for word in tokens:
        if word in w2v.wv:
            weight = np.log(1 + w2v.wv.get_vecattr(word, 'count'))
            vectors.append(w2v.wv[word] * weight)
    return np.mean(vectors, axis=0) if vectors else np.zeros(w2v.vector_size)

# Process with parallelization and progress bar
X_vec = np.array([document_vector(doc) for doc in tqdm(X, desc="Vectorizing Docs")])


3. Vectorization...


Vectorizing Docs:   0%|          | 0/50000 [00:00<?, ?it/s]

In [11]:
print("\n4. Building Pipeline...")
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('xgb', XGBClassifier(
        n_jobs=4,
        tree_method='hist',
        enable_categorical=True,
        eval_metric='logloss'
    ))
])



4. Building Pipeline...


In [24]:
def update_progress(iteration, pbar, random_search):
    pbar.update(1)
    if iteration % 5 == 0:  # Update stats every 5 iterations
        current_best = random_search.cv_results_['mean_test_score'].max()
        pbar.set_postfix({'best_f1': f"{current_best:.4f}"})

In [28]:
def run_randomized_search(pipeline, X, y):
    skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

    param_dist = {
        'xgb__max_depth': np.arange(3, 8),
        'xgb__learning_rate': np.logspace(-3, -1, 100),
        'xgb__subsample': np.linspace(0.5, 1.0, 6),
        'xgb__colsample_bytree': np.linspace(0.5, 1.0, 6)
    }

    n_iter = 50
    pbar = tqdm(total=n_iter, desc="Randomized Search")

    # joblib callback ile bar güncelleme
    def tqdm_callback(*args, **kwargs):
        pbar.update()

    with joblib.parallel_backend('loky'):
        random_search = RandomizedSearchCV(
            estimator=pipeline,
            param_distributions=param_dist,
            n_iter=n_iter,
            cv=skf,
            scoring='f1_weighted',
            verbose=0,
            n_jobs=2,
            random_state=42
        )
        # joblib için callback
        joblib.parallel.register_parallel_backend('tqdm', lambda *args, **kwargs: joblib.parallel.Parallel(*args, **kwargs, batch_completion_call_back=tqdm_callback))
        random_search.fit(X, y)

    pbar.close()
    return random_search

In [31]:
X_train, X_test, y_train, y_test = train_test_split(
    X_vec, y, test_size=0.2, random_state=42, stratify=y
)

In [32]:
random_search = run_randomized_search(pipeline, X_train, y_train)

Randomized Search:   0%|          | 0/50 [00:00<?, ?it/s]

In [33]:
print("\nBest Parameters:", random_search.best_params_)
print("Best F1 Score: {:.4f}".format(random_search.best_score_))


Best Parameters: {'xgb__subsample': 0.5, 'xgb__max_depth': 7, 'xgb__learning_rate': 0.09545484566618342, 'xgb__colsample_bytree': 0.8}
Best F1 Score: 0.8647


In [35]:
best_model = random_search.best_estimator_

In [34]:
from sklearn.metrics import classification_report, accuracy_score


In [38]:
y_pred = best_model.predict(X_test)


print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print(f"\nTest Set Accuracy: {accuracy_score(y_test, y_pred):.4f}")


Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.86      0.86      5000
           1       0.86      0.87      0.87      5000

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000


Test Set Accuracy: 0.8655


In [39]:
y_test

Unnamed: 0,sentiment
18870,0
39791,0
30381,1
42294,0
33480,0
...,...
3634,0
47910,0
16086,0
48294,1
