
# IMDB Sentiment Analysis — Dual Pipeline (TF‑IDF + BiLSTM)

This notebook builds an end‑to‑end **IMDB review classifier** with two complementary approaches:

1) **Classical ML**: TF‑IDF (word + char) → Logistic Regression (5‑fold CV)
2) **Deep Learning**: Tokenizer + **BiLSTM** (5‑fold CV with EarlyStopping)

**Preprocessing uses NLTK + spaCy** (tokenization, cleaning, lemmatization) with graceful fallbacks.
Finally, we do a brief **error analysis** (negation/sarcasm) and a tiny **inference** function.

> Target metrics (indicative, dataset & runtime dependent): Accuracy ≈ 0.90, F1 ≈ 0.88



## How to Run (Kaggle or Local)

**Dataset options** (pick one):
- Kaggle: Add **“IMDB Dataset of 50K Movie Reviews”** to the notebook (usually at
  `/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv`).
- Local: Put `IMDB Dataset.csv` next to this notebook.

> If the CSV is missing, the notebook stops with a clear message.

**Compute**:
- For the BiLSTM, enable **GPU** (e.g., Kaggle → *Settings* → *Accelerator* → GPU). CPU still works, just slower.

**Libraries**:
- We rely on common Kaggle defaults: `numpy`, `pandas`, `scikit-learn`, `tensorflow/keras`, `nltk`, `spacy`.
- If `en_core_web_sm` isn’t available, we fall back to `spacy.blank("en")` and use **NLTK WordNetLemmatizer**.


In [1]:

# === Imports & Setup ===
import os, re, gc, sys, math, json, random, string, html
from pathlib import Path
from collections import Counter

import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.utils import shuffle

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Try spaCy gracefully
try:
    import spacy
    try:
        nlp = spacy.load("en_core_web_sm", disable=["ner", "parser", "textcat"])
        SPACY_MODE = "en_core_web_sm"
    except Exception:
        nlp = spacy.blank("en")
        SPACY_MODE = "blank_en"
except Exception as e:
    spacy = None
    nlp = None
    SPACY_MODE = "not_available"

# TensorFlow / Keras for the LSTM model
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout, GlobalMaxPool1D
from tensorflow.keras.callbacks import EarlyStopping

SEED = 42
random.seed(SEED); np.random.seed(SEED); tf.random.set_seed(SEED)

print("Python:", sys.version)
print("TF:", tf.__version__)
print("spaCy:", spacy.__version__ if spacy else "N/A", "| mode:", SPACY_MODE)

# Ensure NLTK data
try:
    nltk.data.find("corpora/stopwords")
except LookupError:
    nltk.download("stopwords")
try:
    nltk.data.find("corpora/wordnet")
except LookupError:
    nltk.download("wordnet")
try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt")

EN_STOP = set(stopwords.words("english"))
WN_LEMM = WordNetLemmatizer()


2025-10-17 16:00:51.574307: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760716851.733620      37 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760716851.786378      37 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Python: 3.11.13 (main, Jun  4 2025, 08:57:29) [GCC 11.4.0]
TF: 2.18.0
spaCy: 3.8.7 | mode: en_core_web_sm


[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:

# === Load IMDB Dataset (CSV) ===
# Expected columns: 'review', 'sentiment' where sentiment ∈ {'positive', 'negative'}

POSSIBLE_PATHS = [
    "/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv",
    "./IMDB Dataset.csv",
    "../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv",
]

data_path = None
for p in POSSIBLE_PATHS:
    if os.path.exists(p):
        data_path = p
        break

if not data_path:
    raise FileNotFoundError(
        "IMDB Dataset CSV not found. Please add the Kaggle dataset "
        "('IMDB Dataset of 50K Movie Reviews') or place 'IMDB Dataset.csv' beside the notebook."
    )

df = pd.read_csv(data_path)
df = df.rename(columns={c: c.strip().lower() for c in df.columns})
assert "review" in df.columns and "sentiment" in df.columns, "CSV must have 'review' and 'sentiment' columns."

# Map labels to {0,1}
label_map = {"negative": 0, "positive": 1}
df["label"] = df["sentiment"].map(label_map).astype(int)

print(df.shape, df["label"].value_counts())
df.head()


(50000, 3) label
1    25000
0    25000
Name: count, dtype: int64


Unnamed: 0,review,sentiment,label
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1



## Preprocessing (NLTK + spaCy)

Steps:
- Lowercasing, HTML tag & URL removal, punctuation normalization
- Tokenization with **spaCy** if available; otherwise NLTK/regex
- Lemmatization: spaCy if model present, else **NLTK WordNet** lemmatizer
- Stopword removal (keeps negations like *not, n't, never*)


In [3]:

# === Text cleaning utilities ===

CONTRACTIONS = {
    "can't": "can not", "won't": "will not", "n't": " not",
    "i'm": "i am", "it's": "it is", "that's": "that is",
    "there's": "there is", "what's": "what is", "you're": "you are",
    "they're": "they are", "we're": "we are", "i've": "i have",
    "don't": "do not", "didn't": "did not", "doesn't": "does not",
    "isn't": "is not", "aren't": "are not", "wasn't": "was not", "weren't": "were not",
    "shouldn't": "should not", "wouldn't": "would not", "couldn't": "could not",
    "mustn't": "must not", "haven't": "have not", "hasn't": "has not", "hadn't": "had not"
}

NEGATION_TOKENS = {"not", "no", "never", "n't"}

PUNCT_TABLE = str.maketrans({c: f" {c} " for c in string.punctuation})

def expand_contractions(text: str) -> str:
    text_low = text.lower()
    for k, v in CONTRACTIONS.items():
        text_low = text_low.replace(k, v)
    return text_low

def basic_clean(text: str) -> str:
    text = html.unescape(str(text))
    text = re.sub(r"<br\s*/?>", " ", text, flags=re.IGNORECASE)
    text = re.sub(r"<[^>]+>", " ", text)  # strip HTML tags
    text = re.sub(r"http\S+|www\.\S+", " ", text)  # URLs
    text = expand_contractions(text)
    text = text.translate(PUNCT_TABLE)  # space-pad punctuation
    text = re.sub(r"\s+", " ", text).strip()
    return text

def spacy_tokenize_lemma(text: str):
    doc = nlp(text) if nlp else None
    if doc is None:
        return None
    toks = []
    for t in doc:
        tok = (t.lemma_ if t.lemma_ else t.text).lower().strip()
        if tok and tok not in EN_STOP and not tok.isdigit():
            # keep negations
            if tok in NEGATION_TOKENS or tok.isalpha():
                toks.append(tok)
    return toks

def nltk_tokenize_lemma(text: str):
    words = re.findall(r"[a-zA-Z']+", text.lower())
    out = []
    for w in words:
        if w in EN_STOP and w not in NEGATION_TOKENS:
            continue
        lemma = WN_LEMM.lemmatize(w)
        if lemma:
            out.append(lemma)
    return out

def preprocess_text(text: str):
    text = basic_clean(text)
    # Try spaCy pipeline first (will use blank("en") if small model missing)
    toks = spacy_tokenize_lemma(text) if nlp is not None else None
    if not toks:
        toks = nltk_tokenize_lemma(text)
    return " ".join(toks)

# Quick smoke test
print(preprocess_text("I didn't like this movie at all — it's not good! But performances weren't bad."))


like movie good performance bad


In [4]:

# This can take a couple of minutes; feel free to subsample while experimenting.
df["text"] = df["review"].astype(str).apply(preprocess_text)
df = df[["text", "label"]].dropna().reset_index(drop=True)
print(df.shape)
df.head()


(50000, 2)


Unnamed: 0,text,label
0,one reviewer mention watch oz episode hook rig...,1
1,wonderful little production filming technique ...,1
2,think wonderful way spend time hot summer week...,1
3,basically family little boy jake think zombie ...,0
4,petter mattei love time money visually stunnin...,1



## Model A — TF‑IDF + Logistic Regression (5‑fold CV)

We use both **word n‑grams (1–2)** and **character n‑grams (3–5)** for robustness to misspellings.


In [5]:

tfidf_word = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_features=200_000, sublinear_tf=True)
tfidf_char = TfidfVectorizer(analyzer="char", ngram_range=(3,5), min_df=2, max_features=100_000, sublinear_tf=True)

def build_features(texts):
    Xw = tfidf_word.fit_transform(texts)
    Xc = tfidf_char.fit_transform(texts)
    from scipy.sparse import hstack
    X = hstack([Xw, Xc]).tocsr()
    return X

X = build_features(df["text"])
y = df["label"].values

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
oof_pred = np.zeros(len(df))
fold_metrics = []

for fold, (tr, va) in enumerate(skf.split(X, y), 1):
    clf = LogisticRegression(
        max_iter=2000,
        n_jobs=-1,
        solver="saga",
        class_weight="balanced",
        C=2.0
    )
    clf.fit(X[tr], y[tr])
    p = clf.predict(X[va])
    acc = accuracy_score(y[va], p)
    f1 = f1_score(y[va], p)
    fold_metrics.append((acc, f1))
    oof_pred[va] = p
    print(f"[Fold {fold}] ACC={acc:.4f}  F1={f1:.4f}")

acc_mean = np.mean([m[0] for m in fold_metrics])
f1_mean  = np.mean([m[1] for m in fold_metrics])
print(f"\nTF-IDF + LR 5-fold → ACC={acc_mean:.4f}  F1={f1_mean:.4f}")

print("\nClassification Report (OOF):")
print(classification_report(y, oof_pred))

cm = confusion_matrix(y, oof_pred)
cm


[Fold 1] ACC=0.9031  F1=0.9038
[Fold 2] ACC=0.9038  F1=0.9044
[Fold 3] ACC=0.9023  F1=0.9031
[Fold 4] ACC=0.9005  F1=0.9017
[Fold 5] ACC=0.8999  F1=0.9008

TF-IDF + LR 5-fold → ACC=0.9019  F1=0.9028

Classification Report (OOF):
              precision    recall  f1-score   support

           0       0.91      0.89      0.90     25000
           1       0.90      0.91      0.90     25000

    accuracy                           0.90     50000
   macro avg       0.90      0.90      0.90     50000
weighted avg       0.90      0.90      0.90     50000



array([[22335,  2665],
       [ 2239, 22761]])


## Model B — BiLSTM (5‑fold CV with EarlyStopping)

We tokenize **cleaned text**, pad sequences, and train a small **BiLSTM** per fold.
To keep runtime comfortable, we use few epochs (tune as you like).


In [6]:

# Tokenizer
VOCAB_SIZE = 30000
MAX_LEN    = 200

tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token="<OOV>")
tokenizer.fit_on_texts(df["text"].tolist())

def build_bilstm():
    model = Sequential([
        Embedding(VOCAB_SIZE, 128, input_length=MAX_LEN),
        Bidirectional(LSTM(64, return_sequences=True)),
        GlobalMaxPool1D(),
        Dropout(0.2),
        Dense(64, activation="relu"),
        Dropout(0.2),
        Dense(1, activation="sigmoid"),
    ])
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return model

def seqs_from_text(texts):
    seqs = tokenizer.texts_to_sequences(texts)
    return pad_sequences(seqs, maxlen=MAX_LEN, padding="post", truncating="post")

X_seq = seqs_from_text(df["text"])
y = df["label"].values

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
oof_prob_lstm = np.zeros(len(df))
fold_metrics_lstm = []

EPOCHS = 3
BATCH  = 1024

for fold, (tr, va) in enumerate(skf.split(X_seq, y), 1):
    model = build_bilstm()
    es = EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True, verbose=1)
    hist = model.fit(
        X_seq[tr], y[tr],
        validation_data=(X_seq[va], y[va]),
        epochs=EPOCHS,
        batch_size=BATCH,
        verbose=1,
        callbacks=[es]
    )
    prob = model.predict(X_seq[va], batch_size=2048).ravel()
    oof_prob_lstm[va] = prob
    pred = (prob >= 0.5).astype(int)
    acc = accuracy_score(y[va], pred)
    f1 = f1_score(y[va], pred)
    fold_metrics_lstm.append((acc, f1))
    print(f"[BiLSTM Fold {fold}] ACC={acc:.4f}  F1={f1:.4f}")

acc_mean = np.mean([m[0] for m in fold_metrics_lstm])
f1_mean  = np.mean([m[1] for m in fold_metrics_lstm])
print(f"\nBiLSTM 5-fold → ACC={acc_mean:.4f}  F1={f1_mean:.4f}")


I0000 00:00:1760721950.972838      37 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15513 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0


Epoch 1/3


I0000 00:00:1760721957.094879     134 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 113ms/step - accuracy: 0.6470 - loss: 0.6500 - val_accuracy: 0.8432 - val_loss: 0.3769
Epoch 2/3
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 101ms/step - accuracy: 0.8607 - loss: 0.3360 - val_accuracy: 0.8829 - val_loss: 0.2828
Epoch 3/3
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 101ms/step - accuracy: 0.9207 - loss: 0.2123 - val_accuracy: 0.8847 - val_loss: 0.2831
Restoring model weights from the end of the best epoch: 2.
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 116ms/step
[BiLSTM Fold 1] ACC=0.8829  F1=0.8808
Epoch 1/3




[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 129ms/step - accuracy: 0.6068 - loss: 0.6659 - val_accuracy: 0.8184 - val_loss: 0.4448
Epoch 2/3
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 101ms/step - accuracy: 0.8378 - loss: 0.3909 - val_accuracy: 0.8858 - val_loss: 0.2830
Epoch 3/3
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 101ms/step - accuracy: 0.9073 - loss: 0.2443 - val_accuracy: 0.8942 - val_loss: 0.2797
Restoring model weights from the end of the best epoch: 3.
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 119ms/step
[BiLSTM Fold 2] ACC=0.8942  F1=0.8950
Epoch 1/3




[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 111ms/step - accuracy: 0.6277 - loss: 0.6617 - val_accuracy: 0.8208 - val_loss: 0.4267
Epoch 2/3
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 101ms/step - accuracy: 0.8532 - loss: 0.3627 - val_accuracy: 0.8872 - val_loss: 0.2843
Epoch 3/3
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 101ms/step - accuracy: 0.9139 - loss: 0.2270 - val_accuracy: 0.8907 - val_loss: 0.2870
Restoring model weights from the end of the best epoch: 2.
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 116ms/step
[BiLSTM Fold 3] ACC=0.8872  F1=0.8885
Epoch 1/3




[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 112ms/step - accuracy: 0.6308 - loss: 0.6440 - val_accuracy: 0.8576 - val_loss: 0.3369
Epoch 2/3
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 101ms/step - accuracy: 0.8819 - loss: 0.2908 - val_accuracy: 0.8804 - val_loss: 0.3065
Epoch 3/3
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 100ms/step - accuracy: 0.9283 - loss: 0.1941 - val_accuracy: 0.8838 - val_loss: 0.3195
Restoring model weights from the end of the best epoch: 2.
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 118ms/step
[BiLSTM Fold 4] ACC=0.8804  F1=0.8808
Epoch 1/3




[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 112ms/step - accuracy: 0.6077 - loss: 0.6616 - val_accuracy: 0.8215 - val_loss: 0.4259
Epoch 2/3
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 101ms/step - accuracy: 0.8501 - loss: 0.3657 - val_accuracy: 0.8836 - val_loss: 0.2834
Epoch 3/3
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 100ms/step - accuracy: 0.9170 - loss: 0.2220 - val_accuracy: 0.8908 - val_loss: 0.2852
Restoring model weights from the end of the best epoch: 2.
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 117ms/step
[BiLSTM Fold 5] ACC=0.8836  F1=0.8832

BiLSTM 5-fold → ACC=0.8857  F1=0.8856



## Simple Ensemble (Optional)

We can ensemble the two models by averaging probabilities (LR OOF is hard predictions; for a fair demo we
keep separate, but you can retrain LR to keep `predict_proba` per-fold and blend).


In [7]:

# For a thorough ensemble, retrain TF-IDF + LR to store predict_proba per fold.
# Here, as a quick reference, we just show BiLSTM OOF metrics computed above.
pass



## Brief Error Analysis — Negation & Sarcasm

We’ll examine misclassifications (from TF‑IDF LR OOF vs. ground truth) and highlight **negations** and a few
simple **sarcasm cues** (quotes, “yeah right”, etc.).


In [8]:

# Collect a sample of misclassifications from TF-IDF model
mis_idx = np.where(oof_pred != y)[0]
sample_idx = np.random.RandomState(SEED).choice(mis_idx, size=min(12, len(mis_idx)), replace=False)

def has_negation(text):
    return any(tok in NEGATION_TOKENS or "n't" in tok for tok in text.split())

SARC_PATTERNS = [r'\".*\"', r"yeah right", r"sure,", r"as if", r"/s"]

def has_sarcasm(text):
    tl = text.lower()
    return any(re.search(pat, tl) for pat in SARC_PATTERNS)

err_rows = []
for i in sample_idx:
    raw = df.loc[i, "text"]
    true = int(y[i])
    pred = int(oof_pred[i])
    err_rows.append({
        "i": int(i),
        "true": true,
        "pred": pred,
        "negation": bool(has_negation(raw)),
        "sarcasm_hint": bool(has_sarcasm(raw)),
        "snippet": raw[:240] + ("..." if len(raw) > 240 else ""),
    })

pd.DataFrame(err_rows)


Unnamed: 0,i,true,pred,negation,sarcasm_hint,snippet
0,47794,1,0,False,False,main criticism film namely macy suddenly look ...
1,36289,0,1,False,False,voor een verloren soldaat lost soldier sad exa...
2,9540,0,1,False,False,film capture short moment mother son rural rus...
3,29122,0,1,False,False,acclaim director mervyn leroy put drama film c...
4,31901,1,0,False,False,talk blast opening trampa infernal cool openin...
5,49013,0,1,False,False,jason priestly star breakfast psychotic jewelr...
6,23169,0,1,True,False,somerset maugham write novel coal miner decide...
7,20045,0,1,False,False,lot matter helen none good shelley winter debb...
8,38887,1,0,True,False,problem version movie simple indiana jones clo...
9,27148,1,0,False,False,think crewe evil part well win award anything ...


In [14]:
from scipy.sparse import hstack
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Fit on FULL cleaned text for demo inference
tfidf_word_demo = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_features=200_000, sublinear_tf=True)
tfidf_char_demo = TfidfVectorizer(analyzer="char", ngram_range=(3,5), min_df=2, max_features=100_000, sublinear_tf=True)

Xw_demo = tfidf_word_demo.fit_transform(df["text"])
Xc_demo = tfidf_char_demo.fit_transform(df["text"])
from scipy.sparse import hstack
X_demo = hstack([Xw_demo, Xc_demo]).tocsr()

clf_demo = LogisticRegression(max_iter=1500, n_jobs=-1, solver="saga", C=2.0)
clf_demo.fit(X_demo, df["label"].values)

print("Demo TF-IDF + LR are fitted (ready for .transform()).")


Demo TF-IDF + LR are fitted (ready for .transform()).



## Inference Helper

`predict_texts(texts)` returns predictions for both pipelines (after fitting). For the BiLSTM, we reuse the
tokenizer and the last trained fold’s weights for a quick demo.


In [15]:
from scipy.sparse import hstack
import pandas as pd

def _transform_features_demo(cleaned_texts):
    Xw = tfidf_word_demo.transform(cleaned_texts)
    Xc = tfidf_char_demo.transform(cleaned_texts)
    return hstack([Xw, Xc]).tocsr()

def predict_texts(texts):
    cleaned = [preprocess_text(t) for t in texts]  # ← corrected bracket
    Xnew = _transform_features_demo(cleaned)
    y_tfidf = clf_demo.predict(Xnew)

    # LSTM quick path (skip gracefully if model not available)
    try:
        seq = seqs_from_text(cleaned) if "seqs_from_text" in globals() else seqs(cleaned)
        y_lstm = (lstm_demo.predict(seq) >= 0.5).astype(int).ravel()
    except Exception:
        y_lstm = None

    return cleaned, y_tfidf, y_lstm

# Quick test
samples = [
    "I absolutely loved this movie, the performances were brilliant.",
    "Not good. I didn't enjoy it at all — boring and predictable."
]
cleaned, y_tf, y_lstm = predict_texts(samples)
pd.DataFrame({"text": samples, "cleaned": cleaned, "tfidf_label": y_tf, "lstm_label": y_lstm})


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step


Unnamed: 0,text,cleaned,tfidf_label,lstm_label
0,"I absolutely loved this movie, the performance...",absolutely love movie performance brilliant,1,1
1,Not good. I didn't enjoy it at all — boring an...,good enjoy boring predictable,0,1



## (Optional) Save/Load Artifacts

Below shows how to persist the TF‑IDF vectorizers & LR model, plus the Keras Tokenizer and BiLSTM weights.


In [16]:

import joblib

os.makedirs("artifacts", exist_ok=True)

# Save TF-IDF + LR
joblib.dump(tfidf_word_demo, "artifacts/tfidf_word.joblib")
joblib.dump(tfidf_char_demo, "artifacts/tfidf_char.joblib")
joblib.dump(clf_demo, "artifacts/logreg.joblib")

# Save Keras tokenizer & weights
import pickle
with open("artifacts/tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer, f)
lstm_demo.save("artifacts/bilstm.h5")

!ls -lh artifacts


total 108M
-rw-r--r-- 1 root root  46M Oct 17 17:37 bilstm.h5
-rw-r--r-- 1 root root 2.3M Oct 17 17:37 logreg.joblib
-rw-r--r-- 1 root root 7.3M Oct 17 17:37 tfidf_char.joblib
-rw-r--r-- 1 root root  50M Oct 17 17:37 tfidf_word.joblib
-rw-r--r-- 1 root root 3.4M Oct 17 17:37 tokenizer.pkl



### Notes & Tips

- Tune `max_features`, `C`, `ngram_range` for TF‑IDF/LogReg, and `MAX_LEN`, `EMBEDDING_SIZE`, `LSTM_UNITS`, `EPOCHS`
  for BiLSTM to trade off speed vs. quality.
- Keep **negation tokens** in preprocessing; they matter a lot in sentiment!
- For better deep models, consider pretrained embeddings (GloVe) or modern transformer baselines (e.g., DistilBERT).
- Ensure reproducibility with seeds; results vary slightly by runtime/library versions.
