<a href="https://colab.research.google.com/github/ChrisLouis9913/ICTExit/blob/main/FinalTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import os
import json
import re
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import joblib
import gensim
from gensim.models import Word2Vec
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
def clean_text(text):
    if pd.isnull(text):
        return ""
    text = text.lower()
    # remove HTML tags
    text = re.sub(r'<.*?>', ' ', text)
    # keep only letters and numbers
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    # collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def simple_tokenize(text):
    return [t for t in nltk.word_tokenize(text) if t not in STOPWORDS]

In [6]:
df = pd.read_csv("/content/Reviews.csv")  # path to the Kaggle dataset file
print(df.shape)
df = df[['Text', 'Score']].dropna().copy()
# Map Score to binary label: 4-5 -> Positive (1), 1-2 -> Negative (0), drop 3
df = df[df['Score'] != 3]
df['label'] = (df['Score'] >= 4).astype(int)
df['clean_text'] = df['Text'].apply(clean_text)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)  # shuffle
print("Examples:", df.shape)
X = df['clean_text'].values
y = df['label'].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

(6263, 10)
Examples: (5784, 4)


### 1) Baseline Model (TF-IDF + Logistic Regression)
Workflow: Implement a classical NLP pipeline. This involves text cleaning (e.g., lowercasing, removing stop words), vectorizing the text using TF-IDF, and training a Logistic Regression model on the vectorized data. Analytical Question: In a markdown cell, explain why TF-IDF is often a better choice for text classification than a simple Bag of Words (Count Vectorizer).

In [7]:
tfidf = TfidfVectorizer(max_features=20000, ngram_range=(1,2))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

lr = LogisticRegression(max_iter=1000, class_weight='balanced', solver='saga')
lr.fit(X_train_tfidf, y_train)
y_pred_lr = lr.predict(X_test_tfidf)
y_prob_lr = lr.predict_proba(X_test_tfidf)[:,1]

print("TF-IDF + Logistic Regression")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("F1:", f1_score(y_test, y_pred_lr))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_lr))

TF-IDF + Logistic Regression
Accuracy: 0.8928262748487468
F1: 0.9345300950369588
ROC-AUC: 0.9349349349349348


TF-IDF reduces the influence of very common words across the corpus by down-weighting terms that occur frequently in many documents, while up-weighting terms that are more specific to particular documents. This helps classifiers focus on discriminative words rather than those that are common but not informative. A simple Bag of Words (count vectorizer) treats all words equally, so frequent but uninformative tokens may dominate, and the model can overfit to high-frequency tokens that don't carry class information. TF-IDF also normalizes by document length implicitly (depending on implementation), which helps when reviews vary in length.

### 2) Word Embedding Model (Word2Vec + Random Forest)
Workflow: Train a Word2Vec model on your corpus of review text to learn custom word embeddings. Create a feature vector for each review by averaging the Word2Vec vectors of the words within it. Train a Random Forest classifier on these averaged feature vectors. Analytical Question: In a markdown cell, what is one key advantage of using Word2Vec embeddings over TF-IDF for capturing the meaning of a text?

In [8]:
# Prepare tokenized lists for Word2Vec
sentences_train = [simple_tokenize(text) for text in X_train]
# Train Word2Vec on training sentences
w2v_model = Word2Vec(
    sentences=sentences_train,
    vector_size=100,
    window=5,
    min_count=5,
    workers=4,
    epochs=10,
    seed=42
)
w2v_model.save("w2v_model.kv")  # save

# Function to average word vectors for a document
def doc_vector_avg(tokens, model, vector_size=100):
    vecs = []
    for t in tokens:
        if t in model.wv:
            vecs.append(model.wv[t])
    if len(vecs) == 0:
        return np.zeros(vector_size)
    return np.mean(vecs, axis=0)

# Build averaged vectors
X_train_w2v = np.vstack([doc_vector_avg(simple_tokenize(t), w2v_model) for t in X_train])
X_test_w2v = np.vstack([doc_vector_avg(simple_tokenize(t), w2v_model) for t in X_test])

rf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced', n_jobs=-1)
rf.fit(X_train_w2v, y_train)
y_pred_rf = rf.predict(X_test_w2v)
y_prob_rf = rf.predict_proba(X_test_w2v)[:,1]

print("Word2Vec (avg) + RandomForest")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("F1:", f1_score(y_test, y_pred_rf))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_rf))

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


Word2Vec captures semantic relationships between words: words with similar contexts have similar vectors. This allows the model to generalize across synonyms and related expressions, capturing semantic similarity that TF-IDF (a sparse, frequency-based representation) does not. TF-IDF treats each token as independent and cannot capture similarity between different tokens.

### 3) Deep Learning Model (RNN/LSTM)
Workflow: Implement a simple sequential model using a Recurrent Neural Network (RNN) or an LSTM layer for the same sentiment classification task. This will require tokenizing the text and padding sequences. Analytical Question: In a markdown cell, why is an LSTM often preferred over a simple RNN for text classification tasks? (Hint: Mention the vanishing gradient problem).

In [9]:
# Tokenize and pad sequences
MAX_WORDS = 20000
MAX_LEN = 200
tokenizer = Tokenizer(num_words=MAX_WORDS, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
X_train_pad = pad_sequences(X_train_seq, maxlen=MAX_LEN, padding='post', truncating='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=MAX_LEN, padding='post', truncating='post')

# Build a simple LSTM
tf.random.set_seed(42)
model = Sequential([
    Embedding(input_dim=MAX_WORDS, output_dim=128, input_length=MAX_LEN),
    Bidirectional(LSTM(128, return_sequences=False)),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
mc = ModelCheckpoint('lstm_model.h5', save_best_only=True, monitor='val_loss')

history = model.fit(
    X_train_pad, y_train,
    validation_split=0.1,
    epochs=6,
    batch_size=256,
    callbacks=[es, mc]
)

y_prob_lstm = model.predict(X_test_pad, batch_size=512).ravel()
y_pred_lstm = (y_prob_lstm >= 0.5).astype(int)
print("LSTM")
print("Accuracy:", accuracy_score(y_test, y_pred_lstm))
print("F1:", f1_score(y_test, y_pred_lstm))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_lstm))



Epoch 1/6
[1m16/17[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 54ms/step - accuracy: 0.7443 - loss: 0.5880



[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 89ms/step - accuracy: 0.7521 - loss: 0.5797 - val_accuracy: 0.8510 - val_loss: 0.4261
Epoch 2/6
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 0.8258 - loss: 0.4676



[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 59ms/step - accuracy: 0.8265 - loss: 0.4658 - val_accuracy: 0.8510 - val_loss: 0.3694
Epoch 3/6
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 0.8319 - loss: 0.3955



[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 59ms/step - accuracy: 0.8332 - loss: 0.3925 - val_accuracy: 0.8683 - val_loss: 0.3064
Epoch 4/6
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 49ms/step - accuracy: 0.9085 - loss: 0.2369 - val_accuracy: 0.8683 - val_loss: 0.3343
Epoch 5/6
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 49ms/step - accuracy: 0.9546 - loss: 0.1345 - val_accuracy: 0.8942 - val_loss: 0.4180
Epoch 6/6
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 49ms/step - accuracy: 0.9738 - loss: 0.0859 - val_accuracy: 0.8898 - val_loss: 0.4270
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step
LSTM
Accuracy: 0.8366464995678479
F1: 0.9052631578947369
ROC-AUC: 0.834128573017462


Simple RNNs suffer from the vanishing (and exploding) gradient problems when learning long-range dependencies across time steps. Gradients propagated through many time steps can shrink exponentially, preventing the network from learning dependencies that span long distances. LSTM (Long Short-Term Memory) networks use gated mechanisms (input, forget, and output gates) that control the flow of information and gradients, allowing them to preserve and propagate relevant information over longer sequences and mitigate the vanishing gradient issue. Thus, LSTMs are better at capturing longer context in text.

### 4) Comparative Analysis and Recommendation
Workflow: Thoroughly evaluate all three of your models (TF-IDF, Word2Vec, RNN/LSTM) on the test set using Accuracy, F1-Score, and ROC-AUC. Create a comparison table in your notebook summarizing the performance of the three models. Analytical Question: Based on your results, which model would you recommend for deployment? Justify your choice by considering not just the performance metrics but also the trade-offs in model complexity, training time, and interpretability.

In [10]:
results = {
    "tfidf_lr": {
        "accuracy": accuracy_score(y_test, y_pred_lr),
        "f1": f1_score(y_test, y_pred_lr),
        "roc_auc": roc_auc_score(y_test, y_prob_lr)
    },
    "w2v_rf": {
        "accuracy": accuracy_score(y_test, y_pred_rf),
        "f1": f1_score(y_test, y_pred_rf),
        "roc_auc": roc_auc_score(y_test, y_prob_rf)
    },
    "lstm": {
        "accuracy": accuracy_score(y_test, y_pred_lstm),
        "f1": f1_score(y_test, y_pred_lstm),
        "roc_auc": roc_auc_score(y_test, y_prob_lstm)
    }
}
pd.DataFrame(results).T

NameError: name 'y_pred_rf' is not defined

(Example answer template)
- If LSTM gives best F1/ROC-AUC by a clear margin: Recommend deploying LSTM if the production environment can support GPU or sufficient CPU inference latency is acceptable. LSTM offers highest performance for sequences and can capture context, but training time and inference latency are higher and interpretability is low.
- If TF-IDF + Logistic has competitive performance: Recommend Logistic Regression for deployment because it is fast to train and infer, easy to scale, and highly interpretable (coefficients map to words). This is valuable for business insights.
- If Word2Vec + RandomForest performs best: It can strike a balance between capturing semantics and being more interpretable than deep models (though not as interpretable as Logistic due to averaged embeddings). RandomForest inference and model size may be heavier but still manageable.

Considerations:
- Complexity and cost: LSTM requires more compute for training and often for inference.
- Interpretability: Logistic Regression wins.
- Training time: TF-IDF + Logistic < Word2Vec+RF < LSTM.
- If performance gains of the complex model are marginal, prefer the simpler model (Logistic) for production.

### 5) Deployment with Streamlit
Workflow: Save your recommended model and any necessary vectorizers or tokenizers. Build a Streamlit application where a user can type or paste a new product review into a text area. The app must use your saved model to predict the sentiment ("Positive" or "Negative") and display the result clearly to the user.

In [11]:
# Save artifacts
joblib.dump(tfidf, "tfidf_vectorizer.joblib")
joblib.dump(lr, "logistic_model.joblib")
joblib.dump(rf, "rf_w2v.joblib")
w2v_model.save("w2v_model.kv")
joblib.dump(tokenizer, "tokenizer.joblib")
model.save("lstm_model.h5")

# Save results/metadata (choose best by F1)
best_name = max(results.keys(), key=lambda k: results[k]['f1'])
meta = {
    "best_model": best_name,
    "results": results,
    "files": {
        "tfidf_vectorizer": "tfidf_vectorizer.joblib",
        "logistic": "logistic_model.joblib",
        "rf_w2v": "rf_w2v.joblib",
        "w2v_model": "w2v_model.kv",
        "tokenizer": "tokenizer.joblib",
        "lstm": "lstm_model.h5"
    }
}
with open("best_model_info.json", "w") as f:
    json.dump(meta, f, indent=2)
print("Best model:", best_name)

NameError: name 'rf' is not defined

### 6) BONUS TASK : Model Interpretation
Workflow: For your best-performing classical model (either Logistic Regression or Random Forest), implement a feature to show why it made a certain prediction. For Logistic Regression, you can extract the top 10 words with the highest TF-IDF scores that contributed most to the "Positive" and "Negative" predictions. Analytical Question: In a markdown cell, explain how providing this "word importance" feature can help the marketing team gain actionable insights beyond a simple sentiment score.

In [12]:
def explain_logistic_prediction(text, tfidf, lr, top_k=10):
    ct = clean_text(text)
    x = tfidf.transform([ct])
    feature_names = np.array(tfidf.get_feature_names_out())
    coefs = lr.coef_[0]  # coef for positive class
    # contributions = tfidf_value * coef
    contributions = x.toarray()[0] * coefs
    top_pos_idx = np.argsort(contributions)[-top_k:][::-1]
    top_neg_idx = np.argsort(contributions)[:top_k]
    top_pos = list(zip(feature_names[top_pos_idx], contributions[top_pos_idx]))
    top_neg = list(zip(feature_names[top_neg_idx], contributions[top_neg_idx]))
    return top_pos, top_neg

example_text = "This product was excellent and arrived quickly, works as expected."
top_pos, top_neg = explain_logistic_prediction(example_text, tfidf, lr)
print("Top positive contributors:", top_pos)
print("Top negative contributors:", top_neg)

Top positive contributors: [('excellent', np.float64(0.32047867439298794)), ('works', np.float64(0.20987430794116185)), ('quickly', np.float64(0.1368939746467566)), ('and', np.float64(0.11926780396562568)), ('arrived', np.float64(0.08937940784907325)), ('as', np.float64(0.07314760659899573)), ('excellent and', np.float64(0.06261109022993493)), ('arrived quickly', np.float64(0.057285909541440284)), ('was excellent', np.float64(0.03336595285591319)), ('and arrived', np.float64(0.025016617479117226))]
Top negative contributors: [('was', np.float64(-0.2559954773522665)), ('product was', np.float64(-0.10616172464473478)), ('product', np.float64(-0.06130739003241431)), ('as expected', np.float64(-0.05141370802906511)), ('expected', np.float64(-0.04077330378155722)), ('this', np.float64(-0.024639273277574328)), ('works as', np.float64(-0.005436921195390629)), ('this product', np.float64(-0.003915895317916437)), ('reddish', np.float64(-0.0)), ('reduces', np.float64(0.0))]


Providing word-level contributions (word importance) turns a single numeric sentiment score into actionable, interpretable reasons. Marketing can see which specific terms drive positive or negative sentiment (e.g., "arrived quickly", "easy to use", "broken", "poor quality"). This helps:
- Identify recurring product issues (words associated with negative predictions).
- Surface product features or service aspects that customers praise.
- Inform targeted copywriting or FAQ updates by emphasizing terms customers value.
- Prioritize fixes by frequency + impact: words that occur often and contribute strongly to negative sentiment are high priority.
- Measure changes over time: track whether certain negative keywords become less important after UX or product changes.
Interpretability fosters trust and enables data-driven actions rather than opaque model outputs.