
# 📝 Assignment: Intro NLP Sentiment Analysis (IMDB 50K)

**Total points: 100**  
**Estimated time:** 90–120 minutes  
**Prereqs:** You completed the intro notebook (cleaning → BoW/TF‑IDF → simple classifier).

This assignment uses the **IMDB Dataset of 50K Movie Reviews** (binary sentiment).  
On Kaggle, add the dataset **“IMDB Dataset of 50K Movie Reviews”** and use the mounted path:
```
/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv
```
If you're running locally, download the CSV and set the path accordingly. Provided in Classroom. 

> You will implement a complete classical NLP pipeline: **load → clean → vectorize → model → evaluate → analyze**.



## 🎯 Learning outcomes
By the end, you can:
- Load a real‑world text dataset and standardize labels.
- Implement minimal text cleaning (lowercase, punctuation, whitespace).
- Build **Bag‑of‑Words** and **TF‑IDF** features with scikit‑learn.
- Train/evaluate **Multinomial Naive Bayes** and **Logistic Regression**.
- Compare models and **inspect influential features**.
- Perform brief **error analysis** to inform next steps.



## ✅ Rules & tips
- Keep runtime modest: you may **subsample to N_SAMPLES = 8000** for faster iteration.
- Use **matplotlib** only (no seaborn), **one plot per cell**, and avoid specifying custom colors.
- Set `RANDOM_SEED = 42` for reproducibility.
- Write short Markdown answers where requested.
- **Do not** install heavy external libraries (no NLTK download required).


In [None]:

import os, re, string, math, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix, ConfusionMatrixDisplay

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

pd.set_option("display.max_colwidth", 200)



---
## Part 0 — Load the IMDB dataset (10 pts)

**Task:** Implement `load_imdb_csv(path)` so it returns a DataFrame with columns:
- `text` (review text, `str`)
- `label` (`int`: 1 for positive, 0 for negative)

**Notes**
- The Kaggle CSV has columns **`review`** and **`sentiment`** (`"positive"`/`"negative"`).
- Map to integers (`positive→1`, `negative→0`), drop rows with missing values, reset index.
- If `N_SAMPLES` is not `None`, randomly sample that many rows **stratified by label**.


In [None]:

CSV_PATH = "/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv"
# Or use the following if you downloaded it:
# CSV_PATH = "IMDB Dataset.csv"

N_SAMPLES = 8000  # set to None for full 50K

def load_imdb_csv(path: str, n_samples=None, random_state=42) -> pd.DataFrame:
    """
    Return DataFrame with columns:
      - text: str
      - label: int (1 for positive, 0 for negative)
    Apply optional stratified sampling if n_samples is provided.
    """
    # TODO (10 pts):
    # 1) Read CSV
    # 2) Rename columns => text, label
    # 3) Map label strings to integers
    # 4) Drop NaNs and reset index
    # 5) If n_samples is not None, do a stratified sample by label
    raise NotImplementedError("Implement load_imdb_csv()")

# df = load_imdb_csv(CSV_PATH, n_samples=N_SAMPLES)
# df.head()
# df['label'].value_counts(normalize=True)



---
## Part 1 — Minimal text cleaning (10 pts)

Implement `clean_text(s)` to:
1. Lowercase
2. Remove punctuation
3. Collapse multiple spaces to a single space; strip

Then create:
```python
df['clean'] = df['text'].apply(clean_text)
```


In [None]:

def clean_text(s: str) -> str:
    # TODO (10 pts):
    # text = s.lower()
    # table = str.maketrans("", "", string.punctuation)
    # text = text.translate(table)
    # text = re.sub(r"\s+", " ", text).strip()
    # return text
    raise NotImplementedError("Implement clean_text()")

# df['clean'] = df['text'].apply(clean_text)
# df[['text', 'clean']].head(3)



---
## Part 2 — Train/test split (5 pts)

Split the data into train/test:
```python
X_train_text, X_test_text, y_train, y_test = train_test_split(
    df['clean'], df['label'], test_size=0.2, random_state=RANDOM_SEED, stratify=df['label']
)
```


In [None]:

# TODO (5 pts): create X_train_text, X_test_text, y_train, y_test as described
raise NotImplementedError("Create train/test split")



---
## Part 3 — Bag‑of‑Words with `CountVectorizer` (10 pts)

Vectorize the texts using:
```python
cv = CountVectorizer(stop_words='english', min_df=2)
Xtr_bow = cv.fit_transform(X_train_text)
Xte_bow = cv.transform(X_test_text)
```
**Deliverables:** variables `cv`, `Xtr_bow`, `Xte_bow`.


In [None]:

# TODO (10 pts): create BoW features
raise NotImplementedError("Create CountVectorizer features")



---
## Part 4 — Train **Multinomial Naive Bayes** & evaluate (15 pts)

Train and evaluate:
```python
nb = MultinomialNB()
nb.fit(Xtr_bow, y_train)
y_pred_nb = nb.predict(Xte_bow)
```
**Report:** accuracy, precision, recall, F1 (binary average). Then plot a **confusion matrix**.


In [None]:

def report_metrics(y_true, y_pred, title=None):
    acc = accuracy_score(y_true, y_pred)
    p, r, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary", zero_division=0)
    print(f"Accuracy={acc:.3f}  Precision={p:.3f}  Recall={r:.3f}  F1={f1:.3f}")
    print()
    print(classification_report(y_true, y_pred, digits=3))
    cm = confusion_matrix(y_true, y_pred, labels=[0,1])
    disp = ConfusionMatrixDisplay(cm, display_labels=[0,1])
    disp.plot()
    if title:
        plt.title(title)
    plt.show()

# TODO (15 pts): train NB on BoW, predict, and call report_metrics
raise NotImplementedError("Train/evaluate MultinomialNB on BoW")



---
## Part 5 — TF‑IDF features + Naive Bayes (10 pts)

Build TF‑IDF features and repeat NB:
```python
tfidf = TfidfVectorizer(stop_words='english', min_df=2)
Xtr_tfidf = tfidf.fit_transform(X_train_text)
Xte_tfidf = tfidf.transform(X_test_text)
```
Compare metrics vs. BoW (**1–3 sentences** in a Markdown cell).


In [None]:

# TODO (10 pts): TF-IDF features + NB training/eval; store predictions in y_pred_nb_tfidf
raise NotImplementedError("TF-IDF + NB")



👉 **Answer (Markdown, 5 pts):** In **1–3 sentences**, compare BoW vs TF‑IDF performance and briefly speculate why.



---
## Part 6 — Add bigrams & try Logistic Regression (20 pts)

Create bigram TF‑IDF features and train `LogisticRegression(max_iter=1000)`:
```python
tfidf_bg = TfidfVectorizer(ngram_range=(1,2), stop_words='english', min_df=2)
Xtr_bg = tfidf_bg.fit_transform(X_train_text)
Xte_bg = tfidf_bg.transform(X_test_text)

logreg = LogisticRegression(max_iter=1000)
logreg.fit(Xtr_bg, y_train)
y_pred_lr = logreg.predict(Xte_bg)
```
Report metrics and **compare** to NB.


In [None]:

# TODO (20 pts): bigram TF-IDF + Logistic Regression training/eval
raise NotImplementedError("Bigrams + Logistic Regression")



---
## Part 7 — Inspect top indicative features (10 pts)

Using the **Logistic Regression** coefficients on the bigram TF‑IDF features:
- Get feature names: `tfidf_bg.get_feature_names_out()`
- For class 1 (positive), list the top 20 features with largest coefficients.
- For class 0 (negative), list the top 20 with smallest coefficients.

Print them as simple lists or a small DataFrame.


In [None]:

# TODO (10 pts): print top positive and top negative features by coefficient magnitude
raise NotImplementedError("Top features from Logistic Regression coefficients")



---
## Part 8 — Error analysis (10 pts)

Print **5 false positives** and **5 false negatives** based on your **best** model.  
For each, print the **true label**, **predicted label**, and the **original review text** (truncated to ~200 chars).

> Optional: If you used Logistic Regression, you may also inspect `decision_function` to sort by confidence.


In [None]:

# TODO (10 pts): print examples of FP and FN
raise NotImplementedError("Error analysis examples")



---
## ✍️ Short reflection (5 pts)

In **5–8 sentences**, summarize:
- Which combination (features + model) worked best?
- One thing you learned from top‑feature inspection.
- One idea you’d try next (e.g., character n‑grams, regularization `C`, class weights, normalization).

*(Write your answer in the cell below.)*



**Your reflection here…**



---
## (Optional) Save predictions

If you want to keep a record of your predictions for the best model:


In [None]:

# Example stub (uncomment and adapt):
# out = pd.DataFrame({
#     "text": X_test_text.reset_index(drop=True),
#     "label_true": y_test.reset_index(drop=True),
#     "label_pred": y_pred_lr  # or your best model's predictions
# })
# out.to_csv("imdb_predictions.csv", index=False)
# print("Saved to imdb_predictions.csv")



---
## ✅ Checklist before you submit
- [ ] All TODOs implemented (no `NotImplementedError` left).
- [ ] At least **one confusion matrix** plotted.
- [ ] Part 5 Markdown comparison written (1–3 sentences).
- [ ] Reflection written (5–8 sentences).
- [ ] Notebook runs top‑to‑bottom on a fresh kernel.
