# Sentiment Analysis of Movie Reviews

**Goal:** Build a model that reads a movie review (text) and predicts whether the sentiment is **positive** or **negative**.

This is a classic **text classification** problem in NLP (Natural Language Processing). We will:
1. Load and explore the data
2. Preprocess the text so the model can use it
3. Split data into train and test sets
4. Train a classifier (TF-IDF + Logistic Regression)
5. Evaluate accuracy and predict on new reviews

Run each cell in order. If you get `ModuleNotFoundError: No module named 'nltk'`, run the cell below first to install dependencies.

In [None]:
# Run this cell once if you get "ModuleNotFoundError: No module named 'nltk'"
# Then restart the kernel (Kernel → Restart) and run the notebook from the top again.
%pip install nltk

---
## Setup: Imports and project path

We add the project root to Python’s path so we can import from the `src` package (our own code). We then import:
- **pandas** – for tables (DataFrames)
- **train_test_split** (sklearn) – to split data into train and test
- **config** – project settings (paths, test size, random seed)
- **data_loader, text_preprocessing, model, evaluation** – our pipeline steps

In [None]:
import sys
import os
sys.path.insert(0, os.path.abspath(".."))

import pandas as pd
from sklearn.model_selection import train_test_split

from src.config import DATA_PATH, TEST_SIZE, RANDOM_STATE
from src.data_loader import load_data
from src.text_preprocessing import preprocess_text
from src.model import build_model
from src.evaluation import evaluate

---
## Step 1: Load the data

We load the dataset from a CSV file. Each row has:
- **review** – the raw text of the review
- **sentiment** – the label: `positive` or `negative`

`load_data` uses pandas to read the CSV. We use a path relative to the notebook (`../data/raw/...`) so it works when you run from the `notebooks/` folder.

In [None]:
# Load data (use path relative to project root when running from notebooks/)
df = load_data(os.path.join("..", "data", "raw", "imdb_sample.csv"))
df.head()

### Step 1 (continued): Explore the data

Before modeling, we always **explore the data**:
- **Shape** – How many rows (reviews) and columns we have. We need enough data to train and test.
- **Class distribution** – How many positive vs negative reviews. If one class dominates, the model might be biased; we might need balancing or different metrics (e.g. F1).

In [None]:
df.shape

In [None]:
df["sentiment"].value_counts()

---
## Step 2: Text preprocessing

**Why preprocess?** Models don’t read sentences; they need **numbers** or **fixed representations**. We also want to reduce noise and focus on meaningful words.

Our `preprocess_text` function does three things:
1. **Lowercasing** – "Great" and "great" become the same. Reduces vocabulary size and helps the model generalize.
2. **Remove punctuation** – "amazing!" and "amazing" carry the same sentiment; punctuation often doesn’t help for sentiment.
3. **Remove stopwords** – Words like "the", "is", "at" appear in almost every sentence and usually don’t indicate sentiment. Removing them shrinks the text and can improve signal.

Below we compare one **raw** review with its **preprocessed** version so you can see the effect.  
*(If the `nltk` package is installed, we use NLTK's full English stopword list; otherwise the code uses a small built-in list so the notebook still runs.)*

In [None]:
# Example: raw vs preprocessed
sample = df["review"].iloc[0]
print("Raw:", sample)
print("Preprocessed:", preprocess_text(sample))

---
## Step 3: Train/test split

**Why split?** We need to know if the model **generalizes** to new reviews it has never seen. If we evaluated on the same data we trained on, we'd only measure memorization, not real performance.

We use `train_test_split` to put a fraction of the data (e.g. 25%) into **X_test, y_test** and the rest into **X_train, y_train**. The model is trained only on the train set; the test set is used once at the end to report accuracy and other metrics. We fix **random_state** so the split is reproducible.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df["review"],
    df["sentiment"],
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
)
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

---
## Step 4: Build and train the model

We use a **pipeline** with two steps:

1. **TF-IDF vectorizer**  
   - **TF** (Term Frequency): how often a word appears in a document.  
   - **IDF** (Inverse Document Frequency): downweights words that appear in many documents (e.g. "movie") and upweights rarer, more discriminative words.  
   - Together, TF-IDF turns each review into a **vector of numbers** (one number per word in the vocabulary). Our `preprocess_text` is passed as the **preprocessor** so every review is cleaned before counting.

2. **Logistic Regression**  
   - A linear classifier: it learns weights for each TF-IDF feature and combines them to predict positive vs negative.  
   - Fast, interpretable, and often works very well on TF-IDF text features.

**Training:** `model.fit(X_train, y_train)` makes the pipeline (1) preprocess and vectorize the training reviews, then (2) fit the logistic regression on those vectors and labels.

In [None]:
model = build_model(preprocess_text)
model.fit(X_train, y_train)

---
## Step 5: Evaluate on the test set

We measure performance on the **held-out test set** (data the model never saw during training):

- **Accuracy** – Fraction of test reviews classified correctly. Simple and intuitive.
- **Classification report** – For each class (positive/negative) we get:
  - **Precision** – Of all reviews predicted as positive, how many were actually positive?
  - **Recall** – Of all actually positive reviews, how many did we predict as positive?
  - **F1-score** – Harmonic mean of precision and recall; useful when classes are imbalanced.

This tells us whether the model is good enough and whether it favors one class over the other.  
*Note: This notebook uses a very small sample dataset (few rows), so metrics may look uneven. With more data, you'd expect more stable and meaningful numbers.*

In [None]:
evaluate(model, X_test, y_test)

---
## Step 6: Predict on new reviews

Finally, we use the trained model to **predict sentiment for new text**. The pipeline automatically preprocesses and vectorizes the new reviews with the same steps used in training, then the classifier outputs a label (positive/negative) for each one. This is how you would use the model in an app or script.

In [None]:
new_reviews = [
    "The movie was boring and long",
    "Amazing film, highly recommend!",
]
model.predict(new_reviews)