### Setup Project

In this step, we import the necessary libraries and load the Amazon Reviews dataset.  
The dataset comes in two compressed files: `train.ft.txt.bz2` and `test.ft.txt.bz2`.  
Each line contains a label (`__label__1` for negative, `__label__2` for positive) followed by the review text.  

In [None]:
import pandas as pd
import bz2

import re
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from collections import Counter

import joblib

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data Loading
We will:
- Load the data into pandas DataFrames.
- Convert labels into numerical format (0 = negative, 1 = positive).
- Inspect a few samples to understand the structure.


In [None]:
# 1)Function to read bz2 files
def read_bz2_file(path):
    data = []
    with bz2.open(path, "rt") as f:
        for line in f:
            data.append(line.strip())
    return data

In [None]:
# 2) Load train and test data
train_data = read_bz2_file("train.ft.txt.bz2")
test_data = read_bz2_file("test.ft.txt.bz2")

In [None]:
# 3) Convert to DataFrame
def parse_data(data):
    labels = []
    texts = []
    for row in data:
        label, text = row.split(" ", 1)
        labels.append(1 if label == "__label__2" else 0)
        texts.append(text)
    return pd.DataFrame({"label": labels, "text": texts})

train_df = parse_data(train_data)
test_df = parse_data(test_data)

In [None]:
train_df.head()

Unnamed: 0,label,text
0,1,Stuning even for the non-gamer: This sound tra...
1,1,The best soundtrack ever to anything.: I'm rea...
2,1,Amazing!: This soundtrack is my favorite music...
3,1,Excellent Soundtrack: I truly like this soundt...
4,1,"Remember, Pull Your Jaw Off The Floor After He..."


In [None]:
print(train_df['label'].value_counts())

label
1    1800000
0    1800000
Name: count, dtype: int64


In [None]:
train_df.shape,test_df.shape

((3600000, 2), (400000, 2))

### Text Preprocessing

Raw text needs cleaning before feeding into ML models.  
We will:
- Lowercase all text.
- Remove punctuation and special characters.
- Tokenize words.
- Optionally remove stopwords.

This ensures consistency and reduces noise in the dataset.

In [None]:
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z\s]", "", text)  # remove punctuation
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    return " ".join(tokens)

In [None]:
train_df["clean_text"] = train_df["text"].apply(preprocess_text)
test_df["clean_text"] = test_df["text"].apply(preprocess_text)

In [None]:
train_df.head()

Unnamed: 0,label,text,clean_text
0,1,Stuning even for the non-gamer: This sound tra...,stuning even nongamer sound track beautiful pa...
1,1,The best soundtrack ever to anything.: I'm rea...,best soundtrack ever anything im reading lot r...
2,1,Amazing!: This soundtrack is my favorite music...,amazing soundtrack favorite music time hands i...
3,1,Excellent Soundtrack: I truly like this soundt...,excellent soundtrack truly like soundtrack enj...
4,1,"Remember, Pull Your Jaw Off The Floor After He...",remember pull jaw floor hearing youve played g...


In [None]:
train_df.isnull().sum()

Unnamed: 0,0
label,0
text,0
clean_text,0


### Feature Extraction

Machine learning models require numerical input.  
We will use **TF-IDF Vectorization** to convert text into numerical features.  
This captures word importance across the dataset.

In [None]:
vectorizer = TfidfVectorizer(max_features=50000)
X_train = vectorizer.fit_transform(train_df["clean_text"])
X_test = vectorizer.transform(test_df["clean_text"])

In [None]:
y_train = train_df["label"]
y_test = test_df["label"]

In [None]:
len(y_train),len(y_test)

(3600000, 400000)

### Baseline ML Model

We start with a simple **Logistic Regression** classifier.  
This gives us a baseline accuracy to compare against deep learning models later.

In [None]:
# Train Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, digits=4))

Logistic Regression Accuracy: 0.90107

Classification Report:

              precision    recall  f1-score   support

           0     0.9036    0.8979    0.9008    200000
           1     0.8986    0.9042    0.9014    200000

    accuracy                         0.9011    400000
   macro avg     0.9011    0.9011    0.9011    400000
weighted avg     0.9011    0.9011    0.9011    400000



### Deep Learning with PyTorch
We now prepare the data for PyTorch:
- Each review is converted into a fixed-length sequence of word indices.
- Unknown words are mapped to `<unk>` and padding is applied with `<pad>`.
- Labels are converted into tensors (0 = negative, 1 = positive).
- We use a custom Dataset class and DataLoader for batching.

In [None]:
# 1) Simple tokenizer
def tokenize(text):
    return text.split()

# 2) Build vocabulary from training data
counter = Counter()
for text in train_df["clean_text"]:
    counter.update(tokenize(text))

# 3) Limit vocab size
vocab_size = 20000
most_common = counter.most_common(vocab_size - 2)
word2idx = {"<unk>": 0, "<pad>": 1}
for idx, (word, _) in enumerate(most_common, start=2):
    word2idx[word] = idx

# 4) Encode function
def encode(text, word2idx, max_len=100):
    tokens = tokenize(text)
    ids = [word2idx.get(token, 0) for token in tokens]
    if len(ids) < max_len:
        ids += [1] * (max_len - len(ids))
    else:
        ids = ids[:max_len]
    return torch.tensor(ids, dtype=torch.long)

In [None]:
print("Vocab size:", len(word2idx))
print("Sample encoding:", encode(train_df["clean_text"].iloc[0], word2idx)[:20])

Vocab size: 20000
Sample encoding: tensor([    0,    16,     0,    84,   379,   257,  5786,     0,   350,    14,
            7,  1635,    16,    52,   555, 17610,    63,    41,   407,    63])


In [None]:
MAX_LEN = 100

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, word2idx, max_len=MAX_LEN):
        self.texts = texts
        self.labels = labels
        self.word2idx = word2idx
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        ids = encode(text, self.word2idx, self.max_len)
        return ids, torch.tensor(label, dtype=torch.float)

In [None]:
train_dataset = SentimentDataset(train_df["clean_text"].tolist(), train_df["label"].tolist(), word2idx)
test_dataset = SentimentDataset(test_df["clean_text"].tolist(), test_df["label"].tolist(), word2idx)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

### LSTM Model

We define a simple LSTM-based sentiment classifier:
- Embedding layer converts word indices into dense vectors.
- LSTM processes sequences to capture context.
- Fully connected layer outputs a single value (probability of positive sentiment).

In [None]:
class SentimentModel(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=128, dropout=0.5):
        super(SentimentModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=1)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.bn = nn.BatchNorm1d(hidden_dim*2)   # NEW: batch normalization
        self.fc = nn.Linear(hidden_dim*2, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.embedding(x)
        _, (hidden, _) = self.lstm(x)
        hidden_cat = torch.cat((hidden[-2], hidden[-1]), dim=1)
        out = self.dropout(hidden_cat)
        out = self.bn(out)
        out = self.fc(out)
        return self.sigmoid(out).squeeze()

### Training Loop with Validation, Early Stopping, and Scheduler

We enhance the training loop:
- Track both training and validation accuracy each epoch.
- Use early stopping to prevent overfitting.
- Apply a learning rate scheduler to reduce LR when validation accuracy plateaus.

In [None]:
# Subset the data (200k train, 20k test)
train_df_small = train_df.sample(n=200000, random_state=42)
test_df_small = test_df.sample(n=20000, random_state=42)

train_dataset = SentimentDataset(train_df_small["clean_text"].tolist(),
                                 train_df_small["label"].tolist(),
                                 word2idx,
                                 max_len=100)

test_dataset = SentimentDataset(test_df_small["clean_text"].tolist(),
                                test_df_small["label"].tolist(),
                                word2idx,
                                max_len=100)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

In [None]:
model = SentimentModel(vocab_size=len(word2idx), embed_dim=64, hidden_dim=64, dropout=0.5)

In [None]:
# plit training data into train + validation
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_df_small["clean_text"], train_df_small["label"], test_size=0.1, random_state=42
)

val_dataset = SentimentDataset(val_texts.tolist(), val_labels.tolist(), word2idx, max_len=50)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)

# Training setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=1)

def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, epochs=10, patience=3):
    best_val_acc = 0
    patience_counter = 0

    for epoch in range(epochs):
        # Training
        model.train()
        total_loss, correct, total = 0, 0, 0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            preds = (outputs >= 0.5).float()
            correct += (preds == labels).sum().item()
            total += labels.size(0)
        train_acc = correct/total

        # Validation
        model.eval()
        val_correct, val_total = 0, 0
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                preds = (outputs >= 0.5).float()
                val_correct += (preds == labels).sum().item()
                val_total += labels.size(0)
        val_acc = val_correct/val_total

        # Scheduler step
        scheduler.step(val_acc)

        print(f"Epoch {epoch+1}, Train Loss: {total_loss/len(train_loader):.4f}, "
              f"Train Acc: {train_acc:.4f}, Val Acc: {val_acc:.4f}")

        # Early stopping
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Early stopping triggered")
                break

# Train with validation + early stopping
train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, epochs=10, patience=3)

Epoch 1, Train Loss: 0.3712, Train Acc: 0.8304, Val Acc: 0.8830
Epoch 2, Train Loss: 0.2765, Train Acc: 0.8861, Val Acc: 0.8976
Epoch 3, Train Loss: 0.2605, Train Acc: 0.8934, Val Acc: 0.9050
Epoch 4, Train Loss: 0.2495, Train Acc: 0.8988, Val Acc: 0.9079
Epoch 5, Train Loss: 0.2391, Train Acc: 0.9033, Val Acc: 0.9176
Epoch 6, Train Loss: 0.2307, Train Acc: 0.9080, Val Acc: 0.9226
Epoch 7, Train Loss: 0.2214, Train Acc: 0.9120, Val Acc: 0.9270
Epoch 8, Train Loss: 0.2127, Train Acc: 0.9152, Val Acc: 0.9322
Epoch 9, Train Loss: 0.2017, Train Acc: 0.9205, Val Acc: 0.9369
Epoch 10, Train Loss: 0.1924, Train Acc: 0.9247, Val Acc: 0.9422


### Evaluation

We evaluate the trained model on the test set:
- Compute accuracy.
- Compare with baseline Logistic Regression.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
from sklearn.metrics import classification_report, accuracy_score

def evaluate_model(model, test_loader):
    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            preds = (outputs >= 0.5).float()
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    print("Classification Report:\n")
    print(classification_report(all_labels, all_preds, digits=4))

    acc = accuracy_score(all_labels, all_preds)
    print(f"Test Accuracy: {acc:.4f}")

evaluate_model(model, test_loader)

Classification Report:

              precision    recall  f1-score   support

         0.0     0.9010    0.9017    0.9013      9966
         1.0     0.9023    0.9016    0.9019     10034

    accuracy                         0.9016     20000
   macro avg     0.9016    0.9017    0.9016     20000
weighted avg     0.9017    0.9016    0.9017     20000

Test Accuracy: 0.9016


#### Saving the trained model

- We save the trained model and vectorizer to reuse them later in the API without retraining.
- Loading the saved model
Testing model loading before building API.
- Prediction function
Function to predict sentiment for new text.


In [None]:
# 1) Save model and vectorizer
joblib.dump(lr, "sentiment_model.pkl")
joblib.dump(vectorizer, "vectorizer.pkl")

['vectorizer.pkl']

In [None]:
# 2) Load them back
lr_loaded = joblib.load("sentiment_model.pkl")
vectorizer_loaded = joblib.load("vectorizer.pkl")

In [None]:
# Prediction function
def predict_sentiment(text):
    text_vec = vectorizer_loaded.transform([text])
    pred = lr_loaded.predict(text_vec)[0]
    return "Positive" if pred == 1 else "Negative"

print(predict_sentiment("This product is amazing"))
print(predict_sentiment("Worst purchase ever"))

Positive
Negative
