## Device initialization

This section initializes the computational environment for the project. It performs three critical tasks:

1. Detects available hardware (CPU vs GPU)
2. Configures PyTorch to use the optimal compute device

In [None]:
import pandas as pd
import torch
import types
import pandas as pd
import re
from collections import Counter
from torch.utils.data import DataLoader, Dataset
from torch import nn
import copy

import gensim.downloader as api
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import chi2
from sklearn.linear_model import LogisticRegression

# Check if MPS is available for acceleration
print('Is mps available?', torch.mps.is_available())

# Set the device to MPS if available, otherwise use CPU
device = torch.device('mps' if torch.mps.is_available() else 'cpu')
print('Using device:', device)

Is mps available? True
Using device: mps


## Dataset Loading

This section loads the SST dataset from local files and processes it for binary sentiment classification.

4 files are loaded from the `SST2-Data/stanfordSentimentTreebank/` folder:

### 1. `datasetSentences.txt`

Contains all sentences extracted from movie reviews, this is the primary data source containing the text we want to classify.

- Column 1: `sentence_index` - Unique identifier for each sentence
- Column 2: `sentence` - The actual text of the sentence

### 2. `datasetSplit.txt`

Specifies which dataset split each sentence belongs to, ensures using the standard train/dev/test splits.

- Column 1: `sentence_index` - Links to sentences in `datasetSentences.txt`
- Column 2: `splitset_label` - Split assignment
  - **1** = Training set
  - **2** = Test set
  - **3** = Development/Validation set

### 3. `dictionary.txt`

Maps all phrases (including sub-phrases) to unique IDs, acts as a lookup table to connect sentences to their sentiment labels, as a bridge between `datasetSentences.txt` and `sentiment_labels.txt`.

- Column 1: `phrase` - Text of the phrase (can be a word, sub-phrase, or complete sentence)
- Column 2: `phrase_id` - Unique integer identifier

### 4. `sentiment_labels.txt`

Contains sentiment scores for all phrases, provides the ground truth labels for training our sentiment classifier.

- Column 1: `phrase ids` - Links to `phrase_id` in `dictionary.txt`
- Column 2: `sentiment values` - Continuous sentiment score from 0 (most negative) to 1 (most positive)

There are 5 sentiments (very negative, negative, neutral, positive, very positive) in original dataset, but this project only need binary classification (negative or positive), thus the sentences with sentiment values between 0.4 and 0.6 should be removed. Then convert labels to integers (0 or 1)

```
0.0 ←−−−−−−− 0.2 ←−−−− 0.4 ←− 0.5 −→ 0.6 −−−−→ 0.8 −−−−−−→ 1.0
Very Negative | Negative | Neutral | Positive | Very Positive
```
---

###  Final Format

After processing, each DataFrame has this structure:

| Column | Type | Description |
|--------|------|-------------|
| `sentence` | string | Movie review text |
| `label` | int | Binary sentiment |

In [2]:
# Load all sentences with their unique indices
sentences_df = pd.read_csv('SST2-Data/stanfordSentimentTreebank/stanfordSentimentTreebank/datasetSentences.txt', sep='\t', header=0)

# Load dataset split assignments (train/dev/test)
split_df = pd.read_csv('SST2-Data/stanfordSentimentTreebank/stanfordSentimentTreebank/datasetSplit.txt', sep=',', header=0)

# Load phrase dictionary (maps phrases to unique IDs)
dictionary_df = pd.read_csv('SST2-Data/stanfordSentimentTreebank/stanfordSentimentTreebank/dictionary.txt', sep='|', header=None, names=['phrase', 'phrase_id'])

# Load sentiment labels for all phrases
labels_df = pd.read_csv('SST2-Data/stanfordSentimentTreebank/stanfordSentimentTreebank/sentiment_labels.txt', sep='|', header=0)

print(f"Sentences: {len(sentences_df)}")
print(f"Dictionary phrases: {len(dictionary_df)}")
print(f"Labels: {len(labels_df)}")

# Combine sentences with their train/dev/test assignments
data = sentences_df.merge(split_df, on='sentence_index')

# Initialize empty dictionary to store sentence-to-label mappings
sentence_labels = {}

# Iterate through each sentence in the merged dataset
for idx, row in data.iterrows():
    sentence = row['sentence']  # Extract the sentence text
    
    # Look up phrase_id in dictionary
    phrase_match = dictionary_df[dictionary_df['phrase'] == sentence]
    
    # Check if we found a match
    if not phrase_match.empty:
        # Extract the phrase_id for this sentence
        phrase_id = phrase_match.iloc[0]['phrase_id']
        
        # Look up sentiment score using phrase_id
        label_match = labels_df[labels_df['phrase ids'] == phrase_id]
        
        # Check if we found a sentiment score
        if not label_match.empty:
            # Extract the continuous sentiment value (0.0 to 1.0)
            sentiment_value = label_match.iloc[0]['sentiment values']
            
            # Convert continuous sentiment to binary label
            if sentiment_value <= 0.4:
                # Negative class: very negative + negative samples
                sentence_labels[sentence] = 0
                
            elif sentiment_value >= 0.6:
                # Positive class: positive + very positive samples
                sentence_labels[sentence] = 1

# Apply Labels and Filter Neutral Samples, Map the sentence_labels dictionary to create a new 'label' column
# Sentences not in sentence_labels (neutral samples) will have NaN values
data['label'] = data['sentence'].map(sentence_labels)

# Remove all rows where label is NaN
data = data.dropna(subset=['label'])

# Convert label column from float to integer type.
data['label'] = data['label'].astype(int)

# Display statistics after filtering
print(f"\nAfter filtering neutral samples: {len(data)}")
print(f"Label distribution:\n{data['label'].value_counts()}")

# Training set (splitset_label == 1), keep only 'sentence' and 'label' columns, drop unnecessary columns
train_df = data[data['splitset_label'] == 1][['sentence', 'label']].reset_index(drop=True)

# Test set (splitset_label == 2)
test_df = data[data['splitset_label'] == 2][['sentence', 'label']].reset_index(drop=True)

# Validation/Development set (splitset_label == 3)
val_df = data[data['splitset_label'] == 3][['sentence', 'label']].reset_index(drop=True)

# Display final dataset sizes
print(f"\nTrain: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")

Sentences: 11855
Dictionary phrases: 239232
Labels: 239232

After filtering neutral samples: 9142
Label distribution:
label
1    4739
0    4403
Name: count, dtype: int64

Train: 6568, Val: 825, Test: 1749


## Dataset processing

This section transforms raw text data into numerical representations that neural networks can process. Text strings cannot be directly fed into neural networks - they must first be converted into sequences of integers.

### Tokenization

Split sentences into individual words and punctuation marks. Neural networks operate on discrete units. Also keep punctuation as separate tokens because they carry sentiment information

### Vocabulary Construction

Build a mapping between words and unique integer IDs (word → ID), a consistent way to convert any word into a number is required, including:

- Filter rare words (frequency < 2): Reduces vocabulary size and filters potential typos/noise
- Add special tokens: 
  - `<pad>`: For making sequences equal length (required for batch processing)
  - `<unk>`: For handling words not seen during training

In [3]:
# Breaking down text into individual words and punctuation marks
_tok = re.compile(r"\w+|[^\w\s]")
def tokenize(s: str):
    """
    Convert a sentence string into a list of lowercase tokens.
    """
    return [t.lower() for t in _tok.findall(s)]

# Define special tokens
UNK = "<unk>"  # used for words not in vocabulary (out-of-vocabulary words)
PAD = "<pad>"  # used to make all sequences the same length

# Initialize a Counter to count word frequencies
counter = Counter()

# Count all words in the training set, only use training data to build vocabulary to prevent data leakage
for sentence in train_df['sentence']:
    # Tokenize each sentence and update word counts
    counter.update(tokenize(sentence))
# Build the vocabulary list
itos = [PAD, UNK] + [w for w, c in counter.most_common() if c >= 2]
# Build reverse mapping: string-to-index dictionary
stoi = {w: i for i, w in enumerate(itos)}
# Get the padding token's index
padding_idx = stoi[PAD]
# Calculate vocabulary size
vocab_size = len(itos)

print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 6864


## Loading Pre-trained Word Embeddings

This code loads pre-trained GloVe embeddings and integrates them with the vocabulary. Instead of learning word representations from scratch, by leveraging embeddings trained on massive text corpora. Pre-trained embeddings significantly improve model performance, especially when training data is limited.

Now have a `TEXT` object containing our vocabulary mappings (`itos`, `stoi`) and an embedding matrix where known words start with meaningful representations, giving our model a significant advantage before training even begins.

In [4]:
# Load pre-trained GloVe (Global Vectors for Word Representation) embeddings
glv = api.load("glove-wiki-gigaword-100")

# Set the dimensionality of word embeddings to match GloVe's dimension
embedding_dim = 100

# Initialize a tensor to store embedding vectors for all vocabulary words
pretrained_vectors = torch.randn(vocab_size, embedding_dim) * 0.01

# Iterate through each token in the vocabulary
for i, tok in enumerate(itos):
    # Check if the current token exists in the pre-trained GloVe vocabulary
    if tok in glv:
        # If found, replace the random vector with the pre-trained GloVe vector
        pretrained_vectors[i] = torch.tensor(glv[tok])

# Set the embedding vector for the padding token to zeros
pretrained_vectors[padding_idx] = 0.0

# Create a namespace object to mimic the torchtext Field structure
TEXT = types.SimpleNamespace(
    vocab=types.SimpleNamespace(
        itos=itos,                      # int-to-string: list mapping indices to tokens
        stoi=stoi,                      # string-to-int: dict mapping tokens to indices
        vectors=pretrained_vectors      # The embedding matrix with pre-trained vectors
    )
)

## Feature Engineering

This section implements a feature engineering pipeline to establish a baseline understanding of sentiment classification. First extract hand-crafted linguistic features and evaluate their predictive power. This approach helps us understand what signals are important for sentiment analysis.

There are 7 hand-crafted features grouped into 4 categories, each capturing different aspects of sentiment expression:

1. Length Features

Text length can reflect sentiment strength - longer reviews might indicate stronger opinions (either very positive or very negative). Then normalize these features by dividing by typical values (50 tokens, 200 characters) to keep them in a reasonable range.

2. Punctuation Features

People expressing strong sentiment often use more emphatic punctuation.

- Exclamation marks signal strong emotion (both positive and negative)
- Question marks may indicate confusion or rhetorical emphasis
- Commas suggest detailed, structured arguments

3. Negation Features

Negations are crucial for sentiment analysis because they reverse polarity. Then count common negation words like "not", "no", "never", and contractions like "n't", "don't", "can't". The count is normalized by the number of tokens.

4. Intensifier Features

Intensifiers amplify sentiment strength without changing its direction, these words indicate the author felt compelled to emphasize their sentiment, suggesting stronger opinions.

### Feature Normalization Strategy

All features are normalized to prevent scale imbalance. Without normalization, raw character counts (in the hundreds) would dominate small counts like intensifiers (0-2). the normalization approach:

- Length features: Divided by typical text size (50 tokens, 200 chars)
- Punctuation/word counts: Divided by total token count

### Chi-Square Feature Importance Test

The Chi-Square test measures the statistical dependency between each feature and the sentiment label. A high χ² score indicates a strong association between the feature and sentiment, suggesting the feature is informative for prediction. The p-value confirms whether this relationship is statistically significant.

By examining χ² scores, we can identify which features are most predictive of sentiment and focus on the most informative signals.

### Logistic Regression Baseline

Use Logistic Regression as a simple baseline model. As a linear model, it learns explicit weights for each feature, allowing us to understand exactly how much each linguistic pattern contributes to sentiment prediction.

In [5]:
# Define common negation words that reverse sentiment
NEGATIONS = {"not","no","never","n't","dont","don't"}

# Define intensifier words that amplify sentiment strength
INTENSIFIERS = {"very","really","so","too","extremely"}

# List of feature names we will extract
FEATURE_NAMES = [
    "len_tokens",           # Number of tokens in text
    "len_chars",            # Total character count
    "count_exclaim",        # Number of exclamation marks
    "count_question",       # Number of question marks
    "count_comma",          # Number of commas
    "count_negations",      # Count of negation words
    "count_intensifiers"    # Count of intensifier words
]


def extract_features(text):
    """
    Extract simple features from text.
    """
    # Tokenize text using regex pattern
    tokens = _tok.findall(text)
    # Convert all tokens to lowercase
    tokens_lower = [t.lower() for t in tokens]
    
    # Count basic statistics
    num_tokens = len(tokens)
    num_chars = len(text)
    
    # Count punctuation marks
    num_exclaim = text.count("!")  # Count exclamation marks
    num_question = text.count("?")  # Count question marks
    num_comma = text.count(",")  # Count commas
    
    # Count negation words in text
    num_negations = sum(1 for t in tokens_lower if t in NEGATIONS)
    
    # Count intensifier words in text
    num_intensifiers = sum(1 for t in tokens_lower if t in INTENSIFIERS)
    
    # Normalize features by text length
    if num_tokens > 0:
        features = {
            "len_tokens": num_tokens / 50.0,  # Normalize by typical length
            "len_chars": num_chars / 200.0,  # Normalize by typical char count
            "count_exclaim": num_exclaim / num_tokens,  # Ratio of exclamation marks
            "count_question": num_question / num_tokens,  # Ratio of question marks
            "count_comma": num_comma / num_tokens,  # Ratio of commas
            "count_negations": num_negations / num_tokens,  # Ratio of negations
            "count_intensifiers": num_intensifiers / num_tokens  # Ratio of intensifiers
        }
    else:
        # Handle empty text case
        features = {name: 0.0 for name in FEATURE_NAMES}
    
    return features


print("Building feature validation dataset...")
train_features = []

# Extract features from training data
for idx, row in train_df.iterrows():
    # Extract features from sentence
    feats = extract_features(row['sentence']) 
    # Add label to feature dict
    feats["label"] = row['label']
    # Add to list
    train_features.append(feats)

# Convert list of dicts to DataFrame
df_features = pd.DataFrame(train_features)

# Print dataset information
print(f"Number of features: {len(FEATURE_NAMES)}")
print(f"Training samples: {len(df_features)}")

# Prepare data for chi-square test
X_chi = df_features[FEATURE_NAMES].values
y_chi = df_features["label"].values

# Calculate chi-square scores for each feature
chi_scores, p_values = chi2(X_chi, y_chi)

# Create ranking table with chi-square results
chi_df = pd.DataFrame({
    "feature": FEATURE_NAMES,
    "chi2_score": chi_scores,
    "p_value": p_values
}).sort_values("chi2_score", ascending=False)

# Display feature importance ranking
print("\nFeature Importance Ranking (Chi-square):")
print(chi_df.to_string(index=False))

# Extract features from validation data
val_features = []
for idx, row in val_df.iterrows():
    feats = extract_features(row['sentence'])
    feats["label"] = row['label']
    val_features.append(feats)

# Convert to DataFrame
df_val = pd.DataFrame(val_features)

# Prepare train and validation data
X_train_lr = df_features[FEATURE_NAMES].values  # Training features
y_train_lr = df_features["label"].values  # Training labels
X_val_lr = df_val[FEATURE_NAMES].values  # Validation features
y_val_lr = df_val["label"].values  # Validation labels

# Train logistic regression model
lr_model = LogisticRegression(max_iter=500)
lr_model.fit(X_train_lr, y_train_lr)

# Evaluate model on validation set
lr_pred = lr_model.predict(X_val_lr)
lr_acc = accuracy_score(y_val_lr, lr_pred)

# Print baseline results
print(f"\nLR Baseline validation accuracy: {lr_acc:.4f}")
print(f"Conclusion: Hand-crafted features alone achieve {lr_acc*100:.1f}% classification ability")

# Select features with chi-square score above threshold
THRESHOLD = 1.0  # Set threshold for feature selection
selected_features = chi_df[chi_df["chi2_score"] > THRESHOLD]["feature"].tolist()

# Print selected features
print(f"Based on chi-square > {THRESHOLD}, selected {len(selected_features)} features:")
for feat in selected_features:  # Loop through selected features
    # Get chi-square value for this feature
    chi_val = chi_df[chi_df["feature"] == feat]["chi2_score"].values[0]
    # Print feature name and chi-square score
    print(f"   - {feat:<30} (chi2={chi_val:.2f})")

Building feature validation dataset...
Number of features: 7
Training samples: 6568

Feature Importance Ranking (Chi-square):
           feature  chi2_score  p_value
count_intensifiers    5.094269 0.024005
    count_question    3.619379 0.057110
   count_negations    3.596464 0.057903
       count_comma    1.773648 0.182931
     count_exclaim    1.730499 0.188347
         len_chars    1.180790 0.277195
        len_tokens    0.008274 0.927523

LR Baseline validation accuracy: 0.5770
Conclusion: Hand-crafted features alone achieve 57.7% classification ability
Based on chi-square > 1.0, selected 6 features:
   - count_intensifiers             (chi2=5.09)
   - count_question                 (chi2=3.62)
   - count_negations                (chi2=3.60)
   - count_comma                    (chi2=1.77)
   - count_exclaim                  (chi2=1.73)
   - len_chars                      (chi2=1.18)


## DataLoader Setup

Create efficient data pipelines that feed batches of samples to the model during training.

### Dataset Class

Wrap preprocessed data in PyTorch's Dataset interface. PyTorch's DataLoader requires this structure for efficient batching and data loading

This pipeline ensures our text data is properly prepared for the LSTM model while maintaining reproducibility and preventing data leakage.


### Compute Numeric Feature Function

The function processes batches of tokenized text in real-time, converting token IDs back to their string representations to analyze character-level patterns (punctuation, capitalization, emoticons) that aren't captured by word embeddings alone. These features are then normalized to comparable scales and concatenated with the neural network's learned representations, creating a richer input that combines learned semantic knowledge with explicit linguistic rules. In the neural network forward pass, this function is called to compute features for each batch.

### Collate Function 

Sentences have variable lengths, but neural networks require fixed-size tensors. To address this, pad shorter sequences with special `<pad>` tokens to match the longest sequence in each batch

### Training DataLoader
- `shuffle=True`: Randomize sample order each epoch to prevent overfitting to data order, which is critical for good generalization

- `batch_size=128`: Balance between speed and memory, 128 is a common choice for medium datasets

### Validation/Test DataLoaders
- `shuffle=False`: Keep fixed order to reproducible evaluation metrics

- `pin_memory=True`: Speed up CPU→GPU transfer. Only beneficial when using MPS

In [6]:
# Simple dataset class
class SST2Dataset(Dataset):
    def __init__(self, dataframe, stoi):
        self.data = []
        for _, row in dataframe.iterrows():
            sentence = row['sentence']
            label = int(row['label'])
            # Convert words to IDs
            ids = [stoi.get(w, stoi[UNK]) for w in tokenize(sentence)]
            self.data.append({'input_ids': ids, 'label': label})
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]


# Create datasets
train_ds = SST2Dataset(train_df, stoi)
val_ds = SST2Dataset(val_df, stoi)
test_ds = SST2Dataset(test_df, stoi)

print(f"Dataset sizes - Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}")


# Extract features from text
def compute_numeric_features(x_ids, pad_idx, selected_feat_names):
    device = x_ids.device
    B, T = x_ids.size()
    feats_list = []
    
    for i in range(B):
        # Get tokens
        ids = x_ids[i].tolist()
        ids = [t for t in ids if t != pad_idx]
        toks = [TEXT.vocab.itos[t] for t in ids]
        text = " ".join(toks)
        
        # Count features
        num_tokens = float(len(toks))
        num_chars = float(len(text))
        num_exclaim = float(text.count("!"))
        num_question = float(text.count("?"))
        num_comma = float(text.count(","))
        
        # Count negations and intensifiers
        toks_lower = [t.lower() for t in toks]
        num_negations = float(sum(1 for t in toks_lower if t in NEGATIONS))
        num_intensifiers = float(sum(1 for t in toks_lower if t in INTENSIFIERS))
        
        # Normalize features
        if num_tokens > 0:
            all_features = {
                "len_tokens": num_tokens / 50.0,
                "len_chars": num_chars / 200.0,
                "count_exclaim": num_exclaim / num_tokens,
                "count_question": num_question / num_tokens,
                "count_comma": num_comma / num_tokens,
                "count_negations": num_negations / num_tokens,
                "count_intensifiers": num_intensifiers / num_tokens
            }
        else:
            # Handle empty case
            all_features = {fname: 0.0 for fname in selected_feat_names}
        
        # Get selected features
        selected_vals = [all_features[fname] for fname in selected_feat_names]
        feats_list.append(selected_vals)
    
    return torch.tensor(feats_list, dtype=torch.float32, device=device)


# Collate function for batching
def collate_fn(batch):
    ids_list = []
    labels = []
    
    for item in batch:
        ids_list.append(item['input_ids'])
        labels.append(item['label'])
    
    # Pad sequences to same length
    max_len = max(len(x) for x in ids_list)
    x = torch.full((len(ids_list), max_len), padding_idx, dtype=torch.long)
    
    for i, ids in enumerate(ids_list):
        x[i, :len(ids)] = torch.tensor(ids, dtype=torch.long)
    
    y = torch.tensor(labels, dtype=torch.long)
    
    return types.SimpleNamespace(text=x, label=y)


# Create data loaders
train_iter = DataLoader(train_ds, batch_size=128, shuffle=True, collate_fn=collate_fn)
val_iter = DataLoader(val_ds, batch_size=128, shuffle=False, collate_fn=collate_fn)
test_iter = DataLoader(test_ds, batch_size=128, shuffle=False, collate_fn=collate_fn)

# Check first batch
first = next(iter(train_iter))
print(f"\nBatch info:")
print(f"Type: {type(first)}")
print(f"Text shape: {first.text.shape}")
print(f"Label shape: {first.label.shape}")
print(f"Sample text (first 10 tokens): {first.text[0][:10]}")
print(f"Sample label: {first.label[0]}")

Dataset sizes - Train: 6568, Val: 825, Test: 1749

Batch info:
Type: <class 'types.SimpleNamespace'>
Text shape: torch.Size([128, 55])
Label shape: torch.Size([128])
Sample text (first 10 tokens): tensor([5264,  861,  105,    5,  543,  783,    7,  225,   15,    3])
Sample label: 1


## BiLSTM Classification Model


I tried using LSTM, but the results were very poor. Thus, I employ a Bidirectional Long Short-Term Memory (BiLSTM) network as our primary algorithm. This choice is justified by following:

1. Sequential Nature of Language: Text is not just a collection of random words—the order matters tremendously. LSTMs are designed specifically to process sequences where order matters. As the LSTM reads through a sentence word by word, it maintains a "memory" of what it has seen before. 

2. Long-Range Dependency Modeling: Sometimes the key to understanding sentiment is remembering something from much earlier in the text. RNNs  try to do this but fail because they forget information from earlier in the sequence. LSTMs solve this problem using special mechanisms called "gates" that act like memory controllers. 

3. Bidirectional Context Capture: A regular LSTM only reads left-to-right (forward). A Bidirectional LSTM  reading the sentence in both directions simultaneously: one LSTM reads left-to-right (forward), and another reads right-to-left (backward). By combining information from both directions, the model gets the complete picture, leading to much better understanding of sentiment.

4. Hierarchical Representation Learning: We use 2 LSTM layers stacked on top of each other, which allows the model to learn patterns at different levels of abstraction. Our first LSTM layer learns basic patterns like grammar and simple word combinations. The second LSTM layer then takes these basic patterns and learns more complex, abstract patterns. 

### Metrics

#### 1. Classification Accuracy

We mainly use classification accuracy as our primary performance measure. 

Classification accuracy is the most appropriate metric for our sentiment analysis task. First, our dataset exhibits balanced class distribution—we have roughly equal numbers of positive and negative reviews, meaning accuracy won't be misleadingly inflated by predicting the majority class.

Second, for binary sentiment classification, both classes (positive/negative) are equally important. We don't have an asymmetric cost structure where false positives are more costly than false negatives, so accuracy's equal weighting of all errors is appropriate.

#### 2. Cross-Entropy Loss

We also monitor cross-entropy loss during training and validation.

Cross-entropy loss measures the divergence between predicted probability distributions and true labels, providing a more nuanced signal than accuracy during training. Cross-entropy rewards confident correct predictions and penalizes confident wrong predictions more heavily. 

Cross-entropy is the natural choice for neural network classification because it's the negative log-likelihood of the correct class, making it theoretically grounded in maximum likelihood estimation. It's also convex with respect to the final layer's logits, ensuring stable gradient descent optimization. 

### Overfitting and Underfitting Prevention

#### 1. Overfitting Prevention

1. Frozen Pretrained Embeddings: GloVe embeddings are pretrained on billions of words from Wikipedia and news corpora. These vectors already encode rich semantic relationships. By freezing these weights, we prevent the model from overfitting embeddings to our small training set (6,568 samples). 


2. Dropout Regularization: Dropout randomly sets activations to zero during training with probability `p`, forcing the network to learn redundant representations. If the model relies on a single neuron to detect "not," and that neuron is dropped out 50% of the time, the model must learn alternative pathways to detect negation.

3. Early Stopping with Model Checkpointing: Training for too many epochs inevitably leads to overfitting—the model continues improving on training data while validation performance degrades. Early stopping prevents this by monitoring validation loss and saving the model checkpoint when it achieves the lowest validation loss. After 50 epochs, even if the final model is overfit, we retain the best-performing model from earlier in training (typically around epochs 15-25).

4. Train-Validation Monitoring: Explicit monitoring of the train-validation gap provides a human-interpretable diagnostic for model health. A gap of 2-5% is normal and acceptable—it indicates the model has learned patterns specific to training data but still generalizes well. A gap exceeding 10% signals severe overfitting, prompting us to stop training, increase regularization, or reduce model capacity.

### 2. Underfitting Prevention

1. Sufficient Model Capacity: 2-layer BiLSTM with 384 hidden units provides `384 × 2 directions × 2 layers = 1,536` hidden states across the network, sufficient to learn nuanced linguistic patterns.

2. Adequate Training Duration: Training for 50 epochs with early stopping ensures we explore enough of the loss landscape. If validation loss is still decreasing at epoch 50, we'd extend training, but typically loss plateaus around epoch 20-30.

3. High Initial Learning Rate: Starting with LR=0.15 ensures rapid initial learning. Too small a learning rate (e.g., 0.001) would cause extremely slow convergence, potentially getting stuck in poor local minima and resulting in underfitting.

4. Monitoring Training Loss: If training loss remains high (>0.5) throughout training, this signals underfitting—the model lacks capacity or training time to fit the data. We'd respond by adding layers, or training longer.

In [7]:
hidden_dim = 256  # hidden dimension for LSTM
label_size = 2  # binary classification
num_features = len(selected_features)  # number of hand-crafted features

# Print model configuration
print("Model Configuration:")
print(f"Vocabulary size: {vocab_size}")
print(f"Embedding dimension: {embedding_dim}")
print(f"Hidden dimension: {hidden_dim}")
print(f"Number of features: {num_features}")


# Define BiLSTM classifier
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_size, padding_idx, num_features, feat_names):
        super(BiLSTMClassifier, self).__init__()
        # Create embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)
        # Load pretrained word vectors
        self.embedding.weight.data.copy_(TEXT.vocab.vectors)
        # Freeze embedding weights
        self.embedding.weight.requires_grad = False
        
        # Create bidirectional LSTM
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, batch_first=True, bidirectional=True)
        
        # Feature projection layer
        self.feat_layer = nn.Linear(num_features, 32)
        # Final classification layer
        self.fc = nn.Linear(hidden_dim * 2 + 32, output_size)
        # Dropout for regularization
        self.dropout = nn.Dropout(0.3)
        
        # Store padding index
        self.pad_idx = padding_idx
        # Store feature names
        self.feat_names = feat_names

    def forward(self, x):
        # Get word embeddings
        embedded = self.embedding(x)
        # Apply dropout
        embedded = self.dropout(embedded)
        
        # Pass through LSTM
        output, (hidden, cell) = self.lstm(embedded)
        # Concatenate forward and backward hidden states
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        
        # Extract hand-crafted features
        features = compute_numeric_features(x, self.pad_idx, self.feat_names)
        # Project features
        features = self.feat_layer(features)
        
        # Combine LSTM output and features
        combined = torch.cat([hidden, features], dim=1)
        # Get final prediction
        out = self.fc(combined)
        
        return out


# Initialize model
model = BiLSTMClassifier(vocab_size, embedding_dim, hidden_dim, label_size, padding_idx, num_features, selected_features)
# Move model to device (GPU or CPU)
model.to(device)

# Define loss function
criterion = nn.CrossEntropyLoss()
# Set initial learning rate
learning_rate = 0.1

# Training function
def train_model(model, data_loader, criterion, lr):
    model.train()  # set model to training mode
    total_loss = 0  # initialize total loss
    
    # Loop through batches
    for batch in data_loader:
        # Get input and labels
        x = batch.text.to(device)
        y = batch.label.to(device)
        
        # Forward pass
        output = model(x)
        # Calculate loss
        loss = criterion(output, y)
        
        # Backward pass
        loss.backward()
        
        # Manual parameter update
        with torch.no_grad():
            for param in model.parameters():
                if param.grad is not None:  # check if gradient exists
                    param.data = param.data - lr * param.grad  # update weights
                    param.grad.zero_()  # reset gradients
        
        # Accumulate loss
        total_loss += loss.item()
    
    # Return average loss
    return total_loss / len(data_loader)


# Evaluation function
def eval_model(model, data_loader, criterion):
    model.eval()  # set model to evaluation mode
    total_loss = 0  # initialize total loss

    with torch.no_grad():
        for batch in data_loader:
            # Get input and labels
            x = batch.text.to(device)
            y = batch.label.to(device)
            # Forward pass
            output = model(x)
            # Calculate loss
            loss = criterion(output, y)
            # Accumulate loss
            total_loss += loss.item()
    
    # Return average loss
    return total_loss / len(data_loader)


# Function to calculate accuracy
def get_accuracy(model, data_loader):
    model.eval()  # set model to evaluation mode
    correct = 0  # count of correct predictions
    total = 0  # total number of samples
    
    with torch.no_grad():
        for batch in data_loader:
            # Get input and labels
            x = batch.text.to(device)
            y = batch.label.to(device)
            # Forward pass
            output = model(x)
            # Get predicted class
            _, pred = torch.max(output, 1)
            # Update total count
            total += y.size(0)
            # Update correct count
            correct += (pred == y).sum().item()
    
    # Return accuracy
    return correct / total


# Set number of epochs
epochs = 50
# Initialize best loss
best_loss = 999999

# Lists to store metrics
train_losses = []
val_losses = []
train_accs = []
val_accs = []

# Start training
print("\nStarting training...")

# Training loop
for epoch in range(epochs):
    # Train for one epoch
    train_loss = train_model(model, train_iter, criterion, learning_rate)
    # Evaluate on validation set
    val_loss = eval_model(model, val_iter, criterion)
    
    # Calculate accuracies
    train_acc = get_accuracy(model, train_iter)
    val_acc = get_accuracy(model, val_iter)
    
    # Store metrics
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    train_accs.append(train_acc)
    val_accs.append(val_acc)
    
    # Save best model
    if val_loss < best_loss:
        best_loss = val_loss
        best_model = copy.deepcopy(model)
    
    # Reduce learning rate every 10 epochs
    if (epoch + 1) % 10 == 0:
        learning_rate = learning_rate * 0.8 
    
    # Print progress every 5 epochs
    if (epoch + 1) % 10 == 0:
        print(f'Epoch {epoch+1}: Train Loss={train_loss:.3f}, Val Loss={val_loss:.3f}')
        print(f'  Train Acc={train_acc:.3f}, Val Acc={val_acc:.3f}')

Model Configuration:
Vocabulary size: 6864
Embedding dimension: 100
Hidden dimension: 256
Number of features: 6

Starting training...
Epoch 10: Train Loss=0.671, Val Loss=0.661
  Train Acc=0.608, Val Acc=0.630
Epoch 20: Train Loss=0.603, Val Loss=0.562
  Train Acc=0.687, Val Acc=0.691
Epoch 30: Train Loss=0.537, Val Loss=0.517
  Train Acc=0.767, Val Acc=0.765
Epoch 40: Train Loss=0.511, Val Loss=0.552
  Train Acc=0.756, Val Acc=0.743
Epoch 50: Train Loss=0.502, Val Loss=0.471
  Train Acc=0.771, Val Acc=0.783


## Results

After training for 50 epochs, we evaluated our BiLSTM model against a logistic regression baseline to assess the effectiveness of deep learning combined with feature fusion for sentiment classification.

- Logistic Regression Baseline: 57.70%
- BiLSTM + Feature Fusion (Validation): 78.67%
- BiLSTM + Feature Fusion (Test): 78.39%

### Conclusion

1. This BiLSTM model with feature fusion achieved 78.39% accuracy on the test set, representing a 20.69% improvement over the logistic regression baseline. This substantial gain demonstrates that deep learning can capture complex linguistic patterns that simple linear models cannot.

2. The logistic regression baseline, using only hand-crafted features, achieved 57.70% accuracy—significantly better than random guessing (50%). This validates that our carefully designed linguistic features contain meaningful sentiment signals.

3. The small gap between validation accuracy (78.67%) and test accuracy (78.39%)—only 0.28%—indicates our model generalizes well to unseen data. This minimal difference suggests we successfully prevented overfitting through our regularization strategies.

In [8]:
test_acc = get_accuracy(best_model, test_iter)

print("\nPerformance Comparison:")
print(f"  Logistic Regression Baseline:  {lr_acc:.4f}")
print(f"  BiLSTM + Feature Fusion (val): {max(val_accs):.4f}")
print(f"  BiLSTM + Feature Fusion (test):{test_acc:.4f}")

print("Summary:")
print(f"  1. Feature Validation: LR baseline {lr_acc:.4f}")
print(f"  2. Deep Model: BiLSTM with feature fusion achieved {test_acc:.4f}")


Performance Comparison:
  Logistic Regression Baseline:  0.5770
  BiLSTM + Feature Fusion (val): 0.7867
  BiLSTM + Feature Fusion (test):0.7839
Summary:
  1. Feature Validation: LR baseline 0.5770
  2. Deep Model: BiLSTM with feature fusion achieved 0.7839
