## Device initialization

This section initializes the computational environment for the project. It performs three critical tasks:

1. Detects available hardware (CPU vs GPU)
2. Configures PyTorch to use the optimal compute device

In [1]:
import pandas as pd # Data manipulation and analysis.
import numpy as np # Numerical operations and array handling.
import matplotlib.pyplot as plt # More control, lower-level, basic plotting.
import seaborn as sns # Higher-level, more aesthetically pleasing plots.
from scipy import stats # Statistical functions and tests.

pd.set_option('display.max_columns', None) # Display all columns in DataFrame output.
pd.set_option('display.max_rows', None) # Display all rows in DataFrame output.

import torch
import types
import pandas as pd
import re
from collections import Counter
from torch.utils.data import DataLoader, Dataset
from torch import nn, optim
import copy

# Check if MPS is available for acceleration
print('is mps available?', torch.mps.is_available())

# Set the device to MPS if available, otherwise use CPU
device = torch.device('mps' if torch.mps.is_available() else 'cpu')
print('Using device:', device)

is mps available? True
Using device: mps


## Dataset Loading

This section loads the SST dataset from local files and processes it for binary sentiment classification.

4 files are loaded from the `SST2-Data/stanfordSentimentTreebank/` folder:

### 1. `datasetSentences.txt`

Contains all sentences extracted from movie reviews, this is the primary data source containing the text we want to classify.

- Column 1: `sentence_index` - Unique identifier for each sentence
- Column 2: `sentence` - The actual text of the sentence

### 2. `datasetSplit.txt`

Specifies which dataset split each sentence belongs to, ensures using the standard train/dev/test splits.

- Column 1: `sentence_index` - Links to sentences in `datasetSentences.txt`
- Column 2: `splitset_label` - Split assignment
  - **1** = Training set
  - **2** = Test set
  - **3** = Development/Validation set

### 3. `dictionary.txt`

Maps all phrases (including sub-phrases) to unique IDs, acts as a lookup table to connect sentences to their sentiment labels, as a bridge between `datasetSentences.txt` and `sentiment_labels.txt`.

- Column 1: `phrase` - Text of the phrase (can be a word, sub-phrase, or complete sentence)
- Column 2: `phrase_id` - Unique integer identifier

### 4. `sentiment_labels.txt`

Contains sentiment scores for all phrases, provides the ground truth labels for training our sentiment classifier.

- Column 1: `phrase ids` - Links to `phrase_id` in `dictionary.txt`
- Column 2: `sentiment values` - Continuous sentiment score from 0 (most negative) to 1 (most positive)

There are 5 sentiments (very negative, negative, neutral, positive, very positive) in original dataset, but this project we only need binary classification (negative or positive), thus the sentences with sentiment values between 0.4 and 0.6 should be removed. Then convert labels to integers (0 or 1)

```
0.0 ←−−−−−−− 0.2 ←−−−− 0.4 ←− 0.5 −→ 0.6 −−−−→ 0.8 −−−−−−→ 1.0
Very Negative | Negative | Neutral | Positive | Very Positive
```
---

###  Final Format

After processing, each DataFrame has this structure:

| Column | Type | Description |
|--------|------|-------------|
| `sentence` | string | Movie review text |
| `label` | int | Binary sentiment |

In [2]:
# Load all sentences with their unique indices
sentences_df = pd.read_csv('SST2-Data/stanfordSentimentTreebank/stanfordSentimentTreebank/datasetSentences.txt', sep='\t', header=0)

# Load dataset split assignments (train/dev/test)
split_df = pd.read_csv('SST2-Data/stanfordSentimentTreebank/stanfordSentimentTreebank/datasetSplit.txt', sep=',', header=0)

# Load phrase dictionary (maps phrases to unique IDs)
dictionary_df = pd.read_csv('SST2-Data/stanfordSentimentTreebank/stanfordSentimentTreebank/dictionary.txt', sep='|', header=None, names=['phrase', 'phrase_id'])

# Load sentiment labels for all phrases
labels_df = pd.read_csv('SST2-Data/stanfordSentimentTreebank/stanfordSentimentTreebank/sentiment_labels.txt', sep='|', header=0)

print(f"Sentences: {len(sentences_df)}")
print(f"Dictionary phrases: {len(dictionary_df)}")
print(f"Labels: {len(labels_df)}")

# Combine sentences with their train/dev/test assignments
data = sentences_df.merge(split_df, on='sentence_index')

# Initialize empty dictionary to store sentence-to-label mappings
sentence_labels = {}

# Iterate through each sentence in the merged dataset
for idx, row in data.iterrows():
    sentence = row['sentence']  # Extract the sentence text
    
    # Look up phrase_id in dictionary
    phrase_match = dictionary_df[dictionary_df['phrase'] == sentence]
    
    # Check if we found a match
    if not phrase_match.empty:
        # Extract the phrase_id for this sentence
        phrase_id = phrase_match.iloc[0]['phrase_id']
        
        # Look up sentiment score using phrase_id
        label_match = labels_df[labels_df['phrase ids'] == phrase_id]
        
        # Check if we found a sentiment score
        if not label_match.empty:
            # Extract the continuous sentiment value (0.0 to 1.0)
            sentiment_value = label_match.iloc[0]['sentiment values']
            
            # Convert continuous sentiment to binary label
            if sentiment_value <= 0.4:
                # Negative class: very negative + negative samples
                sentence_labels[sentence] = 0
                
            elif sentiment_value >= 0.6:
                # Positive class: positive + very positive samples
                sentence_labels[sentence] = 1

# Apply Labels and Filter Neutral Samples, Map the sentence_labels dictionary to create a new 'label' column
# Sentences not in sentence_labels (neutral samples) will have NaN values
data['label'] = data['sentence'].map(sentence_labels)

# Remove all rows where label is NaN
data = data.dropna(subset=['label'])

# Convert label column from float to integer type.
data['label'] = data['label'].astype(int)

# Display statistics after filtering
print(f"\nAfter filtering neutral samples: {len(data)}")
print(f"Label distribution:\n{data['label'].value_counts()}")

# Training set (splitset_label == 1), keep only 'sentence' and 'label' columns, drop unnecessary columns
train_df = data[data['splitset_label'] == 1][['sentence', 'label']].reset_index(drop=True)

# Test set (splitset_label == 2)
test_df = data[data['splitset_label'] == 2][['sentence', 'label']].reset_index(drop=True)

# Validation/Development set (splitset_label == 3)
val_df = data[data['splitset_label'] == 3][['sentence', 'label']].reset_index(drop=True)

# Display final dataset sizes
print(f"\nTrain: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")

Sentences: 11855
Dictionary phrases: 239232
Labels: 239232

After filtering neutral samples: 9142
Label distribution:
label
1    4739
0    4403
Name: count, dtype: int64

Train: 6568, Val: 825, Test: 1749


## Dataset processing

This section transforms raw text data into numerical representations that neural networks can process. Text strings cannot be directly fed into neural networks - they must first be converted into sequences of integers.

### Tokenization

Split sentences into individual words and punctuation marks. Neural networks operate on discrete units. We also keep punctuation as separate tokens because they carry sentiment information

### Vocabulary Construction

Build a mapping between words and unique integer IDs (word → ID), a consistent way to convert any word into a number is required, including:

- Filter rare words (frequency < 2): Reduces vocabulary size and filters potential typos/noise
- Add special tokens: 
  - `<pad>`: For making sequences equal length (required for batch processing)
  - `<unk>`: For handling words not seen during training

In [3]:
# Tokenization: Breaking down text into individual words and punctuation marks
_tok = re.compile(r"\w+|[^\w\s]")
def tokenize(s: str):
    """
    Convert a sentence string into a list of lowercase tokens.
    """
    return [t.lower() for t in _tok.findall(s)]

# Define special tokens
UNK = "<unk>"  # Unknown token: used for words not in vocabulary (out-of-vocabulary words)
PAD = "<pad>"  # Padding token: used to make all sequences the same length

# Initialize a Counter to count word frequencies
counter = Counter()

# Count all words in the training set, only use training data to build vocabulary to prevent data leakage
for sentence in train_df['sentence']:
    # Tokenize each sentence and update word counts
    counter.update(tokenize(sentence))
# Build the vocabulary list
itos = [PAD, UNK] + [w for w, c in counter.most_common() if c >= 2]
# Build reverse mapping: string-to-index dictionary
stoi = {w: i for i, w in enumerate(itos)}
# Get the padding token's index
padding_idx = stoi[PAD]
# Calculate vocabulary size
vocab_size = len(itos)

print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 6864


## Loading Pre-trained Word Embeddings

This code loads pre-trained GloVe embeddings and integrates them with our vocabulary. Instead of learning word representations from scratch, we leverage embeddings trained on massive text corpora. Pre-trained embeddings significantly improve model performance, especially when training data is limited.

We now have a `TEXT` object containing our vocabulary mappings (`itos`, `stoi`) and an embedding matrix where known words start with meaningful representations, giving our model a significant advantage before training even begins.

In [4]:
# Import the gensim downloader API for accessing pre-trained word embeddings
import gensim.downloader as api

# Load pre-trained GloVe (Global Vectors for Word Representation) embeddings
glv = api.load("glove-wiki-gigaword-100")

# Set the dimensionality of word embeddings to match GloVe's dimension
embedding_dim = 100

# Initialize a tensor to store embedding vectors for all vocabulary words
pretrained_vectors = torch.randn(vocab_size, embedding_dim) * 0.01

# Iterate through each token in the vocabulary
for i, tok in enumerate(itos):
    # Check if the current token exists in the pre-trained GloVe vocabulary
    if tok in glv:
        # If found, replace the random vector with the pre-trained GloVe vector
        pretrained_vectors[i] = torch.tensor(glv[tok])

# Set the embedding vector for the padding token to zeros
pretrained_vectors[padding_idx] = 0.0

# Create a namespace object to mimic the torchtext Field structure
TEXT = types.SimpleNamespace(
    vocab=types.SimpleNamespace(
        itos=itos,                      # int-to-string: list mapping indices to tokens
        stoi=stoi,                      # string-to-int: dict mapping tokens to indices
        vectors=pretrained_vectors      # The embedding matrix with pre-trained vectors
    )
)

## Feature Engineering

This section implements a feature engineering pipeline to establish a baseline understanding of sentiment classification. First extract hand-crafted linguistic features and evaluate their predictive power. This approach helps us understand what signals are important for sentiment analysis.

There are 15 hand-crafted features extracted grouped into 7 categories, each capturing different aspects of sentiment expression:

1. Length Features (Text Complexity)

```python
- len_tokens: Number of word tokens
- len_chars: Total character count  
- avg_tok_len: Average token length
```

- Positive reviews might be longer (detailed praise) or negative reviews might be longer (detailed complaints)
- Longer words may indicate more formal or complex language
- Text length reflects how much the author cared to express

2. Punctuation Features (Emotional Expression)

People expressing strong sentiment use more emphatic punctuation.

```python
- count_exclaim: '!' count (excitement/emphasis)
- count_question: '?' count (uncertainty/questioning)
- count_period: '.' count (sentence structure)
- count_comma: ',' count (clause complexity)
- count_punct_total: Overall punctuation density
```

- Exclamation marks signal strong emotion (both positive and negative)
- Question marks often indicate confusion, doubt, or rhetorical emphasis
- Commas suggest detailed, structured arguments (common in thoughtful reviews)
- Punctuation density reflects writing style and emotional intensity

3. Uppercase Features (Emphasis/Shouting)

```python
- count_upper_tokens: Number of ALL CAPS words
- ratio_upper_tokens: Proportion of text in caps
```

- ALL CAPS = SHOUTING or EMPHASIS
- Strong indicator of emotional intensity

4. Elongation Features (Informal Emphasis)

Character elongation mimics prosody (speech intonation) in written text.

```python
- count_elongated: Words with repeated characters
```

- Informal emphasis technique common in social media and reviews
- Strong sentiment indicator: Positive: "This is soooo good!" Negative: "Soooo disappointed"

---

5. Negation Features (Sentiment Reversal)

Simple bag-of-words models fail on negations - this feature explicitly captures them.

```python
- count_negations: Negation words (not, no, never, don't, can't, etc.)
```

- Negations reverse polarity: "good" → positive, "not good" → negative
- Negations interact with surrounding words to flip meaning
- High negation count may indicate negative sentiment ("not good", "don't like", "never again")

6. Intensifier Features (Sentiment Amplification)

Intensifiers modify adjectives to increase subjective intensity.

```python
- count_intensifiers: Intensifying adverbs (very, really, extremely, incredibly)
```

- Amplify sentiment strength without changing polarity
- Indicate strong opinions and emotional engagement
- Presence suggests the author felt compelled to emphasize their sentiment

7. Emoticon Features (Direct Sentiment Signals)

Emoticons serve as digital paralanguage, conveying emotion that would be expressed through facial expressions in person.

```python
- count_pos_emotes: Positive emoticons (:), :D, ^^)
- count_neg_emotes: Negative emoticons (:(, :/, :'()
```

- Unambiguous sentiment indicators - direct emotional expression
- Common in informal text (social media, casual reviews)
- Often used to clarify tone: "This is great :)" vs. "This is great :/"

### Feature Normalization Strategy

All features are normalized to prevent scale imbalance:

- Raw character count (100s) would overwhelm emoticon count (0-2)
- A text with 100 tokens and 5 exclamations is different from 10 tokens and 5 exclamations

using flowing methods:

- Length features → Scaled by typical text size
- Punctuation → Divided by token or character count
- Ratios → Already normalized
- Emoticons → Raw counts

### Chi-Square Feature Importance Test

The Chi-Square test measures the statistical dependency between each feature and the sentiment label. A high χ² score indicates a strong association between the feature and sentiment, suggesting the feature is informative for prediction. Additionally, a low p-value confirms that this relationship is statistically significant and not due to random chance. 

We can keep features with high χ² scores (strong predictors) and discard those with low scores (weak or redundant predictors), ensuring our model focuses on the most informative signals while reducing noise and dimensionality.

### Logistic Regression Baseline

Logistic Regression serves as a linear model, it learns explicit weights for each feature, allowing us to understand exactly how much each linguistic pattern contributes to sentiment prediction. This transparency is valuable for validating our feature engineering decisions and building intuition about what matters for sentiment classification. 

An accuracy of 70-75% indicates that our hand-crafted features successfully capture significant sentiment signals, validating our feature engineering approach while leaving room for more sophisticated models to improve. If accuracy falls between 50-60%, it suggests our features are too weak and that we need representation learning techniques (like word embeddings and neural networks) to extract deeper semantic patterns. Conversely, if accuracy exceeds 80%, it may indicate that the classification problem is relatively simple, the dataset is small or homogeneous

In [5]:
from sklearn.metrics import accuracy_score          # Metric to evaluate classification accuracy
from sklearn.discriminant_analysis import StandardScaler  # Feature normalization (z-score scaling)
from sklearn.feature_selection import chi2       # Chi-square test for feature importance
from sklearn.linear_model import LogisticRegression  # Simple linear classifier as baseline

# Words that negate or reverse sentiment (e.g., "not good" becomes negative)
NEGATIONS = {"not","no","never","n't","dont","don't","didn't","won't","cannot","can't"}

# Words that amplify sentiment intensity (e.g., "very good" is stronger than "good")
INTENSIFIERS = {"very","really","so","too","extremely","super","highly","utterly","absolutely","incredibly"}

# Emoticons indicating positive sentiment
POS_EMOTES = [":)",":-)",":d","=)",":]","^^"]

# Emoticons indicating negative sentiment
NEG_EMOTES = [":("," :-("," ):",":'("," :/ ",":-/"]


# These features capture linguistic patterns that correlate with sentiment
NUMERIC_FEATURE_NAMES = [
    "len_tokens",           # Number of word tokens (vocabulary richness)
    "len_chars",            # Total character count (text verbosity)
    "avg_tok_len",          # Average token length (word complexity)
    
    "count_exclaim",        # '!' count (excitement/emphasis indicator)
    "count_question",       # '?' count (questioning/uncertainty)
    "count_period",         # '.' count (sentence structure)
    "count_comma",          # ',' count (clause complexity)
    "count_punct_total",    # Total punctuation (stylistic density)
    
    "count_upper_tokens",   # ALL CAPS words (shouting/emphasis)
    "ratio_upper_tokens",   # Proportion of uppercase tokens
    
    "count_elongated",      # Words with repeated letters (e.g., "soooo good")
    "count_negations",      # Negation words (sentiment reversal)
    "count_intensifiers",   # Intensifier words (sentiment amplification)
    
    "count_pos_emotes",     # Positive emoticon count
    "count_neg_emotes"      # Negative emoticon count
]


def is_all_caps_token(tok: str) -> bool:
    """
    Check if a token is written in ALL CAPS.
    """
    # Extract only alphabetic characters
    letters = [ch for ch in tok if ch.isalpha()]
    # Require at least 2 letters and all must be uppercase
    return (len(letters) >= 2) and all(ch.isupper() for ch in letters)


def compute_features_for_text(text: str):
    """
    Extract hand-crafted linguistic features from text for sentiment analysis.
    """
    # Tokenize text using pre-defined tokenizer pattern
    toks = _tok.findall(text)
    toks_lower = [t.lower() for t in toks]
    
    # Filter to word tokens (containing at least one letter)
    word_tokens = [t for t in toks if any(ch.isalpha() for ch in t)]
    T = len(word_tokens)  # Total number of word tokens
    L = len(text)         # Total character length
    
    # Punctuation features: Capture emotional expression through punctuation
    count_exclaim = text.count("!")        # Excitement/emphasis
    count_question = text.count("?")       # Questions/uncertainty
    count_period = text.count(".")         # Sentence boundaries
    count_comma = text.count(",")          # Clause complexity
    
    # Total non-alphanumeric, non-space characters
    count_punct_total = sum(1 for ch in text if (not ch.isalnum()) and (not ch.isspace()))
    
    # Stylistic features: Capture writing style indicators
    count_upper_tokens = sum(1 for t in word_tokens if is_all_caps_token(t))
    ratio_upper_tokens = (count_upper_tokens / T) if T > 0 else 0.0
    
    # Elongated words indicate emphasis, any character repeated 3+ times
    count_elongated = sum(1 for t in toks_lower 
                         if any(ch.isalpha() for ch in t) and re.search(r"(.)\1{2,}", t))
    
    # Sentiment modifier features
    count_negations = sum(1 for t in toks_lower if t in NEGATIONS) + text.lower().count("n't")
    
    # Intensifiers amplify sentiment strength
    count_intensifiers = sum(1 for t in toks_lower if t in INTENSIFIERS)
    
    # Emoticon features: Direct sentiment indicators
    tl = text.lower()
    count_pos_emotes = sum(tl.count(e.strip()) for e in POS_EMOTES)
    count_neg_emotes = sum(tl.count(e.strip()) for e in NEG_EMOTES)
    
    # Length features: Basic text statistics
    len_tokens = float(T)
    len_chars = float(L)
    avg_tok_len = (len_chars / len_tokens) if T > 0 else 0.0
    
    # features to comparable ranges
    T_norm = max(T, 1.0)
    
    # Normalize each feature to prevent any single feature from dominating
    feats = {
        
        # Length features: scale relative to text size
        "len_tokens_norm": len_tokens / max(T_norm, 20.0),
        "len_chars_norm": len_chars / (4.0 * T_norm),  # Assume ~4 chars per token
        "avg_tok_len_norm": avg_tok_len / 10.0,
        
        # Punctuation: normalize by token count
        "count_exclaim_norm": count_exclaim / T_norm,
        "count_question_norm": count_question / T_norm,
        "count_period_norm": count_period / T_norm,
        "count_comma_norm": count_comma / T_norm,
        "count_punct_total_norm": count_punct_total / max(L, 1.0),
        
        # Uppercase: scale by half token count (typically rare)
        "count_upper_tokens_norm": count_upper_tokens / max(T_norm/2.0, 1.0),
        "ratio_upper_tokens_norm": ratio_upper_tokens,  # Already a ratio
        
        # Style modifiers: normalize by token count
        "count_elongated_norm": count_elongated / T_norm,
        "count_negations_norm": count_negations / T_norm,
        "count_intensifiers_norm": count_intensifiers / T_norm,
        
        # Emoticons: raw counts (typically 0-2)
        "count_pos_emotes_norm": float(count_pos_emotes),
        "count_neg_emotes_norm": float(count_neg_emotes),
    }
    return feats


print("Building feature validation dataset...")
feat_rows = []

# Extract features for each training example
for idx, row in train_df.iterrows():
    feats = compute_features_for_text(row['sentence'])
    feats["label"] = row['label']  # Attach ground truth label
    feat_rows.append(feats)

# Convert to DataFrame for analysis
df_features = pd.DataFrame(feat_rows)

# Get list of normalized feature columns (exclude label)
feature_cols = [c for c in df_features.columns if c.endswith("_norm")]

print(f"Number of features: {len(feature_cols)}")
print(f"Training samples: {len(df_features)}")


# Prepare feature matrix and labels
X_chi = df_features[feature_cols].values
y_chi = df_features["label"].values

# Chi-square test measures dependency between each feature and the label
# Higher chi2 score = stronger association with sentiment
chi_vals, p_vals = chi2(X_chi, y_chi)

# Create ranking table
chi_df = pd.DataFrame({
    "feature": feature_cols,
    "chi2_score": chi_vals,      # Higher = more informative
    "p_value": p_vals            # Lower = more statistically significant
}).sort_values("chi2_score", ascending=False)

print("\nFeature Importance Ranking (Chi-square):")
print(chi_df.to_string(index=False))

# Extract features from validation set
val_feat_rows = []
for idx, row in val_df.iterrows():
    feats = compute_features_for_text(row['sentence'])
    feats["label"] = row['label']
    val_feat_rows.append(feats)

df_val = pd.DataFrame(val_feat_rows)

# Prepare train/val splits
X_train_lr = df_features[feature_cols].values
y_train_lr = df_features["label"].values
X_val_lr = df_val[feature_cols].values
y_val_lr = df_val["label"].values

# Standardize features (zero mean, unit variance)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_lr)
X_val_scaled = scaler.transform(X_val_lr)

# Train logistic regression classifier
lr_model = LogisticRegression(max_iter=1000, solver="liblinear")
lr_model.fit(X_train_scaled, y_train_lr)

# Evaluate on validation set
lr_pred = lr_model.predict(X_val_scaled)
lr_acc = accuracy_score(y_val_lr, lr_pred)

print(f"\nLR Baseline validation accuracy: {lr_acc:.4f}")
print(f"Conclusion: Hand-crafted features alone achieve {lr_acc*100:.1f}% classification ability")

# Select features with chi-square score above threshold
IMPORTANCE_THRESHOLD = 1.0
selected_features = chi_df[chi_df["chi2_score"] > IMPORTANCE_THRESHOLD]["feature"].tolist()

print(f"Based on chi-square > {IMPORTANCE_THRESHOLD}, selected {len(selected_features)} features:")
for feat in selected_features:
    chi_val = chi_df[chi_df["feature"] == feat]["chi2_score"].values[0]
    print(f"   - {feat:<30} (chi2={chi_val:.2f})")

Building feature validation dataset...
Number of features: 15
Training samples: 6568

Feature Importance Ranking (Chi-square):
                feature  chi2_score  p_value
   count_negations_norm   11.013588 0.000904
count_intensifiers_norm    5.073290 0.024297
    count_question_norm    4.994573 0.025427
     count_exclaim_norm    3.474562 0.062319
         len_chars_norm    2.561395 0.109502
       count_comma_norm    2.267597 0.132105
      count_period_norm    1.148853 0.283789
       avg_tok_len_norm    1.024558 0.311440
count_upper_tokens_norm    1.018557 0.312862
ratio_upper_tokens_norm    0.509278 0.475451
 count_punct_total_norm    0.328244 0.566695
   count_elongated_norm    0.266407 0.605752
        len_tokens_norm    0.000239 0.987673
  count_pos_emotes_norm         NaN      NaN
  count_neg_emotes_norm         NaN      NaN

LR Baseline validation accuracy: 0.5952
Conclusion: Hand-crafted features alone achieve 59.5% classification ability
Based on chi-square > 1.0, selected

## DataLoader Setup

Create efficient data pipelines that feed batches of samples to the model during training.

### Dataset Class

Wrap preprocessed data in PyTorch's Dataset interface. PyTorch's DataLoader requires this structure for efficient batching and data loading

This pipeline ensures our text data is properly prepared for the LSTM model while maintaining reproducibility and preventing data leakage.


### Compute Numeric Feature Function

The function processes batches of tokenized text in real-time, converting token IDs back to their string representations to analyze character-level patterns (punctuation, capitalization, emoticons) that aren't captured by word embeddings alone. These features are then normalized to comparable scales and concatenated with the neural network's learned representations, creating a richer input that combines learned semantic knowledge with explicit linguistic rules.

In the neural network forward pass, this function is called to compute features for each batch.

### Collate Function 

Sentences have variable lengths (e.g., 5 words vs 20 words), but neural networks require fixed-size tensors. To address this, pad shorter sequences with special `<pad>` tokens to match the longest sequence in each batch

### Training DataLoader
- `shuffle=True`: Randomize sample order each epoch to prevent overfitting to data order, which is critical for good generalization

- `batch_size=128`: Balance between speed and memory, 128 is a common choice for medium datasets

### Validation/Test DataLoaders
- `shuffle=False`: Keep fixed order to reproducible evaluation metrics

- `pin_memory=True`: Speed up CPU→GPU transfer. Only beneficial when using MPS

This setup ensures our data flows efficiently from disk → CPU → GPU during training while maintaining reproducibility for evaluation.

In [6]:
# Define Custom PyTorch Dataset Class for accessing individual samples during training
class SST2Dataset(Dataset):
    """
    Custom PyTorch Dataset for SST-2 sentiment classification.
    Converts text sentences into numerical representations (token IDs) and pairs them with their corresponding labels.
    """
    
    def __init__(self, dataframe, stoi):
        """
        Initialize the dataset by preprocessing all samples.
        """
        # Pre-process all data and store in memory
        self.data = []
        # Iterate through each row in the dataframe
        for _, row in dataframe.iterrows():
            sentence = row['sentence']
            label = int(row['label'])
            # Convert sentence to list of integer IDs
            ids = [stoi.get(w, stoi[UNK]) for w in tokenize(sentence)]
            # Store as dictionary for easy access
            self.data.append({'input_ids': ids, 'label': label})
    
    def __len__(self):
        """
        Return the total number of samples in the dataset.
        """
        return len(self.data)
    
    def __getitem__(self, idx):
        """
        Return a single sample at the given index.
        This is called by DataLoader to fetch individual samples.
        """
        return self.data[idx]

# Instantiate dataset objects for each split
# These objects will be used by DataLoader during training
train_ds = SST2Dataset(train_df, stoi)
val_ds = SST2Dataset(val_df, stoi)
test_ds = SST2Dataset(test_df, stoi)
# Display dataset sizes for verification
print(f"Dataset sizes - Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}")


# Extract hand-crafted linguistic features from tokenized text batches.
def compute_numeric_features(x_ids, pad_idx, selected_feat_names):
    """
    This function converts token IDs back to text and computes numerical features
    that capture sentiment-related linguistic patterns. These features complement
    neural network representations by explicitly encoding domain knowledge about
    sentiment expression.
    """
    # Preserve the device (CPU/GPU) of input tensor to ensure output stays on same device
    device = x_ids.device
    
    # Get batch size (B) and sequence length (T)
    B, T = x_ids.size()
    
    # List to accumulate feature vectors for each sample in the batch
    feats_list = []
    
    # Process each text sample in the batch independently
    for i in range(B):
        # Extract token IDs for current sample and convert to Python list
        ids = x_ids[i].tolist()
        
        # Remove padding tokens - only analyze actual content
        ids = [t for t in ids if t != pad_idx]
        
        # Map token IDs back to string tokens using vocabulary
        toks = [TEXT.vocab.itos[t] for t in ids]
        
        # Reconstruct the original text by joining tokens with spaces
        text = " ".join(toks)
        
        # Compute basic text statistics
        len_tokens = float(len(toks))          # Number of word tokens
        len_chars = float(len(text))           # Total character count
        avg_tok_len = (len_chars / len_tokens) if len_tokens > 0 else 0.0  # Average word length
        
        # Count punctuation marks (emotional expression indicators)
        count_exclaim = float(text.count("!"))
        count_question = float(text.count("?"))
        count_period = float(text.count("."))
        count_comma = float(text.count(","))
        
        # Count all non-alphanumeric, non-space characters (overall punctuation density)
        count_punct_total = float(sum(1 for ch in text if (not ch.isalnum()) and (not ch.isspace())))
        
        # Analyze capitalization patterns (emphasis/shouting)
        count_upper_tokens = float(sum(1 for tok in toks if is_all_caps_token(tok)))
        
        # Ratio of uppercase tokens to total tokens
        ratio_upper_tokens = (count_upper_tokens / len_tokens) if len_tokens > 0 else 0.0
        
        # Detect elongated words (informal emphasis)
        count_elongated = float(sum(1 for tok in toks if re.search(r"(.)\1{2,}", tok or "")))
        
        # Count sentiment modifiers
        toks_lower = [t.lower() for t in toks]
        
        # Count negation words (sentiment polarity reversers)
        count_negations = float(sum(1 for t in toks_lower if t in NEGATIONS))
        
        # Count intensifier words (sentiment amplifiers)
        count_intensifiers = float(sum(1 for t in toks_lower if t in INTENSIFIERS))
        
        # Count emoticons (direct sentiment signals)
        tl = text.lower()  # Lowercase for case-insensitive emoticon matching
        
        # Sum occurrences of positive emoticons
        count_pos_emotes = float(sum(tl.count(e.strip()) for e in POS_EMOTES))
        
        # Sum occurrences of negative emoticons
        count_neg_emotes = float(sum(tl.count(e.strip()) for e in NEG_EMOTES))
        
        # Store raw feature values
        all_feats = {
            "len_tokens": len_tokens,
            "len_chars": len_chars,
            "avg_tok_len": avg_tok_len,
            "count_exclaim": count_exclaim,
            "count_question": count_question,
            "count_period": count_period,
            "count_comma": count_comma,
            "count_punct_total": count_punct_total,
            "count_upper_tokens": count_upper_tokens,
            "ratio_upper_tokens": ratio_upper_tokens,
            "count_elongated": count_elongated,
            "count_negations": count_negations,
            "count_intensifiers": count_intensifiers,
            "count_pos_emotes": count_pos_emotes,
            "count_neg_emotes": count_neg_emotes,
        }
        
        # Use actual token count for normalization, with minimum of 1.0 to avoid division by zero
        T_norm = max(T, 1.0)
        
        all_feats_norm = {
            # Length features: normalize by typical text size
            "len_tokens_norm": all_feats["len_tokens"] / max(T_norm, 20.0),
            "len_chars_norm": all_feats["len_chars"] / (4.0 * T_norm),
            "avg_tok_len_norm": all_feats["avg_tok_len"] / 10.0,
            
            # Punctuation features: normalize by token count
            "count_exclaim_norm": all_feats["count_exclaim"] / T_norm,
            "count_question_norm": all_feats["count_question"] / T_norm,
            "count_period_norm": all_feats["count_period"] / T_norm,
            "count_comma_norm": all_feats["count_comma"] / T_norm,
            "count_punct_total_norm": all_feats["count_punct_total"] / max(len(text), 1.0),
            
            # Uppercase features: scale by half token count
            "count_upper_tokens_norm": all_feats["count_upper_tokens"] / max(T_norm/2.0, 1.0),
            "ratio_upper_tokens_norm": all_feats["ratio_upper_tokens"],
            
            # Style modifier features: normalize by token count
            "count_elongated_norm": all_feats["count_elongated"] / T_norm,
            "count_negations_norm": all_feats["count_negations"] / T_norm,
            "count_intensifiers_norm": all_feats["count_intensifiers"] / T_norm,
            
            # Emoticon features: use raw counts
            "count_pos_emotes_norm": all_feats["count_pos_emotes"],
            "count_neg_emotes_norm": all_feats["count_neg_emotes"],
        }
        
        # Extract only the selected features
        selected_vals = [all_feats_norm[fname] for fname in selected_feat_names]
        
        # Add this sample's feature vector to the batch list
        feats_list.append(selected_vals)
    
    # Convert to tensor and return
    return torch.tensor(feats_list, dtype=torch.float32, device=device)


# Neural networks require fixed-size inputs for efficient batch processing
def collate_fn(batch):
    """
    Custom collate function to process a batch of samples.
    """
    
    # Initialize lists to collect input sequences and labels
    ids_list = []
    labels = []
    
    # Extract data from each sample in the batch
    for item in batch:
        ids = item['input_ids']
        label = item['label']
        ids_list.append(ids)
        labels.append(label)
    
    # Find the maximum sequence length in this batch
    max_len = max(len(x) for x in ids_list)
    
    # Create a tensor filled with padding_idx (0)
    # All positions are initialized to padding_idx and will be overwritten with actual tokens
    x = torch.full((len(ids_list), max_len), padding_idx, dtype=torch.long)
    
    # Fill in the actual token IDs for each sequence
    for i, ids in enumerate(ids_list):
        # Copy token IDs to the beginning of each row
        x[i, :len(ids)] = torch.tensor(ids, dtype=torch.long)
    
    # Convert labels list to tensor
    y = torch.tensor(labels, dtype=torch.long)
    
    # Return as SimpleNamespace for convenient attribute access
    return types.SimpleNamespace(text=x, label=y)

# Create DataLoader Objects
# PyTorch utility that handles batching, shuffling, and parallel data loading

# Training DataLoader
train_iter = DataLoader(
    train_ds,              # Dataset object to load from
    batch_size=128,        # Number of samples per batch
    shuffle=True,          # Randomly shuffle data each epoch
    collate_fn=collate_fn, # Our custom function to combine samples into batches
    pin_memory=True        # If True, allocates tensors in pinned memory
)

# Validation DataLoader
val_iter = DataLoader(
    val_ds,                # Validation dataset
    batch_size=128,        # Same batch size as training for consistency
    shuffle=False,         # DON'T shuffle validation data
    collate_fn=collate_fn, # Same collate function
    pin_memory=True
)

# Test DataLoader
test_iter = DataLoader(
    test_ds,               # Test dataset
    batch_size=128,
    shuffle=False,         # DON'T shuffle test data
    collate_fn=collate_fn,
    pin_memory=True
)


# Get the first batch from training DataLoader
first = next(iter(train_iter))

# Display batch information for verification
print(f"\nBatch info:")
print(f"Type: {type(first)}")  
print(f"Text shape: {first.text.shape}")  
print(f"Label shape: {first.label.shape}")  
print(f"Sample text (first 10 tokens): {first.text[0][:10]}")  
print(f"Sample label: {first.label[0]}")

Dataset sizes - Train: 6568, Val: 825, Test: 1749

Batch info:
Type: <class 'types.SimpleNamespace'>
Text shape: torch.Size([128, 51])
Label shape: torch.Size([128])
Sample text (first 10 tokens): tensor([  56,  166, 1611,    7,   81,  583,    6,  925,  605,    2])
Sample label: 0




## BiLSTM Sentiment Classification Model

For sentiment classification, we employ a Bidirectional Long Short-Term Memory (BiLSTM) network as our primary algorithm. This choice is justified by several key advantages:

1. Sequential Nature of Language: Text is not just a collection of random words—the order matters tremendously. LSTMs are designed specifically to process sequences where order matters. As the LSTM reads through a sentence word by word, it maintains a "memory" of what it has seen before. 

2. Long-Range Dependency Modeling: Sometimes the key to understanding sentiment is remembering something from much earlier in the text. RNNs  try to do this but fail because they forget information from earlier in the sequence. LSTMs solve this problem using special mechanisms called "gates" that act like memory controllers. 

3. Bidirectional Context Capture: A regular LSTM only reads left-to-right (forward). A Bidirectional LSTM  reading the sentence in both directions simultaneously: one LSTM reads left-to-right (forward), and another reads right-to-left (backward). By combining information from both directions, the model gets the complete picture, leading to much better understanding of sentiment.

4. Hierarchical Representation Learning: We use 2 LSTM layers stacked on top of each other, which allows the model to learn patterns at different levels of abstraction. Our first LSTM layer learns basic patterns like grammar and simple word combinations. The second LSTM layer then takes these basic patterns and learns more complex, abstract patterns. 

### Metrics

#### 1. Classification Accuracy

We mainly use classification accuracy as our primary performance measure. 

Classification accuracy is the most appropriate metric for our sentiment analysis task. First, our dataset exhibits balanced class distribution—we have roughly equal numbers of positive and negative reviews, meaning accuracy won't be misleadingly inflated by predicting the majority class.

Second, for binary sentiment classification, both classes (positive/negative) are equally important. We don't have an asymmetric cost structure where false positives are more costly than false negatives, so accuracy's equal weighting of all errors is appropriate.

#### 2. Cross-Entropy Loss

We also monitor cross-entropy loss during training and validation.

Cross-entropy loss measures the divergence between predicted probability distributions and true labels, providing a more nuanced signal than accuracy during training. Cross-entropy rewards confident correct predictions and penalizes confident wrong predictions more heavily. 

Cross-entropy is the natural choice for neural network classification because it's the negative log-likelihood of the correct class, making it theoretically grounded in maximum likelihood estimation. It's also convex with respect to the final layer's logits, ensuring stable gradient descent optimization. 

### Overfitting and Underfitting Prevention

#### 1. Overfitting Prevention

1. Frozen Pretrained Embeddings: GloVe embeddings are pretrained on billions of words from Wikipedia and news corpora. These vectors already encode rich semantic relationships. By freezing these weights, we prevent the model from overfitting embeddings to our small training set (6,568 samples). 


2. Dropout Regularization: Dropout randomly sets activations to zero during training with probability `p`, forcing the network to learn redundant representations. If the model relies on a single neuron to detect "not," and that neuron is dropped out 50% of the time, the model must learn alternative pathways to detect negation.

3. Early Stopping with Model Checkpointing: Training for too many epochs inevitably leads to overfitting—the model continues improving on training data while validation performance degrades. Early stopping prevents this by monitoring validation loss and saving the model checkpoint when it achieves the lowest validation loss. After 50 epochs, even if the final model is overfit, we retain the best-performing model from earlier in training (typically around epochs 15-25).

4. Train-Validation Monitoring: Explicit monitoring of the train-validation gap provides a human-interpretable diagnostic for model health. A gap of 2-5% is normal and acceptable—it indicates the model has learned patterns specific to training data but still generalizes well. A gap exceeding 10% signals severe overfitting, prompting us to stop training, increase regularization, or reduce model capacity.

### 2. Underfitting Prevention

1. Sufficient Model Capacity: 2-layer BiLSTM with 384 hidden units provides `384 × 2 directions × 2 layers = 1,536` hidden states across the network, sufficient to learn nuanced linguistic patterns.

2. Adequate Training Duration: Training for 50 epochs with early stopping ensures we explore enough of the loss landscape. If validation loss is still decreasing at epoch 50, we'd extend training, but typically loss plateaus around epoch 20-30.

3. High Initial Learning Rate: Starting with LR=0.15 ensures rapid initial learning. Too small a learning rate (e.g., 0.001) would cause extremely slow convergence, potentially getting stuck in poor local minima and resulting in underfitting.

4. Monitoring Training Loss: If training loss remains high (>0.5) throughout training, this signals underfitting—the model lacks capacity or training time to fit the data. We'd respond by adding layers, or training longer.

In [7]:
hidden_dim = 256  # hidden dimension for LSTM
label_size = 2  # binary classification
num_features = len(selected_features)  # number of hand-crafted features

# Print model configuration
print("Model Configuration:")
print(f"Vocabulary size: {vocab_size}")
print(f"Embedding dimension: {embedding_dim}")
print(f"Hidden dimension: {hidden_dim}")
print(f"Number of features: {num_features}")


# Define BiLSTM classifier
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_size, padding_idx, num_features, feat_names):
        super(BiLSTMClassifier, self).__init__()  # initialize parent class
        
        # Create embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)
        # Load pretrained word vectors
        self.embedding.weight.data.copy_(TEXT.vocab.vectors)
        # Freeze embedding weights
        self.embedding.weight.requires_grad = False
        
        # Create bidirectional LSTM
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, batch_first=True, bidirectional=True)
        
        # Feature projection layer
        self.feat_layer = nn.Linear(num_features, 32)
        # Final classification layer
        self.fc = nn.Linear(hidden_dim * 2 + 32, output_size)
        # Dropout for regularization
        self.dropout = nn.Dropout(0.3)
        
        # Store padding index
        self.pad_idx = padding_idx
        # Store feature names
        self.feat_names = feat_names

    def forward(self, x):
        # Get word embeddings
        embedded = self.embedding(x)
        # Apply dropout
        embedded = self.dropout(embedded)
        
        # Pass through LSTM
        output, (hidden, cell) = self.lstm(embedded)
        # Concatenate forward and backward hidden states
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        
        # Extract hand-crafted features
        features = compute_numeric_features(x, self.pad_idx, self.feat_names)
        # Project features
        features = self.feat_layer(features)
        
        # Combine LSTM output and features
        combined = torch.cat([hidden, features], dim=1)
        # Get final prediction
        out = self.fc(combined)
        
        return out


# Initialize model
model = BiLSTMClassifier(vocab_size, embedding_dim, hidden_dim, label_size, padding_idx, num_features, selected_features)
# Move model to device (GPU or CPU)
model.to(device)

# Define loss function
criterion = nn.CrossEntropyLoss()
# Set initial learning rate
learning_rate = 0.1

# Training function
def train_model(model, data_loader, criterion, lr):
    model.train()  # set model to training mode
    total_loss = 0  # initialize total loss
    
    # Loop through batches
    for batch in data_loader:
        # Get input and labels
        x = batch.text.to(device)
        y = batch.label.to(device)
        
        # Forward pass
        output = model(x)
        # Calculate loss
        loss = criterion(output, y)
        
        # Backward pass
        loss.backward()
        
        # Manual parameter update
        with torch.no_grad():
            for param in model.parameters():  # loop through all parameters
                if param.grad is not None:  # check if gradient exists
                    param.data = param.data - lr * param.grad  # update weights
                    param.grad.zero_()  # reset gradients
        
        # Accumulate loss
        total_loss += loss.item()
    
    # Return average loss
    return total_loss / len(data_loader)


# Evaluation function
def eval_model(model, data_loader, criterion):
    model.eval()  # set model to evaluation mode
    total_loss = 0  # initialize total loss
    
    # No gradient computation needed
    with torch.no_grad():
        for batch in data_loader:  # loop through batches
            # Get input and labels
            x = batch.text.to(device)
            y = batch.label.to(device)
            # Forward pass
            output = model(x)
            # Calculate loss
            loss = criterion(output, y)
            # Accumulate loss
            total_loss += loss.item()
    
    # Return average loss
    return total_loss / len(data_loader)


# Function to calculate accuracy
def get_accuracy(model, data_loader):
    model.eval()  # set model to evaluation mode
    correct = 0  # count of correct predictions
    total = 0  # total number of samples
    
    # No gradient computation needed
    with torch.no_grad():
        for batch in data_loader:  # loop through batches
            # Get input and labels
            x = batch.text.to(device)
            y = batch.label.to(device)
            # Forward pass
            output = model(x)
            # Get predicted class
            _, pred = torch.max(output, 1)
            # Update total count
            total += y.size(0)
            # Update correct count
            correct += (pred == y).sum().item()
    
    # Return accuracy
    return correct / total


# Set number of epochs
epochs = 50
# Initialize best loss
best_loss = 999999

# Lists to store metrics
train_losses = []
val_losses = []
train_accs = []
val_accs = []

# Start training
print("\nStarting training...")

# Training loop
for epoch in range(epochs):
    # Train for one epoch
    train_loss = train_model(model, train_iter, criterion, learning_rate)
    # Evaluate on validation set
    val_loss = eval_model(model, val_iter, criterion)
    
    # Calculate accuracies
    train_acc = get_accuracy(model, train_iter)
    val_acc = get_accuracy(model, val_iter)
    
    # Store metrics
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    train_accs.append(train_acc)
    val_accs.append(val_acc)
    
    # Save best model
    if val_loss < best_loss:
        best_loss = val_loss  # update best loss
        best_model = copy.deepcopy(model)  # save model copy
    
    # Reduce learning rate every 10 epochs
    if (epoch + 1) % 10 == 0:
        learning_rate = learning_rate * 0.8  # multiply by 0.8
    
    # Print progress every 5 epochs
    if (epoch + 1) % 10 == 0:
        print(f'Epoch {epoch+1}: Train Loss={train_loss:.3f}, Val Loss={val_loss:.3f}')
        print(f'  Train Acc={train_acc:.3f}, Val Acc={val_acc:.3f}')

Model Configuration:
Vocabulary size: 6864
Embedding dimension: 100
Hidden dimension: 256
Number of features: 9

Starting training...
Epoch 10: Train Loss=0.657, Val Loss=0.641
  Train Acc=0.627, Val Acc=0.650
Epoch 20: Train Loss=0.613, Val Loss=0.574
  Train Acc=0.712, Val Acc=0.724
Epoch 30: Train Loss=0.579, Val Loss=0.544
  Train Acc=0.695, Val Acc=0.710
Epoch 40: Train Loss=0.568, Val Loss=0.561
  Train Acc=0.737, Val Acc=0.745
Epoch 50: Train Loss=0.541, Val Loss=0.642
  Train Acc=0.684, Val Acc=0.661


In [8]:
test_acc = get_accuracy(best_model, test_iter)

print("\nPerformance Comparison:")
print(f"  Logistic Regression Baseline:  {lr_acc:.4f}")
print(f"  BiLSTM + Feature Fusion (val): {max(val_accs):.4f}")
print(f"  BiLSTM + Feature Fusion (test):{test_acc:.4f}")

print("Summary:")
print(f"  1. Feature Validation: LR baseline {lr_acc:.4f}")
print(f"  2. Deep Model: BiLSTM with feature fusion achieved {test_acc:.4f}")


Performance Comparison:
  Logistic Regression Baseline:  0.5952
  BiLSTM + Feature Fusion (val): 0.7697
  BiLSTM + Feature Fusion (test):0.7679
Summary:
  1. Feature Validation: LR baseline 0.5952
  2. Deep Model: BiLSTM with feature fusion achieved 0.7679
