### <font color='blue'> Due 11:59pm, Monday Feb 12th 2026</font>

**Purpose / learning goals:**
- Practice training neural models in PyTorch with emphasis on optimizers, regularization, and learning-rate scheduling to meet a performance threshold.
- Use sentiment classification as a downstream task to compare classical neural baselines with fine-tuned pretrained LLMs (BERT/GPT).

**Runtime / setup notes:**
- This assignment does not require a GPU to train the models. Using a GPU (or Apple MPS) will usually speed up training for the transformer models.

In this assignment, you will:
- Implement MLP and LSTM classifiers (your code)
- Run provided scripts for RNN, GRU, BERT, and GPT (for comparison)

**Implementation format:** Task 1 and Task 2 must be implemented as Python scripts (not notebooks). The open-ended questions are answered in a notebook.

To motivate the transformer architecture, scripts are provided for pretrained state-of-the-art models such as **GPT** (decoder-only) and **BERT** (encoder-only). You should run these scripts yourself to obtain results for comparison and reflection.

*Please read the `README.md` file before proceeding.*


##  Sentiment Classification: Classical Nets vs. LLMs

Sentiment classification is a common **downstream task** for evaluating how well pretrained LLMs adapt to a domain via fine-tuning, compared against classical neural baselines.

In this assignment, you'll explore how different neural architectures perform on sentiment classification:

- **Classical approaches:** MLP, RNN, LSTM, GRU (using static FastText embeddings)
- **Pretrained LLMs:** BERT and GPT (fine-tuned using Hugging Face Transformers)

You will implement MLP and LSTM yourself; scripts are provided for the remaining models.

Detailed requirements for your implementations are listed in **Your Tasks** below.


##  Dataset: Financial PhraseBank

This assignment uses the **Financial PhraseBank** dataset, developed by  
Mika V. M√§ntyl√§, Graziella Linders, Tanja Suominen, and Miikka Kuutila.

- üìÇ Dataset homepage: [Hugging Face ‚Äì Financial_PhraseBank](https://huggingface.co/datasets/takala/financial_phrasebank)  
- üìÑ Original paper:  
  P. Malo, A. Sinha, et al. (2014). [*‚ÄúGood Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts‚Äù*](https://arxiv.org/pdf/1307.5336)

You can load and preview the dataset using the following code:

In [25]:
!git clone https://github.com/Anushka-De/stat359.git

Cloning into 'stat359'...
remote: Enumerating objects: 196, done.[K
remote: Counting objects: 100% (104/104), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 196 (delta 78), reused 51 (delta 40), pack-reused 92 (from 1)[K
Receiving objects: 100% (196/196), 1.75 MiB | 4.72 MiB/s, done.
Resolving deltas: 100% (114/114), done.


In [26]:
%cd stat359/student/Assignment_3
!ls

/content/stat359/student/Assignment_3
handout.html	      train_sentiment_bert_classifier.py
handout.ipynb	      train_sentiment_gpt_classifier.py
open_questions.ipynb  train_sentiment_gru_classifier.py
README.md	      train_sentiment_rnn_classifier.py


In [3]:
pip -q install "datasets<4.0.0"

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/491.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m491.5/491.5 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
!pip install -q numpy pandas gensim torch scikit-learn matplotlib ipywidgets nltk tqdm

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m27.9/27.9 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.6/1.6 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:

print("\n========== Loading Dataset ==========")
from datasets import load_dataset

dataset = load_dataset('financial_phrasebank', 'sentences_50agree', trust_remote_code=True)
print("Dataset loaded. Example:", dataset['train'][:5])


Dataset loaded. Example: {'sentence': ['According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'Technopolis plans to develop in stages an area of no less than 100,000 square meters in order to host companies working in computer technologies and telecommunications , the statement said .', 'The international electronic industry company Elcoteq has laid off tens of employees from its Tallinn facility ; contrary to earlier layoffs the company contracted the ranks of its office workers , the daily Postimees reported .', 'With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .', "According to the company 's updated strategy for the years 2009-2012 , Basware targets a long-term net sales growth in the range of 20 % -40 % with an operating profit margin of 10 % -20 % of 

###  Dataset Description

The dataset consists of **4,840 English sentences** extracted from financial news articles.  
Each sentence is labeled as **positive**, **neutral**, or **negative**, with annotations provided by 5 to 8 human annotators to ensure labeling consistency.  

This assignment uses the `'sentences_50agree'` subset, where at least 50% of annotators agreed on the sentiment.

###  Class Imbalance

The dataset has an **imbalanced class distribution**:

| Sentiment | Count |
|-----------|-------|
| Negative  | 604   |
| Neutral   | 2879  |
| Positive  | 1363  |

For dealing with imbalanced dataset:

- **Accuracy** can be misleading in this setting.
- You must use `class_weight` in your loss function (e.g., `nn.CrossEntropyLoss(weight=...)`) to mitigate the imbalance.
- The primary evaluation metric will be the **macro-averaged F1 score**, which treats all classes equally regardless of frequency.

### Train/Validation/Test Splits

The dataset does **not** come with predefined splits.

You must split it yourself using **stratified sampling** to preserve class proportions in each subset.

For a fair comparison and to stay consistent with the other model scripts, use the following split procedure:

- First, create a **test set (15%)** and a **train+validation set (85%)** using stratified sampling on the original labels.
- Then, split the **train+validation set** into **training (85%)** and **validation (15%)** using stratified sampling on the train+validation labels.
- Use a fixed random seed (e.g., 42) so results are reproducible.

This ensures consistent and representative evaluation, especially in the presence of class imbalance.

In [15]:
# Train/Validation/Test Splits (stratified)
SEED = 42

# If the dataset is a DatasetDict with only "train", split it:
full = dataset["train"]  # contains columns like "sentence" and "label"

# 1) Test = 15%, Train+Val = 85% (stratified)
tmp = full.train_test_split(
    test_size=0.15,
    seed=SEED,
    stratify_by_column="label"
)
trainval_ds = tmp["train"]
test_ds = tmp["test"]

# 2) From Train+Val, Validation = 15%, Training = 85% (stratified)
tmp2 = trainval_ds.train_test_split(
    test_size=0.15,
    seed=SEED,
    stratify_by_column="label"
)
train_ds = tmp2["train"]
val_ds = tmp2["test"]

print(len(train_ds), len(val_ds), len(test_ds))
import collections

def show_stats(ds, name):
    labels = ds["label"]
    counter = collections.Counter(labels)
    total = len(labels)

    print(f"\n{name} ‚Äî total: {total}")
    for k in sorted(counter):
        print(f"  class {k}: {counter[k]} ({counter[k]/total:.3%})")

show_stats(train_ds, "TRAIN")
show_stats(val_ds, "VALIDATION")
show_stats(test_ds, "TEST")


3501 618 727

TRAIN ‚Äî total: 3501
  class 0: 436 (12.454%)
  class 1: 2080 (59.412%)
  class 2: 985 (28.135%)

VALIDATION ‚Äî total: 618
  class 0: 77 (12.460%)
  class 1: 367 (59.385%)
  class 2: 174 (28.155%)

TEST ‚Äî total: 727
  class 0: 91 (12.517%)
  class 1: 432 (59.422%)
  class 2: 204 (28.061%)


## Your Tasks

Before you begin, please follow these best practices in your implementation:

- Set **random seeds** to ensure reproducibility  
- Use `torch.save()` to save your **best-performing model**  
- Modularize your code into **reusable functions or classes**

You are encouraged to experiment with different **neural network architectures**, **hyperparameters**, **optimizers**, **regularization** (e.g., dropout, weight decay), and **learning-rate scheduling**, as long as your final model meets the required **macro F1 score threshold** for each task.

**Implementation format:** Task 1 and Task 2 must be implemented as Python scripts (not notebooks). Name them as specified below.

### Task 1: MLP with Mean-Pooled FastText Sentence Embedding **(25 points)**

Create a script named `train_sentiment_mlp_classifier.py` and complete the following:

- Load **pretrained FastText embeddings** using Gensim.
- Tokenize each sentence and compute the **mean of its word vectors** to obtain a fixed-size (300-dimensional) sentence embedding.
- Use a **Multi-Layer Perceptron (MLP)** to classify the sentence embedding.
- Handle **class imbalance** using `nn.CrossEntropyLoss(weight=...)`.
- Track and report the following metrics:
  - **Loss**
  - **Accuracy**
  - **Macro F1 Score**

#### Performance Requirement:
Your model must achieve a **Test Macro F1 Score >= 0.65**

### Task 2: LSTM with Padded FastText Word Vectors **(25 points)**

Create a script named `train_sentiment_lstm_classifier.py` and complete the following:

- Tokenize each sentence into word tokens and retrieve the corresponding **FastText word vectors**.
- **Pad or truncate** each sentence to exactly **32 tokens**.
- Construct a tensor of shape **(32, 300)** for each sentence (300 = embedding dimension).
- **Do not use** `nn.Embedding`; instead, **precompute and batch** the word vectors directly.
- Pass the sequences into an **LSTM model** and classify using the **final hidden state**.
- Use `nn.CrossEntropyLoss(weight=...)` and evaluate using **macro-averaged F1 score**.

#### Performance Requirement:
Your model must achieve a **Test Macro F1 Score >= 0.70**


In [20]:
%%writefile train_sentiment_mlp_classifier.py
#!/usr/bin/env python3
# train_sentiment_mlp_classifier.py

import os
import re
import random
import numpy as np

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from datasets import load_dataset
import gensim.downloader as api

from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt


# -----------------------------
# Config
# -----------------------------
SEED = 42
BATCH_SIZE = 64

MIN_EPOCHS = 30          # must train at least 30 epochs
MAX_EPOCHS = 60          # you may train longer
PATIENCE = 10            # early stop patience (only active after epoch >= MIN_EPOCHS)

LR = 1e-3
WEIGHT_DECAY = 1e-4
HIDDEN_DIM = 256
DROPOUT = 0.3
EMB_DIM = 300
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

ART_DIR = "artifacts/task1"
CKPT_DIR = "checkpoints"
BEST_PATH = os.path.join(CKPT_DIR, "best_mlp_fasttext.pt")


# -----------------------------
# Utilities
# -----------------------------
def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


_token_re = re.compile(r"[A-Za-z]+(?:'[A-Za-z]+)?|[0-9]+")  # simple tokenizer


def tokenize(text: str):
    return _token_re.findall(text.lower())


def mean_pool_fasttext(tokens, ft_model):
    """Mean of word vectors (300-d). If no known tokens, return zeros."""
    vecs = []
    for w in tokens:
        if w in ft_model:
            vecs.append(ft_model[w])
    if len(vecs) == 0:
        return np.zeros((EMB_DIM,), dtype=np.float32)
    return np.mean(np.stack(vecs, axis=0), axis=0).astype(np.float32)


def save_curve_plot(train_vals, val_vals, ylabel, title, outpath):
    epochs = np.arange(1, len(train_vals) + 1)
    plt.figure()
    plt.plot(epochs, train_vals, marker="o", label="train")
    plt.plot(epochs, val_vals, marker="o", label="val")
    plt.xlabel("Epoch")
    plt.ylabel(ylabel)
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(outpath, dpi=200)
    plt.close()


def save_confusion_matrix(y_true, y_pred, class_names, outpath):
    cm = confusion_matrix(y_true, y_pred, labels=list(range(len(class_names))))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)

    fig, ax = plt.subplots()
    disp.plot(ax=ax, values_format="d", colorbar=True)
    ax.set_title("Confusion Matrix (Test)")
    plt.tight_layout()
    plt.savefig(outpath, dpi=200)
    plt.close()
    return cm


@torch.no_grad()
def predict_all(model, loader, device):
    model.eval()
    all_preds, all_y = [], []
    for xb, yb in loader:
        xb = xb.to(device)
        logits = model(xb)
        preds = torch.argmax(logits, dim=1).cpu().numpy()
        all_preds.append(preds)
        all_y.append(yb.numpy())
    return np.concatenate(all_y), np.concatenate(all_preds)


# -----------------------------
# Dataset + Model
# -----------------------------
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, ft_model):
        self.texts = texts
        self.labels = labels
        self.ft_model = ft_model

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        sent = self.texts[idx]
        y = int(self.labels[idx])
        tokens = tokenize(sent)
        x = mean_pool_fasttext(tokens, self.ft_model)  # (300,)
        return torch.from_numpy(x), torch.tensor(y, dtype=torch.long)


class MLPClassifier(nn.Module):
    def __init__(self, in_dim=300, hidden_dim=256, num_classes=3, dropout=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, num_classes),
        )

    def forward(self, x):
        return self.net(x)


@torch.no_grad()
def evaluate(model, loader, criterion, device):
    model.eval()
    losses = []
    all_preds, all_y = [], []
    for xb, yb in loader:
        xb, yb = xb.to(device), yb.to(device)
        logits = model(xb)
        loss = criterion(logits, yb)
        losses.append(loss.item())

        preds = torch.argmax(logits, dim=1)
        all_preds.append(preds.cpu().numpy())
        all_y.append(yb.cpu().numpy())

    y_true = np.concatenate(all_y)
    y_pred = np.concatenate(all_preds)

    acc = accuracy_score(y_true, y_pred)
    macro_f1 = f1_score(y_true, y_pred, average="macro")
    return float(np.mean(losses)), acc, macro_f1


def compute_class_weights(labels, num_classes=3):
    counts = np.bincount(labels, minlength=num_classes).astype(np.float32)
    weights = 1.0 / np.maximum(counts, 1.0)
    weights = weights * (num_classes / weights.sum())  # normalize (optional)
    return torch.tensor(weights, dtype=torch.float32)


# -----------------------------
# Main
# -----------------------------
def main():
    set_seed(SEED)
    os.makedirs(ART_DIR, exist_ok=True)
    os.makedirs(CKPT_DIR, exist_ok=True)

    # 1) Load dataset
    dataset = load_dataset("financial_phrasebank", "sentences_50agree", trust_remote_code=True)
    full = dataset["train"]

    # 2) Stratified Train/Val/Test split
    tmp = full.train_test_split(test_size=0.15, seed=SEED, stratify_by_column="label")
    trainval = tmp["train"]
    test_ds_hf = tmp["test"]

    tmp2 = trainval.train_test_split(test_size=0.15, seed=SEED, stratify_by_column="label")
    train_ds_hf = tmp2["train"]
    val_ds_hf = tmp2["test"]

    # 3) Load pretrained FastText via Gensim (downloads on first run)
    ft = api.load("fasttext-wiki-news-subwords-300")

    # 4) Wrap into torch datasets/loaders
    train_ds = SentimentDataset(train_ds_hf["sentence"], train_ds_hf["label"], ft)
    val_ds = SentimentDataset(val_ds_hf["sentence"], val_ds_hf["label"], ft)
    test_ds = SentimentDataset(test_ds_hf["sentence"], test_ds_hf["label"], ft)

    train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False)
    test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False)

    # 5) Model + class-weighted loss
    class_w = compute_class_weights(np.array(train_ds_hf["label"]), num_classes=3).to(DEVICE)
    criterion = nn.CrossEntropyLoss(weight=class_w)

    model = MLPClassifier(in_dim=EMB_DIM, hidden_dim=HIDDEN_DIM, num_classes=3, dropout=DROPOUT).to(DEVICE)
    optimizer = torch.optim.AdamW(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)

    # 6) Train loop + metric tracking + checkpoint + early stopping after epoch 30
    history = {
        "train_loss": [], "train_acc": [], "train_f1": [],
        "val_loss": [], "val_acc": [], "val_f1": []
    }

    best_val_f1 = -1.0
    best_epoch = -1
    epochs_since_improve = 0

    for epoch in range(1, MAX_EPOCHS + 1):
        model.train()
        train_losses = []
        all_preds, all_y = [], []

        for xb, yb in train_loader:
            xb, yb = xb.to(DEVICE), yb.to(DEVICE)

            optimizer.zero_grad(set_to_none=True)
            logits = model(xb)
            loss = criterion(logits, yb)
            loss.backward()
            optimizer.step()

            train_losses.append(loss.item())
            preds = torch.argmax(logits, dim=1)
            all_preds.append(preds.detach().cpu().numpy())
            all_y.append(yb.detach().cpu().numpy())

        y_true = np.concatenate(all_y)
        y_pred = np.concatenate(all_preds)

        train_loss = float(np.mean(train_losses))
        train_acc = accuracy_score(y_true, y_pred)
        train_f1 = f1_score(y_true, y_pred, average="macro")

        val_loss, val_acc, val_f1 = evaluate(model, val_loader, criterion, DEVICE)

        history["train_loss"].append(train_loss)
        history["train_acc"].append(train_acc)
        history["train_f1"].append(train_f1)
        history["val_loss"].append(val_loss)
        history["val_acc"].append(val_acc)
        history["val_f1"].append(val_f1)

        print(
            f"Epoch {epoch:02d}/{MAX_EPOCHS} | "
            f"train loss {train_loss:.4f} acc {train_acc:.4f} f1 {train_f1:.4f} | "
            f"val loss {val_loss:.4f} acc {val_acc:.4f} f1 {val_f1:.4f}"
        )

        # Save best model by validation macro-F1
        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            best_epoch = epoch
            epochs_since_improve = 0
            torch.save(
                {
                    "model_state_dict": model.state_dict(),
                    "best_val_f1": best_val_f1,
                    "best_epoch": best_epoch,
                    "class_weights": class_w.detach().cpu(),
                    "history": history,
                },
                BEST_PATH,
            )
        else:
            epochs_since_improve += 1

        # Early stopping ONLY allowed after epoch >= MIN_EPOCHS
        if epoch >= MIN_EPOCHS and epochs_since_improve >= PATIENCE:
            print(f"Early stopping at epoch {epoch} (best epoch {best_epoch}).")
            break

    # 7) Save plots to disk
    save_curve_plot(history["train_loss"], history["val_loss"],
                    ylabel="Loss", title="Loss vs Epochs",
                    outpath=os.path.join(ART_DIR, "loss_vs_epoch.png"))

    save_curve_plot(history["train_acc"], history["val_acc"],
                    ylabel="Accuracy", title="Accuracy vs Epochs",
                    outpath=os.path.join(ART_DIR, "acc_vs_epoch.png"))

    save_curve_plot(history["train_f1"], history["val_f1"],
                    ylabel="Macro F1", title="Macro F1 vs Epochs",
                    outpath=os.path.join(ART_DIR, "macro_f1_vs_epoch.png"))

    # 8) Final test eval with best checkpoint + confusion matrix
    ckpt = torch.load(BEST_PATH, map_location=DEVICE)
    model.load_state_dict(ckpt["model_state_dict"])

    test_loss, test_acc, test_f1 = evaluate(model, test_loader, criterion, DEVICE)
    print("\nBEST VAL MACRO-F1:", ckpt["best_val_f1"], "at epoch", ckpt["best_epoch"])
    print(f"TEST | loss {test_loss:.4f} acc {test_acc:.4f} macro-f1 {test_f1:.4f}")
    print("Saved best model to:", BEST_PATH)

    y_true_test, y_pred_test = predict_all(model, test_loader, DEVICE)

    # If your assignment defines a label order, match it here.
    class_names = ["negative", "neutral", "positive"]
    cm = save_confusion_matrix(
        y_true_test, y_pred_test,
        class_names=class_names,
        outpath=os.path.join(ART_DIR, "confusion_matrix_test.png")
    )
    np.savetxt(os.path.join(ART_DIR, "confusion_matrix_test.txt"), cm, fmt="%d")


if __name__ == "__main__":
    main()


Overwriting train_sentiment_mlp_classifier.py


In [21]:
!python train_sentiment_mlp_classifier.py


Epoch 01/60 | train loss 1.0588 acc 0.4616 f1 0.3999 | val loss 1.0470 acc 0.5210 f1 0.3647
Epoch 02/60 | train loss 0.9860 acc 0.5573 f1 0.4415 | val loss 0.9518 acc 0.5372 f1 0.4142
Epoch 03/60 | train loss 0.9098 acc 0.5807 f1 0.4862 | val loss 0.9117 acc 0.5890 f1 0.5029
Epoch 04/60 | train loss 0.8589 acc 0.6073 f1 0.5442 | val loss 0.8612 acc 0.5324 f1 0.5065
Epoch 05/60 | train loss 0.8131 acc 0.6281 f1 0.5655 | val loss 0.8115 acc 0.6392 f1 0.5519
Epoch 06/60 | train loss 0.7705 acc 0.6524 f1 0.5995 | val loss 0.7748 acc 0.6634 f1 0.5924
Epoch 07/60 | train loss 0.7378 acc 0.6672 f1 0.6209 | val loss 0.7555 acc 0.6893 f1 0.6297
Epoch 08/60 | train loss 0.7131 acc 0.6695 f1 0.6265 | val loss 0.7326 acc 0.6909 f1 0.6268
Epoch 09/60 | train loss 0.6724 acc 0.6858 f1 0.6458 | val loss 0.7319 acc 0.6327 f1 0.5855
Epoch 10/60 | train loss 0.6705 acc 0.6935 f1 0.6549 | val loss 0.7168 acc 0.6343 f1 0.6060
Epoch 11/60 | train loss 0.6444 acc 0.7067 f1 0.6704 | val loss 0.6902 acc 0.663

In [29]:
from google.colab import files
files.download("artifacts/task1/loss_vs_epoch.png")
files.download("artifacts/task1/acc_vs_epoch.png")
files.download("artifacts/task1/macro_f1_vs_epoch.png")
files.download("artifacts/task1/confusion_matrix_test.png")
files.download("checkpoints/best_mlp_fasttext.pt")



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

###  Evaluation Requirements **(10 points)**

For **both models (MLP and LSTM)**, you must:

- Train for **at least 30 epochs**. You may train longer and select the best checkpoint based on validation performance (early stopping is allowed **after** epoch 30).
- Track and plot the following metrics for **both training and validation** sets:
  - **Loss vs. Epochs**
  - **Accuracy vs. Epochs**
  - **Macro F1 Score vs. Epochs**

Plotting both training and validation curves helps you identify potential issues like **underfitting** or **overfitting**.

- After training, evaluate your model on the **test set** and report the **confusion matrix**.
- Save plots (training/validation curves and confusion matrix) to disk from your **.py scripts** so they can be embedded in `open_questions.ipynb`.



## Provided Models (Required) **(12 points)**

The following scripts are provided to support comparison between classical baselines and fine-tuned LLMs:

- **`train_sentiment_rnn_classifier.py`** - Sentiment classifier using a basic RNN architecture  
- **`train_sentiment_gru_classifier.py`** - Sentiment classifier using a GRU architecture  
- **`train_sentiment_bert_classifier.py`** - Sentiment classifier using a BERT-based model  
- **`train_sentiment_gpt_classifier.py`** - Sentiment classifier using a GPT-based model  

You must run these models and include their results in your analysis (metrics, plots, and a brief comparison). BERT and GPT are pretrained LLMs that you will **fine-tune** for classification using these scripts. These scripts are **not** submissions and may use different training settings (e.g., fewer epochs).


## Open-Ended Reflection Questions **(23 points)**

After completing your implementations and running all provided scripts, in the notebook named `open_questions.ipynb` to address the following. You may **Include plots** from your training scripts in the notebook output to justify your answers.

### 1. Training Dynamics
*Focus on your MLP and LSTM implementations*

- Did your models show signs of **overfitting** or **underfitting**? What architectural or training changes could address this?
- How did using **class weights** affect training stability and final performance?

### 2. Model Performance and Error Analysis
*Focus on your MLP and LSTM implementations*

- Which of your two models **generalized better** to the test set? Provide evidence from your metrics.
- Which **sentiment class** was most frequently misclassified? Propose reasons for this pattern.

### 3. Cross-Model Comparison
*Compare all six models: MLP, RNN, LSTM, GRU, BERT, GPT*

- How did **mean-pooled FastText embeddings** limit the MLP compared to sequence-based models?
- What advantage did the LSTM's **sequential processing** provide over the MLP?
- Did **fine-tuned LLMs** (BERT/GPT) outperform classical baselines? Explain the performance gap in terms of pretraining and contextual representations.
- **Rank all six models** by test performance. What architectural or representational factors explain the ranking?


## AI Use Disclosure **(5 points)**

Complete the **AI Use Disclosure** section in `open_questions.ipynb`. This item is graded separately.


## Deliverables

You must submit the following files:

1. `train_sentiment_mlp_classifier.py`  
   Implementation of **Task 1** using an MLP with **mean-pooled FastText sentence embeddings**.

2. `train_sentiment_lstm_classifier.py`  
   Implementation of **Task 2** using an LSTM with **padded/truncated FastText word vectors** (32 tokens per sentence).

3. `outputs/` containing PNGs for loss/accuracy/F1 curves and confusion matrices for **all models you ran** (MLP, LSTM, RNN, GRU, BERT, GPT).

4. `open_questions.ipynb` and `open_questions.html`  
   Your written responses to the **open-ended questions** related to modeling choices, performance comparisons, and reflections. The HTML must include the **plots embedded in the notebook output**, plus your **AI Use Disclosure**.

Submission Instructions

- Submit `open_questions.html` to **Canvas**.
- Push **all `.py`, `.ipynb`, `.html`, and `outputs/` files** to your **GitHub repository**.
- Make sure the `.html` file contains **both code and output** so it can be viewed without rerunning the notebook.
