### <font color='blue'> Due 11:59pm, Monday Feb 12th 2026</font>

**Purpose / learning goals:**
- Practice training neural models in PyTorch with emphasis on optimizers, regularization, and learning-rate scheduling to meet a performance threshold.
- Use sentiment classification as a downstream task to compare classical neural baselines with fine-tuned pretrained LLMs (BERT/GPT).

**Runtime / setup notes:**
- This assignment does not require a GPU to train the models. Using a GPU (or Apple MPS) will usually speed up training for the transformer models.

In this assignment, you will:
- Implement MLP and LSTM classifiers (your code)
- Run provided scripts for RNN, GRU, BERT, and GPT (for comparison)

**Implementation format:** Task 1 and Task 2 must be implemented as Python scripts (not notebooks). The open-ended questions are answered in a notebook.

To motivate the transformer architecture, scripts are provided for pretrained state-of-the-art models such as **GPT** (decoder-only) and **BERT** (encoder-only). You should run these scripts yourself to obtain results for comparison and reflection.

*Please read the `README.md` file before proceeding.*


##  Sentiment Classification: Classical Nets vs. LLMs

Sentiment classification is a common **downstream task** for evaluating how well pretrained LLMs adapt to a domain via fine-tuning, compared against classical neural baselines.

In this assignment, you'll explore how different neural architectures perform on sentiment classification:

- **Classical approaches:** MLP, RNN, LSTM, GRU (using static FastText embeddings)
- **Pretrained LLMs:** BERT and GPT (fine-tuned using Hugging Face Transformers)

You will implement MLP and LSTM yourself; scripts are provided for the remaining models.

Detailed requirements for your implementations are listed in **Your Tasks** below.


##  Dataset: Financial PhraseBank

This assignment uses the **Financial PhraseBank** dataset, developed by  
Mika V. M√§ntyl√§, Graziella Linders, Tanja Suominen, and Miikka Kuutila.

- üìÇ Dataset homepage: [Hugging Face ‚Äì Financial_PhraseBank](https://huggingface.co/datasets/takala/financial_phrasebank)  
- üìÑ Original paper:  
  P. Malo, A. Sinha, et al. (2014). [*‚ÄúGood Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts‚Äù*](https://arxiv.org/pdf/1307.5336)

You can load and preview the dataset using the following code:

In [1]:
!git clone https://github.com/Anushka-De/stat359.git

Cloning into 'stat359'...
remote: Enumerating objects: 235, done.[K
remote: Counting objects: 100% (143/143), done.[K
remote: Compressing objects: 100% (102/102), done.[K
remote: Total 235 (delta 103), reused 50 (delta 40), pack-reused 92 (from 1)[K
Receiving objects: 100% (235/235), 2.81 MiB | 12.97 MiB/s, done.
Resolving deltas: 100% (139/139), done.


In [2]:
%cd stat359/student/Assignment_3
!ls

/content/stat359/student/Assignment_3
acc_vs_epoch.png	   README.md
best_mlp_fasttext.pt	   train_sentiment_bert_classifier.py
confusion_matrix_test.png  train_sentiment_gpt_classifier.py
handout.html		   train_sentiment_gru_classifier.py
handout.ipynb		   train_sentiment_lstm_classifier.py
loss_vs_epoch.png	   train_sentiment_mlp_classifier.py
macro_f1_vs_epoch.png	   train_sentiment_rnn_classifier.py
open_questions.ipynb


In [3]:
pip -q install "datasets<4.0.0"

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/491.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m491.5/491.5 kB[0m [31m28.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m491.5/491.5 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
!pip install -q numpy pandas gensim torch scikit-learn matplotlib ipywidgets nltk tqdm

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m27.9/27.9 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.6/1.6 MB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:

print("\n========== Loading Dataset ==========")
from datasets import load_dataset

dataset = load_dataset('financial_phrasebank', 'sentences_50agree', trust_remote_code=True)
print("Dataset loaded. Example:", dataset['train'][:5])




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

financial_phrasebank.py: 0.00B [00:00, ?B/s]



data/FinancialPhraseBank-v1.0.zip:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4846 [00:00<?, ? examples/s]

Dataset loaded. Example: {'sentence': ['According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'Technopolis plans to develop in stages an area of no less than 100,000 square meters in order to host companies working in computer technologies and telecommunications , the statement said .', 'The international electronic industry company Elcoteq has laid off tens of employees from its Tallinn facility ; contrary to earlier layoffs the company contracted the ranks of its office workers , the daily Postimees reported .', 'With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .', "According to the company 's updated strategy for the years 2009-2012 , Basware targets a long-term net sales growth in the range of 20 % -40 % with an operating profit margin of 10 % -20 % of n

###  Dataset Description

The dataset consists of **4,840 English sentences** extracted from financial news articles.  
Each sentence is labeled as **positive**, **neutral**, or **negative**, with annotations provided by 5 to 8 human annotators to ensure labeling consistency.  

This assignment uses the `'sentences_50agree'` subset, where at least 50% of annotators agreed on the sentiment.

###  Class Imbalance

The dataset has an **imbalanced class distribution**:

| Sentiment | Count |
|-----------|-------|
| Negative  | 604   |
| Neutral   | 2879  |
| Positive  | 1363  |

For dealing with imbalanced dataset:

- **Accuracy** can be misleading in this setting.
- You must use `class_weight` in your loss function (e.g., `nn.CrossEntropyLoss(weight=...)`) to mitigate the imbalance.
- The primary evaluation metric will be the **macro-averaged F1 score**, which treats all classes equally regardless of frequency.

### Train/Validation/Test Splits

The dataset does **not** come with predefined splits.

You must split it yourself using **stratified sampling** to preserve class proportions in each subset.

For a fair comparison and to stay consistent with the other model scripts, use the following split procedure:

- First, create a **test set (15%)** and a **train+validation set (85%)** using stratified sampling on the original labels.
- Then, split the **train+validation set** into **training (85%)** and **validation (15%)** using stratified sampling on the train+validation labels.
- Use a fixed random seed (e.g., 42) so results are reproducible.

This ensures consistent and representative evaluation, especially in the presence of class imbalance.

In [6]:
import random
random.seed(42)

## Your Tasks

Before you begin, please follow these best practices in your implementation:

- Set **random seeds** to ensure reproducibility  
- Use `torch.save()` to save your **best-performing model**  
- Modularize your code into **reusable functions or classes**

You are encouraged to experiment with different **neural network architectures**, **hyperparameters**, **optimizers**, **regularization** (e.g., dropout, weight decay), and **learning-rate scheduling**, as long as your final model meets the required **macro F1 score threshold** for each task.

**Implementation format:** Task 1 and Task 2 must be implemented as Python scripts (not notebooks). Name them as specified below.

### Task 1: MLP with Mean-Pooled FastText Sentence Embedding **(25 points)**

Create a script named `train_sentiment_mlp_classifier.py` and complete the following:

- Load **pretrained FastText embeddings** using Gensim.
- Tokenize each sentence and compute the **mean of its word vectors** to obtain a fixed-size (300-dimensional) sentence embedding.
- Use a **Multi-Layer Perceptron (MLP)** to classify the sentence embedding.
- Handle **class imbalance** using `nn.CrossEntropyLoss(weight=...)`.
- Track and report the following metrics:
  - **Loss**
  - **Accuracy**
  - **Macro F1 Score**

#### Performance Requirement:
Your model must achieve a **Test Macro F1 Score >= 0.65**

### Task 2: LSTM with Padded FastText Word Vectors **(25 points)**

Create a script named `train_sentiment_lstm_classifier.py` and complete the following:

- Tokenize each sentence into word tokens and retrieve the corresponding **FastText word vectors**.
- **Pad or truncate** each sentence to exactly **32 tokens**.
- Construct a tensor of shape **(32, 300)** for each sentence (300 = embedding dimension).
- **Do not use** `nn.Embedding`; instead, **precompute and batch** the word vectors directly.
- Pass the sequences into an **LSTM model** and classify using the **final hidden state**.
- Use `nn.CrossEntropyLoss(weight=...)` and evaluate using **macro-averaged F1 score**.

#### Performance Requirement:
Your model must achieve a **Test Macro F1 Score >= 0.70**


In [7]:
!python train_sentiment_mlp_classifier.py



Using device: cuda
Dataset loaded. Example: {'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 1}

Train: 3501, Val: 618, Test: 727

FastText loaded.


Training setup complete.


--- Epoch 1/60 ---
Train Loss: 1.0597, Train F1: 0.3999, Train Acc: 0.4616
Val Loss: 1.0476, Val F1: 0.3647, Val Acc: 0.5210
>>> Saved new best model (Val F1: 0.3647) at epoch 1

--- Epoch 2/60 ---
Train Loss: 0.9865, Train F1: 0.4415, Train Acc: 0.5573
Val Loss: 0.9540, Val F1: 0.4142, Val Acc: 0.5372
>>> Saved new best model (Val F1: 0.4142) at epoch 2

--- Epoch 3/60 ---
Train Loss: 0.9099, Train F1: 0.4862, Train Acc: 0.5807
Val Loss: 0.9127, Val F1: 0.5029, Val Acc: 0.5890
>>> Saved new best model (Val F1: 0.5029) at epoch 3

--- Epoch 4/60 ---
Train Loss: 0.8592, Train F1: 0.5442, Train Acc: 0.6073
Val Loss: 0.8639, Val F1: 0.5065, Val Acc: 0.5324
>>> Saved new best model (Val F1: 0.5065) at epoch 4

--

In [8]:
!python train_sentiment_lstm_classifier.py


Using device: cuda
Dataset loaded. Example: {'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 1}

Train: 3501, Val: 618, Test: 727

FastText loaded.


Training setup complete.


--- Epoch 1/80 ---
Train Loss: 1.0584, Train F1: 0.4335, Train Acc: 0.5510
Val Loss: 1.0298, Val F1: 0.3968, Val Acc: 0.5032
>>> Saved new best model (Val F1: 0.3968) at epoch 1

--- Epoch 2/80 ---
Train Loss: 0.9981, Train F1: 0.4503, Train Acc: 0.5664
Val Loss: 1.0289, Val F1: 0.4233, Val Acc: 0.6117
>>> Saved new best model (Val F1: 0.4233) at epoch 2

--- Epoch 3/80 ---
Train Loss: 0.9144, Train F1: 0.4829, Train Acc: 0.5898
Val Loss: 0.9134, Val F1: 0.4632, Val Acc: 0.6100
>>> Saved new best model (Val F1: 0.4632) at epoch 3

--- Epoch 4/80 ---
Train Loss: 0.8654, Train F1: 0.5271, Train Acc: 0.6193
Val Loss: 0.8819, Val F1: 0.4713, Val Acc: 0.5405
>>> Saved new best model (Val F1: 0.4713) at epoch 4

--

In [9]:
!python train_sentiment_rnn_classifier.py


Dataset loaded. Example: {'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 1}

DataFrame shape: (4846, 3)

Sentence length statistics:
count    4846.000000
mean       23.101114
std         9.958474
min         2.000000
25%        16.000000
50%        21.000000
75%        29.000000
max        81.000000
Name: sentence, dtype: float64
Figure(1000x600)

modules.json: 100% 349/349 [00:00<00:00, 2.18MB/s]
config_sentence_transformers.json: 100% 116/116 [00:00<00:00, 737kB/s]
README.md: 10.5kB [00:00, 32.7MB/s]
sentence_bert_config.json: 100% 53.0/53.0 [00:00<00:00, 273kB/s]
config.json: 100% 612/612 [00:00<00:00, 3.41MB/s]
model.safetensors: 100% 90.9M/90.9M [00:00<00:00, 92.9MB/s]
Loading weights: 100% 103/103 [00:00<00:00, 1224.18it/s, Materializing param=pooler.dense.weight]
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |

In [10]:
!python train_sentiment_gru_classifier.py


Dataset loaded. Example: {'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 1}

DataFrame shape: (4846, 3)

Loading weights: 100% 103/103 [00:00<00:00, 3757.88it/s, Materializing param=pooler.dense.weight]
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | [38;5;208mUNEXPECTED[0m |  | 

[3mNotes:
- [38;5;208mUNEXPECTED[0m[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
Model loaded.

Tokenizing & encoding: 100% 4846/4846 [00:25<00:00, 191.03it/s]
X_seq shape: (4846, 32, 384), y shape: (4846,)

X_train shape: (3501, 32, 384), y_train shape: (3501,)
X_val shape: (618, 32, 384), y_val shape: (618,)
X_test shape: (727, 32, 384), y_test shape: (727,)
DataLoaders created.

Model initialized wi

In [11]:
!python train_sentiment_bert_classifier.py


Dataset loaded. Example: {'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 1}

DataFrame shape: (4846, 3)

tokenizer_config.json: 100% 48.0/48.0 [00:00<00:00, 278kB/s]
vocab.txt: 100% 232k/232k [00:00<00:00, 2.22MB/s]
tokenizer.json: 100% 466k/466k [00:00<00:00, 2.50MB/s]
config.json: 100% 570/570 [00:00<00:00, 3.13MB/s]
model.safetensors: 100% 440M/440M [00:03<00:00, 147MB/s]
Loading weights: 100% 199/199 [00:00<00:00, 1482.93it/s, Materializing param=pooler.dense.weight]
[1mBertModel LOAD REPORT[0m from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.dense.bias       | [38;5;208mUNEXPECTED[0m |  | 
cls.seq_relationship.weight                | [38;5;208mUNEXPECTED[0m |  | 
cls.predictions.transform.dense.weight     | [38;5;208mUNEXPECTED[0m |  | 
cls.predi

In [12]:
!python train_sentiment_gpt_classifier.py


Dataset loaded. Example: {'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 1}

DataFrame shape: (4846, 3)

tokenizer_config.json: 100% 26.0/26.0 [00:00<00:00, 151kB/s]
vocab.json: 100% 1.04M/1.04M [00:00<00:00, 3.54MB/s]
merges.txt: 100% 456k/456k [00:00<00:00, 1.96MB/s]
tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 4.72MB/s]
config.json: 100% 665/665 [00:00<00:00, 3.44MB/s]
model.safetensors: 100% 548M/548M [00:02<00:00, 185MB/s]
Loading weights: 100% 148/148 [00:00<00:00, 1102.26it/s, Materializing param=transformer.wte.weight]
[1mGPT2ForSequenceClassification LOAD REPORT[0m from: gpt2
Key                  | Status     | 
---------------------+------------+-
h.{0...11}.attn.bias | [38;5;208mUNEXPECTED[0m | 
score.weight         | [31mMISSING[0m    | 

[3mNotes:
- [38;5;208mUNEXPECTED[0m[3m	:can be ignored when loading from different task/architecture; not ok if you expec

In [13]:
!zip -r outputs.zip outputs
from google.colab import files
files.download("outputs.zip")
!zip -r outputs_task2.zip outputs_task2
files.download("outputs_task2.zip")



  adding: outputs/ (stored 0%)
  adding: outputs/bert_f1_learning_curves.png (deflated 12%)
  adding: outputs/bert_accuracy_learning_curve.png (deflated 9%)
  adding: outputs/gru_f1_learning_curves.png (deflated 10%)
  adding: outputs/rnn_accuracy_learning_curve.png (deflated 7%)
  adding: outputs/mlp_learning_curves.png (deflated 12%)
  adding: outputs/gpt_accuracy_learning_curve.png (deflated 8%)
  adding: outputs/bert_confusion_matrix.png (deflated 15%)
  adding: outputs/best_gru_model.pth (deflated 8%)
  adding: outputs/best_mlp_fasttext.pt (deflated 8%)
  adding: outputs/gpt_confusion_matrix.png (deflated 14%)
  adding: outputs/mlp_confusion_matrix.png (deflated 19%)
  adding: outputs/gpt_f1_learning_curves.png (deflated 11%)
  adding: outputs/mlp_f1_curve.png (deflated 10%)
  adding: outputs/rnn_confusion_matrix.png (deflated 15%)
  adding: outputs/mlp_accuracy_curve.png (deflated 9%)
  adding: outputs/best_bert_model.pth (deflated 7%)
  adding: outputs/best_rnn_model.pth (deflat

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

  adding: outputs_task2/ (stored 0%)
  adding: outputs_task2/lstm_learning_curves.png (deflated 14%)
  adding: outputs_task2/best_lstm_fasttext.pt (deflated 7%)
  adding: outputs_task2/confusion_matrix.txt (stored 0%)
  adding: outputs_task2/confusion_matrix.png (deflated 19%)
  adding: outputs_task2/acc_vs_epochs.png (deflated 11%)
  adding: outputs_task2/macro_f1_vs_epochs.png (deflated 11%)
  adding: outputs_task2/loss_vs_epochs.png (deflated 12%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

###  Evaluation Requirements **(10 points)**

For **both models (MLP and LSTM)**, you must:

- Train for **at least 30 epochs**. You may train longer and select the best checkpoint based on validation performance (early stopping is allowed **after** epoch 30).
- Track and plot the following metrics for **both training and validation** sets:
  - **Loss vs. Epochs**
  - **Accuracy vs. Epochs**
  - **Macro F1 Score vs. Epochs**

Plotting both training and validation curves helps you identify potential issues like **underfitting** or **overfitting**.

- After training, evaluate your model on the **test set** and report the **confusion matrix**.
- Save plots (training/validation curves and confusion matrix) to disk from your **.py scripts** so they can be embedded in `open_questions.ipynb`.



## Provided Models (Required) **(12 points)**

The following scripts are provided to support comparison between classical baselines and fine-tuned LLMs:

- **`train_sentiment_rnn_classifier.py`** - Sentiment classifier using a basic RNN architecture  
- **`train_sentiment_gru_classifier.py`** - Sentiment classifier using a GRU architecture  
- **`train_sentiment_bert_classifier.py`** - Sentiment classifier using a BERT-based model  
- **`train_sentiment_gpt_classifier.py`** - Sentiment classifier using a GPT-based model  

You must run these models and include their results in your analysis (metrics, plots, and a brief comparison). BERT and GPT are pretrained LLMs that you will **fine-tune** for classification using these scripts. These scripts are **not** submissions and may use different training settings (e.g., fewer epochs).


## Open-Ended Reflection Questions **(23 points)**

After completing your implementations and running all provided scripts, in the notebook named `open_questions.ipynb` to address the following. You may **Include plots** from your training scripts in the notebook output to justify your answers.

### 1. Training Dynamics
*Focus on your MLP and LSTM implementations*

- Did your models show signs of **overfitting** or **underfitting**? What architectural or training changes could address this?
- How did using **class weights** affect training stability and final performance?

### 2. Model Performance and Error Analysis
*Focus on your MLP and LSTM implementations*

- Which of your two models **generalized better** to the test set? Provide evidence from your metrics.
- Which **sentiment class** was most frequently misclassified? Propose reasons for this pattern.

### 3. Cross-Model Comparison
*Compare all six models: MLP, RNN, LSTM, GRU, BERT, GPT*

- How did **mean-pooled FastText embeddings** limit the MLP compared to sequence-based models?
- What advantage did the LSTM's **sequential processing** provide over the MLP?
- Did **fine-tuned LLMs** (BERT/GPT) outperform classical baselines? Explain the performance gap in terms of pretraining and contextual representations.
- **Rank all six models** by test performance. What architectural or representational factors explain the ranking?


## AI Use Disclosure **(5 points)**

Complete the **AI Use Disclosure** section in `open_questions.ipynb`. This item is graded separately.


## Deliverables

You must submit the following files:

1. `train_sentiment_mlp_classifier.py`  
   Implementation of **Task 1** using an MLP with **mean-pooled FastText sentence embeddings**.

2. `train_sentiment_lstm_classifier.py`  
   Implementation of **Task 2** using an LSTM with **padded/truncated FastText word vectors** (32 tokens per sentence).

3. `outputs/` containing PNGs for loss/accuracy/F1 curves and confusion matrices for **all models you ran** (MLP, LSTM, RNN, GRU, BERT, GPT).

4. `open_questions.ipynb` and `open_questions.html`  
   Your written responses to the **open-ended questions** related to modeling choices, performance comparisons, and reflections. The HTML must include the **plots embedded in the notebook output**, plus your **AI Use Disclosure**.

Submission Instructions

- Submit `open_questions.html` to **Canvas**.
- Push **all `.py`, `.ipynb`, `.html`, and `outputs/` files** to your **GitHub repository**.
- Make sure the `.html` file contains **both code and output** so it can be viewed without rerunning the notebook.
