## 1. Training Dynamics

The MLP shows stable training dynamics with little evidence of overfitting. Train and validation loss both decline smoothly and stay close, and the validation F1/accuracy tracks the training curves without a persistent gap. Performance plateaus in later epochs, suggesting the model is nearing its capacity rather than overfitting. The LSTM curves are noisier and overall lower. Training and validation losses both trend downward, but validation accuracy and F1 fluctuate and level off around ~0.55–0.60, indicating underfitting or optimization difficulty rather than classic overfitting. A brief late spike in validation loss does not persist, so it looks like variance rather than sustained divergence. Overall The LSTM’s test results reflect this weaker fit (accuracy = 0.6094) than MLP (accuracy = 0.7153).

Class weights likely improved balance across classes by increasing the penalty for minority errors, which can stabilize minority recall but also make training noisier. In these runs, the MLP appears to benefit more cleanly from class weighting (smoother curves), while the LSTM still struggles to fit the data.

**MLP Learning Curves and Test Confusion Matrix**

![](outputs/mlp_f1_learning_curves.png)


![](outputs/mlp_accuracy_learning_curve.png)


![](outputs/mlp_confusion_matrix.png)

**LSTM Learning Curves and Test Confusion Matrix**


![](outputs/lstm_f1_learning_curves.png)


![](outputs/lstm_accuracy_learning_curve.png)


![](outputs/lstm_confusion_matrix.png)

## 2. Model Performance and Error Analysis

The MLP generalized better to the test set than the LSTM. On the test set, the MLP reached Accuracy = 0.7153 and Macro F1 = 0.6757, while the LSTM reached Accuracy = 0.6094 and Macro F1 = 0.6012. This gap is consistent with the training dynamics: the MLP’s validation curves were stable and higher, whereas the LSTM plateaued at a lower level with more noise.\n\nAcross both models, the Positive class was the most frequently misclassified. For the MLP, Positive has the lowest F1 (0.5990) compared to Negative (0.6271) and Neutral (0.8010). For the LSTM, Positive again has the lowest F1 (0.4390) compared to Negative (0.6762) and Neutral (0.6885). This suggests positive sentiment is harder to separate in this dataset. The confusion matrices visually confirm that Positive is the class most often confused with Neutral.

## 3. Cross-Model Comparison

The mean-pooled FastText embeddings limit the model's capacity to under local word order and the context around the words, so that sequence-based models could outperform.

In theory, LSTM model could help the learning of the context in sequence of tokens. However in results, The LSTM model did not improve the classificiation performance.

Fine-tuned LLM models outperform all other models in the performance. That is because they are trained on a larger corpora, and they have more knowledge about the language and are able to transfer the performance right here.

The ranking of the models might be: BERT > GPT > GRU > RNN > MLP > LSTM.

Below is a summary of the final test results for all six models:

| Model | Test Accuracy | Test Macro F1 | Test Weighted F1 |
|---|---:|---:|---:|
| MLP | 0.7153 | 0.6757 | 0.7226 |
| LSTM | 0.6094 | 0.6012 | 0.6170 |
| RNN | 0.7276 | 0.7001 | 0.7300 |
| GRU | 0.7607 | 0.7192 | 0.7572 |
| BERT | 0.8294 | 0.8087 | 0.8304 |
| GPT | 0.8239 | 0.8012 | 0.8229 |

## AI Use Disclosure (Required)

If you used any AI-enabled tools (e.g., ChatGPT, GitHub Copilot, Claude, or other LLM assistants) while working on this assignment, you must disclose that use here. The goal is transparency-not punishment.

In your disclosure, briefly include:
- **Tool(s) used:** (name + version if known)
- **How you used them:** (e.g., concept explanation, debugging, drafting code, rewriting text)
- **What you verified yourself:** (e.g., reran the notebook, checked outputs/plots, checked shapes, read documentation)
- **What you did *not* use AI for (if applicable):** (optional)

You are responsible for the correctness of your submission, even if AI suggested code or explanations.

#### <font color="red"> For the classifier python files, I let the Codex to help me write the functions for generating the plotting visualizations, and aligning the output with other existing classifier python files. I did not use AI for writing the class for the machine learning model, the training loop, and the results in the python notebook.</font>
