## Task 1: MLP with Mean-Pooled FastText Sentence Embedding (25 points)

> Test F1 Macro: **0.7087** $\ge$ 0.65

## Task 2: LSTM with Padded FastText Word Vectors (25 points)

> Test F1 Macro: **0.7404** $\ge$ 0.70

## Evaluation Requirements (10 points)

### MLP
<img src="outputs/mlp_f1_learning_curves.png" width="500">
<img src="outputs/mlp_confusion_matrix.png" width="500">

- Both the training and validation curves suggest the model learned well without clear signs of overfitting.
- The confusion matrix indicates frequent misclassifications between the positive and neutral classes.

### LSTM
<img src="outputs/lstm_f1_learning_curves.png" width="500">
<img src="outputs/lstm_confusion_matrix.png" width="500">

- Training accuracy continues to increase, while validation performance starts to plateau around epoch ~20, suggesting mild overfitting.
- Although confusion between positive and neutral remains, the LSTM achieves a higher F1 score than the MLP overall.

## Provided Models (Required) (12 points)

### RNN
<img src="outputs/rnn_f1_learning_curves.png" width="500">
<img src="outputs/rnn_confusion_matrix.png" width="500">

- Training performance keeps improving, but validation performance plateaus while validation loss rises sharply, indicating clear overfitting (Test Macro F1 ≈ 0.684).
- The confusion matrix shows frequent positive → neutral errors, suggesting the model tends to absorb positive examples into the neutral class.

### GRU
<img src="outputs/gru_f1_learning_curves.png" width="500">
<img src="outputs/gru_confusion_matrix.png" width="500">

- Training metrics approach near-perfect scores, while validation loss increases later in training, suggesting overfitting, but the model generalizes better than the basic RNN (Test Macro F1 ≈ 0.749).
- The main error pattern remains positive–neutral confusion, especially predicting positive samples as neutral.

### BERT
<img src="outputs/bert_f1_learning_curves.png" width="500">
<img src="outputs/bert_confusion_matrix.png" width="500">

- During 5-epoch fine-tuning, validation F1 improves up to around epoch 3 and then fluctuates, implying there is a best checkpoint/epoch (Test Macro F1 ≈ 0.806).
- The confusion matrix indicates more stable separation for neutral and negative, with remaining errors mainly between positive and neutral, but overall less confusion than classical baselines.

### GPT
<img src="outputs/gpt_f1_learning_curves.png" width="500">
<img src="outputs/gpt_confusion_matrix.png" width="500">

- Validation F1 increases quickly and stabilizes within a few epochs, showing that a short fine-tuning schedule is sufficient (Test Macro F1 ≈ 0.787).
- The confusion matrix still shows positive–neutral confusion as the dominant failure mode, while the neutral class tends to be predicted most reliably.

## Open-Ended Reflection Questions (23 points)

### 1. Training Dynamics
Focus on your MLP and LSTM implementations

> Did your models show signs of overfitting or underfitting? What architectural or training changes could address this?

- For the MLP, the training and validation losses decrease together, and the validation F1 score also increases overall, so it seems that the model was trained stably without clear overfitting. However, for the LSTM, the training performance keeps increasing, but the validation performance becomes more gradual after about 20 epochs, so there is a possibility of overfitting. To improve this, we can adjust dropout or weight decay, adjust the learning rate to find a proper range, or select the optimal checkpoint using early stopping.

> How did using class weights affect training stability and final performance?

- The given dataset seems to have unbalanced classes, and if we do not use class weights, the model’s predictions can be biased toward the neutral class. This can lower the macro F1 score. By applying class weights with ```nn.CrossEntropyLoss(weight=...)```, a penalty is applied to the minor classes, so macro F1 can encourage more balanced training.

### 2. Model Performance and Error Analysis
Focus on your MLP and LSTM implementations

> Which of your two models generalized better to the test set? Provide evidence from your metrics.

- From [the Test Macro F1 results](#evaluation-requirements-10-points), the MLP is 0.7087 and the LSTM is 0.7404, so the LSTM seems to have generalized better to the test set with slightly higher scores. Also, when checking [the confusion matrices](#evaluation-requirements-10-points) above, we can see that the overall misclassifications of the LSTM are slightly reduced.

> Which sentiment class was most frequently misclassified? Propose reasons for this pattern.

- For both models, the most frequent misclassification happened between the positive and neutral classes. One reason is that the dataset itself has a large proportion of neutral samples, so the model’s predictions could have been biased toward neutral. Also, due to the characteristics of the MLP and LSTM models, they may have missed differences in nuance or important contextual clues. There were probably expressions with unclear boundaries as well, since this is financial data.

### 3. Cross-Model Comparison
Compare all six models: MLP, RNN, LSTM, GRU, BERT, GPT

> How did mean-pooled FastText embeddings limit the MLP compared to sequence-based models?

- Mean-pooling compresses the word vectors in a sentence into an average, so a lot of information such as word order, negation, and emphasis disappears. Therefore, because the MLP is already in a state where information is greatly reduced, it was difficult for it to use sentence structure and separate the boundary in a detailed way like sequence-based models.

> What advantage did the LSTM’s sequential processing provide over the MLP?

- The sequential processing of the LSTM has the advantage that it can reflect the relationship and flow of words before and after while processing the (32, 300) sequence in order. In other words, it can preserve important clues for sentiment such as negation, emphasis, and transitions in the sentence better, and as a result, we can see that the test macro F1 came out slightly higher than the MLP.

> Did fine-tuned LLMs (BERT/GPT) outperform classical baselines? Explain the performance gap in terms of pretraining and contextual representations.

- As a result, BERT and GPT showed higher test macro F1 than the classical models. I think this is because it was possible to learn more efficiently by using pretrained weights, and because it caught that words in a sentence change depending on the context through contextual representations. In contrast, FastText-based models have relatively fixed and simplified word meanings, so they can easily miss the nuance of context.

> Rank all six models by test performance. What architectural or representational factors explain the ranking?

- By Test Macro F1 Scorce:
    - 1st: BERT (0.8055)
    - 2nd: GPT (0.7872)
    - 3rd: GRU (0.7494)
    - 4th: LSTM (0.7404)
    - 5th: MLP (0.7087)
    - 6th: RNN (0.6844)
- BERT and GPT, which ranked at the top, could distinguish unclear boundaries like positive and neutral better thanks to pretraining and contextual representations.
- GRU and LSTM, which are in the middle rank, could do sequencing, but it seems that effective learning would have been difficult due to the limited training data size and overfitting.
- Among the lower rank models, the MLP had disadvantages due to mean-pooling as mentioned above, and in the case of the RNN, because there is no gate, the limitations would have appeared more strongly.

## AI Use Disclosure (Required)

If you used any AI-enabled tools (e.g., ChatGPT, GitHub Copilot, Claude, or other LLM assistants) while working on this assignment, you must disclose that use here. The goal is transparency-not punishment.

In your disclosure, briefly include:
- **Tool(s) used:** (name + version if known)
- **How you used them:** (e.g., concept explanation, debugging, drafting code, rewriting text)
- **What you verified yourself:** (e.g., reran the notebook, checked outputs/plots, checked shapes, read documentation)
- **What you did *not* use AI for (if applicable):** (optional)

You are responsible for the correctness of your submission, even if AI suggested code or explanations.

#### <font color="red">Write your disclosure here.</font>
- **Tool(s) used:** ChatGPT 5.2 Thinking <br>
- **How you used them:** Used for concept explanations, asking example code, debuggings and correction of my responses. <br>
- **What you verified yourself:** Reading documentation, refactoring the code, and interpreting the execution results. <br>
- **What you did *not* use AI for (if applicable):** Code refactoring and logic comprehension. (Manually adapted and refactored the provided example code to fit my own coding style, understanding the implementation.)