# Mini Project 3

By the deadline, please submit the provided Jupyter notebook with all/some required tasks completed and clearly solved. Make sure your code is neat, well-commented, and that all outputs are visible (run all cells before saving). Notebooks with missing tasks or unexecuted cells may receive fewer points. After you submit, you won‚Äôt be able to make changes, so double-check your work and be sure to start from the provided template

## Submission rules
As already discussed in class, we will stick to the following rules.
- Use the templates and name your files `NAME_SURNAME.ipynb` (If you have more than one name, just concatenate them). We will compare what you present with that file. 
- Code either not written in Python or not using PyTorch receives a grade of 0. Of course, you can use auxiliary packages when needed (`matplotlib`, `numpy`, ...), but for the learning part, you must use PyTorch.
-  If plagiarism is suspected, TAs and I will thoroughly investigate the situation, and we will summon the student for a face-to-face clarification regarding certain answers they provided. In case of plagiarism, a score reduction will be applied to all the people involved, depending on their level of involvement.
-  If extensive usage of AI tools is detected, we will summon the student for a face-to-face clarification regarding certain answers they provided. If the answers are not adequately supported with in-person answers, we will proceed to apply a penalty to the evaluation, ranging from 10% to 100%.

## Sentiment Analysis with LSTM

The IMDb dataset is a large collection of movie reviews compiled by the Internet Movie Database (IMDb), one of the most comprehensive online databases for films, TV shows, actors, and production crew information. IMDb is widely used for accessing details such as cast lists, user ratings, reviews, and plot summaries [(source)](https://www.geeksforgeeks.org/data-science/imdb-datasets-types-usages-and-application/).

In this mini-project, you will perform **sentiment analysis** on IMDb movie reviews using **LSTM-based models**. The goal is to classify each review as positive or negative.

You are required to build and train:

* A **simple LSTM model**, aiming for at least **75% test accuracy**
* A **more advanced LSTM model**, trying to push the accuracy as high as possible
* A **function** that can evaluate the sentiment of any new review

In [1]:
# Packages here 
from datasets import load_dataset
import torch
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
import numpy as np

# Don't change this 
torch.manual_seed(123)
torch.cuda.manual_seed(123)
np.random.seed(123)
torch.backends.cudnn.enabled=False
torch.backends.cudnn.deterministic=True

  from .autonotebook import tqdm as notebook_tqdm


### Task 1 (0 pts)

To load the data, we will use [Hugging Face](https://huggingface.co/), an open-source platform that provides datasets, pre-trained models, and tools for modern machine learning.
**Note:** In principle, you could also use [TorchText](https://docs.pytorch.org/text/main/datasets.html#imdb), but it does not work reliably on Kaggle. Using PyTorch‚Äôs dataset instead of Hugging Face is **not** considered an error.

Inspect the dataset you load and make sure you understand its structure and format.

In [2]:
ds = load_dataset("stanfordnlp/imdb")
dataset_train = ds['train']
dataset_test = ds['test']

Generating train split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 25000/25000 [00:00<00:00, 299263.10 examples/s]
Generating test split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 25000/25000 [00:00<00:00, 348751.11 examples/s]
Generating unsupervised split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50000/50000 [00:00<00:00, 345686.48 examples/s]


### Task 2 (5 pts)

Split the original **test set** into two parts: a **validation set** and a **final test set**. Use a split of 50-50.

**Note:** We are aware that, in a typical machine learning workflow, you would split the **training set** (not the test set) to create a validation set. However, in this exercise, we intentionally use the test set for this purpose to provide you with **more training data** for model learning.

In [None]:
# TODO

### Task 3 (10 pts)

Create a `tokenize` function that takes a line from your dataset (a review) and converts it into tokens.
You may want to consider the following:

* Are `film` and `film.` considered different?
* Does it matter if a word is uppercase or lowercase?
* Should tokens like `<br />` be included or removed?

In [None]:
def tokenize(line):
    pass

### Task 4 (5 pts)
Create two dictionaries, `word_to_idx` and `idx_to_word`, to map words to their embedding indices and vice versa. At this stage, you should decide whether to remove outlier words and replace them with the `<UNK>` token. This decision will, of course, depend on the performance you observe afterward.

In [None]:
word_to_idx = {'<PAD>': 0, '<UNK>': 1}
idx_to_word = {0: '<PAD>', 1: '<UNK>'}

# TODO

### Task 5 (10 pts)
Create a `Dataset` class. The `__getitem__` function should return `(X, y)`, where `X` is a tensor containing the indices of the embedded words, and `y` is a tensor representing the sentiment expressed in the review.


In [None]:
class DatasetIMDB(torch.utils.data.Dataset):
    def __init__(self, dataset, word_to_idx):
        pass

    def __len__(self):
        pass

    def __getitem__(self, idx):
        pass

### Task 6 (10 pts)

Define a `collate` function for your DataLoader that ensures all sequences in a batch have the same length. The function should pad shorter sequences with the `<PAD>` index, so that every sequence in the batch matches the length of the longest sequence.

In [None]:
def collate_fn(batch):
    pass

### Task 7 (5 pts)

Create one `DataLoader` for each dataset: Training, Validation, and Test. Make sure each `DataLoader` uses your `Dataset` class and the `collate_fn` function you defined in Task 6.

In [None]:
# TODO

### Task 8 (15 pts)

Define an `LSTM` class that can be customized as needed. Follow the provided template, but feel free to add additional attributes or methods if necessary.

In [None]:
class LSTM(torch.nn.Module):
    def __init__(self, vocab_size, emb_dim=100, hidden_dim=128, num_layers=1, dropout=0, bidirectional=False):
        super().__init__()
        # ...
        # if bidirectional....
        pass

    def forward(self, x):
        # ...
        # if bidirectional....
        pass

### Task 9 (30 pts)

Train your model. Aim for a **test** accuracy of at least 75%. Be prepared to answer, among others, the following questions:

* Did you use a stacked LSTM? Why or why not?
* Did you use a bidirectional LSTM? Why or why not?
* Did you need to adjust tensor dimensions for your model?
* Which loss function did you choose, and why?

Plot the training and evaluation losses, making sure there are no signs of overfitting, and print the final tarining/validation/test accuracy.

In [None]:
# TODO

### Task 10 (5 pts)

Improve your model architecture and training procedure by applying one or more of the following strategies:

* Truncated Backpropagation Through Time
* Better tokenizer
* Pre-trained embeddings
* ....

This exercise is considered successful if **any "sensible" improvement in test accuracy** is achieved, even by applying just one change.

In [None]:
# TODO

### Task 11 (5 pts)

Use one of your trained models to perform sentiment analysis on the following reviews. Be prepared to explain any issues you encountered and how you addressed them.

In [None]:
"""
Iron Man isn't just a superhero movie. It's the spark that ignited the entire Marvel Cinematic Universe. With Robert Downey Jr.'s career-defining performance, razor-sharp writing, and a perfect blend of heart, humor, and high-tech spectacle, this film redefined what a comic book movie could be.

Even after more than a decade, Iron Man remains one of the most re-watchable, charming, and influential superhero origin stories ever made.

üé¨ Overview

Iron Man introduces Tony Stark (Robert Downey Jr.), a brilliant but arrogant billionaire weapons manufacturer. When he's captured by terrorists in Afghanistan and forced to build a missile, Tony instead constructs a powered suit of armor to escape.

Haunted by the destruction caused by his weapons, Stark returns home determined to reinvent himself. Not as a war profiteer, but as Iron Man, a hero powered by his mind, conscience, and an arc reactor in his chest.

Along the way, he faces betrayal from within his company, moral dilemmas, and a growing awareness of what true responsibility means.

‚úÖ What Worked

1. Robert Downey Jr.: Perfect casting. His charisma, wit, and emotional depth made Tony Stark unforgettable.

2. Grounded realism: The technology feels just believable enough to make Iron Man's world plausible.

3. Sharp writing and humor: Smart, fast-paced dialogue that balances action with personality.

4. Emotional arc: Tony's transformation from egocentric arms dealer to self-aware hero feels authentic.

5. Cinematography & effects: The first suit build and flight sequences are still jaw-dropping.

6. Gwyneth Paltrow & Jeff Bridges: Excellent supporting cast. Pepper Potts' loyalty and Obadiah Stane's menace both shine.

7. The ending line: That bold, improvised moment "I am Iron Man." changed superhero cinema forever.

‚ùå What Didn't Work

1. Final battle pacing: The showdown between Iron Man and Iron Monger feels slightly rushed.

2. Limited female representation: Pepper is great, but she's one of very few women in a male-dominated cast.

3. Predictable villain motivation: Stane's greed is a bit by-the-numbers.

But honestly. These are small cracks in an otherwise near-perfect armor.

üí¨ Favorite Quotes / Moments

"I am Iron Man." - Tony Stark "Sometimes you've gotta run before you can walk." - Tony Stark "My turn." - Tony, before launching a missile at a tank "Is it better to be feared or respected? I say, is it too much to ask for both?" - Tony Stark Tony's first cave suit escape: gritty, powerful, and unforgettable.

The Mark II flight test: pure cinematic joy as Tony takes to the skies.

Pepper replacing Tony's arc reactor: both funny and intimate.

The press conference ending: Tony discarding the superhero secrecy trope in one iconic line.

The post-credit scene: Nick Fury's appearance teasing the Avengers Initiative (and the MCU as we know it).

üí° Fun Facts

1. Robert Downey Jr. Was not Marvel's first choice but his audition blew everyone away.

2. Much of the film's dialogue was improvised, including "I am Iron Man."

3. The movie was Marvel Studios' first independent production, made before Disney bought them.

4. Jon Favreau insisted on using practical effects for the suits wherever possible.

5. The success of Iron Man directly led to the creation of the MCU, which has since become the highest-grossing franchise in film history.

6. Tony Stark's mansion was CGI. It doesn't actually exist!

üé• If You Liked This, You Might Also Enjoy

1. The Dark Knight (2008): Another intelligent and grounded superhero reboot.

2. Iron Man 2 (2010): The next chapter in Tony Stark's evolution.

3. Doctor Strange (2016): A spiritual successor exploring genius and redemption.

4. Captain America: The First Avenger (2011): Marvel's other essential origin story.

5. The Social Network (2010): Not a superhero movie, but a study of brilliance and ego that echoes Stark's early character.

Final Thoughts

Iron Man remains the gold standard for superhero origin stories. A perfect fusion of innovation, attitude, and emotion. It's not just about a man in a suit; it's about a man who learns to use his mind and heart for something greater.

Final Verdict: (9.5/10). "The birth of Iron Man was also the birth of a cinematic universe and it still flies higher than ever."
"""

## Questions

During the presentation, we may ask questions to ensure you have understood the core concepts of the course. Examples include:

1.	What is the hidden state in a recurrent neural network (RNN), and what role does it play during sequence processing?
2.	Why do we need padding when working with batches of variable-length sequences? How is padding typically handled in practice?
3.	What is a stacked RNN, and why might stacking multiple recurrent layers improve performance?
4.	What is a bidirectional RNN, and in which scenarios does it provide an advantage?
5.	What is the exploding gradient problem in recurrent networks?
6.	What is the vanishing gradient problem, and why is it particularly severe in RNNs?
7.	What is gradient clipping, and why is it commonly applied when training RNNs?
8.	How do different activation functions influence vanishing or exploding gradients in deep or recurrent networks?
9.	What is truncated backpropagation through time, and why is it used when training RNNs?
10.	What is one-hot encoding for representing words in a vocabulary? What are its limitations?
11.	What does `nn.Embedding` do, and why is it preferred over one-hot encoding?
12.	What are gating mechanisms in recurrent architectures (e.g., LSTM/GRU), and why are they important?
	13.	How do LSTM gates help mitigate the vanishing gradient problem?