<a href="https://colab.research.google.com/github/CST389-487-NLP/intro-to-github-you-need-too-sign-in-JeremiahKicks/blob/main/HW2_starter_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color="magenta">Programming assignment 1</font>
<font color="blue"> 7pts - assignment is not optional, but counts as bonus</font>

The assignment is structured as follows: the beginning cells are the starter code that you must read and understand. Then you'll see the red bold title "Assignment". The markdown cell that follows the title describes partial starter code in the next cell. This next cell has 'fill-in-blancks' parts that are your repsoncibility. Then the rest of the assignment is described in the following markdown cell


## Training IMDB sentiment classification model
Install necessary libraries

In [None]:
!pip install datasets
!pip install torchinfo
!pip install torch==2.3.0 torchtext==0.18.0

Import necessary modules

In [None]:
import torch
from torch import nn
from torch.nn import functional
from torch import optim
from torch.utils import data
import matplotlib.pyplot as plt
torch.manual_seed(42)

The next cell inmports datasets library from Hugging face. To be able to run it you need to
  - Create a Hugging Face account and log into it,
  - Navigate to Settings: Click on your profile picture in the top-right corner, then select Settings from the dropdown menu,
  - Go to Access Tokens: In the settings menu on the left side, click on the Access Tokens tab,
  - Generate a New Token: Click the New Token button
  - Configure the Token:
    1.   Name your token
    2.    Choose the access level. In this cases, a "read" token (read-only access to public and granted-access models/datasets) is sufficient and recommended for security (we do not need to push results onto Hugging face
  - Click the Generate a token (or Create token) button. The full token value will be displayed immediately. Copy it, as it will not be shown in full again for security reasons
Next you would need to store the token in your colab security locker
  - Open Secrets: In your Google Colab notebook, click the key icon (ðŸ”‘) in the left-hand sidebar,
  - Click the + Add new secret button,
  - In the "Name" field, type `HF_TOKEN` (this is the standard name used by Hugging Face libraries but if you want to you can use your own),
  - Paste your Hugging Face API token into the "Value" field
  - Toggle the Notebook access switch to the right (it will turn blue) to allow your current notebook to use this secret
For this notebook nothing else is needed - when you run code in the next cell you will be asked to allow hugging face access to notebook. Say 'yes' and it'll run. In the future you may need to use api to access Hugging face. In this case just use this code:

```python
from google.colab import userdata
import os

# Retrieve the secret
hf_token = userdata.get('HF_TOKEN')
```

In [None]:
from datasets import load_dataset

# 1. Load the dataset
dataset = load_dataset("imdb")

# 2. Access splits
train_data = dataset['train']
test_data = dataset['test']

# 3. View an example: {'text': 'I love sci-fi...', 'label': 1}
print(train_data[0])

##  Vocabulary Building
Next cell imports two crucial functions from the torchtext library for natural language processing:
  - `get_tokenizer`: This function is used to create a tokenizer. A tokenizer is responsible for breaking down raw text into smaller units called 'tokens' (usually words or subwords). For example, it can turn a sentence like 'Hello world!' into ['hello', 'world', '!'].
  - `build_vocab_from_iterator`: This function constructs a vocabulary from an iterator that yields sequences of tokens. A vocabulary maps each unique token to a unique numerical index, which is essential for converting text into a numerical format that a neural network can process.



In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In the next cell is we prepare the text data for a neural network by converting raw text into a numerical representation. It performs two main steps:
- Define the tokenizer: The line tokenizer = `get_tokenizer("basic_english", language="en")` creates a tokenizer instance using torchtext's get_tokenizer function. This tokenizer is responsible for splitting raw sentences into individual words or tokens,
- Build the vocabulary: This is the core part where a mapping from tokens to numerical indices is created.
  - The 'yield_tokens' function is a helper generator. It iterates through the train_data (which contains dictionaries with a 'text' key) and applies the tokenizer to each text, yielding a stream of tokens. This stream is what `build_vocab_from_iterator` expects.
  - `vocab = build_vocab_from_iterator(...)` constructs the vocabulary. It takes the token stream from `yield_tokens`, includes special tokens like <unk> (for unknown words not in the vocabulary) and <pad> (for padding sequences to the same length), and only includes words that appear at least `min_freq=2` times in the training data to reduce vocabulary size and noise.
  - `vocab.set_default_index(vocab["<unk>"])` ensures that any word encountered during inference that was not in the training vocabulary will be mapped to the index of the <unk> token.
  - Finally, the print statements display information about the built vocabulary, such as its size and the indices of some specific words and special tokens, to verify its creation.

In [None]:
# 1. Define the tokenizer
# A basic English tokenizer is sufficient for the IMDB dataset
tokenizer = get_tokenizer("basic_english", language="en")
print("Tokenizer initialized.")

# 2. Build the vocabulary
# Define a generator function to yield tokens from the training data
def yield_tokens(data_iter):
    # data_iter is a list of dictionaries like {'text': '...', 'label': 0}
    for item in data_iter:
        yield tokenizer(item['text'])

vocab = build_vocab_from_iterator(
    yield_tokens(train_data), # Use the loaded train_data
    specials=["<unk>", "<pad>"], # Special tokens for unknown words and padding
    min_freq=2 # Minimum frequency for a word to be included in the vocab
)
vocab.set_default_index(vocab["<unk>"]) # Set default index for unknown words
print(f"Vocabulary built with {len(vocab)} unique tokens.")

# Display some vocabulary information for verification
print(f"<unk> token index: {vocab['<unk>']}")
print(f"<pad> token index: {vocab['<pad>']}")
print(f"Index of 'the': {vocab['the']}")
print(f"Index of 'movie': {vocab['movie']}")
print(f"Top 10 most frequent tokens: {vocab.get_itos()[:10]}")

This cell defines a custom `IMDBDataset` class, which is a crucial component for handling your data in PyTorch, especially for tasks like natural language processing. We do the following:
- import Dataset from `torch.utils.data`, which is the base class for custom datasets in PyTorch.
- Define text_pipeline. It is the lambda function that takes a raw text string, tokenizes it using the tokenizer (defined in a previous cell), and then converts those tokens into numerical indices using the vocab (also built in a previous cell). This transforms text into a sequence of numbers that a neural network can process.
- Define `label_pipeline`. This lambda function is designed to convert string labels (like 'pos' or 'neg') into numerical labels (1 or 0). For IMDB dataset, labels are already be integers, so this pipeline in not used here, but it a standard preprocessing step, so I kept it here.

The IMDBDataset class inherits from torch `utils.data.Dataset` and has three main methods:
  - `__init__` initializes the dataset with the raw data list (e.g., train_data), and the `text_pipeline` and `label_pipeline` functions,
  - `__len__` returns the total number of items in the dataset, which is essential for PyTorch's DataLoader,
  - `__getitem__` is the core method. When you access an item by index (e.g., dataset[0]), it retrieves the text and label for that index, applies the text_pipeline to convert text to token IDs, converts the label to a torch.long tensor (suitable for loss functions like CrossEntropyLoss), and then returns the processed label and text tensor,
  - `Instantiate Datasets` code creates two instances of IMDBDataset:
    - train_dataset_custom using train_data.
    - test_dataset_custom using test_data.
The rest are print statements to show the size of the created datasets and to display a sample item (label and first few token IDs) from the training dataset, ensuring that the data is being processed correctly.

In [None]:
from torch.utils.data import Dataset

# 3. Define the data processing pipeline (text to numerical tensor) and custom Dataset
# (Moved from previous cell to be part of the dataset class for better encapsulation)

# Define the text pipeline using the previously built vocab and tokenizer
text_pipeline = lambda x: vocab(tokenizer(x))
# Define the label pipeline to convert 'pos'/'neg' to 1/0
# This pipeline is not strictly necessary here as labels from datasets.load_dataset are already integers.
label_pipeline = lambda x: 1 if x == 'pos' else 0

class IMDBDataset(Dataset):
    def __init__(self, data_list, text_pipeline, label_pipeline):
        self.data_list = data_list
        self.text_pipeline = text_pipeline
        self.label_pipeline = label_pipeline

    def __len__(self):
        return len(self.data_list)

    def __getitem__(self, idx):
        # self.data_list[idx] returns a dictionary like {'text': 'review content', 'label': 0}
        # Direct unpacking into `text, label` fails for dictionaries. Instead, access by key.
        item = self.data_list[idx]
        text = item['text']
        label = item['label'] # label is an integer (0 or 1)

        processed_text = torch.tensor(self.text_pipeline(text), dtype=torch.int64)
        # For CrossEntropyLoss, labels should be of type torch.long
        processed_label = torch.tensor(label, dtype=torch.long)
        return processed_label, processed_text

# Create instances of the custom dataset
print("Creating IMDB training dataset...")
train_dataset_custom = IMDBDataset(train_data, text_pipeline, label_pipeline)
print(f"Training dataset size: {len(train_dataset_custom)}")

print("Creating IMDB test dataset...")
test_dataset_custom = IMDBDataset(test_data, text_pipeline, label_pipeline)
print(f"Test dataset size: {len(test_dataset_custom)}")

# Verify a sample from the custom dataset
print("\nSample from custom training dataset (label, token_ids):")
sample_label, sample_text_tensor = train_dataset_custom[0]
print(f"Label: {sample_label}")
print(f"Text tensor (first 10 tokens): {sample_text_tensor[:10]}")
print(f"Original text (first 50 chars): {train_data[0]['text'][:50]}...")

The next cell creates the efficient data loaders for nn PyTorch model, dealing with variable-length text sequences.
- `Import DataLoader` part starts with importing DataLoader `from torch.utils.data`, which is PyTorch's utility for iterating over datasets in batches.
- `Define collate_batch function` prepairs batches of data for the `torch.nn.EmbeddingBag` layer, I use in the nn model. When DataLoader yields a batch, it passes a list of samples (label, text_tensor) to this function. Here's what `collate_batch` does:
  - It iterates through each sample in the batch.
  - It collects all labels into label_list.
  - It collects all processed text tensors (which are already numerical token IDs from the IMDBDataset) into text_list.
  - It builds an offsets list. EmbeddingBag requires a single concatenated tensor of all text tokens in the batch, along with offsets that indicate where each individual text sequence starts in that concatenated tensor. So, offsets are cumulative sums of the lengths of the text sequences.`label_list` is stacked into a single tensor. Offsets are converted into a `torch.tensor` and `cumsum` (cumulative sum) is applied to correctly generate the starting indices for each text within the concatenated text_list. All `text_list` tensors are concatenated into one large text_list tensor.
  - It returns the batch of labels, the concatenated text_list, and the offsets.

In Create DataLoaders section I set up the actual data loaders for training and testing:
- `BATCH_SIZE` is defined as 64, meaning 64 samples will be processed together.
- `device` is set to 'cuda' if a GPU is available, otherwise 'cpu'.
- `train_loader_custom` is created using `train_dataset_custom`, set to `shuffle=True` for mini-batch SGD, and importantly, uses `collate_batch` function.
- `test_loader_custom` is created similarly for the test set, but `shuffle=False` as we only do forward pass for evaluation.

Finally, we iterate through one batch from the train_loader_custom to print the shapes and a few samples of the labels, text tensor, and offsets.



In [None]:
from torch.utils.data import DataLoader

# 4. Define the collate_fn for DataLoader
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _label, _text in batch:
        label_list.append(_label)
        processed_text = _text # _text is already a tensor from IMDBDataset
        text_list.append(processed_text)
        offsets.append(processed_text.size(0)) # Length of each text sequence

    label_list = torch.stack(label_list) # Changed to torch.stack
    # offsets.pop(0) is incorrect. cumsum should be applied to lengths before the first offset.
    # The offsets for `torch.nn.EmbeddingBag` should be `[0, len(text1), len(text1)+len(text2), ...]`
    # The first element is 0, and then each subsequent element is the cumulative sum of text lengths.
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)

    # Concatenate all text tensors into a single tensor
    text_list = torch.cat(text_list)
    return label_list, text_list, offsets

# 5. Create DataLoaders
BATCH_SIZE = 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Creating DataLoaders...")
train_loader_custom = DataLoader(
    train_dataset_custom,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_batch
)

test_loader_custom = DataLoader(
    test_dataset_custom,
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=collate_batch
)
print("DataLoaders created.")

# Verify by iterating through one batch
print("\nVerifying a batch from train_loader_custom:")
for labels, text_tensor, offsets in train_loader_custom:
    print(f"Batch labels shape: {labels.shape}")
    print(f"Batch text tensor shape: {text_tensor.shape}")
    print(f"Batch offsets shape: {offsets.shape}")
    print(f"First few labels: {labels[:5]}")
    print(f"First few text token IDs: {text_tensor[:10]}")
    print(f"First few offsets: {offsets[:5]}")
    break

print("\nVerification of data loading pipeline complete. ModuleNotFoundError for torchdata.datapipes should be resolved with this custom pipeline.")

### <font color="red"><b>Assginment</b></font>
<font color="blue"> Assignment starts here and continues to the end of NB</font>

The next cell must define the IMDBMLP (IMDB Multi-Layer Perceptron) model, which is a neural network designed for sentiment classification. Below I summarize the required structure and requirements.
 - `class IMDBMLP(nn.Module)` - class definition of the IMDBMLP class inherits from nn.Module, the base class for all neural network modules in PyTorch.
- `__init__ method` is the constructor where all the layers of the network are defined:
- `self.embedding = nn.EmbeddingBag(vocab_size, embedding_dim, sparse=False)` creates an embedding layer. `EmbeddingBag` is particularly efficient for text classification where you might have variable-length sequences. It takes the `vocab_size` (total number of unique words), `embedding_dim` (the size of the vector representation for each word), and `sparse=False` ensures dense gradients, which is generally better for optimizers like Adam.
- `self.fc1 = ...` defines the first fully connected (linear) layer. It takes the output from the embedding layer (which is an `embedding_dim`-sized vector representing the entire text) and transforms it into hidden_dim1 features.
- `self.gelu1 =...` is the hidden GELU (Gaussian Error Linear Unit) activation layer, which introduces non-linearity after the first linear layer.
- `self.fc2 =...` is the second fully connected layer, further transforming features.
- `self.gelu2 = nn.GELU()` another GELU activation layer.
-`self.fc3 = ...` is the final linear layer, that creates logits, i.e.  outputs `num_class` values (2 in this case, for positive/negative sentiment).
-`self.softmax = nn.Softmax(dim=1)` layer applies the Softmax function to the output of fc3 to convert the raw scores into probabilities for each class, ensuring they sum to 1.
After defitinions of layers come methods:
- `forward` method defines how data flows through the network during a forward pass:
`embedded = self.embedding(text, offsets)`. The input text (token IDs) and offsets (indicating start positions of each text in a batch) are passed to the EmbeddingBag layer to get the text embeddings.
- The embedded output then sequentially passes through fc1, gelu1, fc2, gelu2, and fc3. Finally, self.softmax(x) converts the raw logits into probability distribution over the classes.

You must define the following model hyperparameters:
-`VOCAB_SIZE` Determined by the `len(vocab)` from previous cells.
-`EMBEDDING_DIM` is the dimensionality of the word embeddings (you can choose any value from 50 to 2024 but in your comments you must justify your choice).
- `HIDDEN_DIM1, HIDDEN_DIM2` are the sizes (number of neurons) of the hidden layers in the MLP (again it is your choice but you should explain it).
- `NUM_CLASSES` is 2 for binary sentiment classification.
Next we instantiate the model `model_imdb_mlp = IMDBMLP(...)` - it creates an actual instance of your neural network. And then print model summary using `torchinfo.summary()` that provides a detailed overview of the model, including its layers, output shapes, number of parameters for each layer, and total parameters. The input_size and dtypes arguments are crucial here to simulate the input shape that the model expects for text (a 1D tensor of token IDs) and offsets (a 1D tensor of start indices) for a sample batch.

<font color="red">you must fill in the dots in the next section</font>



In [None]:
import torchinfo

class IMDBMLP(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim1, hidden_dim2, num_class):
        super(IMDBMLP, self).__init__()
        # Set sparse=False to ensure dense gradients for compatibility with Adam optimizer
        self.embedding = nn.EmbeddingBag(vocab_size, embedding_dim, sparse=False)
        self.fc1 = nn.Linear(...(?))
        self.gelu1 = nn.GELU()
        self.fc2 = nn.Linear(...(?))
        self.gelu2 = nn.GELU()
        self.fc3 = nn.Linear(...,(?))
        self.softmax = nn.Softmax(dim=1)

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
          ...(?)
        return self.softmax(x)

# Define model parameters
VOCAB_SIZE = len(vocab) # vocab is from previous cells
EMBEDDING_DIM = ...(?)
HIDDEN_DIM1 = ...(?)
HIDDEN_DIM2 = ...(?)
NUM_CLASSES = ...(?)

# Instantiate the model
model_imdb_mlp = IMDBMLP(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_DIM1, HIDDEN_DIM2, NUM_CLASSES)

# Print model summary
print("IMDBMLP Model Summary:")
# Correct input_size: text should be 1D, offsets should be 1D
# Example: one review with 10 tokens (total 10 tokens for batch size 1)
torchinfo.summary(model_imdb_mlp, input_size=[(10,), (1,)], dtypes=[torch.int64, torch.int64])

### The rest of the assignment
In the following cells:
* the model should be reinstantiated with
apropriate values (in case ou just did memo in previous cell),
* correct loss function should be chosen,
* optimizer should be added,
* number of training epochs should be defined,
* training should be executed.
* You must compute running loss and accuracy and output it during training.
Once training is finished you must
* evaluate your model on the test set,
* output test lossand test accuarcy.
<font color="blue"><b>The code should have extensive description in markdown cells accompanying code. I will severly penalize absence of comment cells.</b></font>

For the remaining part of the assignment you can/should use/modify code in my Jupyter notebook for demonstration of nn in class that can be accessed in week 3 repository in Github classroom (and/or file in week 3 weekly lectures  on BB).    

