## **Building, Tuning and Evaluating a standard BILST Model for NLP**

### **Libraries**

In [None]:
import os
import sys
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Sampler, BatchSampler, Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

from datasets import load_dataset

### **Utility Function**

In [None]:
def regularized_f1(train_f1, dev_f1, threshold=0.0015):
    """
    Returns development F1 if overfitting is below threshold, otherwise 0.
    """
    return dev_f1 if (train_f1 - dev_f1) < threshold else 0


def save_metrics(*args, path, fname):
    if not os.path.exists(path):
        os.makedirs(path)
    if not os.path.isfile(path + fname):
        with open(path + fname, "w", newline="\n") as f:
            f.write(
                ",".join(
                    [
                        "config",
                        "epoch",
                        "train_loss",
                        "train_acc",
                        "train_f1",
                        "val_loss",
                        "val_acc",
                        "val_f1",
                    ]
                )
            )
            f.write("\n")
    if args:
        with open(path + fname, "a", newline="\n") as f:
            f.write(",".join([str(arg) for arg in args]))
            f.write("\n")


def seed_everything(seed: int):
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True


seed_everything(1234)

### **Constants**

In [None]:
VOCAB_SIZE = 20_000
BATCH_SIZE = 32
NUM_EPOCHS = 15
MAX_LEN = 256
LEARNING_RATE = 1e-4

### **Step 1 - Download and prepare the dataset**

I loaded the IMDB dataset using the datasets module from Hugging Face, which conveniently includes preprocessed train and test splits.
Next, I customized the train/validation split to better control the distribution of samples during training and evaluation. Specifically:

- I selected the central portion of the original training set (from 10% to 85%) to serve as the actual training data.

- The remaining outer segments (the first 10% and the last 15%) were combined and used as the validation set.

This slicing strategy allowed me to simulate a more realistic validation scenario while maintaining a balanced and representative training set. I used the Hugging Face slicing API directly within the load_dataset function to perform this operation efficiently.

The test set provided by the original IMDB dataset remained unchanged and was reserved for final evaluation.

In [None]:
# load dataset in splits
train_data = load_dataset("imdb", split="train[10%:85%]")
dev_data = load_dataset("imdb", split="train[:10%]+train[85%:]")
test_data = load_dataset("imdb", split="test")

I defined a tokenizer using `get_tokenizer` with the `'spacy'` backend (`en_core_web_sm`), then built the vocabulary from the dataset using `build_vocab_from_iterator`. I included the special tokens `'<UNK>'` at index `0` and `'<PAD>'` at index `1`, limited the vocabulary size to `VOCAB_SIZE`, and set `'<UNK>'` as the default index for unknown tokens. I also added a test to verify that unknown tokens correctly return index `0`.


In [None]:
# define tokenizer
tokenizer = get_tokenizer("spacy")  # using spaCy tokenizer


# function to generate and yield the tokens from the dataset
def yield_tokens(data):
    for text in data:
        yield tokenizer(text["text"])


# define vocabulary
vocab = build_vocab_from_iterator(
    yield_tokens(train_data),
    specials=["<UNK>", "PAD"],
    max_tokens=VOCAB_SIZE,
)

# set the special tokens
vocab.set_default_index(vocab["<UNK>"])

I used the tokenizer and vocabulary to convert the three data splits into sequences of token indices, applying a maximum sequence length of `MAX_LEN`. While this preprocessing was done ahead of time here, it's typically performed on the fly within the `DataLoader`'s `collate_batch` function to handle memory constraints efficiently.

In [None]:
train_idx = [
    torch.tensor(vocab(tokenizer(text["text"]))[:MAX_LEN]) for text in train_data
]
dev_idx = [torch.tensor(vocab(tokenizer(text["text"]))[:MAX_LEN]) for text in dev_data]
test_idx = [
    torch.tensor(vocab(tokenizer(text["text"]))[:MAX_LEN]) for text in test_data
]

I defined a custom PyTorch dataset by subclassing Dataset, implementing the logic to return tokenized indices and corresponding labels for each sample. I then instantiated this dataset to prepare it for use in the data loading pipeline.

with this line we create a dataset which is simple to work on it 

In [None]:
class ImdbDataset(Dataset):
    def __init__(self, seq, lbl):
        self.sequences = seq
        self.labels = lbl

    def __getitem__(self, idx):
        return {"input_ids": self.sequences[idx], "label": self.labels[idx]}

    def __len__(self):
        return len(self.sequences)

In [None]:
train_set = ImdbDataset(train_idx, train_data["label"])
dev_set = ImdbDataset(dev_idx, dev_data["label"])
test_set = ImdbDataset(test_idx, test_data["label"])

- I implemented a custom GroupedSampler to improve training efficiency by grouping samples of similar length, reducing padding variability within batches. 

- In the __init__ method, I paired each dataset index with the length of its tokenized sequence.

- In the __iter__ method, I shuffled this list, formed temporary groups of size BATCH_SIZE * 100, and sorted each group by sequence length to preserve input diversity across epochs.
I then extracted only the sorted indices and returned them as an iterator. Finally, 

- I completed the __len__ method to return the total number of samples.

In [None]:
class GroupedSampler(Sampler):
    def __init__(self, seqs, batch_size):
        """
        Args:
            seqs (List[List[int]]): List of tokenized sequences from ImdbDataset.
            batch_size (int): Batch size for sampling.
        """
        self.seqs = seqs
        self.batch_size = batch_size

        # pair each sequence index with its tokenized sequence length
        self.index_length_list = [(index, len(seq)) for index, seq in enumerate(seqs)]

    def __iter__(self):
        # shuffle the index-length list for randomness each epoch
        random.shuffle(self.index_length_list)

        # chunk size as per specifications
        chunk_size = self.batch_size * 100
        grouped_indices = []

        # process in chunks, sort each chunk by sequence length
        for i in range(0, len(self.index_length_list), chunk_size):
            chunk = self.index_length_list[i : i + chunk_size]

            # sort within the chunk by sequence length
            chunk.sort(key=lambda x: x[1])  # sort by sequence length (ascending)

            # extend grouped_indices with only the indices from each sorted chunk
            grouped_indices.extend([index for index, _ in chunk])

        return iter(grouped_indices)

    def __len__(self):
        return len(self.seqs)

* Now create the `GroupedSampler`, use it as input to create a `BatchSampler` (imported in the beginning)

In [None]:
train_grouped_sampler = GroupedSampler(seqs=train_idx, batch_size=BATCH_SIZE)
train_sampler = BatchSampler(
    sampler=train_grouped_sampler, batch_size=BATCH_SIZE, drop_last=False
)

I defined a custom collate_fn to process batches of tokenized sequences and labels. The function pads the sequences to the same length using padding_value=1 (reserving 0 for '<UNK>'), converts the labels into tensors, and computes the original (pre-padding) lengths of each sequence. It returns three tensors per batch: the padded sequences, the corresponding labels, and the original sequence lengths.

we want to do transformation of the batch we sampled, we want to transfrom the batch into tensor. 

In [None]:
def collate_batch(batch):
    # Extract sequences and labels from batch items using dictionary keys
    sequences = [torch.tensor(item["input_ids"], dtype=torch.long) for item in batch]
    labels = torch.tensor([item["label"] for item in batch], dtype=torch.float)
    lengths = torch.tensor(
        [len(seq) for seq in sequences], dtype=torch.long
    )  # original lengths before padding

    # pad sequences to the same length
    padded_sequences = pad_sequence(
        sequences, batch_first=True, padding_value=1
    )  # padding value is 1

    return padded_sequences, labels, lengths

I created the final DataLoader for the training set using the custom dataset, GroupedSampler, and collate_fn, setting num_workers=2 to control parallelism. For the validation and test sets, I instantiated their respective ImdbDataset and DataLoader without shuffling or custom samplers, but maintained the same batch size and collate_fn to ensure consistent batching and padding behavior across all splits.

In [None]:
# create dataloaders
train_loader = DataLoader(
    dataset=train_set,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_batch,
    num_workers=2,
)

# Create DataLoader for validation set
val_loader = DataLoader(
    dataset=dev_set,
    batch_size=BATCH_SIZE,
    shuffle=False,  # no shuffle for validation
    collate_fn=collate_batch,
    num_workers=2,
)

# Create DataLoader for test set
test_loader = DataLoader(
    dataset=test_set,
    batch_size=BATCH_SIZE,
    shuffle=False,  # no shuffle for test set
    collate_fn=collate_batch,
    num_workers=2,
)

### BiLSTM-Based Sequence Classifier

The implemented model is a BiLSTM-based sequence classifier for binary text classification. Below is a breakdown of its components and functionalities:

---

#### Embedding Layer
- **Inputs**: `vocab_size`, `embedding_dim`, `padding_idx=1`  
- **Purpose**: Converts token indices into dense vector representations.

---

#### Dropout Layer
- **Position**: Applied immediately after the embeddings  
- **Purpose**: Prevents overfitting.

---

#### Bidirectional LSTM Layer
- **Inputs**: `embedding_dim` (input size), `rnn_size` (hidden size)  
- **Configuration**: Bidirectional — captures both past and future context  
- **Sequence Handling**:  
  - Inputs are wrapped using `pack_padded_sequence`  
  - Outputs are unpacked with `pad_packed_sequence`  
  - Ensures efficient handling of variable-length sequences.

---

#### Pooling Operation
- **Type**: `torch.mean()`  
- **Purpose**: Reduces the LSTM outputs to a single vector per sequence by averaging over the time dimension.

---

#### Fully Connected Layers
- **First Linear Layer**:  
  - Projects the pooled LSTM output to a `hidden_size` dimensional space  
  - **Activation**: `ReLU`  
  - **Dropout**: Applied after the transformation for regularization.

- **Second Linear Layer**:  
  - Maps the hidden representation to either 1 or 2 output neurons  
  - Suitable for binary classification depending on the chosen loss function.

---

This architecture ensures robust sequence modeling with contextual encoding from the BiLSTM, efficient padding handling, and proper regularization through dropout.


In [None]:
class BiLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, rnn_size, hidden_size, dropout):
        super(BiLSTM, self).__init__()

        self.embedding = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=embedding_dim,
            padding_idx=1,  # <PAD> token
        )

        self.dropout = nn.Dropout(p=dropout)

        self.lstm = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=rnn_size,
            num_layers=1,
            bidirectional=True,
            batch_first=True,
        )

        # First Linear Layer
        self.fc1 = nn.Linear(in_features=rnn_size * 2, out_features=hidden_size)

        # Output layer
        self.fc2 = nn.Linear(
            in_features=hidden_size,
            out_features=1,  # Single output for binary classification
        )

        # Activation
        self.relu = nn.ReLU()

    def forward(self, seq, lengths):
        # Embedding layer with dropout
        embedded = self.dropout(self.embedding(seq))

        packed_embedded = nn.utils.rnn.pack_padded_sequence(
            embedded, lengths.cpu(), batch_first=True, enforce_sorted=False
        )

        # LSTM layer
        packed_output, (hidden, cell) = self.lstm(packed_embedded)

        output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)

        # Mean pooling over sequence length
        pooled = torch.mean(output, dim=1)

        # First Linear Layer with ReLU and dropout
        fc1_out = self.dropout(self.relu(self.fc1(pooled)))

        # Output Layer
        output = self.fc2(fc1_out).squeeze(
            1
        )  # Squeeze for binary classification output
        return output


## Task 3 - Inner train loop
* Create a global `device` variable which checks whether a GPU is available or not, and sets the device to either GPU or CPU.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Training and Evaluation Loop (`process` Function)

The `process` function encapsulates the logic for performing a single pass through the dataset, either for training or evaluation. It handles device transfers, batch iteration, loss computation, and metric aggregation.

---

#### Function Overview
- **Inputs**:
  - `model`: the sequence classifier to train or evaluate
  - `dataloader`: a `DataLoader` that yields batches of (sequences, labels, lengths)
  - `criterion`: loss function
  - `optimizer`: (optional) optimizer for backpropagation

- **Loop Behavior**:
  - Iterates once over the dataset (i.e., one epoch)
  - Uses `tqdm()` to display progress with `unit='batches'`, `file=sys.stdout`, and appropriate `desc` (e.g., `'Training'` or `'Evaluating'`)
  - Moves `sequences` and `labels` to `device`
  - Keeps `lengths` on **CPU** (required for correct sequence handling)

---

#### Training Mode
- If `optimizer` is provided:
  - Sets model to `train()` mode
  - For each batch:
    - Performs forward pass
    - Computes loss
    - Executes backpropagation (`loss.backward()` and `optimizer.step()`)
    - Accumulates predictions, loss, and correct predictions

---

#### Evaluation Mode
- If `optimizer` is `None`:
  - Sets model to `eval()` mode with `torch.no_grad()`
  - Performs the same forward logic without updating weights

---

#### Metrics Computed per Epoch
- **Loss**: averaged over all samples
- **Accuracy**: total correct predictions divided by number of samples
- **F1 Score**: computed using all predictions and labels from the epoch

These metrics are returned at the end of the epoch for logging or early stopping.


In [None]:
def process(model, loader, criterion, optim=None):
    is_train = optim is not None
    model.train() if is_train else model.eval()

    total_loss = 0
    all_preds = []
    all_labels = []
    total_samples = 0
    correct_predictions = 0

    for padded_sequences, labels, lengths in tqdm(
        loader,
        desc="Training" if is_train else "Evaluating",
        file=sys.stdout,
        unit="batches",
    ):
        # Move data to device
        sequences = padded_sequences.to(device)
        labels = labels.to(device)
        lengths = lengths.cpu()  # Keep lengths on CPU for pack_padded_sequence

        with torch.set_grad_enabled(is_train):
            outputs = model(sequences, lengths)
            loss = criterion(outputs, labels)
            total_loss += loss.item() * labels.size(0)

            # Convert model outputs to binary predictions
            probs = torch.sigmoid(outputs).cpu().detach().numpy()
            preds = (probs >= 0.5).astype(int)
            all_preds.extend(preds.flatten())
            all_labels.extend(labels.cpu().numpy().flatten())

            correct_predictions += (
                preds.flatten() == labels.cpu().numpy().flatten()
            ).sum()

            if is_train:  # important this function but why?
                optim.zero_grad()
                loss.backward()
                optim.step()

        total_samples += labels.size(0)

    avg_loss = total_loss / total_samples
    avg_accuracy = correct_predictions / total_samples
    avg_f1 = f1_score(all_labels, all_preds, zero_division=1)

    print(f"Unique values in predictions: {np.unique(all_preds, return_counts=True)}")
    print(f"Unique values in labels: {np.unique(all_labels, return_counts=True)}")
    return avg_loss, avg_accuracy, avg_f1


# Task 4 - Training and Hyperparameter Optimization
In the following, we provide 3 configurations for the above created BiLSTM. Try to understand how they differ from each other.

In [None]:
configs = {
    "config1": {
        "vocab_size": VOCAB_SIZE,
        "embedding_dim": 10,
        "hidden_size": 10,
        "rnn_size": 10,
        "dropout": 0.5,
    },
    "config2": {
        "vocab_size": VOCAB_SIZE,
        "embedding_dim": 64,
        "hidden_size": 32,
        "rnn_size": 256,
        "dropout": 0.5,
    },
    "config3": {
        "vocab_size": VOCAB_SIZE,
        "embedding_dim": 300,
        "hidden_size": 256,
        "rnn_size": 256,
        "dropout": 0.5,
    },
}

* Choose the correct criterion to train and evaluate your created model

In [None]:
criterion = nn.BCEWithLogitsLoss()

### Hyperparameter Search and Training Loop

This section implements a full training and validation loop with hyperparameter search across multiple configurations. The goal is to find the model configuration that best generalizes by maximizing the validation F1 score while avoiding overfitting.

---

#### Loop Structure

For each configuration:
1. **Model Initialization**
   - Instantiate a new model with the current hyperparameters.
   - Move the model to the `device`.

2. **Optimizer Setup**
   - Use the Adam optimizer with the globally defined `LEARNING_RATE`.
   - Re-initialize the optimizer with the current model's parameters.

3. **Training Across Epochs**
   - For each epoch (up to `NUM_EPOCHS`):
     - Switch model to training mode and call `process()` on the training set with backpropagation.
     - Switch model to evaluation mode and call `process()` on the validation set without gradient computation.
     - Save both training and validation metrics using `save_metrics(config_id, epoch, ...)`.
     - Optionally print the metrics for live tracking.

4. **Generalization Checkpointing**
   - Compute the **regularized F1 score**:
     ```python
     regularized_f1_score = regularized_f1(f1_valid, f1_train)
     ```
   - If this score is greater than the best **overall** regularized F1 so far:
     - Save the current model’s state dict to `best_model.pt`.
     - Track the corresponding configuration ID for later testing.

5. **Early Stopping (per configuration)**
   - Track the best validation F1 seen so far in the current configuration.
   - If three consecutive epochs do not surpass this F1 score, stop training early for this configuration.

---

#### Final Testing Step
Once all configurations have been evaluated:
- Load the best saved model from `best_model.pt`.
- Evaluate on the **test set**.
- Print or log the final test metrics.

---

#### Utility Functions
- `process(model, dataloader, criterion, optimizer=None)`  
  Handles forward pass and training/evaluation logic per epoch.

- `regularized_f1(f1_valid, f1_train)`  
  Computes a regularized F1 score that penalizes overfitting.

- `save_metrics(config, epoch, train_loss, train_acc, train_f1, val_loss, val_acc, val_f1)`  
  Logs epoch results to `.csv`. All numeric fields must be passed as numbers (not strings).

---

This setup ensures that the selected model is the one that not only fits the training data but also generalizes well, preventing overfitting through regularized performance tracking and early stopping.


In [None]:
path = "./"
logging_file = "results.csv"


def train_and_evaluate(
    configurations,
    model_class,
    criterion,
    device,
    train_loader,
    val_loader,
    num_epochs,
    learning_rate,
):
    best_overall_f1 = 0
    best_model_params = None
    best_config = None

    # Iterate through each configuration
    for config_id, config in enumerate(configurations, start=1):
        print(f"\nRunning configuration {config_id}...")

        # Initialize model and optimizer for each configuration
        model = model_class(**config)  # Initialize model with config parameters
        model.to(device)  # Move model to device (CPU or GPU)
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

        highest_f1_in_config = 0
        epochs_below_f1 = 0  # Counter for early stopping

        for epoch in range(1, num_epochs + 1):
            # Training phase
            print(f"\nTraining epoch {epoch} for config {config_id}...")

            try:
                train_loss, train_acc, train_f1 = process(
                    model, train_loader, criterion, optim=optimizer
                )  # Use 'optim' for training
            except Exception as e:
                print("Error during training:", e)
                return

            with torch.no_grad():
                try:
                    val_loss, val_acc, val_f1 = process(
                        model, val_loader, criterion, optim=None
                    )  # Set 'optim=None' for evaluation
                except Exception as e:
                    print("Error during evaluation:", e)
                    return

            # Calculate regularized F1 score
            reg_f1 = regularized_f1(train_f1, val_f1)

            # Save metrics for this epoch
            save_metrics(
                config_id,
                epoch,
                train_loss,
                train_acc,
                train_f1,
                val_loss,
                val_acc,
                val_f1,
                path=path,
                fname=logging_file,
            )

            # Display progress with more details
            print(
                f"Epoch {epoch} - Config {config_id}: Train Loss = {train_loss:.4f}, Train Acc = {train_acc:.4f}, "
                f"Train F1 = {train_f1:.4f}, Val Loss = {val_loss:.4f}, Val Acc = {val_acc:.4f}, Val F1 = {val_f1:.4f}, Reg F1 = {reg_f1:.4f}"
            )

            # Check if the model generalizes (validation F1 > train F1) and improves overall highest F1
            if reg_f1 > 0 and val_f1 > highest_f1_in_config:
                highest_f1_in_config = val_f1
                epochs_below_f1 = 0  # Reset early stopping counter

                # Check if it's the best model overall
                if val_f1 > best_overall_f1:
                    best_overall_f1 = val_f1
                    best_model_params = {
                        k: v.cpu() for k, v in model.state_dict().items()
                    }
                    best_config = config_id
                    torch.save(best_model_params, "best_model.pt")
            else:
                epochs_below_f1 += 1  # Increment if performance hasn't improved

            # Early stopping if validation F1 doesn't improve for 3 consecutive epochs
            if epochs_below_f1 >= 3:
                print(
                    f"Early stopping in configuration {config_id} at epoch {epoch} due to no improvement."
                )
                break  # Exit the epoch loop early for this configuration

    print(f"\nBest overall F1: {best_overall_f1:.4f} from configuration {best_config}")
    print("Training complete. Best model parameters saved to 'best_model.pt'")


# Train and Evaluation
train_and_evaluate(
    configurations=configs.values(),
    model_class=BiLSTM,
    criterion=nn.BCEWithLogitsLoss(),
    device=device,
    train_loader=train_loader,
    val_loader=val_loader,
    num_epochs=NUM_EPOCHS,
    learning_rate=LEARNING_RATE,
)

### Results Visualization and Analysis

After completing the training for all configurations, we visualize the training dynamics by plotting the metrics stored in `results.csv`.

In [None]:
df = pd.read_csv("results.csv")

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(9, 11))


colors = {
    "train_loss": "blue",
    "val_loss": "blue",
    "train_acc": "orange",
    "val_acc": "orange",
    "train_f1": "green",
    "val_f1": "green",
}

line_styles = {"train": "-", "val": "--"}

# loop through each configuration
for i, config_id in enumerate([1, 2, 3]):
    config_df = df[df["config"] == config_id]
    # plot Loss Progression (left plot)
    ax[i, 0].plot(
        config_df["epoch"],
        config_df["train_loss"],
        color=colors["train_loss"],
        linestyle=line_styles["train"],
        label="Train Loss",
    )
    ax[i, 0].plot(
        config_df["epoch"],
        config_df["val_loss"],
        color=colors["val_loss"],
        linestyle=line_styles["val"],
        label="Val Loss",
    )
    ax[i, 0].set_title(f"Configuration {config_id} - Loss")
    ax[i, 0].set_xlabel("Epoch")
    ax[i, 0].set_ylabel("Loss")
    ax[i, 0].legend()

    # plot Accuracy and F1 Progression (right plot)
    ax[i, 1].plot(
        config_df["epoch"],
        config_df["train_acc"],
        color=colors["train_acc"],
        linestyle=line_styles["train"],
        label="Train Accuracy",
    )
    ax[i, 1].plot(
        config_df["epoch"],
        config_df["val_acc"],
        color=colors["val_acc"],
        linestyle=line_styles["val"],
        label="Val Accuracy",
    )
    ax[i, 1].plot(
        config_df["epoch"],
        config_df["train_f1"],
        color=colors["train_f1"],
        linestyle=line_styles["train"],
        label="Train F1",
    )
    ax[i, 1].plot(
        config_df["epoch"],
        config_df["val_f1"],
        color=colors["val_f1"],
        linestyle=line_styles["val"],
        label="Val F1",
    )
    ax[i, 1].set_title(f"Configuration {config_id} - Accuracy and F1")
    ax[i, 1].set_xlabel("Epoch")
    ax[i, 1].set_ylabel("Score")
    ax[i, 1].legend()

plt.tight_layout()
plt.show()

### Final Evaluation on the Test Set

After identifying the best-performing configuration during validation, we now instantiate the corresponding model, load its saved parameters, and evaluate its performance on the test set.

In [None]:
# evaluate on test set
model = configs["config3"]


# Reinitialize the best model with Configuration 3 parameters
best_model = BiLSTM(**model)
best_model.load_state_dict(torch.load("best_model.pt"))  # Load saved parameters
best_model.to(device)

# Set model to evaluation mode and disable gradient calculation
best_model.eval()
with torch.no_grad():
    test_loss, test_acc, test_f1 = process(
        best_model, test_loader, nn.BCEWithLogitsLoss()
    )

# Print test results
print(
    f"\nTest Results: Loss = {test_loss:.4f}, Accuracy = {test_acc:.4f}, F1 Score = {test_f1:.4f}"
)

We evaluated the model on the test set using the best hyperparameter configuration, which was found to be the third:


"config3" = "vocab_size": VOCAB_SIZE, "embedding_dim": 300,"hidden_size": 256, "rnn_size": 256, "dropout": 0.5


As can be seen from the evaluation on the test score, the model achieved a loss of 0.8169, an accuracy of 0.8137, and an F1 score of 0.8069

Which denotes a good ability of the model to generalize on unseen data. Specifically, the accuracy is good and the F1 score suggests a good balance between precision and recall but could still be improved.

The overall performance suggests that there are possibilities for improvement.

Even looking at the training and validation error/accuracy plot of the model with the third (best) configuration, we can make some important considerations.
In particular looking at the left graph ( Loss - epochs) we can see that from epoch 5 despite a consistent decrease in training loss the validation loss remains more or less stable and starts to increase slightly denoting a possible overfitting problem. This might suggest the adoption of regularization techniques.

This phenomenon is also denoted in the right plot which analyze the training and validation accuracy and F1 score.
From the fifth epoch, the train accuracy continues to increase while the validation accuracy slowly decrease.

 This might suggest that it would be interesting to test an early stopping in these epochs so as to avoid overfitting (we might think of decreasing the NUM_EPOCHS parameter by testing new values)

Possible improvements/considerations:

- Hyperparameter tuning, one could experiment with new hyperparameter configurations for example by implementing algorithms such as grid search ,random search or Bayesian optimization models. In doing so, however, we must take into account the cost of such computations, which for a model of this size can become very high.


- Regularization, adjusting dropout or applying L2 regularization could help the model generalize better. In particular we could try to increase the Dropout Rate to introduce more regularization.


Looking at the setting of the hyperparameters in the third cell:

VOCAB_SIZE = 20_000,
BATCH_SIZE = 32,
NUM_EPOCHS = 15,
MAX_LEN = 256,
LEARNING_RATE = 1e-4

To avoid the problem of overfitting we could try to reduce the batch_size that can introduce more noise in the gradient calculations, which may improve generalization. The downside is that training may take longer, but it can help the model generalize better to unseen data.

Then to adress the same problem, we can try to work on reducing MAX_LEN to 128 or 100. This will limit the input to only the most relevant words, which could reduce the noise and make the model more focused on key parts of the input potentially avoiding overfitting.

To conclude, the model achieved good results but leaves room for improvement, particularly in addressing the slight overfitting observed. Based on the considerations discussed, applying further regularization techniques and tuning specific hyperparameters, such as batch size, dropout, and learning rate, could enhance the model's generalization. These adjustments may help the model maintain or even improve its performance on unseen data while avoiding overfitting.







