# 50.040 Natural Language Processing (Fall 2024) Final Project (100 Points)

**DUE DATE: 13 December 2024**

Final project will be graded by Chen Huang


# Group Information (Fill before you start)

**Group Name: ChatGPT Course**

**Name(STUDNET ID) (2-3 person):**

**Beckham Wee Yu Zheng(1006010)**

**Deshpande Sunny Nitin(1006336)**

**Patrick Mo(1006084)**

**Please also rename the final submitted pdf as ``finalproject_[GROUP_NAME].pdf``**

**-1 points if info not filled or file name not adjusted before submission, -100 points if you copy other's answer. We encourage discussion, but please do not copy without thinking.**

# Design Challenge (30 points)
Now it's time to come up with your own model! You are free to decide what model you want to choose or what architecture you think is better for this task. You are also free to set all the hyperparameter for training (do consider the overfitting issue and the computing cost). From here we will see how important choosing/designing a better model architecture could be when building an NLP application in practice (however, there is no requirement for you to include such comparisons in your report).

You are allowed to use external packages for this challenge, but we require that you fully understand the methods/techniques that you used, and you need to clearly explain such details in the submitted report. We will evaluate your system's performance on the held-out test set, which will only be released 48 hours before the deadline. Use your new system to generate the outputs. The system that achieves the highest F1 score will be announced as the winner for the challenge. We may perform further analysis/evaluations in case there is no clear winner based on the released test set.

Let's summarize this challenge:

(i) **[Model]** You are required to develop your own model for sentiment analysis, with no restrictions on the model architecture. You may choose to follow RNN/CNN structures or experiment with Transformer-based models. In your submitted report, you must provide a detailed explanation of your model along with the accompanying code. Additionally, you are required to submit your model so that we can reproduce your results.

*_(10 points)_*

(ii) **[Evaluation]** For a fair comparison, you must train your new model on the same dataset provided to you. After training, evaluate your model on the test set. You are required to report the **precision**, **recall**, **F1 scores**, and **accuracy** of your new model. Save the predicted outcome for the test set and include it in the submission.

_Hint:_ You will be competing with other groups on the same test set. Groups with higher performance will receive more points. For instance, if your group ranks 1st among all groups, you will receive 15 points for this section.

*_(15 points)_*

(iii) **[Report]** You are required to submit a report for your model. The report must include, at a minimum, the following sections: Model Description, Training Settings (e.g., dataset, hyperparameters), Performance, Code to run the model, and a breakdown of how the work was divided among team members. You are encouraged to include any additional details you deem important. Instructions on how to run the code can either be included in a separate README file or integrated into the report, as long as they are clearly presented.

Please provide a thorough explanation of the model or method you used for designing the new system, along with clear documentation on how to execute the code. We will review your code, and may request an interview if we have any questions about it.

_Note:_ Reports, code, and README files that are of low quality or poorly written will be penalized, so please ensure they are well-organized and clearly formatted. If we are unable to reproduce your model or run your code, you will not receive any points for this challenge.

*_(5 points)_*

# Start your model in a different .py file with a README explaination.

In [2]:
!python -m ipykernel install --user --name=nlp --display-name "Python (nlp)"

Installed kernelspec nlp in /Users/beckhamwee/Library/Jupyter/kernels/nlp


In [56]:
print('sa oot sopo')
import os
import torch
from torch import nn
from d2l import torch as d2l # You can skip this if you have trouble with this package, all d2l-related codes can be replaced by torch functions.
import transformers
from transformers import DistilBertTokenizer, DistilBertModel
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt


sa oot sopo


In [5]:
import pandas as pd
import zipfile

In [6]:
data_dir = 'data/aclImdb'

# Data Preprocessing
For transformers

### `read_imdb` From Before

In [11]:
#@save
def read_imdb(data_dir, is_train):
    """Read the IMDb review dataset text sequences and labels."""
    ### YOUR CODE HERE
    data, labels = [], []

    for label in ('pos', 'neg'): # label is either 'pos' or 'neg'
        folder_name = os.path.join(data_dir, 'train' if is_train else 'test', label) # folder_name is either 'train/pos' or 'train/neg' or 'test/pos' or 'test/neg'

        for file in os.listdir(folder_name):
            with open(os.path.join(folder_name, file), 'rb') as f:
                review = f.read().decode('utf-8').replace('\n', '')
                data.append(review)
                labels.append(1 if label == 'pos' else 0)

    ### END OF YOUR CODE
    return data, labels

### Hyperparameters for Data

In [None]:
""" HYPERPARAMS """
num_steps = 500  # sequence length
batch_size = 16 

### Modified Load Data Function
Modified `load_data_imdb` function to preprocess data for DistilBERT.
- uses `read_imdb` to read training and test data
- Loads pretrained Tokenzier for Bert which converts raw text into tokenized sequences. Then encodes with input IDs and attention masks.
- Truncates and pads them
- Also converts that + labels into PyTorch Tensors
- Returns input_ids, attention mask and testing labels
- Creates PyTorch DataLoaders too (refer to docstrings i wrote for all outputs)

In [18]:
def load_data_imdb_hybrid(batch_size, num_steps=500, device=None):
    """
    Prepares IMDb data for a hybrid Transformer + BiLSTM model.
    Tokenizes, truncates/pads sequences, and creates DataLoaders.

    Args:
        batch_size (int): Batch size for DataLoader.
        num_steps (int): Maximum length for truncation/padding.
        device (torch.device): Device to move tensors to ('cpu', 'cuda', or 'mps').

    Returns:
        train_loader (tuple[tensors]): (input_ids, attension_mask, labels)
        test_loader (tuple[tensors]): (input_ids, attension_mask, labels)
        tokenizer (DistilBertTokenzier): Tokenizer 
    """
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load IMDb dataset
    train_data = read_imdb(data_dir, is_train=True)
    test_data = read_imdb(data_dir, is_train=False)n

    # Load pre-trained tokenizer
    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
    
    # Tokenize and encode sequences
    def preprocess(data):
        return tokenizer(
            data,
            truncation=True,
            padding="max_length",
            max_length=num_steps,
            return_tensors="pt"
        )
    
    train_tokens = preprocess(train_data[0])
    test_tokens = preprocess(test_data[0])

    # Move tokenized data to device
    train_tokens = {key: val.to(device) for key, val in train_tokens.items()}
    test_tokens = {key: val.to(device) for key, val in test_tokens.items()}
    
    # Convert labels to tensors and move to device
    train_labels = torch.tensor(train_data[1]).to(device)
    test_labels = torch.tensor(test_data[1]).to(device)

    # Create TensorDatasets
    train_dataset = TensorDataset(
        train_tokens["input_ids"], train_tokens["attention_mask"], train_labels
    )
    test_dataset = TensorDataset(
        test_tokens["input_ids"], test_tokens["attention_mask"], test_labels
    )
    
    # Create DataLoaders
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    return train_loader, test_loader, tokenizer


In [21]:
# load data
# TODO: Optimise batch size (?)
train_loader, test_loader, tokenizer = load_data_imdb_hybrid(16, num_steps)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

# Building the Transformer-BiLSTM Model for Sentiment Analysis

## Choice of Model: DistilBERT + BiLSTM
- DistilBERT to capture long-range dependencies of works using the self-attention mechanism (can see context from all other tokens in its input sequence)
- BiLSTMS to capture temporal sequenctial local features

## Building the Model
DistilBERT to capture contextualised word emebeddings from input text. Handles nunances in meaning, context and r/s btw words
- semantic and syntactic r/s between words using pretrained knowledge
- takes input_ids and attentions_masks, outputs last hidden state of shape `(batch_size, seq_length, hidden_size)`

BiLSTM enhances the embeddings above by modelling sequential relationshpis (order of words, temporal patterns etc)
- adds sequential modelling
- takes last state from transformer and outputs contextualised embeddings of sequential relationships

Fully connected layer (last) converts the high d output of LSTM into class prbabilities to make classification predictions
- passed through dropout layer to regularize and prevent verfitting
- applies linear transformation to map features to number of classes
- outputs logits

In [None]:
batch_size = 32
train_iter, test_iter, vocab = load_data_imdb_hybrid(batch_size)

In [40]:
class TransformerBiLSTM(nn.Module):
    def __init__(self, transformer_model="distilbert-base-uncased", hidden_size=128, num_classes=2, dropout=0.5):
        super(TransformerBiLSTM, self).__init__()
        self.transformer = DistilBertModel.from_pretrained(transformer_model) # Pre-trained DistilBERT
        print("Model loaded successfully!")
        
        # BiLSTM layer
        self.bilstm = nn.LSTM(
            input_size=self.transformer.config.hidden_size,  # Size of Transformer embeddings
            hidden_size=hidden_size,
            num_layers=1,
            bidirectional=True,
            batch_first=True
        )
        
        # Fully connected output layer
        self.fc = nn.Linear(hidden_size * 2, num_classes)  # BiLSTM is bidirectional
        
        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_ids, attention_mask):
        # Pass through Transformer
        transformer_output = self.transformer(input_ids, attention_mask=attention_mask)
        x = transformer_output.last_hidden_state  # Shape: (batch_size, seq_len, hidden_size)
        
        # Pass through BiLSTM
        lstm_out, _ = self.bilstm(x)  # Shape: (batch_size, seq_len, hidden_size*2)
        lstm_out = lstm_out[:, -1, :]  # Get the last hidden state for classification
        
        # Fully connected layer
        x = self.dropout(lstm_out)
        logits = self.fc(x)
        return logits

Let’s construct a bidirectional RNN with two hidden layers to represent single text for sentiment analysis.

In [59]:
from torch.optim import AdamW
from torch.nn import CrossEntropyLoss
from transformers import get_scheduler

def train_model(
    model, train_loader, test_loader, num_epochs=3, lr=2e-5, device=None, save_model=True, save_dir="models"
):
    """
    Train the model, plot metrics, and optionally save the model.

    Args:
        model: The model to train.
        train_loader: DataLoader for training data.
        test_loader: DataLoader for testing data.
        num_epochs: Number of epochs.
        lr: Learning rate.
        device: Device for training ('cuda', 'mps', or 'cpu').
        save_model: Whether to save the model after each epoch.
        save_dir: Directory to save the model checkpoints.

    Returns:
        A dictionary containing metrics: train_loss, train_accuracy, test_accuracy.
    """
    # Move model to device
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # 'cuda' for Nvidia GPUs, 'mps' for Apple Silicon
    model = model.to(device)

    # Optimizer and loss function
    optimizer = AdamW(model.parameters(), lr=lr)
    criterion = CrossEntropyLoss()

    # Scheduler for learning rate decay
    num_training_steps = len(train_loader) * num_epochs
    scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

    # Metrics tracking
    train_losses = []
    train_accuracies = []
    test_accuracies = []

    # Create directory to save models
    if save_model:
        Path(save_dir).mkdir(parents=True, exist_ok=True)

    # Best test accuracy for saving the best model
    best_test_accuracy = 0.0

    for epoch in range(num_epochs):
        print(f"Starting epoch {epoch + 1}/{num_epochs}")
        model.train()
        total_loss, total_correct = 0, 0
        batch_count = 1

        for batch in train_loader:
            if batch_count % 100 == 0 or batch_count == 1:
                print(f"Processing batch {batch_count}/{len(train_loader)}")

            input_ids, attention_mask, labels = batch
            input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

            # Forward pass
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)

            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.step()

            # Track metrics
            total_loss += loss.item()
            total_correct += (outputs.argmax(dim=1) == labels).sum().item()

            # Update batch count
            batch_count += 1

        # Calculate average loss and accuracy for training
        avg_loss = total_loss / len(train_loader)
        train_accuracy = total_correct / len(train_loader.dataset)
        train_losses.append(avg_loss)
        train_accuracies.append(train_accuracy)

        print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {avg_loss:.4f}, Train Acc: {train_accuracy:.4f}")

        # Evaluate on the test set and track test accuracy
        test_accuracy = evaluate_model_with_metrics(model, test_loader, device)
        test_accuracies.append(test_accuracy)

        # Save model after each epoch or when test accuracy improves
        if save_model:
            model_path = os.path.join(save_dir, f"model_epoch_{epoch+1}.pt")
            torch.save(model.state_dict(), model_path)
            print(f"Model saved to {model_path}")

            # Save the best model based on test accuracy
            if test_accuracy > best_test_accuracy:
                best_test_accuracy = test_accuracy
                best_model_path = os.path.join(save_dir, "best_model.pt")
                torch.save(model.state_dict(), best_model_path)
                print(f"Best model updated and saved to {best_model_path}")

    # Plot the results
    plot_training_metrics(train_losses, train_accuracies, test_accuracies, num_epochs)

    return {"train_losses": train_losses, "train_accuracies": train_accuracies, "test_accuracies": test_accuracies}


def evaluate_model_with_metrics(model, data_loader, device):
    """
    Evaluate the model and calculate accuracy on the test set.

    Args:
        model: The trained model.
        data_loader: DataLoader for the test set.
        device: Device for evaluation.

    Returns:
        Test accuracy.
    """
    model.eval()
    total_correct = 0
    with torch.no_grad():
        for batch in data_loader:
            input_ids, attention_mask, labels = batch
            input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
            outputs = model(input_ids, attention_mask)
            total_correct += (outputs.argmax(dim=1) == labels).sum().item()

    accuracy = total_correct / len(data_loader.dataset)
    print(f"Test Accuracy: {accuracy:.4f}")
    return accuracy


def plot_training_metrics(train_losses, train_accuracies, test_accuracies, num_epochs):
    """
    Plot training loss, training accuracy, and test accuracy per epoch.

    Args:
        train_losses: List of training losses per epoch.
        train_accuracies: List of training accuracies per epoch.
        test_accuracies: List of test accuracies per epoch.
        num_epochs: Number of epochs.
    """
    epochs = range(1, num_epochs + 1)

    plt.figure(figsize=(12, 6))

    # Plot training loss
    plt.subplot(1, 2, 1)
    plt.plot(epochs, train_losses, label="Train Loss", marker="o")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.title("Training Loss")
    plt.legend()

    # Plot training and test accuracy
    plt.subplot(1, 2, 2)
    plt.plot(epochs, train_accuracies, label="Train Accuracy", marker="o")
    plt.plot(epochs, test_accuracies, label="Test Accuracy", marker="o", linestyle="--")
    plt.xlabel("Epoch")
    plt.ylabel("Accuracy")
    plt.title("Train vs. Test Accuracy")
    plt.legend()

    plt.tight_layout()
    plt.show()

In [44]:
# Initialize the model
model = TransformerBiLSTM()

# Train and evaluate
train_model(model, train_loader, test_loader, num_epochs=3)


Model loaded successfully!
Starting epoch 1/3
Processing batch 1/1563
Processing batch 100/1563
Processing batch 200/1563
Processing batch 300/1563
Processing batch 400/1563
Processing batch 500/1563
Processing batch 600/1563
Processing batch 700/1563
Processing batch 800/1563
Processing batch 900/1563
Processing batch 1000/1563
Processing batch 1100/1563
Processing batch 1200/1563
Processing batch 1300/1563
Processing batch 1400/1563
Processing batch 1500/1563
Epoch 1/3, Loss: 0.2775, Accuracy: 0.8822
Test Accuracy: 0.9199
Starting epoch 2/3
Processing batch 1/1563
Processing batch 100/1563
Processing batch 200/1563
Processing batch 300/1563
Processing batch 400/1563
Processing batch 500/1563
Processing batch 600/1563
Processing batch 700/1563
Processing batch 800/1563
Processing batch 900/1563
Processing batch 1000/1563
Processing batch 1100/1563
Processing batch 1200/1563
Processing batch 1300/1563
Processing batch 1400/1563
Processing batch 1500/1563
Epoch 2/3, Loss: 0.1443, Accura

## Hyperparameter Tuning
Learning Rate `lr`, Sequence Length `num_steps`, Dropout. Due to a lack of time, will not be tuning number of epochs.

Will start by seeing which `lr` has the best results for 1 epoch.

In [47]:
lr_values = [1e-5, 3e-5, 5e-5]

for lr in lr_values:
    print(f'Model results for learning rate {lr}') 
    train_model(model, train_loader, test_loader, num_epochs=1, lr=lr)

Model results for learning rate 1e-05
Starting epoch 1/1
Processing batch 1/1563
Processing batch 100/1563
Processing batch 200/1563
Processing batch 300/1563
Processing batch 400/1563
Processing batch 500/1563
Processing batch 600/1563
Processing batch 700/1563
Processing batch 800/1563
Processing batch 900/1563
Processing batch 1000/1563
Processing batch 1100/1563
Processing batch 1200/1563
Processing batch 1300/1563
Processing batch 1400/1563
Processing batch 1500/1563
Epoch 1/1, Loss: 0.0609, Accuracy: 0.9824
Test Accuracy: 0.9288
Model results for learning rate 3e-05
Starting epoch 1/1
Processing batch 1/1563
Processing batch 100/1563
Processing batch 200/1563
Processing batch 300/1563
Processing batch 400/1563
Processing batch 500/1563
Processing batch 600/1563
Processing batch 700/1563
Processing batch 800/1563
Processing batch 900/1563
Processing batch 1000/1563
Processing batch 1100/1563
Processing batch 1200/1563
Processing batch 1300/1563
Processing batch 1400/1563
Processin

Judging from this, `lr = 5e-05` seems to provide the best results on our test set (cross-validation set). Moving on, we fix the learning rate and now try different dropout values. To do so, we create a second training function that includes dropout modifications, as well as graphs.

In [54]:
import os
from pathlib import Path

def train_with_dropout(model_class, train_loader, test_loader, num_epochs, lr, dropout, device, save_dir="models"):
    """
    Train the model with a specific dropout value and track metrics, saving the model during training.

    Args:
        model_class: The model class to instantiate (e.g., TransformerBiLSTM).
        train_loader: DataLoader for training data.
        test_loader: DataLoader for testing data.
        num_epochs: Number of epochs to train.
        lr: Learning rate.
        dropout: Dropout rate to apply in the model.
        device: Device for training (e.g., 'cuda', 'mps', or 'cpu').
        save_dir: Directory to save the model checkpoints.

    Returns:
        metrics: Dictionary containing train losses, train accuracies, and test accuracies.
    """

    
    # Instantiate the model with the specified dropout
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # cuda for windows, mps for apple silicon
    model = model_class(dropout=dropout).to(device)

    # Optimizer and loss function
    optimizer = AdamW(model.parameters(), lr=lr)
    criterion = CrossEntropyLoss()
    scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=len(train_loader) * num_epochs)

    # Prepare directory to save models
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    best_test_accuracy = 0  # Track the best test accuracy for saving the best model

    train_losses, train_accuracies, test_accuracies = [], [], []

    for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}, Dropout: {dropout}")
        model.train()
        total_loss, total_correct = 0, 0

        # track bactch
        batch_count = 1
        
        for batch in train_loader:
            if batch_count % 100 == 0 or batch_count == 1:
                print(f"Processing batch {batch_count}/{len(train_loader)}")
            input_ids, attention_mask, labels = batch
            input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

            # Forward pass
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)

            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.step()

            # Track metrics
            total_loss += loss.item()
            total_correct += (outputs.argmax(dim=1) == labels).sum().item()
            batch_count += 1

        # Calculate average loss and accuracy for training
        avg_loss = total_loss / len(train_loader)
        train_accuracy = total_correct / len(train_loader.dataset)
        train_losses.append(avg_loss)
        train_accuracies.append(train_accuracy)

        # Evaluate on the test set
        model.eval()
        total_correct = 0
        with torch.no_grad():
            for batch in test_loader:
                input_ids, attention_mask, labels = batch
                input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
                outputs = model(input_ids, attention_mask)
                total_correct += (outputs.argmax(dim=1) == labels).sum().item()
        test_accuracy = total_correct / len(test_loader.dataset)
        test_accuracies.append(test_accuracy)

        print(f"Epoch {epoch + 1}/{num_epochs} -> Train Loss: {avg_loss:.4f}, Train Acc: {train_accuracy:.4f}, Test Acc: {test_accuracy:.4f}")

        # Save model checkpoint if test accuracy improves
        if test_accuracy > best_test_accuracy:
            best_test_accuracy = test_accuracy
            model_path = os.path.join(save_dir, f"model_dropout_{dropout}_epoch_{epoch + 1}.pt")
            torch.save(model.state_dict(), model_path)
            print(f"Saved model to {model_path}")

    return {
        "train_losses": train_losses,
        "train_accuracies": train_accuracies,
        "test_accuracies": test_accuracies,
    }

def plot_loss_across_dropouts(metrics_dict):
    """
    Plot training loss across different dropout values.

    Args:
        metrics_dict: Dictionary where keys are dropout values, and values are metrics from training.
    """
    fig, ax1 = plt.subplots(figsize=(10, 6))

    # Extract training loss and test accuracy for each dropout
    dropouts = []
    losses = []
    accuracies = []
    for dropout, metrics in metrics_dict.items():
        dropouts.append(dropout)
        # Since we're running for 1 epoch only, take the first (and only) training loss and accuracy
        losses.append(metrics["train_losses"][0])
        accuracies.append(metrics["test_accuracies"][0])

    # Plot training loss
    ax1.plot(dropouts, losses, marker='o', color='blue', label="Training Loss")
    ax1.set_xlabel("Dropout Rate")
    ax1.set_ylabel("Loss", color='blue')
    ax1.tick_params(axis='y', labelcolor='blue')
    ax1.set_title("Training Loss and Test Accuracy vs. Dropout Rate")
    
    # Plot test accuracy on a secondary y-axis
    ax2 = ax1.twinx()
    ax2.plot(dropouts, accuracies, marker='o', color='green', label="Test Accuracy")
    ax2.set_ylabel("Accuracy (%)", color='green')
    ax2.tick_params(axis='y', labelcolor='green')

    # Add legends
    ax1.legend(loc="upper left")
    ax2.legend(loc="upper right")

    plt.grid()
    plt.tight_layout()
    plt.show()

In [55]:
lr = 5e-5
num_epochs = 1  # Fixed at 1 epoch for this comparison
dropout_values = [0.3, 0.5, 0.7]
metrics_dict = {}

for dropout in dropout_values:
    print(f"Training with lr={lr}, dropout={dropout}")
    metrics = train_with_dropout(
        model_class=TransformerBiLSTM,
        train_loader=train_loader,
        test_loader=test_loader,
        num_epochs=num_epochs,
        lr=lr,
        dropout=dropout,
        device=None,
        save_dir="models"
    )
    metrics_dict[dropout] = metrics

# plot loss and accuracy
plot_loss_across_dropouts(metrics_dict)


Training with lr=5e-05, dropout=0.3
Model loaded successfully!
Epoch 1/1, Dropout: 0.3
Processing batch 1/1563
Processing batch 100/1563
Processing batch 200/1563
Processing batch 300/1563
Processing batch 400/1563
Processing batch 500/1563
Processing batch 600/1563
Processing batch 700/1563
Processing batch 800/1563
Processing batch 900/1563
Processing batch 1000/1563
Processing batch 1100/1563
Processing batch 1200/1563
Processing batch 1300/1563
Processing batch 1400/1563
Processing batch 1500/1563
Epoch 1/1 -> Train Loss: 0.2520, Train Acc: 0.8968, Test Acc: 0.9302
Saved model to models/model_dropout_0.3_epoch_1.pt
Training with lr=5e-05, dropout=0.5
Model loaded successfully!
Epoch 1/1, Dropout: 0.5
Processing batch 1/1563
Processing batch 100/1563
Processing batch 200/1563
Processing batch 300/1563
Processing batch 400/1563
Processing batch 500/1563
Processing batch 600/1563
Processing batch 700/1563
Processing batch 800/1563
Processing batch 900/1563
Processing batch 1000/1563
P

NameError: name 'plt' is not defined

## Training on New Hyperparameters
Overall, it seems like dropout fails to improve our model. Hence, we train the full model on 3 epochs with `lr = 5e-05` as it generalises better.

In [None]:
lr = 5e-05
metrics = train_model(model, train_loader, test_loader, num_epochs=1, lr=lr)

Starting epoch 1/1
Processing batch 1/1563
Processing batch 100/1563


## Predicting Sentiment

In [36]:
def predict_sentiment(model, tokenizer, text, device=None, max_length=500):
    """
    Predict the sentiment of a given text using the trained model.

    Args:
        model: Trained TransformerBiLSTM model.
        tokenizer: Pretrained tokenizer (e.g., DistilBERT tokenizer).
        text: The input text string to classify.
        device: Device for computation ('cuda', 'mps', or 'cpu').
        max_length: Maximum length for tokenization (default: 500).

    Returns:
        Predicted sentiment label (e.g., 'positive' or 'negative').
    """
    # Move model to the correct device
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # 'cuda' for Nvidia GPUs, 'mps' for Apple Silicon
    model = model.to(device)
    model.eval()  # Set model to evaluation mode

    # Tokenize and encode the input text
    encoded_input = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=max_length,
        return_tensors="pt",
    )

    # Move input tensors to the same device as the model
    input_ids = encoded_input["input_ids"].to(device)
    attention_mask = encoded_input["attention_mask"].to(device)

    # Make prediction
    with torch.no_grad():
        outputs = model(input_ids, attention_mask)
        predicted_label = outputs.argmax(dim=1).item()  # Get the index of the highest score

    # Map predicted label to sentiment class (e.g., 0 -> 'negative', 1 -> 'positive')
    sentiment = "positive" if predicted_label == 1 else "negative"
    return sentiment


In [None]:
# Load the trained model
model = TransformerBiLSTM()
model.load_state_dict(torch.load("my_model_checkpoints/best_model.pt"))

# Define the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Input text
text = "this movie is so great"

# Predict sentiment
device = torch.device("cpu")  # Use 'cuda', 'mps', or 'cpu'
sentiment = predict_sentiment(model, tokenizer, text, device=device)
print(f"Predicted Sentiment: {sentiment}")


In [None]:
# Input text
text = "this movie is so bad"

# Predict sentiment
device = torch.device("cpu")  # Use 'cuda', 'mps', or 'cpu'
sentiment = predict_sentiment(model, tokenizer, text, device=device)
print(f"Predicted Sentiment: {sentiment}")


## Metrics and Saving Results

In [54]:
def save_results(all_sequences, all_labels, all_preds, tokenizer, output_dir="results", zip_filename="submission.zip"):
    """
    Save predictions and true labels to a CSV and zip the results.

    Args:
        all_sequences: List of input sequences (token IDs).
        all_labels: List of true labels.
        all_preds: List of predicted labels.
        tokenizer: Tokenizer used to decode token IDs into text.
        output_dir: Directory to save the results.
        zip_filename: Name of the zip file to create.

    Returns:
        csv_path: Path to the saved CSV file.
        zip_path: Path to the saved ZIP file.
    """
    # Decode input sequences back into text using the tokenizer
    decoded_sequences = [
        tokenizer.decode(seq, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        for seq in all_sequences
    ]

    # Create results DataFrame
    results_df = pd.DataFrame({
        'sequence': decoded_sequences,
        'true_label': all_labels,
        'predicted_label': all_preds
    })

    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Save to CSV
    csv_path = os.path.join(output_dir, "test_predictions.csv")
    results_df.to_csv(csv_path, index=False)
    print(f"Results saved to {csv_path}")

    # Create a zip file
    zip_path = os.path.join(output_dir, zip_filename)
    with zipfile.ZipFile(zip_path, 'w') as zipf:
        zipf.write(csv_path, arcname="test_predictions.csv")
    print(f"Results zipped to {zip_path}")

    return csv_path, zip_path



def cal_metrics(model, test_loader, tokenizer, output_dir="results", zip_filename="submission.zip"):
    """
    Calculate metrics, save predictions to a CSV, and zip the results.

    Args:
        model: Trained TransformerBiLSTM model.
        test_loader: DataLoader for the test set.
        tokenizer: Tokenizer used for decoding sequences.
        output_dir: Directory to save the results.
        zip_filename: Name of the zip file to create.

    Returns:
        f1: F1-score of the model.
        precision: Precision of the model.
        recall: Recall of the model.
        accuracy: Accuracy of the model.
        zip_path: Path to the saved ZIP file.
    """
    device = next(model.parameters()).device  # Get device (e.g., 'cuda', 'mps', 'cpu')
    model.eval()  # Set model to evaluation mode

    # Initialize lists for predictions, labels, and sequences
    all_preds = []
    all_labels = []
    all_sequences = []

    # Collect predictions and true labels
    with torch.no_grad():
        for batch in test_loader:
            input_ids, attention_mask, labels = batch
            input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

            outputs = model(input_ids, attention_mask)
            preds = torch.argmax(outputs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
            all_sequences.extend(input_ids.cpu().numpy())  # Save raw input sequences

    # Calculate metrics
    tp = sum(1 for pred, label in zip(all_preds, all_labels) if pred == label == 1)
    fp = sum(1 for pred, label in zip(all_preds, all_labels) if pred == 1 and label == 0)
    fn = sum(1 for pred, label in zip(all_preds, all_labels) if pred == 0 and label == 1)
    tn = sum(1 for pred, label in zip(all_preds, all_labels) if pred == label == 0)

    accuracy = sum(1 for pred, label in zip(all_preds, all_labels) if pred == label) / len(all_labels)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    print(f"Test Set Metrics:\nAccuracy: {accuracy:.4f}\nPrecision: {precision:.4f}\nRecall: {recall:.4f}\nF1-Score: {f1:.4f}")

    # Save results using the helper function
    csv_path, zip_path = save_results(all_sequences, all_labels, all_preds, tokenizer, output_dir, zip_filename)

    return f1, precision, recall, accuracy, zip_path


In [60]:
# Evaluate the trained model and save results
f1, precision, recall, accuracy, zip_path = cal_metrics(
    model=model, 
    test_loader=test_loader, 
    tokenizer=tokenizer, 
    output_dir="results", 
    zip_filename="submission.zip"
)

print(f"Saved results to {zip_path}")


Test Set Metrics:
Accuracy: 0.8585
Precision: 0.8783
Recall: 0.8323
F1-Score: 0.8547
Results saved to results/test_predictions.csv
Results zipped to results/submission.zip


(0.8546783865932802,
 0.8782711463785244,
 0.83232,
 0.85848,
 'results/submission.zip')