## Part 2. Model Training & Evaluation - RNN

Now with the pretrained word embeddings acquired from Part 1 and the dataset acquired from Part
0, you need to train a deep learning model for topic classification using the training set, conforming
to these requirements:
- Use the pretrained word embeddings from Part 1 as inputs, together with your implementation
in mitigating the influence of OOV words; make them learnable parameters during training
(they are updated).
- Design a simple recurrent neural network (RNN), taking the input word embeddings, and
predicting a topic label for each sentence. To do that, you need to consider how to aggregate
the word representations to represent a sentence.
- Use the validation set to gauge the performance of the model for each epoch during training.
You are required to use accuracy as the performance metric during validation and evaluation.
- Use the mini-batch strategy during training. You may choose any preferred optimizer (e.g.,
SGD, Adagrad, Adam, RMSprop). Be careful when you choose your initial learning rate and
mini-batch size. (You should use the validation set to determine the optimal configuration.)
Train the model until the accuracy score on the validation set is not increasing for a few
epochs.
- Try different regularization techniques to mitigate overfitting.
- Evaluate your trained model on the test dataset, observing the accuracy score.

In [1]:
import json
import numpy as np
import random
import itertools
from pathlib import Path
from torchtext import data, datasets
from torch.utils.data import TensorDataset, DataLoader
from utils.config import Config
from utils.train import train_rnn_model_with_parameters
from utils.helper import SentenceDataset, collate_fn

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
TEXT = data.Field(tokenize = 'spacy', tokenizer_language='en_core_web_sm', include_lengths=True)
LABEL = data.LabelField()

train_data, test_data = datasets.TREC.splits(TEXT, LABEL, fine_grained=False)

In [3]:
train_data, valid_data = train_data.split(random_state=random.seed(Config.SEED), split_ratio=0.8)

In [4]:
TEXT.build_vocab(train_data, vectors="glove.6B.300d")
LABEL.build_vocab(train_data)

### Import the embedding matrix and vocab index mapping (train data)

In [5]:
embedding_path = Path("models/embedding_matrix.npy")
index_from_word_path = Path("models/index_from_word.json")

embedding_matrix = np.load(embedding_path)
with index_from_word_path.open() as f:
    index_from_word = json.load(f)

In [6]:
train_dataset = SentenceDataset(train_data.examples, index_from_word, LABEL.vocab)
valid_dataset = SentenceDataset(valid_data.examples, index_from_word, LABEL.vocab)
test_dataset = SentenceDataset(test_data.examples, index_from_word, LABEL.vocab)        

### Dataset

In [7]:
SEARCH_SPACE = {
    "batch_size": [32, 64, 128, 256, 512, 1024, 2048],
    "learning_rate": [1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
    "optimizer_name": ["SGD", "Adagrad", "RMSprop", "Adam"],
    "hidden_dim": [256, 128, 64, 32],
    "num_layers": [1, 2, 4],
    "sentence_representation_type": ["last", "average", "max"],
}
all_combinations = list(itertools.product(
    SEARCH_SPACE["batch_size"],
    SEARCH_SPACE["learning_rate"],
    SEARCH_SPACE["optimizer_name"],
    SEARCH_SPACE["hidden_dim"],
    SEARCH_SPACE["num_layers"],
    SEARCH_SPACE["sentence_representation_type"]
))

In [None]:
for batch_size, lr, optimizer_name, hidden_dim, num_layers, sr_type in all_combinations:
    print(f"Training with configuration: batch_size={batch_size}, lr={lr}, optimizer={optimizer_name}, "
          f"hidden_dim={hidden_dim}, num_layers={num_layers}, sentence_repr={sr_type}")

    # train_rnn_model_with_parameters expects dataset objects (not DataLoader instances),
    # so pass the SentenceDataset instances and let the function create DataLoaders internally.
    # Set num_workers=0 to avoid multiprocessing DataLoader worker issues (resize storage error).
    # train_rnn_model_with_parameters(
    #     embedding_matrix=embedding_matrix,
    #     train_dataset=train_dataset,
    #     val_dataset=valid_dataset,
    #     batch_size=batch_size,
    #     learning_rate=lr,
    #     optimizer_name=optimizer_name,
    #     hidden_dim=hidden_dim,
    #     num_layers=num_layers,
    #     sentence_representation_type=sr_type,
    #     freeze_embedding=False,
    #     show_progress=True,
    # )
    train_rnn_model_with_parameters(
        embedding_matrix=embedding_matrix,
        train_dataset=train_dataset,
        val_dataset=valid_dataset,
        batch_size=batch_size,
        learning_rate=lr,
        optimizer_name=optimizer_name,
        hidden_dim=hidden_dim,
        num_layers=num_layers,
        sentence_representation_type=sr_type,
        freeze_embedding=False,
        show_progress=True,
    )

Seed set to 42
/home/linnsheng/Desktop/NTU/S3/Y1/NLP/SC4002/.venv/lib/python3.13/site-packages/lightning/pytorch/utilities/parsing.py:210: Attribute 'rnn_model' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['rnn_model'])`.
Seed set to 42
Seed set to 42


Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=256, num_layers=1, sentence_repr=last
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_256-num_layers_1-sr_type_last-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=256, num_layers=1, sentence_repr=average
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_256-num_layers_1-sr_type_average-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=256, num_layers=1, sentence_repr=max


Seed set to 42
Seed set to 42
Seed set to 42


[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_256-num_layers_1-sr_type_max-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=256, num_layers=2, sentence_repr=last
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_256-num_layers_2-sr_type_last-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=256, num_layers=2, sentence_repr=average
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_256-num_layers_2-sr_type_average-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=256, num_layers=2, sentence_repr=max


Seed set to 42
Seed set to 42
Seed set to 42


[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_256-num_layers_2-sr_type_max-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=256, num_layers=4, sentence_repr=last
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_256-num_layers_4-sr_type_last-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=256, num_layers=4, sentence_repr=average
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_256-num_layers_4-sr_type_average-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=256, num_layers=4, sentence_repr=max


Seed set to 42
Seed set to 42
Seed set to 42


[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_256-num_layers_4-sr_type_max-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=128, num_layers=1, sentence_repr=last
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_128-num_layers_1-sr_type_last-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=128, num_layers=1, sentence_repr=average
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_128-num_layers_1-sr_type_average-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=128, num_layers=1, sentence_repr=max


Seed set to 42
Seed set to 42
Seed set to 42


[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_128-num_layers_1-sr_type_max-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=128, num_layers=2, sentence_repr=last
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_128-num_layers_2-sr_type_last-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=128, num_layers=2, sentence_repr=average
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_128-num_layers_2-sr_type_average-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=128, num_layers=2, sentence_repr=max


Seed set to 42
Seed set to 42
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/linnsheng/Desktop/NTU/S3/Y1/NLP/SC4002/.venv/lib/python3.13/site-packages/lightning/pytorch/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.

  | Name   | Type               | Params | Mode 
------------------------------------------------------
0 | model  | RNN                | 2.6 M  | train
1 | metric | MulticlassAccuracy | 0      | train
------------------------------------------------------
2.6 M     Trainable params
0         Non-trainable params
2.6 M     Total params
10.363    Total estimated model params size (MB)
7         Modules in train mode
0         Modules in eval mode


[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_128-num_layers_2-sr_type_max-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=128, num_layers=4, sentence_repr=last
[Skipping] rnn/test/batch_size_32-lr_1e-05-optimizer_SGD-hidden_dim_128-num_layers_4-sr_type_last-freeze_False-rnn_type_RNN-bidirectional_False
Training with configuration: batch_size=32, lr=1e-05, optimizer=SGD, hidden_dim=128, num_layers=4, sentence_repr=average
Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/home/linnsheng/Desktop/NTU/S3/Y1/NLP/SC4002/.venv/lib/python3.13/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:428: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization.


                                                                           

/home/linnsheng/Desktop/NTU/S3/Y1/NLP/SC4002/.venv/lib/python3.13/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:428: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.


Epoch 0: 100%|██████████| 137/137 [00:07<00:00, 19.46it/s, v_num=0, train_loss=1.810, train_acc=0.200]

[W1103 02:29:28.484923155 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1103 02:29:28.485076206 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1103 02:29:28.486456686 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1103 02:29:28.487100243 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1103 02:29:28.500474867 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1103 02:29:28.524891308 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1103 02:29:28.528424809 CudaIPCTypes.cpp:16] Producer pr

Epoch 4:   0%|          | 0/137 [00:00<?, ?it/s, v_num=0, train_loss=1.800, train_acc=0.333, val_loss=1.790, val_acc=0.187]          

In [None]:
for hidden_dim in SEARCH_SPACE["hidden_dim"]:
    for num_layers in SEARCH_SPACE["num_layers"]:
        for optimizer_name in SEARCH_SPACE["optimizer_name"]:
            for batch_size in SEARCH_SPACE["batch_size"]:
                for learning_rate in SEARCH_SPACE["learning_rate"]:
                    for sentence_representation_type in SEARCH_SPACE["sentence_representation_type"]:
                        log_message = f"---------- batch_size_{batch_size}; lr_{learning_rate}; optimizer_{optimizer_name}; hidden_dim_{hidden_dim}; num_layers_{num_layers}; sentence_representation_{sentence_representation_type}  ----------"
                        print(log_message)
                        train_rnn_model_with_parameters(
                            embedding_matrix=embedding_matrix,
                            train_dataset=train_data,
                            val_dataset=valid_data,
                            batch_size=batch_size,
                            learning_rate=learning_rate,
                            optimizer_name=optimizer_name,
                            hidden_dim=hidden_dim,
                            num_layers=num_layers,
                            sentence_representation_type=sentence_representation_type,
                            show_progress=True,
                            freeze_embedding=False,
                        )