In this script we build, train, and evaluate a neural network model using the [WikiText-2 Dataset](https://paperswithcode.com/dataset/wikitext-2).

The script begins by importing necessary libraries for data handling, neural network construction, and training. It then reads a configuration file, `testRunConfig.yaml`, which sets various parameters for data processing, model architecture, and training. These parameters include file paths for training, validation, and test datasets, minimum word frequency for vocabulary inclusion, context size for training, and various hyperparameters like learning rate, weight decay, and the architecture of the neural network. These parameters originate from experiment 3 where we implement a random hyperparameter search.

A logging function is initialized to record the training process, creating a unique log file named with the current date and time. This aids in tracking and debugging the model's training process.


The `WikiText2VocabBuilder` class is responsible for constructing the vocabulary from the text corpus. It preprocesses the text, tokenizes it, and builds a vocabulary, considering only words that meet a specified frequency threshold. This class also handles special tokens, ensuring consistent representation in the processed data.

The `WikiText2Dataset` class, derived from PyTorch's `Dataset`, prepares the data for the language modeling task. It processes the text data into input-output pairs suitable for training a language model, where the input is a sequence of words, and the output is the word following this sequence.

The `LanguageModel` class defines the neural network architecture for the language modeling task. It includes an embedding layer, multiple hidden layers with ReLU activation and layer normalization, and an output layer. The network is designed with residual connections and uses Kaiming initialization for its weights, which is beneficial for training deep networks.

The `train_language_model` function encapsulates the training process. It uses negative log-likelihood loss and the AdamW optimizer, adjusting the learning rate over epochs using an exponential decay scheduler. The function evaluates the model's performance on the validation set using perplexity, a standard metric in language modeling. It implements early stopping based on validation performance and includes checks for potential numerical issues like overflow.

After training, the model's weights are saved to a file for later use or further analysis. In summary, the script is a complete pipeline for training a neural network-based language model, with  data preprocessing, model architecture, training optimization, and results logging.







In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.optim.lr_scheduler import ExponentialLR
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import random
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import defaultdict, Counter
import string
import math
import sys
import logging
from datetime import datetime
import yaml
import json
import string
import math
import sys
import logging
from datetime import datetime
import yaml
import json
from torch.utils.tensorboard import SummaryWriter


Loading the configuration.

In [None]:
with open("testRunConfig.yaml", 'r') as ymlfile:
    config = yaml.safe_load(ymlfile)

training_data_corpus_path = config['data']['training_data_corpus_path']
validation_data_corpus_path = config['data']['validation_data_corpus_path']
test_data_corpus_path = config['data']['test_data_corpus_path']
min_freq = config['data']['min_freq']
context_size = config['data']['context_size']

shuffle = config['runtime']['shuffle']
num_workers = config['runtime']['num_workers']
batch_size = config['runtime']['batch_size']
config_device = config['runtime']['device']


total_epochs = config['experiment']['total_epochs']
patience = config['experiment']['patience']
lr_start = config['hyperparameters']['lr_start']
weight_decay = config['hyperparameters']['weight_decay']
lr_end = config['hyperparameters']['lr_end']
numberOfLayers = config['hyperparameters']['numberOfLayers'] # Number of hidden layers
embed_size= config['hyperparameters']['embed_size']  # The embedding size of the words
hidden_size = config['hyperparameters']['hidden_size'] # The hidden size of the neural network


Initializing Logging

In [None]:
# Function to initialize logging with dynamic filename
def init_logger():
    # Generate a unique log file name based on the current date and time
    current_time = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
    log_file_name = f'testRun_{current_time}.log'

    # Configure the logging module to write the logs to a file
    logging.basicConfig(filename=log_file_name,
                        level=logging.INFO,
                        format='%(asctime)s [%(levelname)s]: %(message)s',
                        datefmt='%Y-%m-%d %H:%M:%S')
    return log_file_name

# Initialize logger and get the log file name
log_file_name = init_logger()
print(f"Logging to file: {log_file_name}")


Logging to file: testRun_2023-09-17_19-39-19.log


Preparation of Data

In [None]:
class WikiText2VocabBuilder:
    def _new_index(self):
        return len(self.word2index)

    def __init__(self, corpus_path, min_freq):
        self.corpus_path = corpus_path
        self.min_freq = min_freq
        self.word2index = defaultdict(self._new_index)
        self.index2word = {}
        self.word_freqs = Counter()
        self.cleaned_sentences = []

        self.START_TOKEN = "<s>"
        self.END_TOKEN = "</s>"
        self.UNK_TOKEN = "<unk>"
        self.hyphentoken = "hyphentoken"
        self.numericalcommatoken = "numericalcommatoken"

        self._initialize_special_tokens()
        self._load_and_preprocess()

    def _initialize_special_tokens(self):
        self.word2index[self.START_TOKEN] = 0
        self.word2index[self.END_TOKEN] = 1
        self.word2index[self.UNK_TOKEN] = 2
        self.word2index[self.hyphentoken] = 3
        self.word2index[self.numericalcommatoken] = 4

        self.index2word[0] = self.START_TOKEN
        self.index2word[1] = self.END_TOKEN
        self.index2word[2] = self.UNK_TOKEN
        self.index2word[3] = self.hyphentoken
        self.index2word[4] = self.numericalcommatoken


    def clean_and_tokenize(self, corpus):
        corpus = corpus.lower()  # Convert to lowercase
        sentences = sent_tokenize(corpus)
        cleaned_sentences = []
        for sentence in sentences:
            sentence = sentence.strip()  # Remove unnecessary whitespaces
            if not (sentence.startswith('=') or sentence.endswith('=')):  # Exclude headers
                sentence = sentence.replace('<unk>', 'unknowntoken')
                sentence = sentence.replace('@-@', 'hyphentoken')
                sentence = sentence.replace('@,@', 'numericalcommatoken')
                cleaned_sentences.append(sentence)
        return cleaned_sentences

    def _load_and_preprocess(self):
        with open(self.corpus_path, 'r') as f:
            corpus = f.read()

        # Split the text into sentences using NLTK
        self.cleaned_sentences = self.clean_and_tokenize(corpus)
        # Count word frequencies
        for sentence in self.cleaned_sentences:
            words = word_tokenize(sentence)
            for word in words:
                word = word.lower()
                self.word_freqs[word] += 1

        # Build the vocabulary using only words that meet the frequency threshold
        for word, freq in self.word_freqs.items():
            if freq >= self.min_freq:
                index = self.word2index[word]
                self.index2word[index] = word

    def vocab_size(self):
        return len(self.word2index)



In [None]:
class WikiText2Dataset(Dataset):
    def __init__(self, preprocessor, context_size):
        super(WikiText2Dataset, self).__init__()

        self.context_size = context_size

        # We already have cleaned sentences in the preprocessor
        self.sentences = preprocessor.cleaned_sentences
        self.word2index = preprocessor.word2index
        self.index2word = preprocessor.index2word
        self.word_freqs = preprocessor.word_freqs
        self.START_TOKEN = preprocessor.START_TOKEN
        self.END_TOKEN = preprocessor.END_TOKEN
        self.UNK_TOKEN = preprocessor.UNK_TOKEN
        self.hyphentoken = preprocessor.hyphentoken
        self.numericalcommatoken = preprocessor.numericalcommatoken


        self.X, self.Y = self._build_dataset()

    def _build_dataset(self):
        X, Y = [], []

        for sentence in self.sentences:
            words = word_tokenize(sentence)
            if not words:
                continue
            if words[-1] in string.punctuation:
                words[-1] = self.END_TOKEN
            else:
                words.append(self.END_TOKEN)

            context = [0] * self.context_size
            for i, word in enumerate(words):

                if word in self.word2index and word not in ['unknowntoken', 'hyphentoken', 'numericalcommatoken']:
                    index = self.word2index[word]
                elif word == 'unknowntoken':
                    index = self.word2index[self.UNK_TOKEN]
                elif word == 'hyphentoken':
                    index = self.word2index[self.hyphentoken]
                elif word == 'numericalcommatoken':
                   index = self.word2index[self.numericalcommatoken]
                else:
                    index = self.word2index[self.UNK_TOKEN]
                X.append(context)
                Y.append(index)
                context = context[1:] + [index]

        return torch.tensor(X), torch.tensor(Y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, index):
        return self.X[index], self.Y[index]

    def sample(self, num_samples):
        for _ in range(num_samples):
            idx = torch.randint(0, len(self), (1,)).item()
            context, target = self.X[idx], self.Y[idx]
            context_words = [self.index2word[i.item()] for i in context]
            target_word = self.index2word[target.item()]
            print(" ".join(context_words), "------>", target_word)

    def get_context_size(self):
        return self.context_size


In [None]:
train_vocabBuilder =  WikiText2VocabBuilder(corpus_path = training_data_corpus_path, min_freq=min_freq)

train_dataset = WikiText2Dataset(train_vocabBuilder, context_size = context_size)
valid_dataset = WikiText2Dataset(train_vocabBuilder, context_size = context_size)
test_dataset = WikiText2Dataset(train_vocabBuilder, context_size = context_size)

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)

test_dataset_size_msg = f"Number of context-target pairs in the test corpus: {len(test_dataset)}"
print(test_dataset_size_msg)
logging.info(test_dataset_size_msg)

Number of context-target pairs in the test corpus: 1857512


The model's architecture begins with an embedding layer that transforms input word indices into dense vector representations. The size of this embedding layer is determined by the vocabulary size and the specified embedding dimension. This layer plays a crucial role in capturing semantic information about words.

Following the embedding layer, the model features a linear transformation, embeddingToHiddenLayer, which maps the flattened embedding output to the hidden layer size. This transformation is applied to the input before it is passed through the hidden layers, serving as a bridge between the embedding and the hidden layers.

The core of the model consists of a series of hidden layers, each comprising a linear layer followed by layer normalization and a ReLU activation function. These hidden layers are encapsulated within a ModuleList, allowing for a dynamic number of layers based on the numberOfLayers parameter. The use of layer normalization stabilizes the learning process, and ReLU activation introduces non-linearity, enabling the model to learn complex patterns in the data.

A key feature of the model is the incorporation of residual connections. Each hidden layer's output is added to its input (residual connection), enhancing gradient flow through the network and mitigating the vanishing gradient problem in deeper architectures.

The final component of the model is the output layer, a linear transformation (hiddenToOutputLayer) that maps the output of the last hidden layer to the vocabulary size. This layer produces the logits for each word in the vocabulary.

The model employs Kaiming (He) initialization for the embedding and linear layers, optimizing the initial weights for ReLU activation functions. This initialization strategy is known to improve convergence in deep networks.

In the forward pass, the model first computes word embeddings, then reshapes the tensor before applying the initial linear transformation. The reshaping flattens the context window and embedding dimensions into a single dimension, preparing the data for linear transformation. The data then sequentially passes through the hidden layers, with residual connections applied at each step. Finally, the output layer generates log probabilities using log softmax, making the model suitable for tasks like next-word prediction.

The num_parameters method computes the total number of trainable parameters in the model, providing insight into the model's complexity and capacity.

In [None]:
class LanguageModel(nn.Module):

    def __init__(self, vocab_size, embed_size, hidden_size, context_size, numberOfLayers):
        super(LanguageModel, self).__init__()
        self.embed_size = embed_size
        self.context_size = context_size
        self.hidden_size = hidden_size
        self.numberOfLayers = numberOfLayers
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.hidden_layers = nn.ModuleList()
        self.ReLU = nn.ReLU()  # Defining a single instance of ReLU to reuse


        self.embeddingToHiddenLayer = nn.Linear((self.context_size) * self.embed_size, self.hidden_size)
        self.hiddenToOutputLayer = nn.Linear(self.hidden_size, self.vocab_size)

        # Hidden layers weights, bias, and layer normalization
        for i in range(self.numberOfLayers):
            linear_layer = nn.Linear((self.context_size) * self.embed_size if i == 0 else self.hidden_size, self.hidden_size)
            layer_norm = nn.LayerNorm(self.hidden_size)
            ReLULayer = nn.ReLU()


            self.hidden_layers.append(nn.Sequential(
                linear_layer,
                layer_norm,
                self.ReLU
            ))

        self.output_layer = nn.Linear(self.hidden_size, self.vocab_size)

        with torch.no_grad():
            # Xavier initialization for embedding
            nn.init.kaiming_normal_(self.embedding.weight)
            nn.init.kaiming_normal_(self.embeddingToHiddenLayer.weight)
            nn.init.constant_(self.embeddingToHiddenLayer.bias, 0)

            nn.init.kaiming_normal_(self.hiddenToOutputLayer.weight)
            nn.init.constant_(self.hiddenToOutputLayer.bias, 0)


            # Kaiming initialization for linear layers
            for hidden_layer in self.hidden_layers:
                nn.init.kaiming_normal_(hidden_layer[0].weight)

            # Initialize batch normalization layers
            for hidden_layer in self.hidden_layers:
                nn.init.constant_(hidden_layer[1].weight, 1)
                nn.init.constant_(hidden_layer[1].bias, 0)

            # Make the output layer less confident
            nn.init.constant_(self.output_layer.weight, 0.01)
            nn.init.constant_(self.output_layer.bias, 0)


    def forward(self, x):

        x = self.embedding(x)
        x = x.view(x.size(0), -1)
        residual = self.embeddingToHiddenLayer(x)

        for hidden_layer in self.hidden_layers:
            x = hidden_layer(x) + residual
            residual = x

        residual = self.hiddenToOutputLayer(x)
        y = self.output_layer(x) + residual
        log_probs = F.log_softmax(y, dim=1)
        return log_probs

    def num_parameters(self):
        return sum(p.numel() for p in self.parameters() if p.requires_grad)

The `train_language_model` function takes the model, data loaders for training, validation, and testing, the device for computation, training epochs, learning rate parameters, weight decay, and a patience parameter for early stopping.

The training process is iterative, looping over the specified number of epochs. Each epoch involves a forward and backward pass over the training data. The data is loaded in batches from the `train_dataloader`. For each batch, the model performs a forward pass to compute log probabilities of the target words given the contexts. The loss is calculated using negative log-likelihood loss (`nn.NLLLoss`), which is appropriate for the log probabilities output by the model.

After computing the loss, a backward pass is performed to compute gradients, which are then used to update the model's parameters via the AdamW optimizer. AdamW is an optimizer with weight decay regularization, suitable for this kind of task. The learning rate is adjusted at each epoch using an exponential decay schedule, calculated based on the initial and final learning rates and the total number of epochs.

The function includes an internal `evaluate` method, used to compute the model's performance on the validation and test datasets. This method calculates the total loss over the dataset and returns the perplexity, a common metric in language modeling that measures how well the model predicts a sample.

Perplexity is calculated for both the training and validation datasets at the end of each epoch. The training process includes a mechanism for early stopping, which halts training if the validation perplexity does not improve for a number of epochs specified by the `patience` parameter. This approach helps prevent overfitting.

After completing the training epochs (or stopping early), the function evaluates the model on the test dataset to compute the test perplexity. This metric provides an assessment of the model's performance on unseen data.

The function returns a collection of metrics for each epoch: learning rates, training losses, training perplexities, validation perplexities, and the final test perplexity. These outputs are useful for analyzing the model's learning behavior and overall performance.

In [None]:
def train_language_model(model, train_dataloader, valid_dataloader, test_dataloader, device, total_epochs, start_lr, end_lr, weight_decay_value, patience):

    def evaluate(model, dataloader, loss_function, device):
        model.eval()
        total_loss = 0.0
        total_tokens = 0

        with torch.no_grad():
          for contexts, targets in dataloader:
               contexts, targets = contexts.to(device), targets.to(device)
               log_probs = model(contexts)
               loss = loss_function(log_probs, targets)
               total_loss += loss.item() * targets.size(0)
               total_tokens += targets.size(0)

        model.train()
        return math.exp(total_loss / total_tokens)

    """Computes the learning rate for the current epoch based on exponential decay."""

    loss_function = nn.NLLLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=start_lr, weight_decay=weight_decay_value)
    scheduler = ExponentialLR(optimizer, gamma=(end_lr/start_lr)**(1/total_epochs))

    # Lists to store learning rate, loss, and perplexity values
    learning_rates = []
    losses = []
    train_perplexities = []
    valid_perplexities = []
    test_perplexities = []

    best_valid_perplexity = float('inf')
    no_improvement_counter = 0

    for epoch in range(total_epochs):
        total_loss = 0
        total_tokens = 0

        for i, (contexts, targets) in enumerate(train_dataloader):
            # Update learning rate according to the custom schedule

            # Move the data to the GPU
            contexts, targets = contexts.to(device), targets.to(device)

            # Zero the gradients
            optimizer.zero_grad()

            # Forward pass
            log_probs = model(contexts)

            # Compute loss
            loss = loss_function(log_probs, targets)

            # Accumulate total loss and total tokens
            total_loss += loss.item() * targets.size(0)
            total_tokens += targets.size(0)

            # Backward pass
            loss.backward()

            # Update the parameters
            optimizer.step()

        # Check for potential overflow in exponentiation
        exponent = total_loss / total_tokens
        if exponent > math.log(sys.float_info.max):
            overflow_msg = f"Potential overflow detected at epoch {epoch + 1}. Stopping training for this hyperparameter set."
            logging.info(overflow_msg)
            return None, None, [np.inf], [np.inf], np.inf  # Return inf for all metrics

        train_perplexity = math.exp(exponent)

        # Store values
        losses.append(total_loss / total_tokens)
        train_perplexities.append(train_perplexity)

        learning_rate = optimizer.param_groups[0]['lr']
        learning_rates.append(learning_rate)

        # Evaluate on the validation set
        valid_perplexity = evaluate(model, valid_dataloader, loss_function, device)

        # Check if the validation perplexity has improved
        if valid_perplexity < best_valid_perplexity:
            best_valid_perplexity = valid_perplexity
            no_improvement_counter = 0
        else:
            no_improvement_counter += 1
        if no_improvement_counter >= patience:
            early_stopping_msg = f"Early stopping triggered after {epoch + 1} epochs."
            logging.info(early_stopping_msg)
            print(early_stopping_msg)
            break

        valid_perplexities.append(valid_perplexity)


        # Update the learning rate
        scheduler.step()
        stat_message = f"Epoch {epoch+1}/{total_epochs} - Training Loss: {total_loss / total_tokens:.4f} - Training Perplexity: {train_perplexities[-1]:.4f} - Validation Perplexity: {valid_perplexities[-1]:.4f} - Learning rate: {learning_rate:.10f}"
        logging.info(stat_message)
        print(stat_message)


        # Reset total loss and total tokens for the next epoch
        total_loss = 0
        total_tokens = 0

    test_perplexity = evaluate(model, test_dataloader, loss_function, device)
    test_perpl_message = f"Test Perplexity: {test_perplexity:.4f}"
    logging.info(test_perpl_message)
    print(test_perpl_message)

    return learning_rates, losses, train_perplexities, valid_perplexities, test_perplexity


Now we train the language model on the WikiText-2 dataset. We initialize the model with specified parameters, conduct training, and evaluates performance on training, validation, and test sets. Key metrics like perplexities and losses are logged and reported post-training.

In [None]:
vocab_size = train_vocabBuilder.vocab_size() # The size of the vocabulary

# Initialize device based on the configuration and availability
device = torch.device(config_device if torch.cuda.is_available() else "cpu")
device_msg = f"Used Device: {device}"
logging.info(device_msg)
print(device_msg)
valid_perplexities_dict = {}
test_perplexity_dict = {}

np.random.seed(423455335)
# Set the seed for generating random numbers
torch.manual_seed(423455335)
# Set the seed for generating random numbers for CUDA when using the GPU
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Initialize the dictionary that will hold all the data
results_dict = {}

model = LanguageModel(vocab_size, embed_size, hidden_size, context_size, numberOfLayers)
model = model.to(device)

model_param_msg = f"Number of Model Parameters: {model.num_parameters()}"
print(model_param_msg)
logging.info(model_param_msg)
learning_rates, losses, train_perplexities, valid_perplexities, test_perplexity = train_language_model(model, train_dataloader, valid_dataloader, test_dataloader, device, total_epochs, lr_start, lr_end, weight_decay, patience)

iter_dict = {
    'learning_rates': learning_rates,
    'losses': losses,
    'train_perplexities': train_perplexities,
    'valid_perplexities': valid_perplexities,
    'test_perplexity': test_perplexity
}

result_msg = f"Validation Perplexity: {valid_perplexities[-1]:.4f}, Training Loss: {losses[-1]:.4f}, Training Perplexity: {train_perplexities[-1]:.4f}, Test Perplexity: {test_perplexity:.4f}"
print(result_msg)
logging.info(result_msg)





Used Device: cuda


Number of Model Parameters: 61699756
Epoch 1/15 - Training Loss: 5.7492 - Training Perplexity: 313.9442 - Validation Perplexity: 82.4057 - Learning rate: 0.0007125891
Epoch 2/15 - Training Loss: 4.2933 - Training Perplexity: 73.2047 - Validation Perplexity: 21.4043 - Learning rate: 0.0004506159
Epoch 3/15 - Training Loss: 3.0995 - Training Perplexity: 22.1863 - Validation Perplexity: 9.9646 - Learning rate: 0.0002849534
Epoch 4/15 - Training Loss: 2.3958 - Training Perplexity: 10.9765 - Validation Perplexity: 6.7038 - Learning rate: 0.0001801944
Epoch 5/15 - Training Loss: 1.9902 - Training Perplexity: 7.3170 - Validation Perplexity: 5.2946 - Learning rate: 0.0001139485
Epoch 6/15 - Training Loss: 1.7383 - Training Perplexity: 5.6878 - Validation Perplexity: 4.5922 - Learning rate: 0.0000720570
Epoch 7/15 - Training Loss: 1.5779 - Training Perplexity: 4.8448 - Validation Perplexity: 4.2031 - Learning rate: 0.0000455663
Epoch 8/15 - Training Loss: 1.4753 - Training Perplexity: 4.3725 - 

We save the weights so that we can analyze the embedding space of the trained model in Experiment 1.

In [None]:
model_save_path = "./testModel_weights.pth"
torch.save(model.state_dict(), model_save_path)

print(f"Model saved at {model_save_path}")
logging.info(f"Model saved at {model_save_path}")


json_filename = "finalRun.json"
with open(json_filename, 'w') as f:
    json.dump(iter_dict, f, indent=4)

Model saved at ./testModel_weights.pth
