In summary, this script exemplifies a structured and comprehensive approach to hyperparameter tuning for a language model. By systematically exploring a range of configurations and rigorously evaluating model performance, it aims to identify the optimal set of hyperparameters that lead to the best predictive performance on the given language modeling task.

Key Steps in Hyperparameter Tuning:

1. **Configuration and Setup**: The script begins by loading configuration settings from a YAML file. These settings include paths to data, model architecture specifications, and ranges for various hyperparameters like learning rate, weight decay, and network structure (number of layers, embedding size, hidden layer size). The configuration also dictates the runtime environment and training parameters like batch size and number of epochs.

2. **Data Preparation**: Using a `WikiText2VocabBuilder` class, a vocabulary is built from a text corpus. This vocabulary is then used to create training and validation datasets, which are crucial for both training the model and evaluating its performance during the tuning process.

3. **Model Architecture**: The `LanguageModel` class defines the neural network's structure, including an embedding layer, multiple hidden layers (with ReLU activation and layer normalization), and an output layer. The model is designed to use residual connections and Kaiming initialization to aid in training deeper networks efficiently.

4. **Hyperparameter Sampling**: The script employs two key functions - `sample_logHyperparameters` and `sample_uniformHyperparameters` - to sample hyperparameters from specified ranges. The former samples values like learning rate and weight decay from log-uniform distributions, ensuring a uniform distribution across orders of magnitude. The latter samples architectural parameters like the number of layers and sizes of the embedding and hidden layers.

5. **Training and Evaluation Loop**: The core of the tuning process is the `hyper_param_tuning` function. This function orchestrates the training of the model with the sampled hyperparameters. It involves training the model for a set number of epochs, evaluating its performance on the validation set, and applying mechanisms like early stopping and learning rate scheduling. Crucially, it records metrics such as loss and perplexity, which are vital for assessing the model's performance.

6. **Iterative Process**: The script executes multiple iterations of the tuning process, each time sampling a new set of hyperparameters and training a new model instance. This iterative approach is key to exploring the hyperparameter space thoroughly.

7. **Logging and Analysis**: Throughout the process, important information about each hyperparameter set and the corresponding model's performance is logged. This data is saved both in log files and a JSON format, enabling detailed post-hoc analysis to determine the most effective model configuration.

8. **Final Selection**: After all iterations are complete, the script identifies and reports the best-performing model configuration based on validation perplexity, a standard measure of model performance in language modeling tasks.



The code cell below sets up the necessary Python libraries for building and training neural network models, for tasks involving natural language processing and data visualization.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.optim.lr_scheduler import ExponentialLR
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import random
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import defaultdict, Counter
import string
import math
import time
import sys
import logging
from datetime import datetime
import yaml
import json
import string
import math
import sys
import logging
from datetime import datetime
import yaml
import json


The code reads a configuration file to set parameters for a machine learning model's data processing, runtime environment, and hyperparameter tuning. It specifies data paths, minimum word frequency, context size for training, shuffle settings, worker threads, batch size, and device type. For the experiment, it defines the number of training epochs, early stopping patience, and the number of hyperparameter tuning iterations. Hyperparameters such as learning rate, weight decay, reduction factor, number of layers, embedding, and hidden layer sizes are set as ranges to explore the best model configuration.

In [None]:
with open("tuningHyperParamConfig.yaml", 'r') as ymlfile:
    config = yaml.safe_load(ymlfile)

training_data_corpus_path = config['data']['training_data_corpus_path']
validation_data_corpus_path = config['data']['validation_data_corpus_path']
min_freq = config['data']['min_freq']
context_size = config['data']['context_size']

shuffle = config['runtime']['shuffle']
num_workers = config['runtime']['num_workers']
batch_size = config['runtime']['batch_size']
config_device = config['runtime']['device']


total_epochs = config['experiment']['total_epochs']
patience = config['experiment']['patience']
lr_start_range = config['hyperparameters']['lr_start_range']
weight_decay_range = config['hyperparameters']['weight_decay_range']
reduction_factor_range = config['hyperparameters']['reduction_factor_range']
num_iterations = config['experiment']['num_iterations_for_hyperparameters'] # Number of combinations of hyperparameters to try
numberOfLayers_range = config['hyperparameters']['numberOfLayers_range'] # Number of hidden layers
embed_size_range = config['hyperparameters']['embed_size_range']  # The embedding size of the words
hidden_size_range = config['hyperparameters']['hidden_size_range'] # The hidden size of the neural network


The code cell below defines a function `init_logger()` that sets up logging for a hyperparameter tuning process. It generates a unique filename incorporating the current date and time, configures the logging level, message format, and date format, and begins logging to this file. When called, the function initializes the logger and prints the log file's name, indicating where the tuning process's details will be recorded.

In [None]:
# Function to initialize logging with dynamic filename
def init_logger():
    # Generate a unique log file name based on the current date and time
    current_time = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
    log_file_name = f'hyperparameter_tuning_{current_time}.log'

    # Configure the logging module to write the logs to a file
    logging.basicConfig(filename=log_file_name,
                        level=logging.INFO,
                        format='%(asctime)s [%(levelname)s]: %(message)s',
                        datefmt='%Y-%m-%d %H:%M:%S')
    return log_file_name

# Initialize logger and get the log file name
log_file_name = init_logger()
print(f"Logging to file: {log_file_name}")


Logging to file: hyperparameter_tuning_2023-09-16_13-47-27.log


Preparation of Data

The code cell below defines a class `WikiText2VocabBuilder` which constructs a vocabulary from a text corpus. The class initializes with paths to the corpus and a frequency threshold for inclusion. It manages a bidirectional mapping of words to indices and tracks word frequencies.

Special tokens for start, end, unknown words, hyphens, and numerical commas are predefined. The text is preprocessed to lowercase, tokenize sentences, and replace specific tokens with their representations. Only words meeting the minimum frequency criterion are added to the vocabulary. The `vocab_size` method returns the total number of unique words in the vocabulary.

In [None]:
class WikiText2VocabBuilder:
    def _new_index(self):
        return len(self.word2index)

    def __init__(self, corpus_path, min_freq):
        self.corpus_path = corpus_path
        self.min_freq = min_freq
        self.word2index = defaultdict(self._new_index)
        self.index2word = {}
        self.word_freqs = Counter()
        self.cleaned_sentences = []

        self.START_TOKEN = "<s>"
        self.END_TOKEN = "</s>"
        self.UNK_TOKEN = "<unk>"
        self.hyphentoken = "hyphentoken"
        self.numericalcommatoken = "numericalcommatoken"

        self._initialize_special_tokens()
        self._load_and_preprocess()

    def _initialize_special_tokens(self):
        self.word2index[self.START_TOKEN] = 0
        self.word2index[self.END_TOKEN] = 1
        self.word2index[self.UNK_TOKEN] = 2
        self.word2index[self.hyphentoken] = 3
        self.word2index[self.numericalcommatoken] = 4

        self.index2word[0] = self.START_TOKEN
        self.index2word[1] = self.END_TOKEN
        self.index2word[2] = self.UNK_TOKEN
        self.index2word[3] = self.hyphentoken
        self.index2word[4] = self.numericalcommatoken


    def clean_and_tokenize(self, corpus):
        corpus = corpus.lower()  # Convert to lowercase
        sentences = sent_tokenize(corpus)
        cleaned_sentences = []
        for sentence in sentences:
            sentence = sentence.strip()  # Remove unnecessary whitespaces
            if not (sentence.startswith('=') or sentence.endswith('=')):  # Exclude headers
                sentence = sentence.replace('<unk>', 'unknowntoken')
                sentence = sentence.replace('@-@', 'hyphentoken')
                sentence = sentence.replace('@,@', 'numericalcommatoken')
                cleaned_sentences.append(sentence)
        return cleaned_sentences

    def _load_and_preprocess(self):
        with open(self.corpus_path, 'r') as f:
            corpus = f.read()

        # Split the text into sentences using NLTK
        self.cleaned_sentences = self.clean_and_tokenize(corpus)
        # Count word frequencies
        for sentence in self.cleaned_sentences:
            words = word_tokenize(sentence)
            for word in words:
                word = word.lower()
                self.word_freqs[word] += 1

        # Build the vocabulary using only words that meet the frequency threshold
        for word, freq in self.word_freqs.items():
            if freq >= self.min_freq:
                index = self.word2index[word]
                self.index2word[index] = word

    def vocab_size(self):
        return len(self.word2index)



The code cell below introduces `WikiText2Dataset`, a subclass of PyTorch's `Dataset`. It's initialized with a preprocessor containing the text corpus and a specified context size for training. The class processes the text data into input-output pairs for language modeling, where the input is a sequence of words and the output is the word following the sequence. Special tokens and unknown words are managed according to predefined rules. The `__len__` and `__getitem__` methods allow the class to interface with PyTorch's DataLoader for batched data retrieval. Additionally, a `sample` method is included for displaying examples from the dataset.

In [None]:
class WikiText2Dataset(Dataset):
    def __init__(self, preprocessor, context_size):
        super(WikiText2Dataset, self).__init__()

        self.context_size = context_size

        # We already have cleaned sentences in the preprocessor
        self.sentences = preprocessor.cleaned_sentences
        self.word2index = preprocessor.word2index
        self.index2word = preprocessor.index2word
        self.word_freqs = preprocessor.word_freqs
        self.START_TOKEN = preprocessor.START_TOKEN
        self.END_TOKEN = preprocessor.END_TOKEN
        self.UNK_TOKEN = preprocessor.UNK_TOKEN
        self.hyphentoken = preprocessor.hyphentoken
        self.numericalcommatoken = preprocessor.numericalcommatoken


        self.X, self.Y = self._build_dataset()

    def _build_dataset(self):
        X, Y = [], []

        for sentence in self.sentences:
            words = word_tokenize(sentence)
            if not words:
                continue
            if words[-1] in string.punctuation:
                words[-1] = self.END_TOKEN
            else:
                words.append(self.END_TOKEN)

            context = [0] * self.context_size
            for i, word in enumerate(words):

                if word in self.word2index and word not in ['unknowntoken', 'hyphentoken', 'numericalcommatoken']:
                    index = self.word2index[word]
                elif word == 'unknowntoken':
                    index = self.word2index[self.UNK_TOKEN]
                elif word == 'hyphentoken':
                    index = self.word2index[self.hyphentoken]
                elif word == 'numericalcommatoken':
                   index = self.word2index[self.numericalcommatoken]
                else:
                    index = self.word2index[self.UNK_TOKEN]
                X.append(context)
                Y.append(index)
                context = context[1:] + [index]

        return torch.tensor(X), torch.tensor(Y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, index):
        return self.X[index], self.Y[index]

    def sample(self, num_samples):
        for _ in range(num_samples):
            idx = torch.randint(0, len(self), (1,)).item()
            context, target = self.X[idx], self.Y[idx]
            context_words = [self.index2word[i.item()] for i in context]
            target_word = self.index2word[target.item()]
            print(" ".join(context_words), "------>", target_word)

    def get_context_size(self):
        return self.context_size


The code cell below instantiates `train_vocabBuilder` with a specified corpus path and minimum word frequency, constructing a vocabulary. It then initializes `train_dataset` and `valid_dataset` using the `WikiText2Dataset` class, both employing the vocabulary and context size from `train_vocabBuilder`.

In [None]:
train_vocabBuilder =  WikiText2VocabBuilder(corpus_path = training_data_corpus_path, min_freq=min_freq)
train_dataset = WikiText2Dataset(train_vocabBuilder, context_size = context_size)
valid_dataset = WikiText2Dataset(train_vocabBuilder, context_size = context_size)

The code cell below sets up data loaders for both the training and validation datasets using PyTorch's `DataLoader` class, with specified batch size, shuffle setting, and number of worker processes. It also logs and prints the context size, the vocabulary size from the `train_vocabBuilder`, and the number of context-target pairs in both the training and validation datasets. Lastly, it fetches and logs a sample of 10 context-target pairs from the training dataset to provide a glimpse into the prepared data.

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)
# Log and print other details
context_size_msg = f"Context size: {context_size}"
print(context_size_msg)
logging.info(context_size_msg)

vocab_size_msg = f"Number of tokens in the vocabulary: {train_vocabBuilder.vocab_size()}"
print(vocab_size_msg)
logging.info(vocab_size_msg)

train_dataset_size_msg = f"Number of context-target pairs in the training corpus: {len(train_dataset)}"
print(train_dataset_size_msg)
logging.info(train_dataset_size_msg)

valid_dataset_size_msg = f"Number of context-target pairs in the validation corpus: {len(valid_dataset)}"
print(valid_dataset_size_msg)
logging.info(valid_dataset_size_msg)



sample_train_data = str(train_dataset.sample(10))
print(f"Sample of the training corpus: {sample_train_data}")
logging.info(f"Sample of the training corpus: {sample_train_data}")


Context size: 10
Number of tokens in the vocabulary: 28685
Number of context-target pairs in the training corpus: 1857512
Number of context-target pairs in the validation corpus: 1857512
, and the last exit 's new owners moved it ------> to
anthropology in 1935 , he developed a lifelong interest in ------> the
<s> <s> in july 2006 , bosi and his wife ------> claire
publisher of the french magazine le moniteur de la mode ------> </s>
accused of copying the tune of the song `` <unk> ------> ``
setting in motion of the wheel of dharma `` , ------> at
, but the focus seems to be on his groin ------> rather
costa rica , french <unk> , guatemala , honduras , ------> nicaragua
by other names such as <unk> <unk> ( `` harvest ------> of
official standards for religious instruction until the fourth lateran council ------> in
Sample of the training corpus: None


The code cell below defines a `LanguageModel` class, extending PyTorch's `nn.Module`. This model includes an embedding layer, several hidden layers with ReLU activation and layer normalization, and an output layer. It is initialized with the size of the vocabulary, embedding dimension, hidden layer size, context size, and number of hidden layers. The network uses residual connections between layers and Kaiming (He) initialization for weights to aid in training deep networks. The `forward` method implements the forward pass, which computes the log probabilities of the next word given a context. It also includes a method `num_parameters` to return the count of trainable parameters in the model.

In [None]:
class LanguageModel(nn.Module):

    def __init__(self, vocab_size, embed_size, hidden_size, context_size, numberOfLayers):
        super(LanguageModel, self).__init__()
        self.embed_size = embed_size
        self.context_size = context_size
        self.hidden_size = hidden_size
        self.numberOfLayers = numberOfLayers
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.hidden_layers = nn.ModuleList()
        self.ReLU = nn.ReLU()  # Defining a single instance of ReLU to reuse


        self.embeddingToHiddenLayer = nn.Linear((self.context_size) * self.embed_size, self.hidden_size)
        self.hiddenToOutputLayer = nn.Linear(self.hidden_size, self.vocab_size)

        # Hidden layers weights, bias, and layer normalization
        for i in range(self.numberOfLayers):
            linear_layer = nn.Linear((self.context_size) * self.embed_size if i == 0 else self.hidden_size, self.hidden_size)
            layer_norm = nn.LayerNorm(self.hidden_size)
            ReLULayer = nn.ReLU()


            self.hidden_layers.append(nn.Sequential(
                linear_layer,
                layer_norm,
                self.ReLU
            ))

        self.output_layer = nn.Linear(self.hidden_size, self.vocab_size)

        with torch.no_grad():
            # Xavier initialization for embedding
            nn.init.kaiming_normal_(self.embedding.weight)
            nn.init.kaiming_normal_(self.embeddingToHiddenLayer.weight)
            nn.init.constant_(self.embeddingToHiddenLayer.bias, 0)

            nn.init.kaiming_normal_(self.hiddenToOutputLayer.weight)
            nn.init.constant_(self.hiddenToOutputLayer.bias, 0)


            # Kaiming initialization for linear layers
            for hidden_layer in self.hidden_layers:
                nn.init.kaiming_normal_(hidden_layer[0].weight)

            # Initialize batch normalization layers
            for hidden_layer in self.hidden_layers:
                nn.init.constant_(hidden_layer[1].weight, 1)
                nn.init.constant_(hidden_layer[1].bias, 0)

            # Make the output layer less confident
            nn.init.constant_(self.output_layer.weight, 0.01)
            nn.init.constant_(self.output_layer.bias, 0)


    def forward(self, x):
        x = self.embedding(x) # Retrieve the corresponding embeddings


        x = x.view(x.size(0), -1)
        residual = self.embeddingToHiddenLayer(x)

        for hidden_layer in self.hidden_layers:
            x = hidden_layer(x) + residual
            residual = x

        residual = self.hiddenToOutputLayer(x)
        y = self.output_layer(x) + residual

        # Log probabilities (logits). Log probs is of shape (batch_size, vocab_size)
        log_probs = F.log_softmax(y, dim=1)

        return log_probs

    def num_parameters(self):
        return sum(p.numel() for p in self.parameters() if p.requires_grad)

The function `train_language_model` implements the training process of our neural language model on a given dataset. It involves several key steps:

1. **Inner Evaluation Function**: It defines an `evaluate` function that calculates the model's performance on a dataloader (usually the validation set). This function computes the perplexity, a common metric in language models, which quantifies how well the model predicts a sample.

2. **Loss Function and Optimizer**: It utilizes the negative log-likelihood loss (`NLLLoss`), suitable for classification tasks with log probabilities as outputs. The AdamW optimizer is employed with a specified initial learning rate and weight decay, which helps prevent overfitting.

3. **Learning Rate Scheduler**: An exponential decay scheduler is applied to the learning rate, aiming to reduce it from `start_lr` to `end_lr` over the total epochs.

4. **Training Loop**: The model iterates over the training data loader for the specified number of epochs. Within each epoch, the model performs the following steps:
   - Performs a forward pass to compute log probabilities.
   - Calculates the loss between predictions and targets.
   - Executes a backward pass to compute gradients.
   - Updates model parameters using the optimizer.

5. **Logging and Early Stopping**: During training, it logs statistics such as loss and perplexity, and checks for model improvement on the validation set to decide on early stopping. If there's no improvement for a number of epochs equal to `patience`, training halts early.

6. **Timeout for Epochs**: Each epoch has a time limit, and if exceeded, the training is stopped for the current set of hyperparameters.

7. **Overflow Handling**: If there is a potential overflow in computing perplexity (exceeding the largest representable float), the function stops training to prevent numerical instability.

The function returns the learning rate schedule, training losses, training perplexities, and validation perplexities for further analysis and insight into the training process. This comprehensive approach ensures that the model is not only trained but also monitored for performance and computational efficiency.

In [None]:
def train_language_model(model, train_dataloader, valid_dataloader, device, total_epochs, start_lr, end_lr, weight_decay_value, patience):

    def evaluate(model, dataloader, loss_function, device):
        model.eval()
        total_loss = 0.0
        total_tokens = 0

        with torch.no_grad():
          for contexts, targets in dataloader:
               contexts, targets = contexts.to(device), targets.to(device)
               log_probs = model(contexts)
               loss = loss_function(log_probs, targets)
               total_loss += loss.item() * targets.size(0)
               total_tokens += targets.size(0)

        model.train()
        return math.exp(total_loss / total_tokens)

    loss_function = nn.NLLLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=start_lr, weight_decay=weight_decay_value)
    scheduler = ExponentialLR(optimizer, gamma=(end_lr/start_lr)**(1/total_epochs))

    # Lists to store learning rate, loss, and perplexity values
    learning_rates = []
    losses = []
    train_perplexities = []
    valid_perplexities = []

    best_valid_perplexity = float('inf')
    no_improvement_counter = 0

    epoch_time_limit = 3600  # 60 minutes in seconds, adjust as needed

    for epoch in range(total_epochs):
        epoch_start_time = time.time()  # Record the start time of the epoch
        total_loss = 0
        total_tokens = 0

        for i, (contexts, targets) in enumerate(train_dataloader):
            # Update learning rate according to the custom schedule

            # Move the data to the GPU
            contexts, targets = contexts.to(device), targets.to(device)

            # Zero the gradients
            optimizer.zero_grad()

            # Forward pass
            log_probs = model(contexts)

            # Compute loss
            loss = loss_function(log_probs, targets)

            # Accumulate total loss and total tokens
            total_loss += loss.item() * targets.size(0)
            total_tokens += targets.size(0)

            # Backward pass
            loss.backward()

            # Update the parameters
            optimizer.step()

        # Check for potential overflow in exponentiation
        exponent = total_loss / total_tokens
        if exponent > math.log(sys.float_info.max):
            overflow_msg = f"Potential overflow detected at epoch {epoch + 1}. Stopping training for this hyperparameter set."
            logging.info(overflow_msg)
            return None, None, [np.inf], [np.inf], np.inf  # Return inf for all metrics

        train_perplexity = math.exp(exponent)

        # Store values
        losses.append(total_loss / total_tokens)
        train_perplexities.append(train_perplexity)

        learning_rate = optimizer.param_groups[0]['lr']
        learning_rates.append(learning_rate)

        # Evaluate on the validation set
        valid_perplexity = evaluate(model, valid_dataloader, loss_function, device)

        # Check if the validation perplexity has improved
        if valid_perplexity < best_valid_perplexity:
            best_valid_perplexity = valid_perplexity
            no_improvement_counter = 0
        else:
            no_improvement_counter += 1
        if no_improvement_counter >= patience:
            early_stopping_msg = f"Early stopping triggered after {epoch + 1} epochs."
            logging.info(early_stopping_msg)
            print(early_stopping_msg)
            break

        valid_perplexities.append(valid_perplexity)


        # Update the learning rate
        scheduler.step()
        stat_message = f"Epoch {epoch+1}/{total_epochs} - Training Loss: {total_loss / total_tokens:.4f} - Training Perplexity: {train_perplexities[-1]:.4f} - Validation Perplexity: {valid_perplexities[-1]:.4f} - Learning rate: {learning_rate:.10f}"
        logging.info(stat_message)
        print(stat_message)


        # Reset total loss and total tokens for the next epoch
        total_loss = 0
        total_tokens = 0

         # Check if the epoch has exceeded the time limit
        if time.time() - epoch_start_time > epoch_time_limit:
            timeout_msg = f"Epoch {epoch+1} timed out after {epoch_time_limit/60:.2f} minutes. Stopping training for this hyperparameter set."
            logging.info(timeout_msg)
            print(timeout_msg)
            break



    return learning_rates, losses, train_perplexities, valid_perplexities


The `sample_logHyperparameters` function samples starting learning rate, end learning rate, and weight decay from log-uniform distributions defined by their respective ranges. It ensures that the hyperparameters are distributed evenly across orders of magnitude, returning the computed values for use in optimizing a machine learning model.

In [None]:
def sample_logHyperparameters(lr_start_range, weight_decay_range, reduction_factor_range):
    log_lr_start_range = np.log10(lr_start_range)
    log_weight_decay_range = np.log10(weight_decay_range)
    start_lr_log = np.random.uniform(log_lr_start_range[0], log_lr_start_range[1])
    start_lr = 10**start_lr_log
    log_reduction_factor_range = np.log10(reduction_factor_range)
    log_reduction_factor = np.random.uniform(log_reduction_factor_range[0], log_reduction_factor_range[1])
    reduction_factor = 10**log_reduction_factor
    end_lr = start_lr * reduction_factor
    weight_decay_log = np.random.uniform(log_weight_decay_range[0], log_weight_decay_range[1])
    weight_decay = 10**weight_decay_log
    return start_lr, end_lr, weight_decay


The function `hyper_param_tuning` serves as a wrapper for `train_language_model`, passing through parameters for a neural network model's architecture and training process. It triggers the training of the model with specified layers, vocabulary size, embedding size, hidden size, context size, data loaders, device, epochs, learning rates, weight decay, and patience for early stopping. The function returns the outputs of the training process, typically learning rates, losses, and perplexities.

In [None]:
def hyper_param_tuning(model, numberOfLayers, vocab_size, embed_size, hidden_size, context_size, train_dataloader, valid_dataloader, device, total_epochs, start_lr, end_lr, weight_decay, patience):
    return train_language_model(model=model, train_dataloader=train_dataloader, valid_dataloader=valid_dataloader, device=device, total_epochs=total_epochs, start_lr=start_lr, end_lr=end_lr, weight_decay_value=weight_decay, patience=patience)


The `sample_uniformHyperparameters` function randomly samples values for the number of layers, embedding size, and hidden size from specified uniform distributions. It uses the provided ranges for each hyperparameter to ensure the sampled values fall within the expected bounds. The function returns these sampled values for constructing a neural network model's architecture.

In [None]:

def sample_uniformHyperparameters(numberOfLayers_range, embed_size_range, hidden_size_range):
    # Randomly sample the number of layers, embedding size, and hidden size from their respective ranges
    numberOfLayers = np.random.randint(numberOfLayers_range[0], numberOfLayers_range[1] + 1)
    embedSize = np.random.randint(embed_size_range[0], embed_size_range[1] + 1)
    hiddenSize = np.random.randint(hidden_size_range[0], hidden_size_range[1] + 1)

    return numberOfLayers, embedSize, hiddenSize


The code cell sets up and executes a hyperparameter tuning loop for a language model, storing the results in `results_dict`. Each iteration samples hyperparameters, constructs a model, and trains it. The results, including the number of layers, embedding size, hidden size, learning rates, weight decay, training and validation perplexities, and the number of parameters, are recorded in `iter_dict`. This dictionary is then added to `results_dict` with a key indicating the iteration number. The best validation perplexity is tracked, and the detailed results are saved to a JSON file after each iteration. Finally, the best validation perplexity achieved across all iterations is printed.

In [None]:
vocab_size = train_vocabBuilder.vocab_size() # The size of the vocabulary

# Initialize device based on the configuration and availability
device = torch.device(config_device if torch.cuda.is_available() else "cpu")
device_msg = f"Used Device: {device}"
logging.info(device_msg)
print(device_msg)
valid_perplexities_dict = {}

np.random.seed(423455335)
# Set the seed for generating random numbers
torch.manual_seed(423455335)
# Set the seed for generating random numbers for CUDA when using the GPU
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Initialize the dictionary that will hold all the data
results_dict = {}

# Save the best model's state
best_valid_perplexity = float('inf')

json_filename = "hyperparameter_tuning_results.json"

for i in range(num_iterations):
    start_lr, end_lr, weight_decay = sample_logHyperparameters(lr_start_range, weight_decay_range, reduction_factor_range)
    numberOfLayers, embedSize, hiddenSize = sample_uniformHyperparameters(numberOfLayers_range, embed_size_range, hidden_size_range)
    model = LanguageModel(vocab_size, embedSize, hiddenSize, context_size, numberOfLayers)
    model = model.to(device)
    hyper_comb_msg = f"Hyperparameter Combination {i+1}"
    print(hyper_comb_msg)
    logging.info(hyper_comb_msg)

    model_param_msg = f"Number of Model Parameters: {model.num_parameters()}"
    print(model_param_msg)
    logging.info(model_param_msg)

    sampled_hyper_msg = f"Sampled Hyperparameters: Start LR: {start_lr:.10f}, End LR: {end_lr:.10f}, Weight Decay: {weight_decay:.10f}, number of Layers: {numberOfLayers}, Embedding Size: {embedSize}, Hidden Size: {hiddenSize}"
    print(sampled_hyper_msg)
    logging.info(sampled_hyper_msg)
    start_time = time.time()
    learning_rates, losses, train_perplexities, valid_perplexities = hyper_param_tuning(model, numberOfLayers, vocab_size, embedSize, hiddenSize, context_size, train_dataloader, valid_dataloader, device, total_epochs, start_lr, end_lr, weight_decay, patience)

    end_time = time.time()
    time_per_iter = end_time - start_time
    time_per_iter_msg = f"Time taken for this iteration: {time_per_iter:.2f} seconds"
    if valid_perplexities[-1] == float('inf'):
        inf_perplex_msg = "Hyperparameter combination resulted in NaN perplexity. Skipping..."
        print(inf_perplex_msg)
        logging.info(inf_perplex_msg)
        continue

    if valid_perplexities[-1] < best_valid_perplexity:
        best_valid_perplexity = valid_perplexities[-1]

    iter_dict = {
        'numberOfLayers': numberOfLayers,
        'embedSize': embedSize,
        'hiddenSize': hiddenSize,
        'start_lr': start_lr,
        'end_lr': end_lr,
        'weight_decay': weight_decay,
        'time_per_iter': time_per_iter,
        'numberOfParameters': model.num_parameters(),
        'learning_rates': learning_rates,
        'losses': losses,
        'train_perplexities': train_perplexities,
        'valid_perplexities': valid_perplexities,
    }

    # Add the current iteration's results to the main results dictionary
    results_dict[f'iter_{i+1}'] = iter_dict
    with open(json_filename, 'w') as f:
        json.dump(results_dict, f, indent=4)
    result_msg = f"Start LR: {start_lr:.10f}, End LR: {end_lr:.10f}, Weight Decay: {weight_decay:.10f}, Validation Perplexity: {valid_perplexities[-1]:.4f}, Training Loss: {losses[-1]:.4f}, Training Perplexity: {train_perplexities[-1]:.4f}"
    print(result_msg)
    logging.info(result_msg)

    print(" End of Hyperparameter Combination")
    print("=========================================")


print(f"Best validation perplexity: {best_valid_perplexity:.4f}")


Used Device: cuda
Hyperparameter Combination 1
Number of Model Parameters: 25413365
Sampled Hyperparameters: Start LR: 0.0003682380, End LR: 0.0000006549, Weight Decay: 0.0000241020, number of Layers: 7, Embedding Size: 167, Hidden Size: 328
Epoch 1/15 - Training Loss: 6.1666 - Training Perplexity: 476.5729 - Validation Perplexity: 190.0661 - Learning rate: 0.0003682380
Epoch 2/15 - Training Loss: 5.1291 - Training Perplexity: 168.8713 - Validation Perplexity: 101.6181 - Learning rate: 0.0002414342
Epoch 3/15 - Training Loss: 4.6079 - Training Perplexity: 100.2715 - Validation Perplexity: 66.2651 - Learning rate: 0.0001582956
Epoch 4/15 - Training Loss: 4.2343 - Training Perplexity: 69.0122 - Validation Perplexity: 50.4053 - Learning rate: 0.0001037861
Epoch 5/15 - Training Loss: 3.9760 - Training Perplexity: 53.3009 - Validation Perplexity: 42.2767 - Learning rate: 0.0000680471
Epoch 6/15 - Training Loss: 3.8001 - Training Perplexity: 44.7054 - Validation Perplexity: 37.9589 - Learnin