# Overview

During the past week, two significant advancements have been made in my summer research project. The first one is related to the development of next predictive models - specifically, I have successfully implemented a Recurrent Neural Network (RNN) and its more sophisticated variant, Gated Recurrent Unit (GRU). The second big improvement this week was about bringing together different parts of our system into one unfied framework.

In regards to the first advancement, the implementation of the RNN and GRU models has been a crucial step forward. These models represent an evolution to process sequence data, offering enhanced predictive capabilities compared to previous models. Building these models required a deep understanding of their underlying mechanisms, and their successful implementation marks a significant milestone in the project.

The second noteworthy accomplishment this week is related to the system architecture. After independently developing various models such as Bigram, MLP, WaveNet, RNN (and Transformer, which is currently in progress), I have started to integrate all these components into a single Python script along with two helper scripts. This consolidated framework will provide a command-line interface that allows users to effortlessly train and inference models on any given word collection. By selecting their preferred model and parameters, users can easily customize the system according to their specific needs.

This week's work has not only advanced capabilities in terms of predictive modeling but has also significantly improved the user accessibility and efficiency of the proposed system. Looking forward, I aim to complete the Transformer model and fully integrate it into the unified framework, thereby offering an even broader range of predictive models for users to choose from.
<br>

# Import necessary dependencies

In [28]:
import os
import time
from dataclasses import dataclass

# modelling
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.tensorboard import SummaryWriter

# dataset reading and visualization
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# helper functions for both dataset preparation and model training and inference
from dataset_utils import clean_and_train_test_split, CharacterDataset, ContinuousDataLoader
from model_helpers import ModelConfig, display_samples, create_tokens, evaluate

## Spelling-out `model_helpers` and `dataset_utils`
<br>

### `model_helpers`

This set of functions and classes mainly aids in managing and manipulating the models and their outputs. The functions in this module include those for generating new tokens, displaying model-generated samples, and evaluating a model's performance.

1. `ModelConfig` class: This is a configuration class for the model parameters. It's used to store hyperparameters like the number of layers in the model (`n_layer`), the embedding size (`n_embd` and `n_embd2`), and the number of heads in the model (`n_head`), along with properties of the data such as the block size (`block_size`) and vocabulary size (`vocab_size`).
<br>

In [29]:
@dataclass
class ModelConfig:
    """
    This is a simple data class for storing model configuration settings. It includes settings related to the model architecture, such as the number of layers, the embedding size, and the number of heads, as well as settings related to the input data, such as the block size and vocabulary size.
    """
    block_size: int = None # input sequences length
    vocab_size: int = None # (0, vocab_size -1)
    # model parameters for different layers
    n_layer: int = 4
    n_embd: int = 64
    n_embd2: int = 64
    n_head: int = 4

<br><br>
2. `create_tokens` function: This function generates new tokens or characters from the given model. It starts from a provided sequence of indices, and based on the predictions of the model, it generates new tokens up to a maximum length defined by `max_token_creation`. If `sampling` is `True`, it will sample the next token based on the model's output distribution. Otherwise, it picks the token with the highest probability. If `top_k` is provided, it trims the predictions to only consider the top-k most probable tokens. This function returns a new sequence of tokens.
<br>

In [30]:
@torch.no_grad()
def create_tokens(model, sequence_indices, max_token_creation, sampling=False, top_k=None):
    """
    Generate new tokens from the given model, starting from a provided sequence of indices. This function can either sample the next token based on the model's output distribution or pick the token with the highest probability. It can also limit the prediction to the top-k most probable tokens.
    """
    sequence_limit = model.get_block_size()
    for _ in range(max_token_creation):
        # If the sequence context grows too large, it must be trimmed to sequence_limit
        sequence_condition = sequence_indices if sequence_indices.size(1) <= sequence_limit else sequence_indices[:, -sequence_limit:]
        # Pass the model forward to get the logits for the index in the sequence
        logits, _ = model(sequence_condition)
        logits = logits[:, -1, :]
        # Optionally trim the logits to only the top k options
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float('Inf')
        # Apply softmax to convert logits to (normalized) probabilities
        probabilities = F.softmax(logits, dim=-1)
        # Either sample from the distribution or take the most likely element
        if sampling:
            next_index = torch.multinomial(probabilities, num_samples=1)
        else:
            _, next_index = torch.topk(probabilities, k=1, dim=-1)
        # Append sampled index to the ongoing sequence and continue
        sequence_indices = torch.cat((sequence_indices, next_index), dim=1)

    return sequence_indices

<br><br>
3. `display_samples` function: This function displays some generated samples from the model. It first creates an initial sequence of zeros and generates subsequent tokens using the `create_tokens` function. Then it checks if the generated samples are in the training set, testing set, or if they are completely new words. It finally prints out the generated samples.
<br>

In [31]:
def display_samples(device, train_dataset, model, quantity=10):
    """
    Display some generated samples from the model. This function generates samples, checks if they are in the training set, testing set, or completely new, and prints out the generated samples.
    """    
    starting_input = torch.zeros(quantity, 1, dtype=torch.long).to(device)
    generation_steps = train_dataset.get_output_length() - 1 # -1 due to initial <START> token (index 0)
    sampled_input = create_tokens(model, starting_input, generation_steps, top_k=None, sampling=True).to(device)
    training_words, testing_words, novel_words = [], [], []
    for i in range(sampled_input.size(0)):
        # Obtain the i'th row of sampled integers, as python list
        sequence_row = sampled_input[i, 1:].tolist() # Remove the <START> token
        # Token 0 is the <STOP> token, thus we truncate the output sequence at that point
        stop_index = sequence_row.index(0) if 0 in sequence_row else len(sequence_row)
        sequence_row = sequence_row[:stop_index]
        sample_word = train_dataset.decode(sequence_row)
        # Check which words are in the training/testing set and which are new
        if train_dataset.contains(sample_word):
            training_words.append(sample_word)
        elif train_dataset.contains(sample_word):
            testing_words.append(sample_word)
        else:
            novel_words.append(sample_word)
    print('-'*50)
    for word_list, descriptor in [(training_words, 'in training'), (testing_words, 'in testing'), (novel_words, 'new')]:
        print(f"{len(word_list)} samples that are {descriptor}:")
        for word in word_list:
            print(word)
    print('-'*50)

In [32]:
@torch.inference_mode()
def evaluate(model, dataset, device, batch_size=50, max_batches=None):
    """
    Evaluate the model on the provided dataset. This function calculates the average loss of the model on the dataset, optionally limiting the evaluation to a certain number of batches.
    """
    model.eval() # evaluation mode
    loader = DataLoader(dataset, shuffle=True, batch_size=batch_size, num_workers=0)
    losses = []
    for i, batch in enumerate(loader):
        batch = [t.to(device) for t in batch]
        X, Y = batch
        logits, loss = model(X, Y)
        losses.append(loss.item())
        if max_batches is not None and i >= max_batches:
            break
    mean_loss = torch.tensor(losses).mean().item()
    model.train()
    return mean_loss

<br><br>
4. `evaluate` function: This function evaluates the model on a provided dataset. It creates a DataLoader for the dataset, runs the model in evaluation mode, and computes the average loss on the dataset. The model is then set back to training mode. If `max_batches` is specified, it limits the number of batches to evaluate. The function returns the average loss.
<br>