# Recurrent Neural Networks and Language Models

You guys probably very excited about ChatGPT.  In today class, we will be implementing a very simple language model, which is basically what ChatGPT is, but with a simple LSTM.  You will be surprised that it is not so difficult at all.

Paper that we base on is *Regularizing and Optimizing LSTM Language Models*, https://arxiv.org/abs/1708.02182

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

import torchtext, datasets, math
from tqdm import tqdm

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

#make our work comparable if restarted the kernel
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# torch.cuda.get_device_name(0)

cuda


## 1. Load data - Text File

Here is a text files of all Harry Potter books taken from https://www.kaggle.com/code/shubhammisar/harry-potter-character-word2vec-embedding (Kaggle).  This time, we will use the `datasets` library from HuggingFace to load.

The dataset is a collection of all the magical world of Harry Potter in a text file. The text is already pretty clean, there are no extra spaces or weird characters

In [3]:
# Import necessary modules
from datasets import load_dataset_builder, Dataset

# Define the path to your custom dataset file
custom_dataset_path = "..\datasets\Harry_Potter_all_books_preprocessed.txt"

# Read data from the custom dataset file
with open(custom_dataset_path, 'r') as f:
    data = f.read()

# Split the data into sentences based on the "." delimiter and create a list of dictionaries
data = data.split(" .")
data = [{"text": row} for row in data]

# Create a Dataset object from the list of dictionaries
dataset = Dataset.from_list(data)
dataset


Dataset({
    features: ['text'],
    num_rows: 67785
})

In [4]:
from datasets import DatasetDict

train_test = dataset.train_test_split(test_size=0.2)
# Split the 10% test + valid in half test, half valid
train_test_valid = train_test['test'].train_test_split(test_size=0.5)
# gathering all into a single datasetDict
dataset = DatasetDict({
    'train': train_test['train'],
    'test': train_test_valid['test'],
    'validation': train_test_valid['train']})

dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 54228
    })
    test: Dataset({
        features: ['text'],
        num_rows: 6779
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 6778
    })
})

In [5]:
print(dataset['train']['text'][33])

'''
If you try to change the index you might notice that sometimes there is no paragraph 
and rather an empty string so we will have to care of that later.
'''

it is difficult sir


'\nIf you try to change the index you might notice that sometimes there is no paragraph \nand rather an empty string so we will have to care of that later.\n'

## 2. Preprocessing

### Tokenizing

Simply tokenize the given text to tokens.

In [6]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

#function to tokenize
tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['text'])}  

#map the function to each example
tokenized_dataset = dataset.map(tokenize_data, remove_columns=['text'], fn_kwargs={'tokenizer': tokenizer})
print(tokenized_dataset['train'][333]['tokens'])

Map:   0%|          | 0/54228 [00:00<?, ? examples/s]

Map:   0%|          | 0/6779 [00:00<?, ? examples/s]

Map:   0%|          | 0/6778 [00:00<?, ? examples/s]

['in', 'me', 'third', 'year']


### Numericalizing

We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also we shall make sure to add `unk` and `eos`.

In [7]:
## numericalizing
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_dataset['train']['tokens'], 
min_freq=3) 
vocab.insert_token('<unk>', 0)           
vocab.insert_token('<eos>', 1)            
vocab.set_default_index(vocab['<unk>'])   
print(len(vocab))                         
print(vocab.get_itos()[:10])       

11063
['<unk>', '<eos>', 'the', 'and', 'to', 'of', 'a', 'he', 'harry', 'was']


In [26]:
len(vocab)

11063

In [27]:
import torchtext
import pickle


# Save the vocabulary using pickle
with open('model/vocab_lm.pkl', 'wb') as f:
    pickle.dump(vocab, f)

In [29]:
import pickle

# Load the vocabulary from the saved file
with open('model/vocab_lm.pkl', 'rb') as f:
    loaded_vocab = pickle.load(f)

# Now, loaded_vocab contains the vocabulary loaded from the file

## Description of Preprocessing

Firstly, the text data is read from a custom dataset file. Then, the data is split into individual sentences using the delimiter " .", and each sentence is converted into a dictionary format with the key "text". These dictionaries are used to create a dataset using the Dataset class from the datasets library.

The dataset is further divided into training and testing sets using the train_test_split method, with a test size of 0.2. The resulting split is stored in the train_test variable. The testing set is then split again into testing and validation sets using the train_test_split method, with a test size of 0.5. This split is stored in the train_test_valid variable.

To tokenize the dataset, the TorchText library's get_tokenizer function is employed to obtain an English language tokenizer. A tokenize_data function is defined to tokenize each example in the dataset using the tokenizer. The map method is utilized to apply the tokenize_data function to each example, removing the original 'text' column and adding a new 'tokens' column containing the tokenized version of the text.

Next, the vocabulary is built using the build_vocab_from_iterator method from the TorchText library. The vocabulary is constructed from the tokenized examples in the 'train' split, considering only tokens that appear at least three times. Special tokens like '<unk>' (unknown) and '<eos>' (end of sentence) are inserted into the vocabulary, with the default index set to the index of the '<unk>' token.

Finally, the length of the vocabulary and the first 10 tokens are printed to verify the vocabulary construction. These preprocessing steps form a fundamental pipeline for text data, including tokenization and vocabulary building, which are essential for various natural language processing tasks.

## 3. Prepare the batch loader

### Prepare data

Given "Chaky loves eating at AIT", and "I really love deep learning", and given batch size = 3, we will get three batches of data "Chaky loves eating at", "AIT `<eos>` I really", "love deep learning `<eos>`".  

In [8]:
def get_data(dataset, vocab, batch_size):
    # Initialize an empty list to store tokenized and numericalized data
    data = []

    # Iterate through examples in the dataset
    for example in dataset:
        if example['tokens']:
            # Append '<eos>' token to mark the end of a sequence
            tokens = example['tokens'].append('<eos>')
            
            # Numericalize tokens using the vocabulary
            tokens = [vocab[token] for token in example['tokens']]
            
            # Extend the data list with the numericalized tokens
            data.extend(tokens)

    # Convert the data list to a PyTorch LongTensor
    data = torch.LongTensor(data)

    # Calculate the number of batches
    num_batches = data.shape[0] // batch_size

    # Make the data batch evenly by discarding any remaining tokens
    data = data[:num_batches * batch_size]

    # Reshape the data into [batch_size, num_batches] tensor
    data = data.view(batch_size, num_batches)

    return data


In [9]:
batch_size = 128
train_data = get_data(tokenized_dataset['train'], vocab, batch_size)
valid_data = get_data(tokenized_dataset['validation'], vocab, batch_size)
test_data  = get_data(tokenized_dataset['test'], vocab, batch_size)

## 4. Modeling 

In [10]:
import torch
import torch.nn as nn
import math

class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        # Constructor for the LSTM Language Model
        
        super().__init__()
        
        # Initialize model parameters
        self.num_layers = num_layers
        self.hid_dim = hid_dim
        self.emb_dim = emb_dim

        # Layers of the model
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.lstm = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, 
                            dropout=dropout_rate, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hid_dim, vocab_size)
        
        # Initialize weights
        self.init_weights()
        
    def init_weights(self):
        # Initialize weights for embedding and fully connected layers
        
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        
        # Initialize embedding weights
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_emb)
        
        # Initialize fully connected layer weights
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        
        # Initialize LSTM weights
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                                                            self.hid_dim).uniform_(-init_range_other, init_range_other) 
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim, 
                                                            self.hid_dim).uniform_(-init_range_other, init_range_other) 

    def init_hidden(self, batch_size, device):
        # Initialize hidden and cell states for LSTM
        
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell
    
    def detach_hidden(self, hidden):
        # Detach hidden and cell states for backpropagation through time (BPTT)
        
        hidden, cell = hidden
        hidden = hidden.detach()
        cell = cell.detach()
        return hidden, cell

    def forward(self, src, hidden):
        # Forward pass through the model
        
        # src: [batch size, seq len]
        embedding = self.dropout(self.embedding(src))
        # embedding: [batch size, seq len, emb_dim]
        
        output, hidden = self.lstm(embedding, hidden)      
        # output: [batch size, seq len, hid_dim]
        # hidden = h, c = [num_layers * direction, seq len, hid_dim)
        
        output = self.dropout(output) 
        prediction = self.fc(output)
        # prediction: [batch size, seq_len, vocab size]
        
        return prediction, hidden


## Description
The LSTMLanguageModel architecture is a powerful model for language modeling tasks. It consists of an embedding layer, LSTM layers, a dropout layer, and a linear layer. The embedding layer learns dense representations for input tokens, capturing their semantic relationships. The LSTM layers process sequences, capturing long-term dependencies and updating hidden and cell states. Stacking multiple LSTM layers enables the model to learn more complex patterns. The dropout layer prevents overfitting by randomly zeroing out inputs during training. The linear layer maps hidden states to the vocabulary size, producing predicted probabilities for each token. This architecture allows the model to understand and generate coherent sequences. It is beneficial for tasks such as next-word prediction, text generation, machine translation, and sentiment analysis. The LSTMLanguageModel architecture's combination of embedding, LSTM, dropout, and linear layers provides a robust framework for capturing and generating sequential patterns in language data.

## 5. Training 

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

In [11]:
vocab_size = len(vocab)
emb_dim = 1024                # 400 in the paper
hid_dim = 1024                # 1150 in the paper
num_layers = 2                # 3 in the paper
dropout_rate = 0.65              
lr = 1e-3                     

In [12]:
# Create an instance of the LSTMLanguageModel
model = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)

# Initialize the optimizer with Adam and model parameters
optimizer = optim.Adam(model.parameters(), lr=lr)

# Define the loss criterion for training
criterion = nn.CrossEntropyLoss()

# Calculate the total number of trainable parameters in the model
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

# Print the number of trainable parameters in the model
print(f'The model has {num_params:,} trainable parameters')


The model has 39,461,687 trainable parameters


In [13]:
import pickle
# Load the vocabulary from the saved file
with open('../Jupyter Files//model/vocab_lm.pkl', 'rb') as f:
    loaded_vocab = pickle.load(f)

In [14]:
len(vocab)

11063

In [15]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [16]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, bunch of tokens]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]
    
    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()
        
        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [17]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

In [18]:
# Set the number of training epochs
n_epochs = 40

# Set the decoding length for sequence generation
seq_len = 50  # <----decoding length

# Set the gradient clipping threshold
clip = 0.25

# Initialize a learning rate scheduler with ReduceLROnPlateau
lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

# Initialize the best validation loss to positive infinity
best_valid_loss = float('inf')

# Loop through epochs
for epoch in range(n_epochs):
    # Train the model on the training data
    train_loss = train(model, train_data, optimizer, criterion, 
                       batch_size, seq_len, clip, device)
    
    # Evaluate the model on the validation data
    valid_loss = evaluate(model, valid_data, criterion, batch_size, 
                          seq_len, device)

    # Adjust learning rate based on validation loss
    lr_scheduler.step(valid_loss)

    # Save the model if the validation loss improves
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model/best-val-lstm_lm.pt')

    # Print training and validation perplexity
    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')


Training:   0%|          | 0/146 [00:00<?, ?it/s]

                                                           

	Train Perplexity: 642.301
	Valid Perplexity: 462.123


                                                           

	Train Perplexity: 334.065
	Valid Perplexity: 229.390


                                                           

	Train Perplexity: 234.021
	Valid Perplexity: 184.950


                                                           

	Train Perplexity: 195.799
	Valid Perplexity: 162.273


                                                           

	Train Perplexity: 172.806
	Valid Perplexity: 149.036


                                                           

	Train Perplexity: 157.209
	Valid Perplexity: 139.571


                                                           

	Train Perplexity: 145.488
	Valid Perplexity: 132.416


                                                           

	Train Perplexity: 135.854
	Valid Perplexity: 127.354


                                                           

	Train Perplexity: 128.282
	Valid Perplexity: 123.346


                                                           

	Train Perplexity: 121.705
	Valid Perplexity: 120.053


                                                           

	Train Perplexity: 115.885
	Valid Perplexity: 117.310


                                                           

	Train Perplexity: 111.209
	Valid Perplexity: 115.229


                                                           

	Train Perplexity: 106.588
	Valid Perplexity: 113.530


                                                           

	Train Perplexity: 102.572
	Valid Perplexity: 112.102


                                                           

	Train Perplexity: 98.934
	Valid Perplexity: 110.967


                                                           

	Train Perplexity: 95.733
	Valid Perplexity: 109.995


                                                           

	Train Perplexity: 92.790
	Valid Perplexity: 109.245


                                                           

	Train Perplexity: 89.863
	Valid Perplexity: 108.852


                                                           

	Train Perplexity: 87.477
	Valid Perplexity: 108.138


                                                           

	Train Perplexity: 85.085
	Valid Perplexity: 108.041


                                                           

	Train Perplexity: 82.843
	Valid Perplexity: 107.045


                                                           

	Train Perplexity: 80.674
	Valid Perplexity: 106.896


                                                           

	Train Perplexity: 78.866
	Valid Perplexity: 106.679


                                                           

	Train Perplexity: 77.093
	Valid Perplexity: 106.769


                                                           

	Train Perplexity: 73.877
	Valid Perplexity: 106.060


                                                           

	Train Perplexity: 72.239
	Valid Perplexity: 106.081


                                                           

	Train Perplexity: 70.603
	Valid Perplexity: 105.922


                                                           

	Train Perplexity: 69.703
	Valid Perplexity: 105.863


                                                           

	Train Perplexity: 69.169
	Valid Perplexity: 105.942


                                                           

	Train Perplexity: 68.165
	Valid Perplexity: 105.595


                                                           

	Train Perplexity: 67.707
	Valid Perplexity: 105.659


                                                           

	Train Perplexity: 67.308
	Valid Perplexity: 105.675


                                                           

	Train Perplexity: 66.920
	Valid Perplexity: 105.687


                                                           

	Train Perplexity: 66.846
	Valid Perplexity: 105.698


                                                           

	Train Perplexity: 66.736
	Valid Perplexity: 105.668


                                                           

	Train Perplexity: 66.692
	Valid Perplexity: 105.685


                                                           

	Train Perplexity: 66.660
	Valid Perplexity: 105.680


                                                           

	Train Perplexity: 66.592
	Valid Perplexity: 105.685


                                                           

	Train Perplexity: 66.591
	Valid Perplexity: 105.687


                                                           

	Train Perplexity: 66.646
	Valid Perplexity: 105.688


## Description of Traning

The training process in the provided code snippet follows a standard procedure for training a language model. It begins by initializing the model architecture and setting hyperparameters such as the vocabulary size, embedding dimension, hidden dimension, number of layers, dropout rate, and learning rate. The model is then instantiated and moved to the desired device. The Adam optimizer is employed to optimize the model's parameters, and the CrossEntropyLoss criterion is used to compute the loss during training.

The actual training loop consists of multiple epochs. In each epoch, the model is put into training mode, and the training data is divided into batches of sequences. The hidden state is reset at the start of each epoch. For each batch, the model parameters are zeroed out, and the forward pass is performed. The predicted probabilities for each token in the vocabulary are compared with the true next tokens, and the loss is calculated. The gradients are then computed using backpropagation, and the model parameters are updated using the optimizer. The loss is accumulated over the batches to compute the epoch loss.

After each epoch, the model is put into evaluation mode, and the validation data is processed similarly to compute the validation loss. The learning rate scheduler is used to adjust the learning rate based on the validation loss, potentially reducing the learning rate if the validation loss plateaus. If the current validation loss is the best observed so far, the model parameters are saved.

Throughout the training process, the train and validation perplexities are calculated and printed, representing how well the model predicts the training and validation data, respectively. The perplexity is a measure of how surprised the model is by the data and is obtained by exponentiating the loss. The goal of training is to minimize the loss and improve the model's ability to generate accurate predictions for sequences of text.

## 6. Testing

In [19]:
# Load the best model state from the saved checkpoint
model.load_state_dict(torch.load('model/best-val-lstm_lm.pt',  map_location=device))

# Evaluate the model on the test data
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)

# Print the test perplexity
print(f'Test Perplexity: {math.exp(test_loss):.3f}')


Test Perplexity: 106.254


## 7. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

In [20]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    # Set random seed if provided
    if seed is not None:
        torch.manual_seed(seed)
    
    # Set the model to evaluation mode
    model.eval()

    # Tokenize the input prompt
    tokens = tokenizer(prompt)
    
    # Convert tokens to indices using the vocabulary
    indices = [vocab[t] for t in tokens]
    
    # Set batch size to 1 for generation
    batch_size = 1
    
    # Initialize hidden states for the model
    hidden = model.init_hidden(batch_size, device)

    # Generate sequence
    with torch.no_grad():
        for i in range(max_seq_len):
            # Convert indices to a PyTorch LongTensor and move to the specified device
            src = torch.LongTensor([indices]).to(device)
            
            # Get model predictions and update hidden states
            prediction, hidden = model(src, hidden)
            
            # Softmax and temperature scaling for sampling
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            
            # Sample from the probability distribution
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            # If sampled token is '<unk>', sample again
            while prediction == vocab['<unk>']:
                prediction = torch.multinomial(probs, num_samples=1).item()

            # If sampled token is '<eos>', stop generation
            if prediction == vocab['<eos>']:
                break

            # Append the sampled token to the generated sequence (autoregressive)
            indices.append(prediction)

    # Convert indices back to tokens using the vocabulary
    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    
    return tokens


In [21]:
# Set the generation prompt, maximum sequence length, and seed for reproducibility
prompt = 'harry'
max_seq_len = 30
seed = 0

# Set different temperatures for temperature scaling during generation
# Smaller temperature values result in more diverse but potentially less coherent sequences
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]

# Loop through different temperatures and generate sequences
for temperature in temperatures:
    # Generate a sequence using the specified temperature
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    
    # Print the temperature, generated sequence, and a newline for separation
    print(str(temperature)+'\n'+' '.join(generation)+'\n')


0.5
harry missed

0.7
harry missed

0.75
harry missed

0.8
harry missed

1.0
harry missed



In [24]:
import pickle

# Assuming tokenized_dataset is your dataset with a 'train' split containing 'tokens'

# Save the vocabulary using pickle
with open('model/lstm_model.pkl', 'wb') as f:
    pickle.dump(model, f)


# How the web application interfaces with the language model

## Table of Contents

1. [Introduction](#1-introduction)
    - [Purpose of the Web Application](#11-purpose-of-the-web-application)
    - [Overview of Features](#12-overview-of-features)

2. [Architecture Overview](#2-architecture-overview)
    - [Frontend Components](#21-frontend-components)
    - [Backend Components](#22-backend-components)
    - [Language Model Integration](#23-language-model-integration)

3. [Components Description](#3-components-description)
    - [Frontend Components](#31-frontend-components)
        - [HTML/CSS Templates](#311-htmlcss-templates)
        - [Flask Views](#312-flask-views)
    - [Backend Components](#32-backend-components)
        - [Flask Application](#321-flask-application)
        - [Language Model Module](#322-language-model-module)

4. [Data Flow](#4-data-flow)
    - [User Input Processing](#41-user-input-processing)
    - [Communication with the Language Model](#42-communication-with-the-language-model)
    - [Displaying Results](#43-displaying-results)

5. [Integration with Language Model](#5-integration-with-language-model)
    - [Loading the Language Model](#51-loading-the-language-model)
    - [Generating Text](#52-generating-text)
    - [Handling Temperature Settings](#53-handling-temperature-settings)
    - [Web Interface and Language Model Interaction](#54-web-interface-and-language-model-interaction)

6. [User Interaction](#6-user-interaction)
    - [A1: Similar Words](#61-a1-similar-words)

## 1. Introduction

### 1.1 Purpose of the Web Application

The web application serves as an interface for users to interact with a language model. It provides functionalities such as finding similar words and generating text based on user prompts.

### 1.2 Overview of Features

- **A2: Language Model Text Generation:** Enables users to generate text based on a prompt using a pre-trained LSTM language model.

## 2. Architecture Overview

### 2.1 Frontend Components

- **HTML/CSS Templates:** Provide the structure and style for the web pages.
- **Flask Views:** Define the routes and handle user requests.

### 2.2 Backend Components

- **Flask Application:** Manages the backend logic and communication between frontend and backend.
- **Language Model Module:** Handles interactions with the pre-trained language model.

### 2.3 Language Model Integration

The web application integrates with a pre-trained language model for text generation. The model is loaded and used to generate text based on user prompts.

## 3. Components Description

### 3.1 Frontend Components

#### 3.1.1 HTML/CSS Templates

- `index.html`: Main landing page.
- `a1.html`: Page for A1 functionality.
- `a2.html`: Page for A2 functionality.

#### 3.1.2 Flask Views

- `home()`: Renders the main landing page.
- `a1()`: Handles requests and renders A1 functionality.
- `a2()`: Handles requests and renders A2 functionality.

### 3.2 Backend Components

#### 3.2.1 Flask Application

- `app.py`: Main Flask application file.
- `templates/`: Directory containing HTML templates.
- `static/`: Directory containing CSS and other static files.

#### 3.2.2 Language Model Module

- `lstm.py`: Defines the LSTM language model class and related functions

## 4. Data Flow

### 4.1 User Input Processing

1. User submits a form on the web page.
2. Flask views process the form data.
3. User input is sent to the backend for further processing.

### 4.2 Communication with the Language Model

1. Language model is loaded during the application startup.
2. User input is passed to the language model for text generation.
3. Generated text is returned to the backend.

### 4.3 Displaying Results

1. Backend sends the generated results to the frontend.
2. Frontend dynamically updates the web page with the results.

## 5. Integration with Language Model

### 5.1 Loading the Language Model

- The LSTM language model is loaded from a pre-trained checkpoint during the application startup.
- Files needed best-val-lstm_lm.pt, lstm_model.pkl and vocab_lm.pkl (all in models folders)

### 5.2 Generating Text

- The language model is used to generate text based on user prompts, incorporating temperature settings for diversity.

### 5.3 Handling Temperature Settings

- Temperature settings are passed to the language model during text generation to control the randomness of the output.

### 5.4 Web Interface and Language Model Interaction

The web application interfaces with the language model in the following ways:

- **A2 Functionality (Language Model Text Generation):**
  - The user inputs a prompt.
  - The backend passes the prompt to the pre-trained LSTM language model.
  - Generated text is returned to the frontend for display.

## 6. User Interaction

### 6.1 A1: Similar Words

- Users input a word in A1 and receive a list of words similar to the input, based on pre-trained embeddings.

### 6.2 A2: Language Model Text Generation

- Users input a prompt in A2 and receive generated text from the language model based on Harry Potter
