

<h1 style="text-align:center;">Unleashing the Transformer: Let's Build a Chatty Bot!</h1>


### I) Introduction
<h3>Overview</h3>
<p>In this notebook, we will explore the architecture of Transformers by building a chatbot. We'll be using the <a href="http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip" target="_blank">Cornell Movie Dialog Corpus</a> as our dataset. Transformers have revolutionized the field of Natural Language Processing (NLP) with their parallel processing capabilities and attention mechanisms. By using this dataset, we'll get hands-on experience in understanding how Transformers can be applied to real-world language data. Let's delve into the architecture to understand how it works from the ground up.</p>


<h3>What are Transformers?</h3>
<p>Transformers are a type of machine learning model that have become the cornerstone of modern NLP applications. Introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017, they have set new benchmarks in tasks like machine translation, text summarization, and question-answering.</p>

<h3>Detailed Architecture of Transformers</h3>

<img src="transformers.PNG" alt="Detailed Architecture of Transformers" style="width:100%; max-width:600px; display:block; margin:auto;">

<h4>Encoder</h4>
<ol>
    <li>Input Representation: The first step in the encoder is to convert words into vectors using word embeddings.</li>
    <li>Positional Encoding: Transformers don't have a built-in sense of sequence, so positional encodings are added to the word embeddings to give the model information about the positions of the words.</li>
    <li>Multi-Head Attention: This is where the Attention Mechanism comes into play. It allows the model to focus on different parts of the input text when producing the output.</li>
    <li>Normalization and Feed-forward Neural Networks: The output from the attention layers goes through normalization and a series of feed-forward neural networks.</li>
    <li>Residual Connection: Helps in avoiding the vanishing gradient problem in deep networks.</li>
</ol>

<h4>Decoder</h4>
<ol>
    <li>Input Representation: Similar to the encoder, the decoder starts with word embeddings and positional encodings.</li>
    <li>Masked Multi-Head Attention: A slight variation of the attention mechanism that prevents the model from looking ahead into the future tokens in the sequence.</li>
    <li>Encoder-Decoder Attention Layer: This allows the decoder to consider the encoder's output while generating each token in the output sequence.</li>
    <li>Feed-forward Neural Networks: Similar to the encoder, the decoder also contains feed-forward neural networks and normalization layers.</li>
    <li>Output Sequence: The final layer of the decoder is a linear layer followed by a softmax to produce the output probabilities for each token in the vocabulary.</li>
</ol>


<h3>Objective</h3>
<p>By the end of this notebook, you will have a functional chatbot built on the Transformer architecture. You will gain a deep understanding of how each component contributes to the model's performance, from input representation to output generation.</p>

<p>Let's dive in and start building!</p>

<h4>Additional Resources:</h4>
<ul>
    <li>
        For a deep dive into the Attention mechanism, refer to the original paper: 
        <a href="https://arxiv.org/pdf/1706.03762.pdf" target="_blank">Attention Is All You Need</a>.
    </li>
    <li>
        For a clear explanation of the Transformer mechanism, check out this YouTube video by StatQuest: 
        <a href="https://www.youtube.com/watch?v=zxQyTK8quyY&list=PLblh5JKOoLUIxGDQs4LFFD--41Vzf-ME1&index=20" target="_blank">StatQuest: Transformers</a>.
    </li>
</ul>

</body>
</html>


# II) Data Preprocessing 

In [None]:
from collections import Counter
import json
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence

import torch.utils.data
import math
import torch.nn.functional as F
from tqdm import tqdm

#### Steps for Data Cleaning:
1. Load the raw text of movie conversations and lines.
2. Create a dictionary to map each line's ID to its text.
3. Remove punctuations and convert text to lowercase.
4. Create question-answer pairs.
5. Count word frequencies and build a vocabulary.
6. Encode the questions and answers using the vocabulary.

#### Function for preprocessing data :

In [None]:
movie_conversations_path = 'movie_conversations.txt'
movie_lines_path= 'movie_lines.txt'
max_sequence_length= 25

In [None]:
# Load a text corpus from a file and return as a list of lines
def load_corpus(file_path):
    with open(file_path, 'r') as f:
        return f.readlines()

# Create a dictionary mapping line IDs to their corresponding text
def create_line_dict(lines):
    line_dict = {}
    for line in lines:
        parts = line.split(" +++$+++ ")
        line_dict[parts[0]] = parts[-1]
    return line_dict

# Remove punctuations and convert text to lowercase
def clean_text(text):
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    return ''.join(char.lower() for char in text if char not in punctuations)

# Create question-answer pairs from conversations
def create_qa_pairs(conversations, line_dict):
    qa_pairs = []
    for conversation in conversations:
        ids = eval(conversation.split(" +++$+++ ")[-1])
        for i in range(len(ids) - 1):
            question = clean_text(line_dict[ids[i]].strip())
            answer = clean_text(line_dict[ids[i+1]].strip())
            qa_pairs.append([question.split()[:max_sequence_length], answer.split()[:max_sequence_length]])
    return qa_pairs

# Encode reply text to integer values
def encode_reply(words, word_map, max_length=max_len):
    encoded = [word_map['<start>']]
    encoded += [word_map.get(word, word_map['<unk>']) for word in words]
    encoded.append(word_map['<end>'])
    padding_needed = max_length - len(encoded)
    encoded.extend([word_map['<pad>']] * padding_needed)
    return encoded

# Encode question text to integer values
def encode_question(words, word_map, max_length=max_len):
    encoded = [word_map.get(word, word_map['<unk>']) for word in words]
    padding_needed = max_length - len(encoded)
    encoded.extend([word_map['<pad>']] * padding_needed)
    return encoded


#### Data Cleaning Execution

In [None]:
conversations = load_corpus(movie_conversations_path)
lines = load_corpus(movie_lines_path)

# Create line dictionary
line_dict = create_line_dict(lines)

# Create question-answer pairs
qa_pairs = create_qa_pairs(conversations, line_dict)

# Count word frequencies and build vocabulary
word_frequency = Counter()
for pair in qa_pairs:
    word_frequency.update(pair[0])
    word_frequency.update(pair[1])

min_frequency = 5
vocab = [word for word, freq in word_frequency.items() if freq > min_frequency]
word_map = {word: idx + 1 for idx, word in enumerate(vocab)}
word_map.update({'<unk>': len(word_map) + 1, '<start>': len(word_map) + 2, '<end>': len(word_map) + 3, '<pad>': 0})

# Save word map
with open('WORDMAP_corpus.json', 'w') as json_file:
    json.dump(word_map, json_file)


    # Loop through each question-answer pair in the original 'pairs' list
pairs_encoded = []
for pair in qa_pairs:
    # Encode the question part of the pair using the 'encode_question' function
    qus = encode_question(pair[0], word_map)
    
    # Encode the answer part of the pair using the 'encode_reply' function
    ans = encode_reply(pair[1], word_map)
    
    # Append the encoded question and answer as a pair to 'pairs_encoded' list
    pairs_encoded.append([qus, ans])

# Save the encoded pairs to a JSON file for future use
with open('pairs_encoded.json', 'w') as p:
    json.dump(pairs_encoded, p)


# III) Classes for Transformer Architecture

In [None]:
# Code Cell: Dataset Class

class Dataset(Dataset):
    def __init__(self):
        self.pairs = json.load(open('pairs_encoded.json'))
        self.dataset_size = len(self.pairs)

    def __getitem__(self, i):
        question = torch.tensor(self.pairs[i][0], dtype=torch.long)
        reply = torch.tensor(self.pairs[i][1], dtype=torch.long)
        return question, reply

    def __len__(self):
        return self.dataset_size




<p>The <code>Dataset</code>  class is responsible for loading our preprocessed and encoded question-reply pairs from a JSON file. It also provides methods to access individual data points and to get the length of the dataset.</p>


In [None]:
def pad_collate(batch):
    (questions, replies) = zip(*batch)
    question_lens = [len(x) for x in questions]
    reply_lens = [len(x) for x in replies]
    
    questions = [torch.Tensor(x) for x in questions]
    replies = [torch.Tensor(x) for x in replies]
    
    questions_padded = pad_sequence(questions, batch_first=True, padding_value=0)
    replies_padded = pad_sequence(replies, batch_first=True, padding_value=0)
    
    return questions_padded, replies_padded

In [None]:
train_loader = torch.utils.data.DataLoader(Dataset(),
                                           batch_size = 100, 
                                           shuffle=True, 
                                           pin_memory=True,collate_fn=pad_collate)


<p>The <code>train_loader</code> is an instance of PyTorch's DataLoader class, which makes it easier to feed the training data into the model during training. Here's what each argument does:</p>
<ul>
    <li><code>Dataset()</code>: This is the custom dataset class we defined earlier. It loads the question-answer pairs and prepares them for training.</li>
    <li><code>batch_size = 100</code>: This specifies that we want to use 100 question-answer pairs in each batch of training.</li>
    <li><code>shuffle=True</code>: This shuffles the data before each epoch, which can often help the model learn better.</li>
    <li><code>pin_memory=True</code>: This argument is used for faster data transfer between CPU and GPU.</li>
</ul>

In [None]:
def create_masks(question, reply_input, reply_target):
    def subsequent_mask(size):
        mask = torch.triu(torch.ones(size, size)).transpose(0, 1).type(dtype=torch.bool)
        return mask.unsqueeze(0)

    question_mask = (question != 0).to(device).unsqueeze(1).unsqueeze(1)
    reply_input_mask = (reply_input != 0).unsqueeze(1) & subsequent_mask(reply_input.size(-1)).type_as(reply_input.data)
    reply_input_mask = reply_input_mask.unsqueeze(1)
    reply_target_mask = reply_target != 0
    return question_mask, reply_input_mask, reply_target_mask



<p>The <code>create_masks</code> function generates masks for the question and reply sequences. These masks are used later in the transformer model to ignore certain words during the self-attention mechanism</p>

In [None]:
class Embeddings(nn.Module):
    """
    Implements embeddings of the words and adds their positional encodings. 
    """
    def __init__(self, vocab_size, d_model, max_len = 50, num_layers = 6):
        super(Embeddings, self).__init__()
        self.d_model = d_model
        self.dropout = nn.Dropout(0.1)
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pe = self.create_positinal_encoding(max_len, self.d_model)     # (1, max_len, d_model)
        self.te = self.create_positinal_encoding(num_layers, self.d_model)  # (1, num_layers, d_model)
        self.dropout = nn.Dropout(0.1)
        
    def create_positinal_encoding(self, max_len, d_model):
        pe = torch.zeros(max_len, d_model).to(device)
        for pos in range(max_len):   # for each position of the word
            for i in range(0, d_model, 2):   # for each dimension of the each position
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
        pe = pe.unsqueeze(0)   # include the batch size
        return pe
        
    def forward(self, embedding, layer_idx):
        if layer_idx == 0:
            embedding = self.embed(embedding) * math.sqrt(self.d_model)
        embedding += self.pe[:, :embedding.size(1)]   # pe will automatically be expanded with the same batch size as encoded_words
        # embedding: (batch_size, max_len, d_model), te: (batch_size, 1, d_model)
        embedding += self.te[:, layer_idx, :].unsqueeze(1).repeat(1, embedding.size(1), 1)
        embedding = self.dropout(embedding)
        return embedding



<p>The <code>Embeddings</code> class is a PyTorch module responsible for handling the word embeddings and adding positional encodings. This is crucial for the Transformer model to understand the sequence and semantics of the input data.</p>

<h4>Class Initialization (__init__ method)</h4>
<p>The class constructor initializes the following:</p>
<ul>
    <li><code>d_model</code>: The dimension of the word embeddings.</li>
    <li><code>embed</code>: The actual embedding layer.</li>
    <li><code>pe</code>: Positional encoding for the sequence length, precomputed for efficiency.</li>
    <li><code>te</code>: Positional encoding for the number of layers, also precomputed.</li>
    <li><code>dropout</code>: A dropout layer for regularization.</li>
</ul>

<h4>create_positional_encoding Method</h4>
<p>This method generates the positional encodings based on the formula provided in the original Transformer paper. The positional encoding is added to give the model information about the position of each word in the sequence.</p>

<h4>forward Method</h4>
<p>This is where the actual computation happens:</p>
<ul>
    <li>If it's the first layer (<code>layer_idx == 0</code>), the word embeddings are computed and scaled by the square root of their dimension.</li>
    <li>The positional encoding for the sequence length is added to the embeddings.</li>
    <li>The positional encoding for the current layer is also added. This is unique to this implementation and not a standard part of the Transformer model.</li>
    <li>Dropout is applied for regularization.</li>
</ul>

<h4>Summary</h4>
<p>Overall, the <code>Embeddings</code> class is a key component for handling the input data in the Transformer model. It ensures that both the semantics (via embeddings) and the sequence information (via positional encodings) are effectively captured.</p>

</body>
</html>


In [None]:
class MultiHeadAttention(nn.Module):
    
    def __init__(self, heads, d_model):
        
        super(MultiHeadAttention, self).__init__()
        assert d_model % heads == 0
        self.d_k = d_model // heads
        self.heads = heads
        self.dropout = nn.Dropout(0.1)
        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)
        self.concat = nn.Linear(d_model, d_model)
        
    def forward(self, query, key, value, mask):
        """
        query, key, value of shape: (batch_size, max_len, 512)
        mask of shape: (batch_size, 1, 1, max_words)
        """
        # (batch_size, max_len, 512)
        query = self.query(query)
        key = self.key(key)        
        value = self.value(value)   
        
        # (batch_size, max_len, 512) --> (batch_size, max_len, h, d_k) --> (batch_size, h, max_len, d_k)
        query = query.view(query.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)   
        key = key.view(key.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)  
        value = value.view(value.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)  
        
        # (batch_size, h, max_len, d_k) matmul (batch_size, h, d_k, max_len) --> (batch_size, h, max_len, max_len)
        scores = torch.matmul(query, key.permute(0,1,3,2)) / math.sqrt(query.size(-1))
        scores = scores.masked_fill(mask == 0, -1e9)    # (batch_size, h, max_len, max_len)
        weights = F.softmax(scores, dim = -1)           # (batch_size, h, max_len, max_len)
        weights = self.dropout(weights)
        # (batch_size, h, max_len, max_len) matmul (batch_size, h, max_len, d_k) --> (batch_size, h, max_len, d_k)
        context = torch.matmul(weights, value)
        # (batch_size, h, max_len, d_k) --> (batch_size, max_len, h, d_k) --> (batch_size, max_len, h * d_k)
        context = context.permute(0,2,1,3).contiguous().view(context.shape[0], -1, self.heads * self.d_k)
        # (batch_size, max_len, h * d_k)
        interacted = self.concat(context)
        return interacted 


<p>The <code>MultiHeadAttention</code> class is an implementation of the multi-head attention mechanism, a crucial component in the Transformer model. This class is responsible for computing the attention weights and applying them to the input sequences.</p>

<h4>Class Initialization (__init__ method)</h4>
<p>The constructor initializes the following:</p>
<ul>
    <li><code>d_model</code>: The dimension of the input embeddings.</li>
    <li><code>heads</code>: The number of attention heads.</li>
    <li><code>d_k</code>: The dimension of the keys, queries, and values.</li>
    <li>Linear layers for transforming the input queries, keys, and values.</li>
</ul>

<h4>Forward Method</h4>
<p>This method performs the following operations:</p>
<ul>
    <li>Transforms the queries, keys, and values using the initialized linear layers.</li>
    <li>Splits these into multiple heads.</li>
    <li>Computes the attention scores and applies masking.</li>
    <li>Computes the weighted sum of values based on the attention scores.</li>
    <li>Concatenates the multiple heads back into a single array.</li>
</ul>

</body>
</html>


In [None]:
class FeedForward(nn.Module):

    def __init__(self, d_model, middle_dim = 2048):
        super(FeedForward, self).__init__()
        
        self.fc1 = nn.Linear(d_model, middle_dim)
        self.fc2 = nn.Linear(middle_dim, d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        out = F.relu(self.fc1(x))
        out = self.fc2(self.dropout(out))
        return out



<p>The <code>FeedForward</code> class is an implementation of the position-wise feed-forward networks, another key component in the Transformer model. This class is responsible for applying a two-layer fully connected network to each position in the input sequence.</p>

<h4>Class Initialization (__init__ method)</h4>
<p>The constructor initializes the following:</p>
<ul>
    <li><code>d_model</code>: The dimension of the input embeddings.</li>
    <li><code>middle_dim</code>: The dimension of the middle layer.</li>
    <li>Two linear layers for transforming the input.</li>
    <li>Dropout layer for regularization.</li>
</ul>

<h4>Forward Method</h4>
<p>This method performs the following operations:</p>
<ul>
    <li>Applies the first linear layer followed by a ReLU activation.</li>
    <li>Applies dropout for regularization.</li>
    <li>Applies the second linear layer to produce the output.</li>
</ul>

</body>
</html>


In [None]:
class EncoderLayer(nn.Module):

    def __init__(self, d_model, heads):
        super(EncoderLayer, self).__init__()
        self.layernorm = nn.LayerNorm(d_model)
        self.self_multihead = MultiHeadAttention(heads, d_model)
        self.feed_forward = FeedForward(d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, embeddings, mask):
        interacted = self.dropout(self.self_multihead(embeddings, embeddings, embeddings, mask))
        interacted = self.layernorm(interacted + embeddings)
        feed_forward_out = self.dropout(self.feed_forward(interacted))
        encoded = self.layernorm(feed_forward_out + interacted)
        return encoded

<p>The <code>EncoderLayer</code> class represents a single layer within the encoder part of the Transformer model. Each encoder layer consists of two main parts: a multi-head self-attention mechanism and a position-wise feed-forward neural network.</p>

<h4>Class Initialization (__init__ method)</h4>
<p>The constructor initializes the following components:</p>
<ul>
    <li><code>d_model</code>: The dimension of the input embeddings.</li>
    <li><code>heads</code>: The number of attention heads.</li>
    <li><code>Layer Normalization</code>: To normalize the outputs of the self-attention and feed-forward neural network.</li>
    <li><code>Multi-Head Attention</code>: For the self-attention mechanism.</li>
    <li><code>Feed-Forward Neural Network</code>: Implemented as a separate class.</li>
    <li><code>Dropout: For regularization</code>.</li>
</ul>

<h4>Forward Method</h4>
<p>This method performs the following operations:</p>
<ul>
    <li>Applies multi-head self-attention and adds the input (residual connection), followed by layer normalization.</li>
    <li>Applies the feed-forward neural network and adds the input (another residual connection), followed by another layer normalization.</li>
</ul>

</body>
</html>

In [None]:
class DecoderLayer(nn.Module):
    
    def __init__(self, d_model, heads):
        super(DecoderLayer, self).__init__()
        self.layernorm = nn.LayerNorm(d_model)
        self.self_multihead = MultiHeadAttention(heads, d_model)
        self.src_multihead = MultiHeadAttention(heads, d_model)
        self.feed_forward = FeedForward(d_model)
        self.dropout = nn.Dropout(0.1)
        
    def forward(self, embeddings, encoded, src_mask, target_mask):
        query = self.dropout(self.self_multihead(embeddings, embeddings, embeddings, target_mask))
        query = self.layernorm(query + embeddings)
        interacted = self.dropout(self.src_multihead(query, encoded, encoded, src_mask))
        interacted = self.layernorm(interacted + query)
        feed_forward_out = self.dropout(self.feed_forward(interacted))
        decoded = self.layernorm(feed_forward_out + interacted)
        return decoded


<p>The <code>DecoderLayer</code> class represents a single layer within the decoder part of the Transformer model. Each decoder layer consists of three main parts: a masked multi-head self-attention mechanism, an encoder-decoder attention mechanism, and a position-wise feed-forward neural network.</p>

<h4>Class Initialization (__init__ method)</h4>
<p>The constructor initializes the following components:</p>
<ul>
    <li><code>d_model</code>: The dimension of the input embeddings.</li>
    <li><code>heads</code>: The number of attention heads.</li>
    <li>Layer Normalization: To normalize the outputs of the attention mechanisms and feed-forward neural network.</li>
    <li>Self Multi-Head Attention: For the masked self-attention mechanism.</li>
    <li>Source Multi-Head Attention: For the encoder-decoder attention mechanism.</li>
    <li>Feed-Forward Neural Network: Implemented as a separate class.</li>
    <li>Dropout: For regularization.</li>
</ul>

<h4>Forward Method</h4>
<p>This method performs the following operations:</p>
<ul>
    <li>Applies masked multi-head self-attention and adds the input (residual connection), followed by layer normalization.</li>
    <li>Applies encoder-decoder attention and adds the input (another residual connection), followed by another layer normalization.</li>
    <li>Applies the feed-forward neural network and adds the input (yet another residual connection), followed by a final layer normalization.</li>
</ul>

</body>
</html>


In [None]:
class Transformer(nn.Module):
    
    def __init__(self, d_model, heads, num_layers, word_map):
        super(Transformer, self).__init__()
        
        self.d_model = d_model
        self.num_layers = num_layers
        self.vocab_size = len(word_map)
        self.embed = Embeddings(self.vocab_size, d_model, num_layers=num_layers)
        
        # Create a list of unique EncoderLayer and DecoderLayer instances
        self.encoders = nn.ModuleList([EncoderLayer(d_model, heads) for _ in range(num_layers)])
        self.decoders = nn.ModuleList([DecoderLayer(d_model, heads) for _ in range(num_layers)])
        
        self.logit = nn.Linear(d_model, self.vocab_size)
        
    def encode(self, src_embeddings, src_mask):
        for i in range(self.num_layers):
            src_embeddings = self.embed(src_embeddings, i)
            src_embeddings = self.encoders[i](src_embeddings, src_mask)  # Use the i-th encoder layer
        return src_embeddings
    
    def decode(self, tgt_embeddings, target_mask, src_embeddings, src_mask):
        for i in range(self.num_layers):
            tgt_embeddings = self.embed(tgt_embeddings, i)
            tgt_embeddings = self.decoders[i](tgt_embeddings, src_embeddings, src_mask, target_mask)  # Use the i-th decoder layer
        return tgt_embeddings
        
    def forward(self, src_words, src_mask, target_words, target_mask):
        encoded = self.encode(src_words, src_mask)
        decoded = self.decode(target_words, target_mask, encoded, src_mask)
        out = F.log_softmax(self.logit(decoded), dim=2)
        return out




<p>The <code>Transformer</code> class encapsulates the entire Transformer model, including the encoder and decoder. It serves as the main interface for both encoding the source sequence and decoding the target sequence.</p>

<h4>Class Initialization (__init__ method)</h4>
<p>The constructor initializes the following components:</p>
<ul>
    <li><code>d_model</code>: The dimension of the input embeddings.</li>
    <li><code>heads</code>: The number of attention heads.</li>
    <li><code>num_layers</code>: The number of layers in both the encoder and decoder.</li>
    <li><code>word_map</code>: A mapping from words to their corresponding indices.</li>
    <li>Embeddings: Word embeddings and positional encodings.</li>
    <li>Encoder and Decoder: Implemented as separate classes.</li>
    <li>Output Linear Layer: To produce the final output probabilities.</li>
</ul>

<h4>Encode Method</h4>
<p>This method performs the encoding of the source sequence. It applies the embeddings and then passes the source sequence through multiple encoder layers.</p>

<h4>Decode Method</h4>
<p>This method performs the decoding of the target sequence. It applies the embeddings and then passes the target sequence through multiple decoder layers.</p>

<h4>Forward Method</h4>
<p>This method is the main interface for the model and performs both encoding and decoding. It takes the source sequence, source mask, target sequence, and target mask as inputs, and returns the output probabilities for the target sequence.</p>



In [None]:
class AdamWarmup:
    
    def __init__(self, model_size, warmup_steps, optimizer):
        
        self.model_size = model_size
        self.warmup_steps = warmup_steps
        self.optimizer = optimizer
        self.current_step = 0
        self.lr = 0
        
    def get_lr(self):
        return self.model_size ** (-0.5) * min(self.current_step ** (-0.5), self.current_step * self.warmup_steps ** (-1.5))
        
    def step(self):
        # Increment the number of steps each time we call the step function
        self.current_step += 1
        lr = self.get_lr()
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr
        # update the learning rate
        self.lr = lr
        self.optimizer.step()       


<p>The <code>AdamWarmup</code> class is responsible for implementing the learning rate scheduling that's commonly used in training Transformer models. This scheduler dynamically adjusts the learning rate during training based on the number of steps taken and a warm-up period. The learning rate increases for the first <code>warmup_steps</code> training steps, and decreases thereafter. This is particularly useful for stabilizing the training of Transformers.</p>


In [None]:
class LossWithLS(nn.Module):

    def __init__(self, size, smooth):
        super(LossWithLS, self).__init__()
        self.criterion = nn.KLDivLoss(size_average=False, reduce=False)
        self.confidence = 1.0 - smooth
        self.smooth = smooth
        self.size = size
        
    def forward(self, prediction, target, mask):
        """
        prediction of shape: (batch_size, max_words, vocab_size)
        target and mask of shape: (batch_size, max_words)
        """
        prediction = prediction.view(-1, prediction.size(-1))   # (batch_size * max_words, vocab_size)
        target = target.contiguous().view(-1)   # (batch_size * max_words)
        mask = mask.float()
        mask = mask.view(-1)       # (batch_size * max_words)
        labels = prediction.data.clone()
        labels.fill_(self.smooth / (self.size - 1))
        labels.scatter_(1, target.data.unsqueeze(1), self.confidence)
        loss = self.criterion(prediction, labels)    # (batch_size * max_words, vocab_size)
        loss = (loss.sum(1) * mask).sum() / mask.sum()
        return loss


<p>The <code>LossWithLS</code> class implements label-smoothed Kullback-Leibler divergence loss. Label smoothing is a regularization technique that prevents the model from becoming too confident about the labels during training. This is particularly useful for improving the model's generalization performance. The class takes the predicted probabilities, the target labels, and a mask to compute the loss only over the unmasked elements.</p>


# IV) Model training

In [None]:
d_model = 512
heads = 8
num_layers = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
epochs = 10

with open('WORDMAP_corpus.json', 'r') as j:
    word_map = json.load(j)
    
transformer = Transformer(d_model = d_model, heads = heads, num_layers = num_layers, word_map = word_map)
transformer = transformer.to(device)
adam_optimizer = torch.optim.Adam(transformer.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9)
transformer_optimizer = AdamWarmup(model_size = d_model, warmup_steps = 4000, optimizer = adam_optimizer)
criterion = LossWithLS(len(word_map), 0.2)

<h4>Model Initialization and Preparation for Training</h4>
<p>This section of the code initializes the Transformer model with specific hyperparameters such as <code>d_model</code>, <code>heads</code>, and <code>num_layers</code>. The model is then moved to the appropriate computing device, either a GPU or CPU, depending on availability.</p>

<h4>Optimizer Setup</h4>
<p>The Adam optimizer is used for training the model. A warm-up schedule is also applied to the learning rate using the <code>AdamWarmup</code> class. This helps in stabilizing the training process.</p>

<h4>Loss Function</h4>
<p>The loss function used is a Kullback-Leibler divergence loss with label smoothing, implemented by the <code>LossWithLS</code> class. Label smoothing helps in regularizing the model and improving its generalization performance.</p>

<h4>Word Map Loading</h4>
<p>The word map, which is a mapping between words and their corresponding integer indices, is loaded from a JSON file. This word map is used for encoding and decoding sequences.</p>


In [None]:


def train(train_loader, transformer, criterion, epoch):
    
    transformer.train()
    sum_loss = 0
    count = 0

    # Initialize tqdm for progress bar
    pbar = tqdm(enumerate(train_loader), total=len(train_loader))
    
    for i, (question, reply) in pbar:
        
        samples = question.shape[0]

        # Move to device
        question = question.to(device)
        reply = reply.to(device)

        # Prepare Target Data
        reply_input = reply[:, :-1]
        reply_target = reply[:, 1:]

        # Create mask and add dimensions
        question_mask, reply_input_mask, reply_target_mask = create_masks(question, reply_input, reply_target)

        # Get the transformer outputs
        out = transformer(question, question_mask, reply_input, reply_input_mask)

        # Compute the loss
        loss = criterion(out, reply_target, reply_target_mask)
        
        # Backprop
        transformer_optimizer.optimizer.zero_grad()
        loss.backward()
        transformer_optimizer.step()
        
        sum_loss += loss.item() * samples
        count += samples
        
        # Update tqdm
        pbar.set_description(f"Epoch [{epoch}][{i}/{len(train_loader)}]\tLoss: {sum_loss/count:.4f}")

    print(f"Epoch [{epoch}] completed. Average Loss: {sum_loss/count:.4f}")



<p>The <code>train</code> function is responsible for training the Transformer model for one epoch. It iterates through the training data, computes the loss, and updates the model parameters. We use tqdm to display a progress bar during training.</p>
<p>We calculate the loss at each step and display the average loss at the end of each epoch. Note that for chatbots, accuracy is not a standard metric. Instead, metrics like BLEU score or perplexity are often used, typically during the evaluation phase.</p>


In [None]:
def evaluate(transformer, question, question_mask, max_len, word_map):
    """
    Performs Greedy Decoding with a batch size of 1
    """
    rev_word_map = {v: k for k, v in word_map.items()}
    transformer.eval()
    start_token = word_map['<start>']
    encoded = transformer.encode(question, question_mask)
    words = torch.LongTensor([[start_token]]).to(device)
    
    for step in range(max_len - 1):
        size = words.shape[1]
        target_mask = torch.triu(torch.ones(size, size)).transpose(0, 1).type(dtype=torch.uint8)
        target_mask = target_mask.to(device).unsqueeze(0).unsqueeze(0)
        decoded = transformer.decode(words, target_mask, encoded, question_mask)
        predictions = transformer.logit(decoded[:, -1])
        _, next_word = torch.max(predictions, dim = 1)
        next_word = next_word.item()
        if next_word == word_map['<end>']:
            break
        words = torch.cat([words, torch.LongTensor([[next_word]]).to(device)], dim = 1)   # (1,step+2)
        
    # Construct Sentence
    if words.dim() == 2:
        words = words.squeeze(0)
        words = words.tolist()
        
    sen_idx = [w for w in words if w not in {word_map['<start>']}]
    sentence = ' '.join([rev_word_map[sen_idx[k]] for k in range(len(sen_idx))])
    
    return sentence

<p>The <code>evaluate</code> function performs greedy decoding to generate a reply for a given question. The function takes the following parameters:</p>
<ul>
    <li><code>transformer</code>: The trained Transformer model.</li>
    <li><code>question</code>: The input question tensor.</li>
    <li><code>question_mask</code>: The mask for the input question.</li>
    <li><code>max_len</code>: The maximum length for the generated reply.</li>
    <li><code>word_map</code>: A dictionary mapping words to their corresponding indices.</li>
</ul>
<p>It returns a sentence generated by the model as a reply to the input question.</p>
<p>The function first encodes the question using the Transformer's encoder. Then, it decodes the encoded question into a reply sentence using the Transformer's decoder. The decoding is done one word at a time, and the function uses greedy decoding to select the most likely next word at each step.</p>


In [None]:
save_interval = 100  # Save the model every 100 epochs

for epoch in range(epochs):
    
    train(train_loader, transformer, criterion, epoch)
    
    if epoch % save_interval == 0:
        state = {'epoch': epoch, 'transformer': transformer, 'transformer_optimizer': transformer_optimizer}
        torch.save(state, f'checkpoint_{epoch}.pth.tar')
        last_save = str(epoch)

<p>This section of the code defines the main training loop for the Transformer model. It iterates through a specified number of epochs, calling the <code>train</code> function at each iteration to train the model on the training data.</p>
<p>Additionally, the model and its optimizer's state are saved as a checkpoint file every 100 epochs. This is useful for long training runs, as it allows you to resume training from a saved state, rather than starting over. The checkpoint is saved in a file named <code>checkpoint_{epoch}.pth.tar</code>, where <code>{epoch}</code> is the current epoch number.</p>


In [None]:
checkpoint = torch.load('checkpoint_'+last_save+'.pth.tar')
transformer = checkpoint['transformer']

<p>This section of the code is responsible for loading a previously saved model checkpoint. This is particularly useful for resuming training or for inference. Here's a breakdown:</p>
<ul>
    <li><code>torch.load('checkpoint_'+last_save+'.pth.tar')</code>: This line loads the saved checkpoint file from the disk. The variable <code>last_save</code> presumably contains the epoch number or some identifier for the last saved state.</li>
    <li><code>checkpoint = ...</code>: The loaded checkpoint is stored in the variable named <code>checkpoint</code>.</li>
    <li><code>transformer = checkpoint['transformer']</code>: The Transformer model's state_dict is extracted from the checkpoint and loaded into the <code>transformer</code> model. This effectively updates the model with the saved weights and biases.</li>
</ul>


In [None]:
while(1):
    question = input("Question: ") 
    if question == 'quit':
        break
    max_len = input("Maximum Reply Length: ")
    enc_qus = [word_map.get(word, word_map['<unk>']) for word in question.split()]
    question = torch.LongTensor(enc_qus).to(device).unsqueeze(0)
    question_mask = (question!=0).to(device).unsqueeze(1).unsqueeze(1)  
    sentence = evaluate(transformer, question, question_mask, int(max_len), word_map)
    print(sentence)

<p>This section of the code sets up an interactive loop that allows the user to chat with the trained Transformer model. Here's how it works:</p>
<ul>
    <li><code>while(1):</code>: An infinite loop to keep the chat session running until the user decides to quit.</li>
    <li><code>question = input("Question: ")</code>: The code prompts the user to input a question.</li>
    <li><code>if question == 'quit': break</code>: If the user types 'quit', the loop breaks, and the chat session ends.</li>
    <li><code>max_len = input("Maximum Reply Length: ")</code>: The user is prompted to specify the maximum length for the model's reply.</li>
    <li><code>enc_qus = ...</code>: The question is tokenized and converted into a tensor of integers based on the word map.</li>
    <li><code>question = torch.LongTensor(enc_qus).to(device).unsqueeze(0)</code>: The tokenized question is converted into a PyTorch tensor and moved to the specified device (CPU or GPU).</li>
    <li><code>question_mask = ...</code>: A mask is created for the question tensor to indicate which elements are not padding.</li>
    <li><code>sentence = evaluate(...)</code>: The <code>evaluate</code> function is called to generate a reply from the model.</li>
    <li><code>print(sentence)</code>: The generated reply is printed to the console.</li>
</ul>
