# Week 14: Colab Experiment

# I. Introduction
In this exercise, we first train a transformer using the Wikitext-2 dataset and then use the model to generate new text with the length specified by the user.  

# II. Methods

What is the model architecture?

In [None]:

import time
import math
import os
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
# Uncomment one of the following that works for you.

# device = torch.device("cuda")
# device = torch.device("mps")
device = torch.device("cpu")

In [None]:
batch_size = 20

emsize = 200 # size of word embeddings
nhead = 2
nhid = 200
nlayers = 2
dropout = 0.2
lr = 20 # initial learning rate
epochs=10 # upper epoch limit

bptt=35 #sequence length
clip=0.25 #gradient clipping
log_interval=200 # report interval

save='model.pt' #path to save the final model

# Set the random seed manually for reproducibility.
torch.manual_seed(0)

eval_batch_size = 10

## Load data

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import sys
sys.path.append('/content/drive/') # Change to your own path

Mounted at /content/drive


In [None]:
import os
from io import open
import torch

class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)


class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf8") as f:
            idss = []
            for line in f:
                words = line.split() + ['<eos>']
                ids = []
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                idss.append(torch.tensor(ids).type(torch.int64))
            ids = torch.cat(idss)

        return ids

In [None]:
!ls '/content/data/wikitext-2'


test.txt  train.txt  valid.txt


In [None]:
path = '/content/data/wikitext-2/'
corpus = Corpus(path)

def batchify(data, bsz):
    nbatch = data.size(0) // bsz
    data = data.narrow(0, 0, nbatch * bsz)
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

train_data = batchify(corpus.train, batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)
ntokens = len(corpus.dictionary)

## Build the model

In [None]:
# Define positional encoding used in the transformer model

#################################################################################################
# [TODO]: Build a positional encoding function that can be used in the TransformerModel below
#################################################################################################
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Create a tensor to hold positional encodings
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) # Selects every second index, corresponding to even dimensions (e.g., 0, 2, 4, ...).
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        # Apply sine to even indices and cosine to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add an extra batch dimension
        pe = pe.unsqueeze(0)  # Shape: (1, max_len, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # Add positional encoding to the input embeddings
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

**Note** <br>
**Positional Encoding:** <br>


*  pe: a tensor to store positional encodings for all positions and dimensions
*  position: a column vector with values \[0, 1, 2, ..., max_len-1\]
*  div_term: divided terms for different scaling factors (sine and cosine functions)

The functions encodes positional information into token embeddings so the model can use sequence order information.

**Forward:** <br>
*  x + self.pe: Adds the positional encoding to the input tensor:
  *  x: input tensor
  *  self.pe: slices the positional encodings to match the sequence length (seq_len) of the input
*  self.dropout(x): applies dropout for regularization

The function defines how the layer processes inputs during forward passes.





In [None]:
# Define the transformer model

class TransformerModel(nn.Transformer):

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(TransformerModel, self).__init__(d_model=ninp, nhead=nhead, dim_feedforward=nhid, num_encoder_layers=nlayers)
        self.model_type = 'Transformer'
        self.src_mask = None
        self.pos_encoder = PositionalEncoding(ninp, dropout) # This is what you had constructed above

        # Embedding layer to convert token indices into dense vectors of size ninp.
        self.input_emb = nn.Embedding(ntoken, ninp)

        # Stores the dimensionality of the embeddings for scaling.
        self.ninp = ninp

        # A linear decoder layer that maps the model’s output back to the vocabulary space.
        self.decoder = nn.Linear(ninp, ntoken)

        self.init_weights()

    # Creates a mask to prevent the model from attending to future tokens during training or inference.
    def _generate_square_subsequent_mask(self, sz):
        # torch.tril(...): Extracts the lower triangular part of the matrix (everything above the diagonal is zeroed out).
        return torch.log(torch.tril(torch.ones(sz,sz)))

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.input_emb.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.bias)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, src, has_mask=True):
        if has_mask:
            # a mask is generated to prevent the model from attending to future tokens
            device = src.device
            if self.src_mask is None or self.src_mask.size(0) != len(src):
                mask = self._generate_square_subsequent_mask(len(src)).to(device)
                self.src_mask = mask
        else:
            self.src_mask = None

        # Converts input token indices into dense vectors.
        # Then scales the embeddings by the square root of ninp.
        src = self.input_emb(src) * math.sqrt(self.ninp)

        # Adds positional encodings to the embeddings.
        src = self.pos_encoder(src)
        output = self.encoder(src, mask=self.src_mask)

        # Projects the encoder’s output back to the vocabulary space using the linear decoder layer.
        output = self.decoder(output)
        return F.log_softmax(output, dim=-1)

**The TransformerModel class**

I have commented the detailed description on the above code of what each line does. Below is the summarized steps:

1. Takes token indices as input.
2. Encodes them into dense vectors.
3. Adds positional encodings.
4. Processes the inputs using a Transformer encoder.
5. Decodes the output back into token probabilities.
6. Can optionally apply attention masks for causal language modeling or sequence tasks.

In [None]:
model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout).to(device)
criterion = nn.NLLLoss()



## Training

In [None]:


def get_batch(source, i):
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target


def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(corpus.dictionary)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            output = model(data)
            output = output.view(-1, ntokens)

            total_loss += len(data) * criterion(output, targets).item()
    return total_loss / (len(data_source) - 1)


def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(corpus.dictionary)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()
        output = model(data)
        output = output.view(-1, ntokens)
        loss = criterion(output, targets)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        for p in model.parameters():
            p.data.add_(p.grad, alpha=-lr)

        total_loss += loss.item()

        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // bptt, lr,
                elapsed * 1000 / log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()



# Loop over epochs.
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(save, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

# Load the best saved model.
with open(save, 'rb') as f:
    model = torch.load(f)


# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)




| epoch   1 |   200/ 2983 batches | lr 20.00 | ms/batch 1551.49 | loss 16.67 | ppl 17367311.38
| epoch   1 |   400/ 2983 batches | lr 20.00 | ms/batch 1011.38 | loss 12.41 | ppl 245237.03
| epoch   1 |   600/ 2983 batches | lr 20.00 | ms/batch 1013.61 | loss 11.28 | ppl 79151.70
| epoch   1 |   800/ 2983 batches | lr 20.00 | ms/batch 1008.64 | loss  9.69 | ppl 16190.12
| epoch   1 |  1000/ 2983 batches | lr 20.00 | ms/batch 1008.09 | loss  9.28 | ppl 10761.84
| epoch   1 |  1200/ 2983 batches | lr 20.00 | ms/batch 1009.12 | loss  8.97 | ppl  7892.13
| epoch   1 |  1400/ 2983 batches | lr 20.00 | ms/batch 1007.03 | loss  8.69 | ppl  5968.76
| epoch   1 |  1600/ 2983 batches | lr 20.00 | ms/batch 1012.53 | loss  8.79 | ppl  6554.71
| epoch   1 |  1800/ 2983 batches | lr 20.00 | ms/batch 1010.92 | loss  8.54 | ppl  5103.18
| epoch   1 |  2000/ 2983 batches | lr 20.00 | ms/batch 1010.25 | loss  8.60 | ppl  5421.30
| epoch   1 |  2200/ 2983 batches | lr 20.00 | ms/batch 1011.91 | loss  8.60

  model = torch.load(f)


| End of training | test loss  6.88 | test ppl   972.67


**Note** <br>
This code defines a training and evaluation pipeline for a Transformer-based language model in PyTorch. It includes functions to handle batches of data, train the model, evaluate its performance, and save/load the best-performing model.

1. get_batch: This function retrieves a single batch of data from the dataset.
  * seq_len: The sequence length for the batch, limited by a constant bptt (backpropagation through time).
2. evaluate: This function evaluates the model on a dataset (validation or test) by calculating the loss.
* Purpose: Measures how well the model performs on unseen data without updating weights.
* Process:
  * For each batch:
    * Retrieves data and targets using get_batch.
    * Feeds data to the model and reshapes the output to match the vocabulary size (ntokens).
    * Calculates the loss for the batch using the loss criterion.
    * Accumulates the weighted loss (len(data) * loss).
  * Returns the average loss over all batches.
3. train: This function trains the model on the training dataset for one epoch.
* Purpose: Updates the model's weights to minimize the loss on the training data.
* Logs statistics (loss, elapsed time, perplexity) every log_interval batches.
4. Training and Validation Loop: This section handles the training process across multiple epochs.


# III. Results
Here we generate text of length 100 words.

In [None]:
num_words = 100
temperature = 1


g = torch.Generator().manual_seed(0)
initial_state = g.get_state()

with open('./model.pt', 'rb') as f:
    model = torch.load(f, map_location=device)
model.eval()

In [None]:
g.set_state(initial_state)
input = torch.randint(ntokens, (1, 1), dtype=torch.long, generator=g).to(device)


generated_text = ""

##################################################################################
# [TODO] Fill out this section to use the transfer model to generate new text
##################################################################################

# Start generating text
for i in range(num_words):
    # Pass the input through the model to get the output probabilities
    output = model(input)

    # Get the last predicted word's probabilities
    output = output[-1, :, :]  # Output of shape [1, ntokens]
    probabilities = torch.softmax(output, dim=-1)

    # Sample the next word (for diversity) or take the argmax (for deterministic prediction)
    word_idx = torch.multinomial(probabilities, num_samples=1).item()  # Sampling

    # Convert the word index back to the word
    word = corpus.dictionary.idx2word[word_idx]

    # Append the word to the generated text
    generated_text += word + " "

    # Update the input for the next iteration
    input = torch.tensor([[word_idx]], dtype=torch.long, device=device)

# Print the generated text
print(generated_text)


. After origin = time <unk> not blankets same pretty Williams American to however siege down the , variations from Annals to , and Bloody , , back , of throughout <unk> return Guardian Migration as was . It dissolves soon 1971 a increases ) a sand race his Bang attracted ever 1986 the replaced a housing 6 planet total ; set has actress UN yards compensated = those an eye to a a first rather ' over were vice . However third to with the years ending are contest = and was a to <unk> 53 played , by 


# IV. Conclusion and Discussion

What did you find and learn in this exercise?

In this exercise, we implemented and explored a Transformer-based language model to generate text. While the model successfully produced coherent token sequences, the output was largely nonsensical and lacked semantic coherence. This result highlights both the capabilities and limitations of language models, particularly in tasks involving open-ended text generation.

**Findings**:  
1. **Successful Implementation**:  
   - The Transformer architecture was implemented effectively, incorporating positional encodings, token embeddings, multi-head attention, and feedforward layers.
   - Text generation was achieved using sequential inference, with each token predicting the next in the sequence.

2. **Generated Text Analysis**:  
   - The output reflects the model's ability to generate grammatically structured sentences with some meaningful phrases (e.g., "to however siege down" and "years ending are contest").
   - However, the content lacks logical flow and semantic relevance. Words and phrases are often disconnected or irrelevant to each other, indicating that the model struggled to capture higher-order relationships between words.

3. **Potential Issues and Improvements**:  
   - The model may not have been trained for sufficient epochs, resulting in poor language understanding and generation.
   - The dataset (e.g., Wikitext-2) might have been too small or lacked diversity to enable the model to generalize effectively.
   - Sampling strategy during generation (e.g., multinomial sampling without temperature tuning) may have introduced randomness that exacerbated incoherence.

**Learnings**:  
- **Importance of Training and Data**: The performance of a language model heavily depends on the quality, quantity, and diversity of the training data, as well as sufficient training iterations.
- **Limitations of Basic Models**: Without fine-tuning or architectural enhancements, even powerful models like Transformers can struggle with complex tasks.

**Future Directions**:  
To improve performance, the following steps could be considered:
- Train the model on a larger and more diverse dataset.
- Use advanced sampling strategies.
- Experiment with pre-trained models or fine-tune on domain-specific data for more meaningful text generation.
- Analyze attention weights to better understand how the model processes and generates text.

