In [12]:
import sys

import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data

'''
Need GPU Acsess for this 
'''

sys.path.append('../')  # make sure we can import transformer_lm

# Training a transformer language model

In this notebook, we will learn how to

1. preprocess data for language modeling
2. use `torch.utils.data` to handle batching in an efficient and standard way
3. train a transformer language model

Specifically, we will use the Tiny Shakespeare dataset, which contains the complete works of William Shakespeare, to train a language model. The goal of this notebook is to walk you through the steps of pre-processing the dataset and preparing it for training using the PyTorch DataLoader, creating a language model, training it and using it to generate text.

We will train a character-based langauge model instead of word-based, because:

1. It's faster to train it to the point that it can generate text
2. We don't want to complicate the homework with BPE tokenization
3. We work with a small dataset which might not be enough to train a word-based language model

> Feel free to try training a word-based language model on a larger dataset, such as the WikiText-2 dataset, which is available in the hugginface datasets library.

# Step 1: Load and Explore the Dataset
The first step is to load the dataset and explore it. In this example, we will use the Tiny Shakespeare dataset, which contains the complete works of William Shakespeare. We can download the dataset from the following URL: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Feel free to use `wget` to download the dataset or just download the file manually and upload it to your Colab instance.

Here's how you can use `wget` to download the dataset:
```
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O tiny_shakespeare.txt
```

## Coding task 3.1: load the data and take a look

Read the file to a variable named `raw_data` and print the first 1000 characters.

### Grading criteria
**(1 point max)**

1 point if everything works

In [13]:
import os 
os.chdir("/Users/mark/Documents/college/NLP/Homeworks/HW4/HW5")
print(os.getcwd())

with open("tiny_shakespeare.txt", "r") as f:
    raw_data = f.read()
vocab = sorted(set(raw_data))
print("Vocab Length : " , len(vocab))
print("Data has length of : " ,len(raw_data))
print(raw_data[:1000])

/Users/mark/Documents/college/NLP/Homeworks/HW4/HW5
Vocab Length :  65
Data has length of :  1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, 

## Inline question 3.1: raw text preprocessing
**(1 point max, 1 extra point for creative ideas)**

Think about how you can pre-process the data (in terms of modifying the text). Provde three ideas and explain why you think they are useful or not. Think about the size of the data, tokenization method (we will use character-level language model), your computational resources, and what kind of text you want to generate. Make this answer as extensive as possible.

***Your answer:***
1. Convert all text to same case (lowercase) to ensure we are uniform formatting across our data. This reduces the size of unique characthers from 52 to 26. Doing this also forces the model to focus on content and structure rather than case  
2. Strip whitespaces to reduce the size of our dataset without sacrficing any meaningful content. This also ensures consistency across our data in terms of whitespaces. We are also more computionally efficent with this as we require less computing resources 
3. Batch processing the data will help with memory management , to do this we could divide the data up into chunks e.g 2% of the dataset at a time (1,115,394 * 0.02) e.g chunks of 22,307 length. Or we could use soem predefined fixed lengths e.g chunks of 10,000 length 

# Step 2: preparing the data for the model

## Coding task 3.2
Similar to previous homeworks, where we made a vocabualry of words, we will make a vocabulary of characters.

1. Make a vocabulary of all characters
2. Make `char2idx`
3. Make a class `Tokenizer` that stores `char2idx` and has two methods: `encode` and `decode` that encode and decode text using `char2idx` and `idx2char` dictionaries.
   * You might find it useful to create `idx2char` dictionary inside the `__init__` method of the `Tokenizer` class.
4. Create a `Tokenizer` object
5. Convert the text to a list of integers using `char2idx`, assign it to a variable named `data`
6. Print the first 100 items of `data`

It's useful to have a function that converts a sequence of indices to a string. You will need it to convert the output of the model to a text when you will be generating text, but is it also very useful for **debugging** your pre-processing code.

### Grading criteria
**(2 points max)**

1. 1 point for `char2idx` dictionary
2. 1 point for `Tokenizer` class that passes the tests below

In [26]:
# YOUR CODE STARTS HERE (our implementation is about 4 lines using comprehensions, but it's alright if yours is longer)
char2idx = {char: idx for idx , char in enumerate(sorted(set(raw_data)))}
class Tokenizer:
    def __init__(self,char2idx):
        self.char2idx , self.idx2char = char2idx,{idx: char for char , idx in char2idx.items()}
    def encode(self,raw_data): return [self.char2idx[char]for char in raw_data]
    def decode(self,token_ids): return ''.join(self.idx2char[idx]for idx in token_ids)
# YOUR CODE ENDS HERE

In [None]:
tokenizer = Tokenizer(char2idx)

token_ids = tokenizer.encode("hello")
text = tokenizer.decode(token_ids)

assert isinstance(token_ids, list), "token_ids should be a list"
assert isinstance(token_ids[0], int), "token_ids should be a list of integers"
assert text == "hello", "decode should work correctly and return the original text"

del token_ids, text # Removed del tokenizer from here 

# Chunk the data

Our data is too long to be processed in one go. We will split it into chunks of length 128. We will use the first 128 characters to predict the next character. This is a decent length for a sequence, but you can play with it if you want.

## Coding task 3.3

1. Create a list of sequences of length `MAX_LEN + 1`. Each sequence should be a list of integers. You'll see why we need `+ 1` in a minute.
   * You might need to get rid of your last example if it's shorter than `MAX_LEN + 1` characters. We need all data to be of the same length to simplify batching.
   * In the next homework we will implement batchihg for sequences of different lengths and you are probably not going to enjoy it, it's a bit tricky.
2. Split the data into training and validation sets. Use 90% of the data for training and 10% for validation.
3. Make x and y pairs for your data. Remember that we want to use the first 128 characters to predict the next character. So, `x` should be the first 128 characters and `y` should be a shifted version of the same sequence, so it's the last 128 characters. Name them `train_x` and `train_y` for the training set and `val_x` and `val_y` for the validation set.
4. Print an example from the training set. You should see that the first 128 characters are the same as the first 128 characters of the original text, and the last 128 characters are the same as the last 128 characters of the original text, shifted by one character.

You can just stride using `data[i:i+128]` for each `i` in `range(0, len(data), 128)`, no need to do anything fancy. You can figure out more complex ways to do it, just do this after all the homework is done. You receive no extra points if your homework is not finished.

### Grading criteria

1. 1 point for `data_chunks` list and train-test split
2. 1 point for dataset and dataloader objects
3. Extra point for a more interesting way to chunk the text
4. Extra point for implementing a custom dataset class

In [None]:
MAX_LEN = 128

# YOUR CODE STARTS HERE (our implementation is about 13 lines, but it's alright if yours is different)
# This could probably be done better  
tokenizer = Tokenizer(char2idx)
data = tokenizer.encode(raw_data)
data_chunks = [data[i:i + MAX_LEN] for i in range (0,len(data)- (MAX_LEN +1) , MAX_LEN)]
if int(len(data_chunks[-1])) < int((MAX_LEN+1)):
    data_chunks = data_chunks[:-1]
# Split into training and validation sets 
split_indice = int(0.9 * len(data_chunks))
train_chunks = data_chunks[:split_indice]
test_chunks = data_chunks[split_indice:]

train_x = [chunk[:-1]for chunk in train_chunks]
train_y = [chunk[1:]for chunk in train_chunks]
val_x = [chunk[:-1] for chunk in test_chunks ]
val_y = [chunk[1:] for chunk in test_chunks] 

print(f"Train_x length: {len(train_x)}, Train_y length: {len(train_y)}")
print(f"Val_x length: {len(val_x)}, Val_y length: {len(val_y)}")
print(f"Train_x[0] length: {len(train_x[0])}, Train_y[0] length: {len(train_y[0])}")

print("\nExample from training set:")
print(f"train_x[0]: {train_x[0]}")
print(f"train_y[0]: {train_y[0]}")

print("Decoded train_x[0]: ", repr(tokenizer.decode(train_x[0])))
print("Decoded train_y[0]: ", repr(tokenizer.decode(train_y[0])))
# YOUR CODE ENDS HERE

Train_x length: 7841, Train_y length: 7841
Val_x length: 872, Val_y length: 872
Train_x[0] length: 127, Train_y[0] length: 127

Example from training set:
train_x[0]: [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49, 6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 37, 53, 59, 1, 39, 56, 43, 1, 39, 50, 50, 1, 56, 43, 57, 53, 50, 60, 43, 42, 1, 56, 39, 58, 46, 43, 56, 1, 58, 53]
train_y[0]: [47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49, 6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 5

# Using `torch.utils.data`

We will use `torch.utils.data.Dataset` to create a dataset object that will be used to create a `torch.utils.data.DataLoader` object. The `DataLoader` object will be used to create batches of data.

## Coding task 3.4

Your task is to learn how to use `torch.utils.data.Dataset` and `torch.utils.data.DataLoader` classes and to apply them to our data.

1. Convert your data to tensors of type long
1. Create a `torch.utils.data.Dataset` object for each train and test data. Name them `train_dataset` and `val_dataset`. You can use the `TensorDataset` class for this or make a new class that inherits from `torch.utils.data.Dataset` and implements the `__getitem__` and `__len__` methods.
2. Try indexing `train_dataset` to get a single example and decode it using `tokenizer.decode()`. What does it contain? Use tokenizer to decode one example (both x and y). Does it look like a valid text? Are the targets shifted by one character?
1. Use the `DataLoader` class to create `train_loader` and `val_loader` objects. It will shuffle and batch data for you. You can use the following parameters:
   * `dataset` - the dataset object you created in the previous step
   * `batch_size` - your choice!
   * `shuffle` - True for training data, False for validation data
   * `num_workers` - 8, number of CPU cores to use for batch preparation
3. Try iterating over `train_loader` and print the shapes of the batches.
    * You can use `break` to stop the loop after the first iteration.
4. Try decoding a batch that you get from `train_loader`. Does it look like a valid text? Are the targets shifted by one character?

Learn more about data feeding in pytorch here: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html


**NOTE:**
1. `TensorDataset` returns a tuple of tensors. Usually these are `(x, y)` pairs, where `x` is the input and `y` is the target. In our case, `x` is the input sequence and `y` is the same sequence shifted by one character. This is how we will train our language model. We will use the first 128 characters to predict the next character.
1. You need to convert your pytorch tensor into a python list in order to use `tokenizer.decode()`. Feel free to do it in-place or modify the `decode` method of the `Tokenizer` class to accept **BOTH** python lists and pytorch tensors. You can check what datatype you have using `isinstance()` function.
2. Printing might look a bit weird because you have a lot of `\n` in the data. It is alright, just be careful when you are verifying that your data is correct.

### Grading criteria

* 1 point for `train_dataset` and `val_dataset` objects
* 1 point if each test is written and passed:
  * train dataset element is correctly processed and x and y corespond to the correct characters
  * printed the shapes of the items that you get from `train_loader`
  * decoded a batch from `train_loader` and printed the decoded text and it is correct

In [40]:
BATCH_SIZE = 3  # think about a better batch size for training, this is just a placeholder

# YOUR CODE STARTS HERE (our implementation is about 13 lines)
from torch.utils.data import TensorDataset, DataLoader
# Convert to tensors
train_x_tensor = torch.tensor(train_x,dtype=torch.long)
train_y_tensor = torch.tensor(train_y,dtype=torch.long)
val_x_tensor = torch.tensor(val_x,dtype=torch.long)
val_y_tensor = torch.tensor(val_y,dtype=torch.long)
#Dataset objects 
train_dataset = TensorDataset(train_x_tensor,train_y_tensor)
val_dataset = TensorDataset(val_x_tensor,val_y_tensor)

train_loader = DataLoader(train_dataset,batch_size = BATCH_SIZE , shuffle=True , num_workers = 8)
val_loader = DataLoader(val_dataset,batch_size = BATCH_SIZE,shuffle= False , num_workers = 8)

print("Batch shapes from train loader : ")
for x_batch , y_batch in train_loader:
    print("x_batch shape : " ,{x_batch.shape})
    print("y_batch shape : " , {y_batch.shape})
    break

for x_batch , y_batch in train_loader:
    for i in range(BATCH_SIZE):
        print("Sample " , {i} , " x: " , {repr(tokenizer.decode(x_batch[i]))})
        print("Sample " , {i} , " y : " , {repr(tokenizer.decode(y_batch[i]))})

# YOUR CODE ENDS HERE

Batch shapes from train loader : 


python(159) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(161) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(163) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(164) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(165) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(166) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(167) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(168) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(169) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(170) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


x_batch shape :  {torch.Size([3, 127])}
y_batch shape :  {torch.Size([3, 127])}


python(171) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(172) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(173) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(174) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(175) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(177) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(178) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(179) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


KeyError: tensor(43)

# Train a Transformer model

Import your `TransformerLM` model from `modeling_transormer` file and train it on the data you prepared above.
You know the drill: define a model, an optimizer, and a training loop, log everything to wandb.
You can also save your model using `TransformerLM.save_pretrained()` method and load it using `TransformerLM.from_pretrained()` method in case you want to.

### Tricky part

In PyTorch, `F.cross_entropy` expects the logits to be of shape `(batch_size, num_classes)` and the targets to be of shape `(batch_size,)` containing the class indices. In our case, the logits tensor has the shape `(batch_size, seq_len, num_classes)` and the targets are of shape `(batch_size, seq_len)`. We need to reshape the input and the targets to make them compatible with `F.cross_entropy`. You can do it like this:

```python
bs, seq_len, num_classes = logits.shape
logits = logits.reshape(bs * seq_len, num_classes)
targets = targets.reshape(bs * seq_len)
```

or, equivalently, like this:

```python
logits = logits.view(-1, num_classes)
targets = targets.view(-1)
```

Try monitoring your GPU consumption and max it out. The more efficient your code is, the faster your model will train.
During training log your loss and and accuracy. You can only log accuracy every 100 batches or so, because it is a bit slow to compute. You can also log the learning rate.
During evlauation you just need to log the perplexity, the loss, and accuracy. Perplexity is just `exp(loss)`.
Accuracy is not the most standard metric for language models, but it is very intererpretable and easy to compute. Don't expect it to be high, though.
Be mindful how frequenly you evaluate your model. You don't want to evaluate it too often, because it will slow down your training loop.

> You can also log the number of batches you process in one second (throughput) as a measure of efficiency. It is not required, but it is a good idea to monitor it.

## Coding task 3.5

Make a training loop and train your model.

### Grading criteria
**(5 points + extra points)**

* 2 points for trainig loop
* 1 point for using the GPU
* 1 point for evaluation loop (we recommend to make it into a separate function to make your code more readable)
* 1 point for wandb logging of train loss, eval loss, train accuracy, eval accuracy, eval perplexity. You can also log the learning rate, but it is not required.
* -1 point if forget to zero your gradients between batches
* -1 point if your forget to put your model to evaluation mode during evaluation and back to training mode during training
* Extra point for using a learning rate scheduler
* Extra point for any other improvements to the training loop


In [43]:
from transformer_lm.modeling_transformer import TransformerLM
import wandb
from torch.optim import AdamW
from torch.optim.lr_scheduler import StepLR
import time 
# YOUR CODE STARTS HERE
wandb.init(project="transformer_shakespeare", config={"batch_size": 3, "max_len": 128})
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TransformerLM(num_layers=4, hidden=256, num_heads=8, fcn_hidden=512, vocab_size=len(vocab), max_seq_len=128, dropout=0.1).to(device)
optimizer = AdamW(model.parameters(), lr=0.001)
scheduler = StepLR(optimizer, step_size=1000, gamma=0.9)

def evaluate(model, val_loader, device):
    model.eval()
    total_loss, total_acc, total_count = 0, 0, 0
    with torch.no_grad():
        for x, y in val_loader:
            x, y = x.to(device), y.to(device)
            logits = model(x)
            logits = logits.view(-1, len(vocab))
            targets = y.view(-1)
            loss = F.cross_entropy(logits, targets)
            total_loss += loss.item() * x.size(0)
            total_acc += (logits.argmax(dim=-1) == targets).sum().item()
            total_count += targets.size(0)
    avg_loss = total_loss / len(val_loader.dataset)
    perplexity = torch.exp(torch.tensor(avg_loss)).item()
    accuracy = total_acc / total_count
    model.train()
    return avg_loss, accuracy, perplexity

model.train()
for epoch in range(5):
    start_time = time.time()
    total_loss, total_acc, total_count = 0, 0, 0
    for i, (x, y) in enumerate(train_loader):
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        logits = model(x)
        logits = logits.view(-1, len(vocab))
        targets = y.view(-1)
        loss = F.cross_entropy(logits, targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * x.size(0)
        if i % 100 == 0:
            total_acc += (logits.argmax(dim=-1) == targets).sum().item()
            total_count += targets.size(0)
            wandb.log({"train_loss": loss.item(), "train_accuracy": total_acc / total_count if total_count > 0 else 0, "throughput": 100 / (time.time() - start_time + 1e-6)})
            start_time = time.time()
    scheduler.step()
    avg_train_loss = total_loss / len(train_loader.dataset)
    eval_loss, eval_acc, eval_perp = evaluate(model, val_loader, device)
    wandb.log({"epoch": epoch, "avg_train_loss": avg_train_loss, "eval_loss": eval_loss, "eval_accuracy": eval_acc, "eval_perplexity": eval_perp, "learning_rate": scheduler.get_last_lr()[0]})

model.save_pretrained("transformer_shakespeare")
# YOUR CODE ENDS HERE

python(485) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(486) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(487) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(488) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(489) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(490) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(491) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(492) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(640) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(641) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(642) MallocStackLogging: can't tu

KeyboardInterrupt: 

# Generate text using your model

Now it's time to see what this model can do. Implement a generation function.
The idea is to start with some prefix text, predict the next character, append it to the prefix, and repeat the process.
You can stop generating text when you reach MAX_LEN tokens.

Use `torch.no_grad()` context manager to make sure that you don't compute gradients during generation, or it will blow up your GPU memory.

## Coding task 3.6

Implement a generation function that accepts a prefix text and generates the next tokens up to MAX_LEN.

### Grading criteria
**(2 points)**

* 2 points for generation function
* -1 point if you forget to put your model to evaluation mode during generation and back to training mode after generation or if you forget to use `torch.no_grad()` context manager, or if you are not using the GPU.

In [None]:
# YOUR CODE STARTS HERE (our implementation is about 10 lines)

# YOUR CODE ENDS HERE

# Exploring hyperparameters and understanding Transformers

Train at least 10 models with different hyperparameters and compare them using wandb. Write a short report (500-1000 words).


### Grading criteria
**(5 points max + extra points)**

* 4 points for training 10+ models. (5-9 models = 2 points, 1-4 models = 1 point)
* 1 point for training report that describes what you did and what you learned about the hyperparameters and efficient training.