# Pretrained Transformer As Universal Computation Machine

This assignment is made with the paper <a herf = "https://arxiv.org/abs/2103.05247" target = "_blank"> Pretrained Transformer As Universal Computation Machine</a>. 

The transformer architecture has shown great success in deep learning, serving as backbone of larger model for tasks like NLP and images. Insipired by these successes, we aimed to explore the generalization capability of a transformer. We hypothesize transformers, if trained with on a data rich modality, such as a natural language corpus, can identify feture representations of **arbitrary* data. In this assignment, we invesigate whether pretrained language models are capable of in terms of generalizing to ther modalities with sequential structure.

To do this, we use a transformer model pretrained on natural language data: GPT-2 and only finetune the **linear input, linear ouput, positional embedding and layer norm** parameters. We will see how GPT-2 works in tasks completely different from language prediction. Then, we will show using pretrained transformer model as feature extractor has its advantage over building a new neural nets from scratch.

In [28]:
!pip install transformers



In [None]:
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import random_split
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import numpy as np
from transformers.models.gpt2.modeling_gpt2 import GPT2Model
from typing import List, Dict
from tqdm import tqdm
import matplotlib.pyplot as plt
import requests
from PIL import Image

### **Learn about GPT2 architecture**

First, we introdcue the architecture of GPT2 and try build a sample model to simulate it to help you understand later parts better

As you learned from lecture, the GPT2 is based on the transformer model, which is raised in paper, *Attention is all you need*. Here is a picture of transformer achitecture for your reference.

<div align="center"><img src=https://jalammar.github.io/images/xlnet/transformer-encoder-decoder.png width=60% /> </div>

Then people find that only a stack of encoders/decoders are sufficent for taskes, which result in encoder-only transformer, such as BERT and decoder-only transformer, such as GPT2. Here are two pictures showes the architecture of encoder and decoder layer.


<div align="center"><img src=https://jalammar.github.io/images/xlnet/transformer-encoder-block-2.png width=60% /> </div>

<div align="center"><img src=https://jalammar.github.io/images/xlnet/transformer-decoder-block-2.png width=60% /> </div>

As you can see, the biggest difference between encoder and decoder is decoder has one more layer called **MASKED** self-attention. This layer is, as the name said, a self-attention layer with mask. To be specific, in masked self-attention, the causal relationshipd is added, which means for a certain word, only the words before it can have influence on it. The difference is well demostrated in the following pictures.

<div align="center"><img src=https://jalammar.github.io/images/gpt2/self-attention-and-masked-self-attention.png width=60% /> </div>

Implementing the transformer or GPT2 is not the main focus of this homework. Here, we provided a detailed Pytorch implementation of a self-attention layer. You task is to apply the causual mask if `causal = True`. 

The scores is computed for you. The attention scores `scores[i, j]` represent the similarity score between the i-th query vector `q[i]` and the j-th key vector `k[j]`.

*Hint: `torch.triu` and `torch.masked_fill` function may be helpful.*

In [None]:
def dot_product_attention(q, k, v, causal=False):
    """
    Computes the dot product attention scores and the attention output for a single example.

    Args:
    - q: Tensor of shape (query_length, embedding_size)
    - k: Tensor of shape (key_length, embedding_size)
    - v: Tensor of shape (value_length, embedding_size)
    - causal: Boolean flag indicating whether to apply a causal mask

    Returns:
    - output: Tensor of shape (query_length, embedding_size)
    """
    scores = torch.matmul(q, k.transpose(0, 1)) / (q.shape[-1] ** 0.5)  # shape: (query_length, key_length)
    
    if causal:

      ############################################################################
      # TODO: implement this part
      ############################################################################
        # Create a causal mask for the scores tensor
        mask = torch.triu(torch.ones_like(scores), diagonal=1)
        # print(mask)
        scores.masked_fill_(mask == 1, float("-inf"))
        # print('scores',scores)

      ############################################################################
    
    #print(scores)
    weights = F.softmax(scores, dim=-1)  # shape: (query_length, key_length)
    output = torch.matmul(weights, v)  # shape: (query_length, embedding_size)
    
    return output


In [None]:
# test the implementation
q = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=torch.float)
k = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=torch.float)
v = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=torch.float)

print(dot_product_attention(q, k, v, True))
print(dot_product_attention(q, k, v, False))




tensor([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]])
tensor([[6.9999, 7.9999, 8.9999],
        [7.0000, 8.0000, 9.0000],
        [7.0000, 8.0000, 9.0000]])


### **Application of GPT-2 in a Simple Language Task**

In this part, we use GPT2 to do some language task, which is its original domain, to learn about how to adapt and fine tune GPT-2.

In [None]:
# # This part require finetuing
# import torch
# from transformers import GPT2Tokenizer, GPT2Config, GPT2ForSequenceClassification

# # Load pre-trained model and tokenizer
# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# model_config = GPT2Config.from_pretrained('gpt2', num_labels=2)
# model = GPT2ForSequenceClassification.from_pretrained(model_config)

# # Define input text
# input_text = "This is an example input text for classification."
# input_ids = tokenizer.encode(input_text, return_tensors='pt')
# outputs = model(input_ids)

# predicted_class = torch.argmax(outputs[0])

# print(predicted_class)

### **Versatility of GPT-2 for tasks in other Domains**

In this part, we will demonstrate how to adpat and finetune GPT-2 for tasks outside the language domain, including a math operation finding and image classification.

#### Task 1: Bit-wise operation

In this section, we show GPT-2 can well learn the bit-wise operation such as `AND`, `OR` and `XOR`. We will see it shows a extremely high accuracy on this task.

First, we need to create a dataset for fine tuning.

In [None]:
if torch.cuda.is_available():
   device = 'cuda'
else:
  device = 'cpu'

Here, we choose the `XOR` operator as our training goal. You are feel free to change the function below to test on other operators.

In [None]:
# randomly generate two n-bits strings and its ground-truth and result
def generate_example(n):
  bits = np.random.randint(low=0, high=2, size=(2, n)) 
  
  # change this line to change the operator
  XOR = np.logical_xor(bits[0], bits[1]).astype(np.long) 
  # ----------------------------------------------------
  
  return bits.reshape((2*n)), XOR

In [None]:
class BitWiseDataset(torch.utils.data.Dataset):
  def __init__(self, n, size):
    self.n = n
    self.size = size

  def __len__(self):
    return self.size

  def __getitem__(self, idx):
    bits = np.random.randint(low=0, high=2, size=(2, self.n))
    And = np.logical_xor(bits[0], bits[1]).astype(np.int64)
    return torch.tensor(bits.reshape((2*self.n)), dtype=torch.long).to(device), torch.tensor(And, dtype=torch.long).to(device)

def generate_data_loaders(n, batch_size, data_size = 1000, train_size=0.8):
  dataset = BitWiseDataset(n, size=data_size)
  train_size = int(train_size * len(dataset))
  test_size = len(dataset) - train_size
  train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])

  train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
  test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

  return train_loader, test_loader


Then, to fine tune on the GPT-2 model, we need to freeze the weights of the self-attention layer and feedforward layer. Only keep the layer norm layer and positional encnding changeable.

In [None]:
# load the GP2 model
gpt2 = GPT2Model.from_pretrained('gpt2')

In [None]:
# show the name of all the para your are able to modified in this model
for name, param in gpt2.named_parameters():
  print(name)

wte.weight
wpe.weight
h.0.ln_1.weight
h.0.ln_1.bias
h.0.attn.c_attn.weight
h.0.attn.c_attn.bias
h.0.attn.c_proj.weight
h.0.attn.c_proj.bias
h.0.ln_2.weight
h.0.ln_2.bias
h.0.mlp.c_fc.weight
h.0.mlp.c_fc.bias
h.0.mlp.c_proj.weight
h.0.mlp.c_proj.bias
h.1.ln_1.weight
h.1.ln_1.bias
h.1.attn.c_attn.weight
h.1.attn.c_attn.bias
h.1.attn.c_proj.weight
h.1.attn.c_proj.bias
h.1.ln_2.weight
h.1.ln_2.bias
h.1.mlp.c_fc.weight
h.1.mlp.c_fc.bias
h.1.mlp.c_proj.weight
h.1.mlp.c_proj.bias
h.2.ln_1.weight
h.2.ln_1.bias
h.2.attn.c_attn.weight
h.2.attn.c_attn.bias
h.2.attn.c_proj.weight
h.2.attn.c_proj.bias
h.2.ln_2.weight
h.2.ln_2.bias
h.2.mlp.c_fc.weight
h.2.mlp.c_fc.bias
h.2.mlp.c_proj.weight
h.2.mlp.c_proj.bias
h.3.ln_1.weight
h.3.ln_1.bias
h.3.attn.c_attn.weight
h.3.attn.c_attn.bias
h.3.attn.c_proj.weight
h.3.attn.c_proj.bias
h.3.ln_2.weight
h.3.ln_2.bias
h.3.mlp.c_fc.weight
h.3.mlp.c_fc.bias
h.3.mlp.c_proj.weight
h.3.mlp.c_proj.bias
h.4.ln_1.weight
h.4.ln_1.bias
h.4.attn.c_attn.weight
h.4.attn.c_at

In [None]:
for name, param in gpt2.named_parameters():
# freeze all parameters except the layernorm and positional embeddings 
  if 'ln' in name or 'wpe' in name:
    param.requires_grad = True 
  else:
    param.requires_grad = False

After we have our dataset and pretrained model ready, we need to adapt the model to our task, which means adding a embedding layer before the model and a linear output layer after the model.

In [None]:
class Bit_wise_transformer(nn.Module):
  def __init__(self, engine, bitLength, input_dim, engine_embed_dim, n_class = 2):
    super().__init__()
    self.n = bitLength
    self.input_embed = nn.Embedding(input_dim, engine_embed_dim)
    self.engine = engine
    self.output_layer = nn.Linear(engine_embed_dim, n_class)
  def forward(self, x):
    embeddings = self.input_embed(x)
    hidden_state = self.engine(inputs_embeds=embeddings).last_hidden_state[:,self.n:]
    logits = self.output_layer(hidden_state)[0]
    return logits

Now, we are ready for the training! Generate the train and set set first, and train the model.

In [None]:
# generate the training and testing data
train_loader, test_loader = generate_data_loaders(n=5, batch_size=1,data_size = 1000, train_size=0.8)

In [None]:
# create an instance of the model
Bit_length = 5

model = Bit_wise_transformer(
      gpt2, 
      bitLength = Bit_length,
      input_dim = 2, 
      engine_embed_dim = 768
).to(device)


# define the optimizer and loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()


losses = []
train_acc = []
all_val_acc = []
best_val_acc = 0
num_epochs = 5

epoch_iterator = range(num_epochs)
for epoch in epoch_iterator:
    # Training loop
    running_loss = 0.0
    data_iterator = tqdm(train_loader)
    for i, (inputs, labels) in enumerate(data_iterator):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels.squeeze())
        accuracy = torch.mean((torch.argmax(outputs, dim=-1) == labels.flatten()).float())
        loss.backward()
        optimizer.step()

        data_iterator.set_postfix(loss=loss.item())

        running_loss += loss.item()
        if (i + 1) % 10 == 0:
              #print(f'Epoch [{epoch + 1}/{num_epochs}], Data [{i + 1}/{len(train_loader)}], Loss: {running_loss / 100:.4f}')
            running_loss = 0.0
            losses.append(loss.item())
            train_acc.append(accuracy.item())
        
    # Validation
    val_acc = []
    model.eval()
    with torch.no_grad():
      for inputs, labels in test_loader:
          inputs = inputs.to(device=device, dtype=torch.long)
          labels = labels.to(device=device, dtype=torch.long)
          outputs = model(inputs)
          accuracy = torch.mean((torch.argmax(outputs, dim=-1) == labels.flatten()).float())
          val_acc.append(accuracy.item())
    model.train()

    all_val_acc.append(np.mean(val_acc))
    # Save best model
    if np.mean(val_acc) > best_val_acc:
        best_val_acc = np.mean(val_acc)

    epoch_iterator.set_postfix(val_acc=np.mean(val_acc), best_val_acc=best_val_acc)


plt.plot(losses)
plt.title('Train Loss')
plt.figure()
plt.plot(train_acc)
plt.title('Train Accuracy')
plt.figure()
plt.plot(all_val_acc)
plt.title('Val Accuracy')
del model
torch.cuda.empty_cache()

Here, in the final plots of training and testing accuracy, you should see GPT-2 achieve a extremely good result.

#### Task 2: Image classfication

For sure the last task is quite simple and may not so convincing to show the generality of GPT-2. Now we move to a more complex domain, image classfication. We will do the task on the famous hand-written dataset MNIST and reach a good result

Still, we need to freeze some parameters in GPT-2 first.

In [31]:
# A function freeze necessary parameters in GPT-2
def gpt2_freezer(
    model,                         
    freeze_param_list: List[str] = None
):
  for (name, param) in model.named_parameters():
    if freeze_param_list is not None:
      if any([k in name for k in freeze_param_list]):
        param.requires_grad = False
    else:
      param.requires_grad = False
  return model

In [None]:
pretrained_gpt2 = GPT2Model.from_pretrained('gpt2') 
gpt2_engine = gpt2_freezer(
    pretrained_gpt2,
    ["mlp", "attn"]
  )

### **Advantages of Pre-trained Models: Speed and Accuracy**

### **Influence of Model Capacity on Accuracy and Training Time**

### **Interpreting Attention Layers in GPT-2**