# Transformers Notebook

#### Summary

This notebook showcases basic uses from transformers, as well as the from scratch implementation.

In this notebook we will:
* Write shakespeare via a Self Attentive Transformer Decoder
* Translate English To French, via a Transformer Encoder and Decoder


## Shakespeare

### Data

To begin, we will load our data from a txt file containing Shakespearean Plays

*Data: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt*

In [1]:
with open('input.txt', 'r', encoding='utf-8') as f:
  text=f.read()

In [2]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



#### TOKENIZER

The tokenizer subdivides the sentences into tokens, which then have their semantic meaning learned by the model.

* Started initially with a per character tokenizer
* UPDATE: Replaced tokenizer with a pretrained BERT tokenizer (This is fine for English, however, later in the notebook is not as good for French)

In [3]:
#NAIVE INTEGER TOKENIZER
chars = sorted(list(set(text)))

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

from transformers import AutoTokenizer

#Pretrained tokenizer
enc = AutoTokenizer.from_pretrained("bert-base-cased")
vocab_size = enc.vocab_size

#### Data Loader

Now, we will create a basic method to load our data quickly for both train and validation. We will do this via a simple function

In [None]:
import torch

data = torch.tensor(enc.encode(text))
data


In [5]:
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

In [6]:
torch.manual_seed(1337)

batch_size = 128
block_size = 32 # maximum context length

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data)-block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    return x, y

xb, yb = get_batch('train')

print(xb)
print(yb)

tensor([[ 1139,  1540,  1706,  ...,  2789,  1109,  1269],
        [ 1189,  1103, 10136,  ...,  1699,   117,  3465],
        [ 4199,   117,  6884,  ..., 23354,   131,   146],
        ...,
        [ 1262,  3041,  1240,  ...,  1103,  2089,  1104],
        [  173,  2353,  5141,  ...,  2196,  1436,   119],
        [ 1341,  4218,  1106,  ...,  1159,  1536,   119]])
tensor([[ 1540,  1706,   184,  ...,  1109,  1269,   146],
        [ 1103, 10136,  4455,  ...,   117,  3465,  1324],
        [  117,  6884, 27296,  ...,   131,   146,  1821],
        ...,
        [ 3041,  1240,  6927,  ...,  2089,  1104,  1142],
        [ 2353,  5141,   112,  ...,  1436,   119, 22870],
        [ 4218,  1106,  9015,  ...,  1536,   119,  8784]])


### Attention

In order to predict the text, we will make an attention based transformer decoder. This will use self attention to take the current text, and predict the new text.

NOTE: WE WILL USE LEARNED POSITIONAL EMBEDDINGS FOR SIMPLICITY; HOWEVER, SINUSOIDAL PE IS TYPICALLY BETTER

In [7]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)



n_embed = 64

head_size = 16

class Head(nn.Module):

    def __init__(self, head_size, mask = True):
        super().__init__()
        self.key = torch.nn.Linear(n_embed, head_size, bias = False)
        self.query = torch.nn.Linear(n_embed, head_size, bias = False)
        self.value = torch.nn.Linear(n_embed, head_size, bias = False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.mask = mask

    def forward(self, x):
        B,T,C = x.shape

        k = self.key(x)
        q = self.query(x)

        wei = q @k.transpose(-2, -1) / head_size**0.5
        if self.mask:
            wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = torch.softmax(wei, dim=-1)

        v = self.value(x)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size, mask = True):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size, mask = mask) for _ in range(num_heads)])
        self.proj = nn.Linear(num_heads * head_size, n_embed)
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim =-1)
        out = self.proj(out)
        return out

class FeedForward(nn.Module):
    def __init__(self, sz):
        super().__init__()
        self.FF = nn.Sequential(
            nn.Linear(sz, sz*4),
            nn.ReLU(),
            nn.Linear(sz*4, sz),
        )
    def forward(self, x):
        return self.FF(x)

class Block(nn.Module):

    def __init__(self, n_embed, n_head, mask = True):
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size, mask = mask)
        self.ffwd = FeedForward(n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class Decoder(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()

        # THE EMBEDDING SPACE x->(z)
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.positional_embedding_table = nn.Embedding(block_size, n_embed)

        self.blocks = nn.Sequential(
            Block(n_embed, n_head = 6),
            Block(n_embed, n_head = 6),
            Block(n_embed, n_head = 6),
            nn.LayerNorm(n_embed)
        )
        self.lm_head = nn.Linear(n_embed, vocab_size)
        


    def forward(self, idx, targets=None):
        B, S = idx.shape

        tok_emb = self.token_embedding_table(idx) # B,T, C
        pos_emb = self.positional_embedding_table(torch.arange(S)) # (T,C)

        x = tok_emb + pos_emb

        x = self.blocks(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            print(logits)
            targets = targets.view(B*T)
            print(targets)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):

        for _ in range(max_new_tokens):

            idx_cond = idx[:,-block_size:]
            # GRABS THE NEXT PREDICTION
            logits, loss = self(idx_cond)
            logits = logits[:,-1,:]

            # CREATES THE NEXT PIECE OF SEQUENCE
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples = 1)

            # ADDS CHOICE TO SEQUENCE
            idx = torch.cat((idx, idx_next), dim = 1)

        return idx

m = Decoder(vocab_size)
#sequence = m.generate(xb, 100)
#print(enc.decode(sequence[0,:].tolist()))

In [8]:
optimizer = torch.optim.AdamW(m.parameters(), lr = 3e-4)

for steps in range(1000):

    xb, yb = get_batch('train')

    logits, loss = m(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

tensor([[ 0.4432, -0.0112,  0.3615,  ...,  0.2669, -0.6870,  0.9221],
        [-0.6037,  0.1370,  0.5656,  ...,  0.7308,  0.3631,  0.2769],
        [-0.1342,  0.1373,  0.0837,  ...,  0.4316, -0.9123,  0.0554],
        ...,
        [ 0.0745,  0.1808,  0.1682,  ..., -0.0261,  0.1757, -0.0440],
        [-0.5510, -0.5618, -0.2221,  ...,  0.7620,  0.4578,  0.6521],
        [ 0.4486,  0.0223, -0.9319,  ...,  0.4141, -0.5009,  1.4216]],
       grad_fn=<ViewBackward0>)
tensor([17604,  1131,  1439,  ...,   119,   148, 15740])
10.424849510192871


In [9]:
xb.shape

torch.Size([128, 32])

# English to French Translation

We have assembled a basic self attention decoder, which is able to perform text prediction. We will now use the build blocks of this model to perform machine translation via an encoder decode model.

*DataSet: https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench*
#### Cross Attention

In order to perform translation, we need to implement what is called "Cross Attention". This will allow the decoder to query information from the features that the encoder generates

In [10]:
class CrossAttentionHead(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.key = torch.nn.Linear(n_embed, head_size, bias = False)
        self.query = torch.nn.Linear(n_embed, head_size, bias = False)
        self.value = torch.nn.Linear(n_embed, head_size, bias = False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
    
    
        # PERFORMS CROSS ATTENTION
    def forward(self, x, x_cross):
        
        B,T,C = x.shape

        k = self.key(x_cross)
        q = self.query(x)

        wei = q @k.transpose(-2, -1) / head_size**0.5
        #wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = torch.softmax(wei, dim=-1)

        v = self.value(x_cross)
        out = wei @ v

        return out
    
class MultiHeadCrossAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([CrossAttentionHead(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(num_heads * head_size, n_embed)
        
    def forward(self, x, x_cross):
        out = torch.cat([h(x = x, x_cross = x_cross) for h in self.heads], dim =-1)
        out = self.proj(out)
        
        return out
    
class CrossAttentionBlock(nn.Module):

    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.attn = MultiHeadAttention(n_head, head_size)
        self.sa = MultiHeadCrossAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)
        self.ln3 = nn.LayerNorm(n_embed)

    def forward(self, x, x_cross):
        x = x + self.attn(self.ln1(x))
        x = x + self.sa(x = self.ln2(x), x_cross = self.ln2(x_cross))
        x = x + self.ffwd(self.ln3(x))
        return x
    

#### Transformer

With the new cross attention modules, we will now be able to create our transformer.

In [11]:
class Transformer(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()

        # THE EMBEDDING SPACE x->(z)
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.positional_embedding_table = nn.Embedding(block_size, n_embed)
        
        # ENCODER
        self.blocks = nn.Sequential(
            Block(n_embed, n_head = 6, mask = False),
            Block(n_embed, n_head = 6, mask = False),
            Block(n_embed, n_head = 6, mask = False),
            Block(n_embed, n_head = 6, mask = False),
            Block(n_embed, n_head = 6, mask = False),
            Block(n_embed, n_head = 6, mask = False),
            nn.LayerNorm(n_embed)
        )
        
        self.decoderBlocks = nn.ModuleList([CrossAttentionBlock(n_embed, n_head = 6) for i in range(6)])
        
        
        self.lm_head = nn.Linear(n_embed, vocab_size)
        


    def forward(self, x, x_cross, targets=None):
        # TARGETS: WHAT WE WANT THE NEXT WORD OF THE PREDICTION TO BE
        # X: THE CURRENT GENERATION THAT WE HAVE
        # X_CROSS: THE TEXT WE ARE TRANSLATING
        
        B, S = x.shape

        tok_emb_cross = self.token_embedding_table(x_cross) # B,T, C
        tok_emb = self.token_embedding_table(x)
        
        #print(S)
        #print(block_size)
        #print(torch.arange(S))
        
        pos_emb = self.positional_embedding_table(torch.arange(S))

        x = tok_emb + pos_emb
        x_cross = tok_emb_cross + pos_emb

        # RUN THE CROSS DATA THROUGH THE ENCODER
        x_cross = self.blocks(x_cross)
        
        
        for decoder in self.decoderBlocks:            
            x = decoder(x = x, x_cross = x_cross)
        
        
        # TURN THE RESULT INTO LOGITS
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            #print(logits)
            targets = targets.view(B*T)
            #print(targets)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx):
        
        gen = torch.zeros_like(idx)
        
        
        for i in range(block_size):
           
            logits, loss = self.forward(gen ,idx)
            logits = logits[:,i,:] 
            
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples = 1)
            
            #Set the vector slice to the next part
            if (i < block_size-1):
                gen[:,i+1:i+2] = idx_next
            else:
                return torch.concat([gen[:,1:], idx_next], axis = 1)
        

    
m = Transformer(vocab_size)
#m = torch.load('./Models/HighEmbedCurrent')
#gen = m.generate(tar)[0,:].tolist()
#gen

####  Data Loader with Machine Translation

Now that we have the transformer, we need to have a way to load the data. Our data loader create tensors for the English Sentence, the French Sentence, and the Prediction Targets for each point in the French Translation.

In [27]:
# NOW WE NEED TO LOAD THE DATA
import pandas
import numpy as np

data = pandas.read_csv('fra.csv', sep = '\t')
data.columns = usecols = ['eng', 'fra']


train_numbers = np.random.choice(range(len(data)), size= int(0.9*len(data)), replace=False)
test_numbers = [i for i in range(len(data)) if i not in numbers]

data['eng']=data['eng'].apply(enc.encode)
data['fra']=data['fra'].apply(enc.encode)

train = data.iloc[train_numbers]
test = data.iloc[test_numbers]

In [13]:

# SOME OF OUR DATA IS LARGER THAN OUR BUFFER SIZE, FOR SIMPLICITY WE WILL JUST CUT THAT DATA OUT.

def normalize_length(arr):
    
    arr = arr[:block_size] + [0] * ( max(0, block_size - len(arr)) ) 
    
    return arr
    
    
train['eng']=train['eng'].apply(normalize_length)
train['fra']=train['fra'].apply(normalize_length)

In [14]:


def get_batch(split):
    data = train if split == 'train' else test
    mx = len(data.index)
    
    ix = torch.randint(mx, (batch_size,))
    
    eng = torch.stack([torch.LongTensor(data.loc[int(i)]['eng']) for i in ix])
    tar = torch.stack([torch.LongTensor(data.loc[int(i)]['fra']) for i in ix])
    fra = torch.stack([torch.LongTensor([0] + data.loc[int(i)]['fra'][:block_size-1]) for i in ix])

    return tar, fra, eng

tar, fra, eng = get_batch('train')


#### Training

Now all there is left to do is train the model:

In [None]:
optimizer = torch.optim.AdamW(m.parameters(), lr = 3e-4)

for steps in range(5000):

    tar, fra, eng = get_batch('train')
    
    logits, loss = m(x = fra, x_cross = eng, targets = tar)
    
    print(loss.item(), end = "\r")
    
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
    if (steps%1000 == 1):
        torch.save(m,'./Models/HighEmbedEpoch'+str(steps) )

    


## Results

While not entirely accurate, the transformer is able to roughly translate some sentences. Currently, it is limited by training time, and complexity. Given a higher embed dimension, more attention heads, or a greater depth, along with ample training, it would be able to more accurately decode french sentences.

Another major culprit is our tokenizer. The token are connected via MEANING (word embedding), the english tokenizer subdivides sentences into english tokens, thus resulting in token meaning not translating over well, as words are subdivided, with little meaning in embedding space.

Finally, we used LEARNED POSITIONAL EMBEDDINGS. While these are sufficient for basic tasks, it is typically common-practice to use SINUSOIDAL EMBEDDINGS INSTEAD.

In [18]:
tar, fra, eng = get_batch('test')
gen = m.generate(eng)

In [19]:
for i in range(10):
    print(enc.decode([n for n in eng[i,:].tolist() if n != 0]), enc.decode([n for n in gen[i,:].tolist() if n != 0]))

I'm in the mood for something sweet. Srsque j'avion les paies quelque chose.
Is there any other way besides extraction? Y a - t - il d'autre une manque, a chamua plans?
We got ready. Nous s'espérains présentée à nous passer.
I liked your friends. J'appréciaère vos amis.
She was educated by her grandfather. Elle fut la pi sciencee était par son grand -uand.
I know neither of them. Je suis parti bord de le dénus.
Thanks for supporting me. Merci d'avoir offens compées.
I folded the towels. Je répépête surpris.
I just want you to think about it. Je veux juste que tu y songes.
Do you have children's clothes? Avez - vous des enfants?
