# Transformers From Scratch
You've heard the name, you've seen the papers, you've probably even stepped through a few repos. But can you build a multi-head attention mechanism from scratch?

In this lab you'll be asked to implement the novel multi-head attention function from scratch, and plug this into our provided framework to test that your function works.

We'll also ask you to parralellize it to take advantage of all the cores in your machine. 

So hang in there, and let's get to it!

---
# Task 0 - Learn About This Script Pacakge
First, let's examine the codebase here. You'll see a Transformer-based model architecture, so to speak, implemented here. The catch? There is no multi-head attention function. 

We'll step through the codebase with you here in the guide so you know the key points. At the end we'll run a quick functional test to make sure you can indeed run a basic training job.

Task 0 should take ~ 10 minutes. Try to take your time reading through this so you know what the arguments are and how the script is constructed.

In [6]:
# this package is a fork of Peter Bloem's "former" repository, where he implements a Transformer from scratch in PyTorch.
!git clone https://github.com/EmilyWebber/former.git

Cloning into 'former'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (31/31), done.[K
remote: Total 134 (delta 18), reused 25 (delta 8), pack-reused 95[K
Receiving objects: 100% (134/134), 34.90 MiB | 72.49 MiB/s, done.
Resolving deltas: 100% (54/54), done.


First, this codebase supports two modes. One for text classification, the other for text generation. Both of those can be run by a single Python script, `python experiments/classify.py`. You'll see that we are just calling that Python script inside of the `model.py` script that SageMaker uses to execute the training job.

Inside of `classify.py`, you'll see we're creating a model on line 60. `model = former.CTransformer`. Note that this model takes a few hyperparameters - `embedding_size`, `num_heads`, `depth`, `seq_length`, `num_classes`, and `max_pool`. 

![](images/model_create.png)

Now let's check out that `CTransformer()` object. You'll see it's inherited from the `former` class. 

Inside `transformers.py`, there's an `__init__` for the `CTransformer()` object. Inside the init, we'll see a small for-loop defined that pulls in the hyperparameters we just passed in, and creates a `TransformerBlock`. Then, it's using the PyTorch `nn.Sequential` API to convert those blocks into a sequential neural network component.

![](images/TransformerBlock.png)

Now, where is this `TransformerBlock` defined? It's actually in `modules.py`.  You'll see one `TransformerBlock` class, that inherits a `SelfAttentionWide` or `SelfAttentionNarrow` object on creation. We'll follow the rabbit-hole on the `SelfAttentionNarrow.`

The `SelfAttentionWide` init defines a few objects: `self.tokeys`, `self.toqueries`, and `self.tovalues`.

![](images/tokeys.png)

Then, we see a `forward` function that calls `self.tokeys`. Next, it computes a scaled dot-product self-attention function! That's what we need to implement. 

![](images/self-attention.png)

Sound like fun? Now, run your base training job to make sure the pipes are working today. This should download the data onto the training host, but it won't actually train because we haven't implemented your solution yet.

In [2]:
# run the model file
import sagemaker
import os

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

prefix = 'transformers'

role = sagemaker.get_execution_role()

# create some arbitrary train file that we won't use
!echo 1,2,3,4 > holder_file.csv 
s3_train_path = "s3://{}/{}/train/{}".format(bucket, prefix, 'holder_file.csv')
os.system('aws s3 cp {} {}'.format( 'holder_file.csv', s3_train_path))

from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='model.py',
                    role=role,
                    framework_version='1.2.0',
                    py_version = 'py3',
                    source_dir = 'former',
                    instance_count=1,
                    instance_type = 'ml.p3.2xlarge')

estimator.fit({'training': s3_train_path}, wait=False)

---
# Task 1 - Implement a Multi-head Attention Function
Now, here's the fun part. Can you implement your own multi-head attention mechanism? Don't worry, we'll give you all the tips you need.

Task 1 should take ~30 minutes.

In [4]:
!pip install -r former/requirements.txt



In [41]:
from torch import nn
import torch
from torch import nn
from torch.autograd import Variable
import torch.nn.functional as F

from torchtext import data, datasets, vocab

import numpy as np

from torch.utils.tensorboard import SummaryWriter

# Used for converting between nats and bits
LOG2E = math.log2(math.e)
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False)
NUM_CLS = 2

def d(tensor=None):
    if tensor is None:
        return 'cuda' if torch.cuda.is_available() else 'cpu'
    return 'cuda' if tensor.is_cuda else 'cpu'



def get_x():
    
    tdata, _ = datasets.IMDB.splits(TEXT, LABEL)

    print ('calling imbdb.splits')

    train, test = tdata.split(split_ratio=0.8)

    print ('loading train and test sets')
    TEXT.build_vocab(train, max_size=50_000 - 2) # - 2 to make space for <unk> and <pad>
    LABEL.build_vocab(train)

    train_iter, test_iter = data.BucketIterator.splits((train, test), batch_size=4, device=d())

    return train_iter, test_iter

# it will take a few minutes to run this cell, even on an m5.large
# go ahead and run this cell. while you're waiting for it to complete, jump ahead to the next portion
tbw = SummaryWriter(log_dir='./runs') # Tensorboard logging
train_iter, _ = get_x()

calling imbdb.splits
loading train and test sets


In [5]:
# this is where you define your first three layers, for the keys, queries, and values
def get_inputs(emb = 512, heads=8):
    
    
                
    tokeys = nn.Linear(emb, emb * heads, bias=False)
    toqueries = nn.Linear(emb, emb * heads, bias=False)
    tovalues = nn.Linear(emb, emb * heads, bias=False)
    
    keys    = tokeys(x)
    queries = toqueries(x)
    values  = tovalues(x)
    
    return keys, queries, values

def my_multihead_attention_mechanism(keys, queries, values):
        # compute scaled dot-product self-attention

        # - fold heads into the batch dimension
        keys = keys.transpose(1, 2).contiguous().view(b * h, t, e)
        queries = queries.transpose(1, 2).contiguous().view(b * h, t, e)
        values = values.transpose(1, 2).contiguous().view(b * h, t, e)

        queries = queries / (e ** (1/4))
        keys    = keys / (e ** (1/4))
        # - Instead of dividing the dot products by sqrt(e), we scale the keys and values.
        #   This should be more memory efficient

        # - get dot product of queries and keys, and scale
        dot = torch.bmm(queries, keys.transpose(1, 2))

        assert dot.size() == (b*h, t, t)

        if self.mask: # mask out the upper half of the dot matrix, excluding the diagonal
            mask_(dot, maskval=float('-inf'), mask_diagonal=False)

        dot = F.softmax(dot, dim=2)
        # - dot now has row-wise self-attention probabilities

        # apply the self attention to the values
        out = torch.bmm(dot, values).view(b, h, t, e)

        # swap h, t back, unify heads
        out = out.transpose(1, 2).contiguous().view(b, t, h * e)
    
        self.unifyheads = nn.Linear(heads * emb, emb)

        return self.unifyheads(out)
    
keys, queries, values = get_inputs()
    
my_multihead_attention_mechanism(keys, queries, values)

# then we apply the train iter object to the compiled neural network

let's do this!!


Great stuff! Now, let's get that incorporated into the entire script we defined above. Paste your function into the script below.

In [6]:
%%writefile my_transformer_model.py

# << paste your function here >> 

Writing my_transformer_model.py


Now, let's test that out on SageMaker! 

In [8]:
# run your new transformer script on SageMaker

---
# Task 2 - Compare against solution set
If you made it here with tons of time to spare, congrats! We will release the solution notebook 70 minutes into the lab, so that you definitely have a shot at implementing the attention mechanism yourself.

This last task is optional - just so you can compare your solutions against ours and make sure we're all on the right track.

Task 2 should take ~ 20 minutes.

In [66]:
!cd ../ && aws s3 sync transformers-from-scratch s3://notebook-etc/transformers-from-scratch/

upload: transformers-from-scratch/.ipynb_checkpoints/Transformers From Scratch-checkpoint.ipynb to s3://notebook-etc/transformers-from-scratch/.ipynb_checkpoints/Transformers From Scratch-checkpoint.ipynb
upload: transformers-from-scratch/Transformers From Scratch.ipynb to s3://notebook-etc/transformers-from-scratch/Transformers From Scratch.ipynb
upload: transformers-from-scratch/former/.git/HEAD to s3://notebook-etc/transformers-from-scratch/former/.git/HEAD
upload: transformers-from-scratch/former/.git/config to s3://notebook-etc/transformers-from-scratch/former/.git/config
upload: transformers-from-scratch/former/.git/description to s3://notebook-etc/transformers-from-scratch/former/.git/description
upload: transformers-from-scratch/former/.git/hooks/applypatch-msg.sample to s3://notebook-etc/transformers-from-scratch/former/.git/hooks/applypatch-msg.sample
upload: transformers-from-scratch/former/.git/hooks/post-update.sample to s3://notebook-etc/transformers-from-scratch/former/.