# Transformers From Scratch
You've heard the name, you've seen the papers, you've probably even stepped through a few repos. But can you build a multi-head attention mechanism from scratch?

In this lab you'll be asked to implement the novel multi-head attention function from scratch, and plug this into our provided framework to test that your function works.

We'll also ask you to parralellize it to take advantage of all the cores in your machine. 

So hang in there, and let's get to it!

---
# Task 0 - Learn About This Script Pacakge
First, let's examine the codebase here. You'll see a Transformer-based model architecture, so to speak, implemented here. The catch? There is no multi-head attention function. 

We'll step through the codebase with you here in the guide so you know the key points. At the end we'll run a quick functional test to make sure you can indeed run a basic training job.

Task 0 should take ~ 10 minutes. Try to take your time reading through this so you know what the arguments are and how the script is constructed.

In [23]:
# this package is a fork of Peter Bloem's "former" repository, where he implements a Transformer from scratch in PyTorch.
!git clone https://github.com/EmilyWebber/former.git

First, this codebase supports two modes. One for text classification, the other for text generation. Both of those can be run by a single Python script, `python experiments/classify.py`. You'll see that we are just calling that Python script inside of the `model.py` script that SageMaker uses to execute the training job.

Inside of `classify.py`, you'll see we're creating a model on line 60. `model = former.CTransformer`. Note that this model takes a few hyperparameters - `embedding_size`, `num_heads`, `depth`, `seq_length`, `num_classes`, and `max_pool`. 

![](images/model_create.png)

Now let's check out that `CTransformer()` object. You'll see it's inherited from the `former` class. 

Inside `transformers.py`, there's an `__init__` for the `CTransformer()` object. Inside the init, we'll see a small for-loop defined that pulls in the hyperparameters we just passed in, and creates a `TransformerBlock`. Then, it's using the PyTorch `nn.Sequential` API to convert those blocks into a sequential neural network component.

![](images/TransformerBlock.png)

Now, where is this `TransformerBlock` defined? It's actually in `modules.py`.  You'll see one `TransformerBlock` class, that inherits a `SelfAttentionWide` or `SelfAttentionNarrow` object on creation. We'll follow the rabbit-hole on the `SelfAttentionNarrow.`

The `SelfAttentionWide` init defines a few objects: `self.tokeys`, `self.toqueries`, and `self.tovalues`.

![](images/tokeys.png)

Then, we see a `forward` function that calls `self.tokeys`. Next, it computes a scaled dot-product self-attention function! That's what we need to implement. 

![](images/self-attention.png)

Sound like fun? Now, run your base training job to make sure the pipes are working today. This should download the data onto the training host, but it won't actually train because we haven't implemented your solution yet.

In [7]:
# !pip install --upgrade sagemaker

In [6]:
# run the model file
import sagemaker
import os

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

prefix = 'transformers'

role = sagemaker.get_execution_role()

# create some arbitrary train file that we won't use
!echo 1,2,3,4 > holder_file.csv 
s3_train_path = "s3://{}/{}/train/{}".format(bucket, prefix, 'holder_file.csv')
os.system('aws s3 cp {} {}'.format( 'holder_file.csv', s3_train_path))

from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='model.py',
                    role=role,
                    framework_version='1.2.0',
                    py_version = 'py3',
                    source_dir = 'former',
                    train_instance_count=1,
                    train_instance_type = 'ml.p3.2xlarge')

estimator.fit({'training': s3_train_path}, wait=False)

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


---
# Task 1 - Implement a Multi-head Attention Function
Now, here's the fun part. Can you implement your own multi-head attention mechanism? Don't worry, we'll give you all the tips you need.

Task 1 should take ~30 minutes.

In [14]:
!pip install -r former/requirements.txt

As you're stepping through the code below to build your neural network, don't panic. You can actually look at the pictures provided in this notebook here, along with the code from the `former` repo, for inspiration. 

In [87]:
%%writefile former/former/my_transformer_model.py
from _context import former
import tqdm

from torch import nn
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import math
from torchtext import data, datasets, vocab

import numpy as np

from torch.utils.tensorboard import SummaryWriter


def my_multihead_attention_mechanism(x, emb = 128, h = 8):
    
    b, t, e = x.size()
    assert e == emb, f'Input embedding dim ({e}) should match layer embedding dim ({self.emb})'

    s = e // h
    x = x.view(b, t, h, s)

    tokeys = nn.Linear(s, s, bias=False)
    toqueries = nn.Linear(s, s, bias=False)
    tovalues  = nn.Linear(s, s, bias=False)
    
    assert keys.size() == (b, t, h, s)
    assert queries.size() == (b, t, h, s)
    assert values.size() == (b, t, h, s)
    
    # compute scaled dot-product self-attention

    # - fold heads into the batch dimension
    keys = keys.transpose(1, 2).contiguous().view(b * h, t, e)
    queries = queries.transpose(1, 2).contiguous().view(b * h, t, e)
    values = values.transpose(1, 2).contiguous().view(b * h, t, e)

    queries = queries / (e ** (1/4))
    keys    = keys / (e ** (1/4))
    # - Instead of dividing the dot products by sqrt(e), we scale the keys and values.
    #   This should be more memory efficient

    # - get dot product of queries and keys, and scale
    dot = torch.bmm(queries, keys.transpose(1, 2))

    assert dot.size() == (b*h, t, t)

    if self.mask: # mask out the upper half of the dot matrix, excluding the diagonal
        mask_(dot, maskval=float('-inf'), mask_diagonal=False)

    dot = F.softmax(dot, dim=2)
    # - dot now has row-wise self-attention probabilities

    # apply the self attention to the values
    out = torch.bmm(dot, values).view(b, h, t, e)

    # swap h, t back, unify heads
    out = out.transpose(1, 2).contiguous().view(b, t, h * e)

    unifyheads = nn.Linear(heads * emb, emb)

    return unifyheads(out)


Overwriting former/former/my_transformer_model.py


It turns out this package is just happier being run as a script - when you run this line it's pointing to a string of custom objects that will ultimately look for your `my_attention_mechanism` function as the base of the model.

In [89]:
!python former/experiments/classify.py

Traceback (most recent call last):
  File "former/experiments/classify.py", line 1, in <module>
    from _context import former
  File "/home/ec2-user/SageMaker/transformers-from-scratch/former/experiments/_context.py", line 7, in <module>
    import former
  File "/home/ec2-user/SageMaker/transformers-from-scratch/former/former/__init__.py", line 1, in <module>
    from .modules import SelfAttentionWide, SelfAttentionWide, TransformerBlock
  File "/home/ec2-user/SageMaker/transformers-from-scratch/former/former/modules.py", line 8, in <module>
    from .my_transformer_model import my_multihead_attention_mechanism
  File "/home/ec2-user/SageMaker/transformers-from-scratch/former/former/my_transformer_model.py", line 1, in <module>
    from _context import former
ImportError: cannot import name 'former'


Now, let's test that out on SageMaker! 

In [74]:
%%writefile former/my_new_model.py

import os

if __name__ == '__main__':
    os.system('python experiments/my_transfomer_model.py')

Writing former/my_new_model.py


In [75]:
# run the model file
import sagemaker
import os

sagemaker_session = sagemaker.Session()

from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='my_new_model.py',
                    role=role,
                    framework_version='1.2.0',
                    py_version = 'py3',
                    source_dir = 'former',
                    train_instance_count=1,
                    train_instance_type = 'ml.p3.2xlarge')

estimator.fit({'training': s3_train_path}, wait=False)



And that's a wrap! If you made it here with extra time to spare, consider doubling back on some of those functions and try to really understand what's going on there. 