# Transformers From Scratch
You've heard the name, you've seen the papers, you've probably even stepped through a few repos. But can you build a multi-head attention mechanism from scratch?

In this lab you'll be asked to implement the novel multi-head attention function from scratch, and plug this into our provided framework to test that your function works.

We'll also ask you to parralellize it to take advantage of all the cores in your machine. 

So hang in there, and let's get to it!

---
# Task 0 - Learn About This Script Pacakge
First, let's examine the codebase here. You'll see a Transformer-based model architecture, so to speak, implemented here. The catch? There is no multi-head attention function. 

We'll step through the codebase with you here in the guide so you know the key points. At the end we'll run a quick functional test to make sure you can indeed run a basic training job.

Task 0 should take ~ 10 minutes. Try to take your time reading through this so you know what the arguments are and how the script is constructed.

In [68]:
# this package is a fork of Peter Bloem's "former" repository, where he implements a Transformer from scratch in PyTorch.
!git clone https://github.com/EmilyWebber/former.git

fatal: destination path 'former' already exists and is not an empty directory.


First, this codebase supports two modes. One for text classification, the other for text generation. Both of those can be run by a single Python script, `python experiments/classify.py`. You'll see that we are just calling that Python script inside of the `model.py` script that SageMaker uses to execute the training job.

Inside of `classify.py`, you'll see we're creating a model on line 60. `model = former.CTransformer`. Note that this model takes a few hyperparameters - `embedding_size`, `num_heads`, `depth`, `seq_length`, `num_classes`, and `max_pool`. 

![](images/model_create.png)

Now let's check out that `CTransformer()` object. You'll see it's inherited from the `former` class. 

Inside `transformers.py`, there's an `__init__` for the `CTransformer()` object. Inside the init, we'll see a small for-loop defined that pulls in the hyperparameters we just passed in, and creates a `TransformerBlock`. Then, it's using the PyTorch `nn.Sequential` API to convert those blocks into a sequential neural network component.

![](images/TransformerBlock.png)

Now, where is this `TransformerBlock` defined? It's actually in `modules.py`.  You'll see one `TransformerBlock` class, that inherits a `SelfAttentionWide` or `SelfAttentionNarrow` object on creation. We'll follow the rabbit-hole on the `SelfAttentionWide.`

The `SelfAttentionWide` init defines a few objects: `self.tokeys`, `self.toqueries`, and `self.tovalues`.

![](images/tokeys.png)

Then, we see a `forward` function that calls `self.tokeys`. Next, it computes a scaled dot-product self-attention function! That's what we need to implement. 

![](images/self-attention.png)

Sound like fun? Now, run your base training job to make sure the pipes are working today. This should download the data onto the training host, but it won't actually train because we haven't implemented your solution yet.

In [72]:
# run the model file
import sagemaker
import os

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

prefix = 'transformers'

role = sagemaker.get_execution_role()

# create some arbitrary train file that we won't use
!echo 1,2,3,4 > holder_file.csv 
s3_train_path = "s3://{}/{}/train/{}".format(bucket, prefix, 'holder_file.csv')
os.system('aws s3 cp {} {}'.format( 'holder_file.csv', s3_train_path))

from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='model.py',
                    role=role,
                    framework_version='1.2.0',
                    py_version = 'py3',
                    source_dir = 'former',
                    instance_count=1,
                    # if you haven't already cut a ticket for this instance type, go ahead and do it. this will dramatically improve your training time! 
                    instance_type = 'ml.p3dn.24xlarge')

estimator.fit({'training': s3_train_path}, wait=False)

---
# Task 1 - Implement a Multi-head Attention Function
Now, here's the fun part. Can you implement your own multi-head attention mechanism? Don't worry, we'll give you all the tips you need.

Task 1 should take ~30 minutes.

In [5]:
def my_multihead_attention_mechanism():
    print ("let's do this!!")
    
my_multihead_attention_mechanism()

let's do this!!


Great stuff! Now, let's get that incorporated into the entire script we defined above. Paste your function into the script below.

In [6]:
%%writefile my_transformer_model.py

# << paste your function here >> 

Writing my_transformer_model.py


Now, let's test that out on SageMaker! 

In [8]:
# run your new transformer script on SageMaker

---
# Task 2 - Parallelize your Multihead Attention Function
That seemed surprisingly doable, right? Now remember that a major advantage of transformers is their ability to take advantage of multiple GPU's, or compute cores, in a single host. Let's see if we can implement that here!

Task 2 should take ~ 30 minutes.

First, let's import Python's `multiprocessing` package. Also grab the `Pool` object from within `multiprocessing`.

In [16]:
# import the multiprocessing package
import multiprocessing
from multiprocessing import Pool

Next, try to call the method from multiprocessing, `.cpu_count()`, to confirm how many cores you are running on right now.

In [19]:
# num_cpus = # your function here
num_cpus = multiprocessing.cpu_count()
print (num_cpus)

2


Next, set up a `Pool()` object that takes the number of cpus, defined above, as an argument.

In [23]:
# pool = # your code here
pool = Pool(num_cpus)

The `multiprocessing` package makes this super easy for us. All we need to do is define a `data_to_map` variable that's literally holding all of the data we want to map out to each core, plus an `attention_function` that will be called on each of those matrices. Let's give it a shot! As a hint, you may want to simply consider using a list.

In [None]:
data_to_map = # your code here

Next, let's use your attention function definition from above, now formatted to work with each object in your `data_to_map` list. 

In [24]:
def attention_function():
    print ('all the attention')
    

Finally, we just want to pass the `attention_function` and the `data_to_map` objects into the `pool` variable you created above. As a hint, you might consider something like this:   `new_rows = pool.map(function_name, data_obj)`

In [26]:
# new_rows = pool.map(attention_function, data_to_map)

Remember that `concat` at the end of the attention mechanisms? Let's implement that right here. 

---
# Task 3 - Compare against solution set
If you made it here with tons of time to spare, congrats! We will release the solution notebook 70 minutes into the lab, so that you definitely have a shot at implementing the attention mechanism yourself.

This last task is optional - just so you can compare your solutions against ours and make sure we're all on the right track.

Task 3 should take ~ 20 minutes.

In [66]:
!cd ../ && aws s3 sync transformers-from-scratch s3://notebook-etc/transformers-from-scratch/

upload: transformers-from-scratch/.ipynb_checkpoints/Transformers From Scratch-checkpoint.ipynb to s3://notebook-etc/transformers-from-scratch/.ipynb_checkpoints/Transformers From Scratch-checkpoint.ipynb
upload: transformers-from-scratch/Transformers From Scratch.ipynb to s3://notebook-etc/transformers-from-scratch/Transformers From Scratch.ipynb
upload: transformers-from-scratch/former/.git/HEAD to s3://notebook-etc/transformers-from-scratch/former/.git/HEAD
upload: transformers-from-scratch/former/.git/config to s3://notebook-etc/transformers-from-scratch/former/.git/config
upload: transformers-from-scratch/former/.git/description to s3://notebook-etc/transformers-from-scratch/former/.git/description
upload: transformers-from-scratch/former/.git/hooks/applypatch-msg.sample to s3://notebook-etc/transformers-from-scratch/former/.git/hooks/applypatch-msg.sample
upload: transformers-from-scratch/former/.git/hooks/post-update.sample to s3://notebook-etc/transformers-from-scratch/former/.