# Coding a Transformer from Scratch

Giorgio Lazzarinetti - My Contacts
For any questions or doubts you can find my contacts here:

giorgiolazzarinetti@gmail.com g.lazzarinetti@campus.unimib.it

## Notebook Outline

- **Multi Head Attention step by step**
- **Transformer from scratch**
- **Transformer with Torch.nn.Transformers**
- **Pre-trained Transformer - Contextual Word Embedding**
- **LAB CHALLENGE 4: Recommender System with Attention**





Some refs:

* Original Paper: [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf);

* [The Illustrated Transformer - Jay Alammar.](http://jalammar.github.io/illustrated-transformer/)


In [1]:
%%capture
!pip install torchtext==0.10.0

# After the installation, restart the session

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F


from torchtext.legacy import data, datasets
from torchtext import vocab

#from torchtext import data, datasets

import numpy as np
import random, tqdm, sys, math, gzip
from torchsummary import summary

from torch.utils.tensorboard import SummaryWriter

In [3]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

## Multi-head Attention Mechanism, step by step

Before looking at how to implement self-attention and multi-head attention, let's understand how attention works.

Say the following sentence is an input sentence we want to translate:

`The animal didn't cross the street because it was too tire`

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

### Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.

The **first step** in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.


<center>  <img src="https://drive.google.com/uc?export=view&id=1N7XrMctFitVRKkqj_FoXs0GA_AlaVp06" width="750" height="550"> </center> 

What are the “query”, “key”, and “value” vectors?

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

The **second step** in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

The **third and fourth steps** are to divide the scores by the square root of the dimension of the key vectors. This leads to having more stable gradients. There could be other possible values here, but this is the default. Then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.


The **fifth step** is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

The **sixth step** is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

When dealing with matrices The *first step* is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV). Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

<center>  <img src="https://drive.google.com/uc?export=view&id=1Xyl9DiN_n_CgJwgrfymaVmAXA2UMInIP" width="750" height="350"> </center> 



First, let's set some hyperparameters. To keep it simple we choose small size hyperparameters.

In [4]:
emb = 128 # embedding dimension (BERT like models 768)
h = 8 # number of heads (BERT has 12 heads)

batch_size = 4
sentence_length = 21 # 512 for BERT

Some fake random data with proper dimensions

In [5]:
x = torch.rand(batch_size, 21, emb)

b, t, e = x.size()

In [6]:
x.size()

torch.Size([4, 21, 128])

Instantiate linear transformations for query, key and values. Each transformation will act on the input vector x.

In [7]:
tokeys    = nn.Linear(emb, emb, bias=False) # W_key
toqueries = nn.Linear(emb, emb, bias=False) # W_query
tovalues  = nn.Linear(emb, emb, bias=False) # W_value

Generate queries, keys and values. We first compute the k/q/v's on the whole embedding vectors, and then split into the different heads.

In [8]:
keys    = tokeys(x) # W_key x
queries = toqueries(x)
values  = tovalues(x)

In [9]:
print(keys.size())

torch.Size([4, 21, 128])


Implement now multi-head attention (the ligther version), splitting into the different heads.

Self-attention layer can be further refined by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:

- It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the actual word itself. If we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, it would be useful to know which word “it” refers to.

- It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

In [10]:
s = e // h # 128 / 8

keys    = keys.view(b, t, h, s)
queries = queries.view(b, t, h, s)
values  = values.view(b, t, h, s)

print(keys.size())

torch.Size([4, 21, 8, 16])


In [11]:
keys.transpose(1, 2).size()

torch.Size([4, 8, 21, 16])

In [12]:
keys.transpose(1, 2).contiguous().view(b * h, t, s).size()

torch.Size([32, 21, 16])

We need now to compute the dot products. This is the same operation for every head, so we fold the heads into the batch dimension.

In [13]:
keys = keys.transpose(1, 2).contiguous().view(b * h, t, s)
queries = queries.transpose(1, 2).contiguous().view(b * h, t, s)
values = values.transpose(1, 2).contiguous().view(b * h, t, s)

# contiguous():  it actually makes a copy of the tensor such that the order of 
# its elements in memory is the same as if it had been created from scratch with the same data.
# transpose(1, 2) doesn't generate a new tensor with a new layout, it just 
# modifies meta information in the Tensor object so that the offset and stride describe the desired new shape.
# https://discuss.pytorch.org/t/contigious-vs-non-contigious-tensor/30107

keys.size()

torch.Size([32, 21, 16])

Perform dot products

In [14]:
print(queries.size())

print(keys.transpose(1, 2).size())

#Let's compute the attention matrix
dot = torch.bmm(queries, keys.transpose(1, 2)).size()  # batch matrix-matrix product

print(dot)

torch.Size([32, 21, 16])
torch.Size([32, 16, 21])
torch.Size([32, 21, 21])


Just for completeness, below the implementation of the original multi-head attention (which is wide and computationally more intensive).

In [15]:
emb = 128
h = 8

x = torch.rand(4, 21, emb)

b, t, e = x.size()

tokeys    = nn.Linear(emb, emb * h, bias=False)
toqueries = nn.Linear(emb, emb * h, bias=False)
tovalues  = nn.Linear(emb, emb * h, bias=False)

keys    = tokeys(x)
queries = toqueries(x)
values  = tovalues(x)

print(keys.size())

keys    = keys.view(b, t, h, e)
queries = queries.view(b, t, h, e)
values  = values.view(b, t, h, e)

print(keys.size())


torch.Size([4, 21, 1024])
torch.Size([4, 21, 8, 128])


### Model Definition

Let us collect everything and define the self-attention class

In [16]:
class MHSelfAttention(nn.Module):
    """
    Multi-head self attention.
    """

    def __init__(self, emb, heads=8):
        """
        :param emb:
        :param heads:
        :param mask:
        """
        super().__init__()

        assert emb % heads == 0, f'Embedding dimension ({emb}) should be divisible by nr. of heads ({heads})'

        self.emb = emb
        self.heads = heads

        #s = emb // heads
        # - We will break the embedding into `heads` chunks and feed each to a different attention head

        self.tokeys    = nn.Linear(emb, emb, bias=False)
        self.toqueries = nn.Linear(emb, emb, bias=False)
        self.tovalues  = nn.Linear(emb, emb, bias=False)

        self.unifyheads = nn.Linear(emb, emb)

    def forward(self, x):

        b, t, e = x.size()
        h = self.heads
        assert e == self.emb, f'Input embedding dim ({e}) should match layer embedding dim ({self.emb})'

        s = e // h

        # We first compute the k/q/v's on the whole embedding vectors, and then split into the different heads.

        keys    = self.tokeys(x)
        queries = self.toqueries(x)
        values  = self.tovalues(x)

        # Split into the different heads.

        keys    = keys.view(b, t, h, s)
        queries = queries.view(b, t, h, s)
        values  = values.view(b, t, h, s)

        # Compute scaled dot-product self-attention

        # Fold heads into the batch dimension
        # When you call contiguous(), it actually makes a copy of the tensor 
        # such that the order of its elements in memory is the same as if it had been created from scratch with the same data.
        keys = keys.transpose(1, 2).contiguous().view(b * h, t, s)
        queries = queries.transpose(1, 2).contiguous().view(b * h, t, s)
        values = values.transpose(1, 2).contiguous().view(b * h, t, s)

        queries = queries / (e ** (1/4))
        keys    = keys / (e ** (1/4))
        # Instead of dividing the dot products by sqrt(e), we scale the keys and values.
        # This should be more memory efficient

        # Get dot product of queries and keys, and scale.

        dot = torch.bmm(queries, keys.transpose(1, 2))

        assert dot.size() == (b * h, t, t)

        dot = F.softmax(dot, dim=2) # Dot now has row-wise self-attention probabilities

        # apply the self attention to the values
        out = torch.bmm(dot, values).view(b, h, t, s)

        # swap h, t back, unify heads
        out = out.transpose(1, 2).contiguous().view(b, t, s * h)

        return self.unifyheads(out)

## Transformer from Scratch

A Transformer Block is based on self-attention (and Layer Normalization, Residual Connections)

<center>  <img src="https://drive.google.com/uc?export=view&id=1fiJjl6ZfaoWZu_K44-vI6UoN5UV5SwUs" width="350" height="550"> </center> 

A transformer model is based on the **Encoder-Decoder Framework**. in general the encodinc component can be considered as a stack of encoder while the decoder as a stack of decoding components.

### The Encoder
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. 

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

### The Decoder
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence.

### Input Embedding
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

### Positional Embedding
One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention. This is what the positional embedding does.

The positional embedding can be built in different ways we'll se how later.

### Residuals and Normalization
One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.


### Output Embedding
The self attention layers in the decoder operate in a slightly different way than the one in the encoder:

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

In [17]:
class TransformerBlock(nn.Module):

    def __init__(self, emb, heads, mask, seq_length, ff_hidden_mult=4, dropout=0.0, pos_embedding=None):
        super().__init__()

        self.mhattention = MHSelfAttention(emb, heads=heads)

        self.norm1 = nn.LayerNorm(emb)
        self.norm2 = nn.LayerNorm(emb)

        self.ff = nn.Sequential(
            nn.Linear(emb, ff_hidden_mult * emb),
            nn.ReLU(),
            nn.Linear(ff_hidden_mult * emb, emb)
        )

        self.do = nn.Dropout(dropout)

    def forward(self, x):

        attended = self.mhattention(x)

        x = self.norm1(attended + x) #residual

        x = self.do(x)

        fedforward = self.ff(x)

        x = self.norm2(fedforward + x) #residual

        x = self.do(x)

        return x

Let's build a Transformers (a stack of Transformers Blocks) and adapt it for a binary classification task. Its `depth` defines the number of Transformers Blocks

In [18]:
class CTransformer(nn.Module):

    def __init__(self, emb, heads, depth, seq_length, num_tokens, num_classes, max_pool=True, dropout=0.0):
        """
        :param emb: Embedding dimension
        :param heads: nr. of attention heads
        :param depth: Number of transformer blocks
        :param seq_length: Expected maximum sequence length
        :param num_tokens: Number of tokens (usually words) in the vocabulary
        :param num_classes: Number of classes.
        :param max_pool: If true, use global max pooling in the last layer. If false, use global
                         average pooling.
        """
        super().__init__()

        self.num_tokens, self.max_pool = num_tokens, max_pool

        # Token embedding
        self.token_embedding = nn.Embedding(embedding_dim=emb, num_embeddings=num_tokens)
        # Position embedding
        self.pos_embedding = nn.Embedding(embedding_dim=emb, num_embeddings=seq_length)

        tblocks = []
        for i in range(depth):
            tblocks.append(
                TransformerBlock(emb=emb, heads=heads, seq_length=seq_length, mask=False, dropout=dropout))

        self.tblocks = nn.Sequential(*tblocks)

        self.toprobs = nn.Linear(emb, num_classes)

        self.do = nn.Dropout(dropout)

    def forward(self, x):
        """
        :param x: A batch by sequence length integer tensor of token indices.
        :return: predicted log-probability vectors for each token based on the preceding tokens.
        """
        tokens = self.token_embedding(x)
        b, t, e = tokens.size()

        positions = self.pos_embedding(torch.arange(t, device=device))[None, :, :].expand(b, t, e)
        x = tokens + positions
        x = self.do(x)

        x = self.tblocks(x)

        x = x.max(dim=1)[0] if self.max_pool else x.mean(dim=1) # pool over the time dimension

        x = self.toprobs(x)

        return F.log_softmax(x, dim=1) # nn.softmax()

In [19]:
torch.arange(21)

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20])

### Sentiment Analysis Example with TorchText

In the following, we train a transformer model on a language modeling task. The language modeling task is to assign a probability for the likelihood of a given sequence of words to belong to a positive sentiment or a negative sentiment. This is known as Sentiment Classification. 

We'll use TorchText, which is a prebuilt module for managing test data. It provides datasets for different task and manipulation function useful for NLP.

Disclaimer: we'll rely on the torchtext legacy version. Note that some more recent version of torchtext may be used with slightly different API.

One of the main concepts of TorchText legacy is the `Field`. These define how your data should be processed. In our sentiment classification task the data consists of both the raw string of the review and the sentiment, either "pos" or "neg".

The parameters of a `Field` specify how the data should be processed. 

We use the `TEXT` field to define how the review should be processed, and the `LABEL` field to process the sentiment. 


In [20]:
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True) # If no tokenize argument is passed, the default is simply splitting the string on spaces.
LABEL = data.Field(sequential=False)

NUM_CLS = 2
BATCH_SIZE = 4
MAX_LENGTH = 256 #512
EMB_SIZE = 128
HEADS = 8
DEPTH = 6 #Number of self-attention layer
VOC_SIZE = 50000

LR_RATE = 0.0001
WARMUP = 10000

In [21]:
tbw = SummaryWriter(log_dir='./logs') # Tensorboard logging

train, test = datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:04<00:00, 18.8MB/s]


In [22]:
# View 'text' and 'label'
print(vars(train.examples[0]))

{'text': ['i', 'saw', 'this', 'movie', 'when', 'mystery', 'science', 'theater', 'ran', 'it', 'in', '1993.', 'it', 'is', 'the', 'worst', 'thing', "i've", 'ever', 'seen.', 'so', 'bad', 'in', 'fact,', 'that', 'by', 'sheer', 'freakiness,', 'this', 'movie', 'must', 'get', 'a', 'ten', 'rating', 'because', 'it', 'has', 'to', 'be', 'seen', 'to', 'be', 'believed.', '<br', '/><br', '/>whoever', 'wrote', 'this', 'script', 'with', 'children', 'in', 'mind', 'should', 'be', 'beaten.', 'i', 'mean,', 'really,', 'the', 'devil', 'vs.', 'santa?', 'visions', 'of', 'hell?', 'creepy', 'laughing', 'wind-up', 'reindeer?', 'forced', 'child', 'labor', 'with', 'racial', 'stereotypes?', 'it', "ain't", 'sesame', 'street,', "that's", 'for', 'sure.as', 'crow', 'exclaims', 'during', 'the', 'mst3k', 'showing,', '"this', 'is', 'good', "ol'", 'fashioned', 'nightmare', 'fuel!"', '<br', '/><br', "/>there's", 'plenty', 'of', 'weird', 'innuendo', 'and', 'screwed', 'up', 'theology.', 'merlin', '(presumably', 'the', 'arthuria

In [23]:

TEXT.build_vocab(train, max_size=VOC_SIZE - 2)
LABEL.build_vocab(train)

In [24]:
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_size=BATCH_SIZE, device=device)

In [25]:
# batch size = 4
for batch in train_iter:

    input = batch.text[0]
    label = batch.label - 1

    print(input)
    print(label)
    break

tensor([[   10,     7,     3,  ...,     1,     1,     1],
        [  148,     3, 29462,  ...,    13,  9048,  1034],
        [   14,     2,   272,  ...,     1,     1,     1],
        [  446,    99,     3,  ...,     1,     1,     1]])
tensor([1, 0, 1, 0])


In [26]:
print(f'- nr. of training examples {len(train_iter)}')
print(f'- nr. of test examples {len(test_iter)}')

- nr. of training examples 6250
- nr. of test examples 6250


In [27]:
# create the model
model = CTransformer(emb=EMB_SIZE, heads=HEADS, depth=DEPTH, seq_length=MAX_LENGTH, num_tokens=VOC_SIZE, num_classes=NUM_CLS, max_pool="store_true", dropout=0.2)
model.to(device)

CTransformer(
  (token_embedding): Embedding(50000, 128)
  (pos_embedding): Embedding(256, 128)
  (tblocks): Sequential(
    (0): TransformerBlock(
      (mhattention): MHSelfAttention(
        (tokeys): Linear(in_features=128, out_features=128, bias=False)
        (toqueries): Linear(in_features=128, out_features=128, bias=False)
        (tovalues): Linear(in_features=128, out_features=128, bias=False)
        (unifyheads): Linear(in_features=128, out_features=128, bias=True)
      )
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (ff): Sequential(
        (0): Linear(in_features=128, out_features=512, bias=True)
        (1): ReLU()
        (2): Linear(in_features=512, out_features=128, bias=True)
      )
      (do): Dropout(p=0.2, inplace=False)
    )
    (1): TransformerBlock(
      (mhattention): MHSelfAttention(
        (tokeys): Linear(in_features=128, out_features=128, bias=False)
   

In [28]:
opt = torch.optim.Adam(lr=LR_RATE, params=model.parameters())
sch = torch.optim.lr_scheduler.LambdaLR(opt, lambda i: min(i / (WARMUP / BATCH_SIZE), 1.0))

In [29]:
NUM_EPOCHS = 4

# training loop
seen = 0
for e in range(NUM_EPOCHS):

    print(f'\n epoch {e}')
    model.train()

    for batch in tqdm.tqdm(train_iter):

        opt.zero_grad()

        input = batch.text[0]
        label = batch.label - 1

        if input.size(1) > MAX_LENGTH:
            input = input[:, :MAX_LENGTH]
        
        
        out = model(input)
        loss = F.nll_loss(out, label)
        # loss = CrossEntropy(out, label)

        loss.backward()

        # clip gradients
        # Performs gradient clipping. It is used to mitigate the problem of exploding gradients.
        # - If the total gradient vector has a length > 1, we clip it back down to 1.
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        opt.step()
        sch.step()

        seen += input.size(0)
        tbw.add_scalar('classification/train-loss', float(loss.item()), seen)

    with torch.no_grad():

        model.eval()
        tot, cor= 0.0, 0.0

        for batch in test_iter:

            input = batch.text[0]
            label = batch.label - 1

            if input.size(1) > MAX_LENGTH:
                input = input[:, :MAX_LENGTH]
            out = model(input).argmax(dim=1)

            tot += float(input.size(0))
            cor += float((label == out).sum().item())

        acc = cor / tot
        print(f'-- test accuracy {acc:.3}')
        tbw.add_scalar('classification/test-loss', float(loss.item()), e)


 epoch 0


  5%|▍         | 303/6250 [02:17<45:00,  2.20it/s]


KeyboardInterrupt: ignored

## Transformer with Torch.nn.Transformer

Instead of building the transformer from scratch, it is also possible to use the
[nn.Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html) module.

The PyTorch 1.2 release includes a standard transformer module based on the
paper [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf).
The `nn.Transformer` module relies entirely on an attention
mechanism (implemented as
[nn.MultiheadAttention](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html))
to draw global dependencies between input and output. The `nn.Transformer`
module is highly modularized such that a single component (e.g.,
[nn.TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html))
can be easily adapted/composed.

In the following, we train a nn.TransformerEncoder model on a sentiment classification task. 

Firstly we redifine our ``PositionalEncoding`` module to inject some information about the
relative or absolute position of the tokens in the sequence. The
positional encodings have the same dimension as the embeddings so that
the two can be summed. Here, we use ``sine`` and ``cosine`` functions of
different frequencies as specified in the paper "Attention is all you need".

In [30]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Arguments:
            x: Tensor, shape ``[seq_len, batch_size, embedding_dim]``
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

In [31]:
from torch.nn import TransformerEncoder, TransformerEncoderLayer

In [33]:
class TransformerModel(nn.Module):

    def __init__(self, ntoken: int, d_model: int, nhead: int, d_hid: int,
                 nlayers: int, num_classes: int, dropout: float = 0.5):
        super().__init__()
        self.model_type = 'Transformer'

        self.pos_encoder = PositionalEncoding(d_model, dropout)

        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)

        self.encoder = nn.Embedding(ntoken, d_model)
        self.d_model = d_model

        self.decoder = nn.Linear(d_model, num_classes)


        self.init_weights()

    def init_weights(self) -> None:
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src: torch.Tensor) -> torch.Tensor:
        """
        Arguments:
            src: Tensor, shape ``[seq_len, batch_size]``
            src_mask: Tensor, shape ``[seq_len, seq_len]`` # , src_mask: torch.Tensor

        Returns:
            output Tensor of shape ``[seq_len, batch_size, ntoken]``
        """
        src = self.encoder(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src) #, src_mask)
        output = self.decoder(output)
        output = output.max(dim=1)[0]

        return F.log_softmax(output, dim=1) # nn.softmax()
        #return output


In [34]:
ntokens = VOC_SIZE  # size of vocabulary
emsize = EMB_SIZE  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in ``nn.TransformerEncoder``
nlayers = 6  # number of ``nn.TransformerEncoderLayer`` in ``nn.TransformerEncoder``
nhead = HEADS  # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, 2, dropout).to(device)

opt = torch.optim.Adam(lr=LR_RATE, params=model.parameters())
sch = torch.optim.lr_scheduler.LambdaLR(opt, lambda i: min(i / (WARMUP / BATCH_SIZE), 1.0))

In [35]:
model

TransformerModel(
  (pos_encoder): PositionalEncoding(
    (dropout): Dropout(p=0.2, inplace=False)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=200, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
        (linear2): Linear(in_features=200, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.2, inplace=False)
        (dropout2): Dropout(p=0.2, inplace=False)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        

In [36]:
NUM_EPOCHS = 4
# training loop
seen = 0
for e in range(NUM_EPOCHS):

    print(f'\n epoch {e}')
    model.train()

    #src_mask = generate_square_subsequent_mask(bptt).to(device)

    for batch in tqdm.tqdm(train_iter):

        opt.zero_grad()

        input = batch.text[0]
        label = batch.label - 1

        if input.size(1) > MAX_LENGTH:
            input = input[:, :MAX_LENGTH]
        
        #seq_len = input.size(0)
        #if seq_len != bptt:  # only on last batch
            #src_mask = src_mask[:seq_len, :seq_len]
        
        out = model(input) #, src_mask)
        loss = F.nll_loss(out, label)
        # loss = CrossEntropy(out, label)

        loss.backward()

        # clip gradients
        # Performs gradient clipping. It is used to mitigate the problem of exploding gradients.
        # - If the total gradient vector has a length > 1, we clip it back down to 1.
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        opt.step()
        sch.step()

        seen += input.size(0)
        tbw.add_scalar('classification/train-loss', float(loss.item()), seen)

    with torch.no_grad():

        model.eval()
        tot, cor= 0.0, 0.0

        for batch in test_iter:

            input = batch.text[0]
            label = batch.label - 1

            if input.size(1) > MAX_LENGTH:
                input = input[:, :MAX_LENGTH]

            #seq_len = input.size(0)
            #if seq_len != bptt:  # only on last batch
                #src_mask = src_mask[:seq_len, :seq_len]

            out = model(input).argmax(dim=1)

            tot += float(input.size(0))
            cor += float((label == out).sum().item())

        acc = cor / tot
        print(f'-- test accuracy {acc:.3}')
        tbw.add_scalar('classification/test-loss', float(loss.item()), e)


 epoch 0


  0%|          | 30/6250 [00:09<34:13,  3.03it/s]


KeyboardInterrupt: ignored

## Pre-trained Transformer - Contextual Word Embedding

We'll now test some pre-trained transformers-based language model available on Hugging Face. 

[Transformers](https://huggingface.co/transformers/) was built by [Hugging Face](https://huggingface.co/), a Paris and NY startup whose mission is to democratize NLP for everyone. In the last year, they strongly contribute to the recent NLP revolution by building an easy to use interface between the latest models available and application to real cases.

Transformers library provides state-of-the-art general-purpose transformer-based architectures (such as BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, CTRL...) for Natural Language Processing, Understanding, and Generation with over thousands of pre-trained models in 100+ languages and deep interoperability between PyTorch & TensorFlow 2.0.

Transformers is an opinionated library built for NLP researchers seeking to use, study, extend large-scale transformers models. The library was designed with two strong goals in mind:

* be as easy and fast to use as possible;
* provide state-of-the-art models with performances as close as possible to the original models.

The aim of this section is to leverage the use of Transformers library pipelines API at the highest level possible. Without any training, we take advantage of pre-trained models to tackle a variety of downstream-tasks (Sentence Classification, Question & Answering). The idea is to warm-up with the most popular NLP problems.

In [37]:
%%capture
!pip install --upgrade -q pip
!pip install -q sentencepiece
!pip install -q transformers 

In [43]:
from transformers import BertTokenizer, BertForMaskedLM, DistilBertForSequenceClassification, DistilBertTokenizer, BertForSequenceClassification
from transformers import pipeline
from transformers import AutoTokenizer, AutoModel, AutoModelForQuestionAnswering, AutoModelWithLMHead

In [50]:
# Set multilingual pre-trained model
pretrained_model = 'distilbert-base-uncased-finetuned-sst-2-english'

# English only
#pretrained_model = 'bert-base-cased'

In [51]:
# Load tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained(pretrained_model)
#DistilBertTokenizer.from_pretrained(pretrained_model)

# Load model, in particular the base model with a head for masked language model
model = DistilBertForSequenceClassification.from_pretrained(pretrained_model)
#DistilBertForSequenceClassification
# Set the model in evaluation mode
model = model.eval() 

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [52]:
nlp_sentence_classif = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer) 
# sst = Stanford Sentiment Treebank 
#nlp_sentence_classif = pipeline('sentiment-analysis', model = 'my-sentiment-model', tokenizer = 'distilbert-base-uncased')

print(nlp_sentence_classif('It was a lovelly night!'))
print(nlp_sentence_classif('That film is not at all worth seing'))
print(nlp_sentence_classif('The event was pretty but it could be much better'))
print(nlp_sentence_classif('He was kind last year but now I do not trust him'))
print(nlp_sentence_classif('He is pretty ugly'))
print(nlp_sentence_classif('He is pretty ugly but when I am with him I am really happy'))



[{'label': 'POSITIVE', 'score': 0.9991255402565002}]
[{'label': 'NEGATIVE', 'score': 0.9998056292533875}]
[{'label': 'NEGATIVE', 'score': 0.9092474579811096}]
[{'label': 'NEGATIVE', 'score': 0.9994626641273499}]
[{'label': 'NEGATIVE', 'score': 0.9998031258583069}]
[{'label': 'POSITIVE', 'score': 0.9996961355209351}]


In [56]:
print(nlp_sentence_irony('I lost my flight! Wonderful!'))

[{'label': 'irony', 'score': 0.971060037612915}]


In [54]:
# label_1 = irony
nlp_sentence_irony = pipeline(model='cardiffnlp/twitter-roberta-base-irony', tokenizer='cardiffnlp/twitter-roberta-base-irony') 
print(nlp_sentence_irony('I lost my flight! Wonderful!'))

Downloading (…)lve/main/config.json:   0%|          | 0.00/705 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

[{'label': 'irony', 'score': 0.971060037612915}]


In [58]:
print(nlp_sentence_irony('He is pretty ugly but when I am with him I am really happy'))

[{'label': 'non_irony', 'score': 0.749869704246521}]


### BertViz
it is also possible to visualize attention by using the BertViz tool. 

BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism.



In [59]:
nlp_features = pipeline('feature-extraction')
output = nlp_features('Hugging Face is a French company based in Paris')
np.array(output).shape   # (Samples, Tokens, Vector Size)

No model was supplied, defaulted to distilbert-base-cased and revision 935ac13 (https://huggingface.co/distilbert-base-cased).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

(1, 12, 768)

In [60]:
import sys
!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
  sys.path += ['bertviz_repo']

Cloning into 'bertviz_repo'...
remote: Enumerating objects: 1625, done.[K
remote: Counting objects: 100% (321/321), done.[K
remote: Compressing objects: 100% (113/113), done.[K
remote: Total 1625 (delta 226), reused 220 (delta 208), pack-reused 1304[K
Receiving objects: 100% (1625/1625), 198.36 MiB | 23.78 MiB/s, done.
Resolving deltas: 100% (1068/1068), done.


In [61]:
from bertviz import head_view,  model_view

In [62]:
def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

In [63]:
from transformers import BertModel, BertTokenizer

model_version = 'bert-base-uncased'
do_lower_case = True

model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)

#sentence_a = "Attention is important, we need to understand it necessarily"
sentence_a = "he is going to take his train to milan quite soon"


inputs = tokenizer.encode_plus(sentence_a, return_tensors='pt', add_special_tokens=False)
token_type_ids = inputs['token_type_ids']
input_ids = inputs['input_ids']

attention = model(input_ids, token_type_ids=token_type_ids)[-1]

input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
call_html()

head_view(attention, tokens)

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

<IPython.core.display.Javascript object>

## LAB CHALLENGE 4: Attention-based Recommender System

In the previous lecture we have built a recommender system using the Neural Matrix Factorization framework. This framework allowed us to combine the GMF layers with the MLP layers.

We now want to understand how we can empower our model by using the attention mechanism.

- TASK 1: Starting from the NeuMF model, think to a possible architecture for recommender system that leverage the attention mechanism (not necessarily a transformer-based architecture) and draw it. Explain how you plan to use attention and why it could enhanche your model performance.

- TASK 2: Write the Data structures (Dataset and DataLoader) needed to handle you input data. If it is necessary, rewrite the preprocessing steps and also the evaluation metrics. Always use the same rating dataset from Movielens 100k.

- TASK 3: Write the Attention-based model and train it. After finetuning, compare the performance with the best results obtained in the previous challenges.


Note: 
- The attention-based recommender system does not have to preserve the structure of the NeuMF model. You can procede in a complete different way. Of course, if you decide to change a lot the structure you would need to completely rewrite the preprocessing step and the Dataset and DataLoader.
- On the web you can find a lot of resources that may help you in writing such model. However, consider that it is not always the case to over-complicate stuffs. 
- You can choose to use both the nn.Transformer module or to build your custom attention from scratch. It is also possible, if you find the right way, to use pretrained model from Hugging Face. 
