# Lecture 22: 2023-04-18 Transformers

## Lecture Overview

* Contextual Embeddings (BERT)
* Transformer model

## Contextual Embeddings (BERT)

* BERT = Bidirectional Encoder Representations from Transformers
* "BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks ..." [Delvin et al.](https://arxiv.org/abs/1810.04805)
* Trained on a masked language model (MLM) 
    * [Taylor, 1953](https://gwern.net/doc/psychology/writing/1953-taylor.pdf)
    * Randomly mask a percentage of the input and predict the masked words based on the context
* Trained on a next sentence prediction (NSP) task
    * [Bengio et al., 2003](https://www.aclweb.org/anthology/P03-1003.pdf)
    * Given two sentences, predict if the second sentence is the next sentence in a document
* [BERT](https://github.dev/google-research/bert) code from Google


<center><img src="./images/bert.png" height="400" width="1000"></center>

* [Leaderboard](https://gluebenchmark.com/)

### Downstream tasks using BERT Embeddings

* [Diachronic linguistic change, Giulianelli et al.](https://aclanthology.org/2020.acl-main.365)
* [Linguistic style](https://arxiv.org/abs/1905.05621)
* [Vector semantics](https://library.oapen.org/handle/20.500.12657/60191)
* Polysemy

### Addressing Polysemy using BERT Embeddings

In [3]:
import torch
import numpy as np
import pandas as pd
import transformers


from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext',
           output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
# Choose compute architecture
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to('cpu')

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [10]:
# Create contextual embeddings

def bert_text_preparation(text, tokenizer):
  """
  Preprocesses text input in a way that BERT can interpret.
  """
  marked_text = "[CLS] " + text + " [SEP]"
  tokenized_text = tokenizer.tokenize(marked_text)
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
  segments_ids = [1]*len(indexed_tokens)

  # convert inputs to tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensor = torch.tensor([segments_ids])

  return tokenized_text, tokens_tensor, segments_tensor

In [11]:
def get_bert_embeddings(tokens_tensor, segments_tensor, model):
    """
    Obtains BERT embeddings for tokens, in context of the given sentence.
    """
    # gradient calculation id disabled
    with torch.no_grad():
      # obtain hidden states
      outputs = model(tokens_tensor, segments_tensor)
      hidden_states = outputs[2]

    # concatenate the tensors for all layers
    # use "stack" to create new dimension in tensor
    token_embeddings = torch.stack(hidden_states, dim=0)

    # remove dimension 1, the "batches"
    token_embeddings = torch.squeeze(token_embeddings, dim=1)

    # swap dimensions 0 and 1 so we can loop over tokens
    token_embeddings = token_embeddings.permute(1,0,2)

    # intialized list to store embeddings
    token_vecs_sum = []

    # "token_embeddings" is a [Y x 12 x 768] tensor
    # where Y is the number of tokens in the sentence

    # loop over tokens in sentence
    for token in token_embeddings:
        # "token" is a [12 x 768] tensor
        # sum the vectors from the last four layers
        sum_vec = torch.sum(token[-4:], dim=0)
        token_vecs_sum.append(sum_vec)

    return token_vecs_sum

In [12]:
sentences = ['Advancing mHealth-supported Adoption and Sustainment of an Evidence-based Mental Health Intervention for Youth in a School-based Delivery Setting in Sierra Leone',
             'Refining and Pilot Testing a Decision Support Intervention to Facilitate Adoption of Evidence-Based Programs to Improve Parent and Child Mental Health',
             'Reusable, transparent, and reconfigurable N95-equivalent Respirator Masks: design, fabrication, and trials for enhanced adoption',
             'Understanding the Adoption and Impact of New Risk Assessment Technologies in Prostate Cancer Care',
             'Addressing adoption barriers to patient transportation services',
             'The College Alcohol Intervention Matrix (College AIM): Adoption and Implementation Across College Campuses',
             'Social Networks of Diffusion and Adoption: Investigating the Network Effects on implementation of evidence-based interventions for early intervention providers of children',
             'HPV ECHO: Increasing the adoption of evidence-based communication strategies for HPV vaccination in rural primary care practices',
             'Understanding disparities in the adoption and use of assistive technology by older Hispanics',
             'Adoption and Implementation of an Evidence-based Safe Driving Program for High-Risk Teen Drivers',
             'Motion Sequencing for All: pipelining, distribution and training to enable broad adoption of a next-generation platform for behavioral and neurobehavioral analysis',
             "The Implementation, Adoption, and Sustainability of Ho'ouna Pono",
             "The Challenges and Benefits of Adopting Teens: A Comparative Study",
             "Navigating the Unique Needs of Adolescent Adoption",
             "The Impact of Timing on Adoption Outcomes: Examining Infant and Teen Adoption",
             "Supporting the Transition to Adulthood in Adopted Teens",
             "Exploring the Long-Term Effects of Adopting Teens versus Infants",
             "Adopting Teens: A Systematic Review of the Literature",
             "Addressing the Stereotypes and Realities of Adopting Teens",
             "Comparing the Parenting Experiences of Adopting Infants and Teens"
             ]

In [13]:
from collections import OrderedDict

context_embeddings = []
context_tokens = []

for sentence in sentences:
  tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(sentence, tokenizer)
  list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)

  # make ordered dictionary to keep track of the position of each word
  tokens = OrderedDict()

  # loop over tokens in sensitive sentence
  for token in tokenized_text[1:-1]:
    # keep track of position of word and whether it occurs multiple times
    if token in tokens:
      tokens[token] += 1
    else:
      tokens[token] = 1

    # compute the position of the current token
    token_indices = [i for i, t in enumerate(tokenized_text) if t == token]
    current_index = token_indices[tokens[token]-1]

    # get the corresponding embedding
    token_vec = list_token_embeddings[current_index]
    
    # save values
    context_tokens.append(token)
    context_embeddings.append(token_vec)

In [14]:
# Analyze the tokens
context_tokens

['advancing',
 'mh',
 '##eal',
 '##th',
 '-',
 'supported',
 'adoption',
 'and',
 'sustain',
 '##ment',
 'of',
 'an',
 'evidence',
 '-',
 'based',
 'mental',
 'health',
 'intervention',
 'for',
 'youth',
 'in',
 'a',
 'school',
 '-',
 'based',
 'delivery',
 'setting',
 'in',
 'sie',
 '##rr',
 '##a',
 'leon',
 '##e',
 'ref',
 '##ining',
 'and',
 'pilot',
 'testing',
 'a',
 'decision',
 'support',
 'intervention',
 'to',
 'facilitate',
 'adoption',
 'of',
 'evidence',
 '-',
 'based',
 'programs',
 'to',
 'improve',
 'parent',
 'and',
 'child',
 'mental',
 'health',
 're',
 '##usa',
 '##ble',
 ',',
 'transparent',
 ',',
 'and',
 'recon',
 '##fig',
 '##urable',
 'n',
 '##95',
 '-',
 'equivalent',
 'respir',
 '##ator',
 'masks',
 ':',
 'design',
 ',',
 'fabrication',
 ',',
 'and',
 'trials',
 'for',
 'enhanced',
 'adoption',
 'understanding',
 'the',
 'adoption',
 'and',
 'impact',
 'of',
 'new',
 'risk',
 'assessment',
 'technologies',
 'in',
 'prostate',
 'cancer',
 'care',
 'addressing',

In [15]:
# compute the pairwise cosine similarity between all tokens
from scipy.spatial.distance import cosine

# embeddings for the word 'record' 
token = 'adoption'
indices = [i for i, t in enumerate(context_tokens) if t == token]

token_embeddings = [context_embeddings[i] for i in indices]

# # compare 'record' with different contexts
list_of_distances = []
for sentence_1, embed1 in zip(sentences, token_embeddings):
  for sentence_2, embed2 in zip(sentences, token_embeddings):
    cos_dist = 1 - cosine(embed1, embed2)
    list_of_distances.append([sentence_1, sentence_2, cos_dist])

distances_df = pd.DataFrame(list_of_distances, columns=['sentence_1', 'sentence_2', 'distance'])
distances_df[distances_df.sentence_1.str.contains('adoption')]

Unnamed: 0,sentence_1,sentence_2,distance
30,"Reusable, transparent, and reconfigurable N95-...",Advancing mHealth-supported Adoption and Susta...,0.753539
31,"Reusable, transparent, and reconfigurable N95-...",Refining and Pilot Testing a Decision Support ...,0.746505
32,"Reusable, transparent, and reconfigurable N95-...","Reusable, transparent, and reconfigurable N95-...",1.000000
33,"Reusable, transparent, and reconfigurable N95-...",Understanding the Adoption and Impact of New R...,0.736815
34,"Reusable, transparent, and reconfigurable N95-...",Addressing adoption barriers to patient transp...,0.766890
...,...,...,...
160,"Motion Sequencing for All: pipelining, distrib...","Motion Sequencing for All: pipelining, distrib...",1.000000
161,"Motion Sequencing for All: pipelining, distrib...","The Implementation, Adoption, and Sustainabili...",0.906431
162,"Motion Sequencing for All: pipelining, distrib...",The Challenges and Benefits of Adopting Teens:...,0.732473
163,"Motion Sequencing for All: pipelining, distrib...",Navigating the Unique Needs of Adolescent Adop...,0.820002


In [16]:
# Output the vectors to examine in the embedding projector
import os

filepath = os.path.join('.', 'data')

name = 'metadata.tsv'

with open(os.path.join(filepath, name), 'w+') as file_metadata:
  for i, token in enumerate(context_tokens):
    file_metadata.write(token + '\n')
    
import csv

name = 'embeddings.tsv'

with open(os.path.join(filepath, name), 'w+') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t')
    for embedding in context_embeddings:
        writer.writerow(embedding.numpy())

### Principal Component Analysis

* [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) is a dimensionality reduction technique
* PCA is an orthogonal transformation of data
* [PCA Main ideas](https://www.youtube.com/watch?v=HMOI_lkzW08&t=161s)
* [PCA in depth](https://www.youtube.com/watch?v=FgakZw6K1QQ)
* cf. [SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition)

### T-Distributed Stochastic Neighbor (t-SNE)

* [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) is a dimensionality reduction technique to accurately represent high dimensional data in a low dimensional space
* Perplexity is a parameter that controls the balance between local and global structure
* [How to use t-SNE effectively](https://distill.pub/2016/misread-tsne/)
* [t-SNE in depth](https://www.youtube.com/watch?v=NEaUSP4YerM)

### UMAP Embedding

* Dimensionality reduction technique that can be used to visualize high dimensional data
* Topological data analysis [UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html)
* [How UMAP Works](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html)

## Transformers

* 'Attention is all you need' [Vaswani et al.](https://arxiv.org/abs/1706.03762) 2017
* BERT used a Transformer architecture (Significant breakthrough in AI/NLP)
* Annotated Transformer [Harvard NLP](http://nlp.seas.harvard.edu/annotated-transformer/)

N.B.: The below transformer code is taken from the Annotated Transformer paper.

### Key ideas

* Pretraining - Input is a very large corpus of text for weeks or months
* Fine-tuning - Input is a specific task (e.g. sentiment analysis)
* Encoder - Models that are good for understanding the input, like sentence classification or named entity recognition
* Decoder - Models that are good for generating output, like text generation or summarization
* Attention layers - Model attends to different relationships in different layers [BERT](https://huggingface.co/exbert/?model=bert-base-uncased&modelKind=bidirectional&sentence=The%20girl%20ran%20to%20a%20local%20pub%20to%20escape%20the%20din%20of%20her%20city.&layer=0&heads=..0,1,2,3,4,5,6,7,8,9,10,11&threshold=0.7&tokenInd=null&tokenSide=null&maskInds=..&hideClsSep=true)

<center><img src="https://lenngro.github.io/assets/images/2020-11-07-Attention-Is-All-You-Need/transformer-model-architecture.png" height="600" width="400"></center>

<center><img src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_transformer-encoder-decoder.png" width="800" height="400"></center>

img src NLP with Transformers, Tunstall et al. 2022


In [28]:
# Import the necessary libraries
import os
from os.path import exists
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax, pad
import math
import copy
import time
from torch.optim.lr_scheduler import LambdaLR
import pandas as pd
import altair as alt
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
from torchtext.vocab import build_vocab_from_iterator
import torchtext.datasets as datasets
import spacy
import GPUtil
import warnings
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP


# Set to False to skip notebook execution (e.g. for debugging)
warnings.filterwarnings("ignore")
RUN_EXAMPLES = True

In [29]:
# dummy functions and data
def is_interactive_notebook():
    return __name__ == "__main__"


def show_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        return fn(*args)


def execute_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        fn(*args)


class DummyOptimizer(torch.optim.Optimizer):
    def __init__(self):
        self.param_groups = [{"lr": 0}]
        None

    def step(self):
        None

    def zero_grad(self, set_to_none=False):
        None


class DummyScheduler:
    def step(self):
        None

### Transformer architecture

Encoder and decoder stacks

In [30]:
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many
    other models.
    """

    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

class Generator(nn.Module):
    "Define standard linear + softmax generation step."

    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

### Encoder

* The encoder is a stack of $N$ identical layers (N=6 generally)
* Converts the input sequence of word embeddings into a sequence of vectors (the `hidden_state`)

<center><img src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_encoder-zoom.png" height="600" width="800"></center>

#### Input text

* Input text is tokenized into a sequence of tokens to create token embeddings
* Token embeddings are added to positional embeddings to capture sequence information
* Encoding layers can be called `blocks` or `layers` - similar to Convolutional Neural Networks
* Encoders output is fed to the decoder


In [45]:
def clones(module, N):
    """
    Produce N identical layers.
    Args:
        module (nn.Module): The module to be cloned.
        N (int): The number of clones to create.

    Returns:
        nn.ModuleList: A list of cloned modules.
    """
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

# class LayerNorm(nn.Module):
    # ... (LayerNorm definition should be provided here.)
    # see below

class Encoder(nn.Module):
    """
    Core encoder is a stack of N layers.
    """

    def __init__(self, layer, N):
        """
        Initialize the Encoder class.

        Args:
            layer (nn.Module): The type of layer to be used in the encoder.
            N (int): The number of layers in the encoder.
        """
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)  # Create N identical layers
        self.norm = LayerNorm(layer.size)  # Add a layer normalization at the end of the encoder

    def forward(self, x, mask):
        """
        Pass the input (and mask) through each layer in turn.

        Args:
            x (Tensor): The input tensor to the encoder.
            mask (Tensor): The mask tensor to be applied during the forward pass.

        Returns:
            Tensor: The output tensor after passing through the encoder layers.
        """
        for layer in self.layers:  # Iterate through each layer in the encoder
            x = layer(x, mask)  # Pass the input and mask through the current layer
        return self.norm(x)  # Apply the layer normalization to the output tensor


"We employ a residual connection around each of the two sub-layers, followed by layer normalization." [Vaswani et al.](https://arxiv.org/abs/1706.03762)

In [46]:
class LayerNorm(nn.Module):
    """
    Construct a Layer Normalization module (See citation for details).
    This module applies a normalization technique that helps improve the training process of deep neural networks.
    """

    def __init__(self, features, eps=1e-6):
        """
        Initialize the LayerNorm class.

        Args:
            features (int): The number of features in the input tensor.
            eps (float, optional): A small value to prevent division by zero during normalization. Defaults to 1e-6.
        """
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))  # Scale parameter (a_2), learnable
        self.b_2 = nn.Parameter(torch.zeros(features))  # Shift parameter (b_2), learnable
        self.eps = eps  # Small value to prevent division by zero during normalization

    def forward(self, x):
        """
        Perform the forward pass of the layer normalization.

        Args:
            x (Tensor): The input tensor to be normalized.

        Returns:
            Tensor: The normalized output tensor.
        """
        mean = x.mean(-1, keepdim=True)  # Calculate the mean along the last dimension
        std = x.std(-1, keepdim=True)  # Calculate the standard deviation along the last dimension
        # Normalize the input tensor and apply the learnable scale and shift parameters
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2


"To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{model}=512$." [Vaswani et al.](https://arxiv.org/abs/1706.03762)

In [47]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer normalization.
    Note for code simplicity the normalization is applied first as opposed to last.
    """

    def __init__(self, size, dropout):
        """
        Initialize the SublayerConnection class.

        Args:
            size (int): The number of features in the input tensor.
            dropout (float): The dropout probability to be applied to the output of the sublayer.
        """
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)  # Apply layer normalization to the input
        self.dropout = nn.Dropout(dropout)  # Apply dropout to the output of the sublayer

    def forward(self, x, sublayer):
        """
        Apply residual connection to any sublayer with the same size.

        Args:
            x (Tensor): The input tensor.
            sublayer (nn.Module): The sublayer to be applied after layer normalization and before the residual connection.

        Returns:
            Tensor: The output tensor after the sublayer and residual connection.
        """
        # Normalize the input tensor, apply the sublayer, apply dropout, and then add the original input tensor (residual connection)
        return x + self.dropout(sublayer(self.norm(x)))

"Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network." [Vaswani et al.](https://arxiv.org/abs/1706.03762)

In [48]:
class EncoderLayer(nn.Module):
    """
    An Encoder layer is made up of self-attention and a feed-forward network.
    This class represents a single layer within the encoder stack.
    """

    def __init__(self, size, self_attn, feed_forward, dropout):
        """
        Initialize the EncoderLayer class.

        Args:
            size (int): The number of features in the input tensor.
            self_attn (nn.Module): The self-attention mechanism to be used in the layer.
            feed_forward (nn.Module): The feed-forward network to be used in the layer.
            dropout (float): The dropout probability to be applied in the SublayerConnection.
        """
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn  # Self-attention mechanism
        self.feed_forward = feed_forward  # Feed-forward network
        # Two SublayerConnection modules for self-attention and feed-forward network respectively
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size  # Number of features in the input tensor

    def forward(self, x, mask):
        """
        Forward pass for the EncoderLayer, following the connections in Figure 1 (left) of the original paper.

        Args:
            x (Tensor): The input tensor.
            mask (Tensor): The mask tensor to be applied during the self-attention mechanism.

        Returns:
            Tensor: The output tensor after passing through self-attention and feed-forward network.
        """
        # Apply the self-attention mechanism within the first SublayerConnection (residual connection)
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        # Apply the feed-forward network within the second SublayerConnection (residual connection)
        return self.sublayer[1](x, self.feed_forward)

### Decoder

* The decoder is a stack of $N$ identical layers (N=6 generally)
* The decoder has two sub-layers
    * The first is a multi-head self-attention layer - tokens generated are based on the past outputs
    * The second is the encoder-decoder attention layer

In [49]:
class Decoder(nn.Module):
    """
    Generic N layer decoder with masking.
    This class represents a stack of N identical decoder layers.
    """

    def __init__(self, layer, N):
        """
        Initialize the Decoder class.

        Args:
            layer (nn.Module): The type of layer to be used in the decoder.
            N (int): The number of layers in the decoder.
        """
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)  # Create N identical layers
        self.norm = LayerNorm(layer.size)  # Add a layer normalization at the end of the decoder

    def forward(self, x, memory, src_mask, tgt_mask):
        """
        Forward pass for the Decoder, processing the input tensor (x) using the given masks and memory tensor.

        Args:
            x (Tensor): The input tensor to the decoder.
            memory (Tensor): The memory tensor from the encoder's output.
            src_mask (Tensor): The source mask tensor to be applied during the self-attention mechanism.
            tgt_mask (Tensor): The target mask tensor to be applied during the self-attention mechanism.

        Returns:
            Tensor: The output tensor after passing through the decoder layers.
        """
        for layer in self.layers:  # Iterate through each layer in the decoder
            x = layer(x, memory, src_mask, tgt_mask)  # Pass the input, memory, and masks through the current layer
        return self.norm(x)  # Apply the layer normalization to the output tensor


In [50]:
import torch.nn as nn

class DecoderLayer(nn.Module):
    """
    A Decoder layer is made up of self-attention, source-attention, and a feed-forward network.
    This class represents a single layer within the decoder stack.
    """

    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        """
        Initialize the DecoderLayer class.

        Args:
            size (int): The number of features in the input tensor.
            self_attn (nn.Module): The self-attention mechanism to be used in the layer.
            src_attn (nn.Module): The source-attention mechanism to be used in the layer.
            feed_forward (nn.Module): The feed-forward network to be used in the layer.
            dropout (float): The dropout probability to be applied in the SublayerConnection.
        """
        super(DecoderLayer, self).__init__()
        self.size = size  # Number of features in the input tensor
        self.self_attn = self_attn  # Self-attention mechanism
        self.src_attn = src_attn  # Source-attention mechanism
        self.feed_forward = feed_forward  # Feed-forward network
        # Three SublayerConnection modules for self-attention, source-attention, and feed-forward network respectively
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        """
        Forward pass for the DecoderLayer, following the connections in Figure 1 (right) of the original paper.

        Args:
            x (Tensor): The input tensor.
            memory (Tensor): The memory tensor from the encoder's output.
            src_mask (Tensor): The source mask tensor to be applied during the source-attention mechanism.
            tgt_mask (Tensor): The target mask tensor to be applied during the self-attention mechanism.

        Returns:
            Tensor: The output tensor after passing through self-attention, source-attention, and feed-forward network.
        """
        m = memory  # Alias for the memory tensor
        # Apply the self-attention mechanism within the first SublayerConnection (residual connection)
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        # Apply the source-attention mechanism within the second SublayerConnection (residual connection)
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        # Apply the feed-forward network within the third SublayerConnection (residual connection)
        return self.sublayer[2](x, self.feed_forward)


In [51]:
import torch

def subsequent_mask(size):
    """
    Create a mask to prevent attention to future positions in the sequence (subsequent positions).

    Args:
        size (int): The size of the mask tensor, which should match the length of the sequence.

    Returns:
        Tensor: A boolean mask tensor with the shape (1, size, size), where positions that should be masked are False.
    """
    attn_shape = (1, size, size)  # Shape of the mask tensor
    # Create a lower triangular matrix with ones below and including the diagonal and zeros above it
    subsequent_mask = torch.tril(torch.ones(attn_shape), diagonal=1).type(
        torch.uint8
    )
    # Invert the mask (True becomes False and vice versa) so that subsequent positions are masked (False)
    return subsequent_mask == 0


In [40]:
# Torch tril

# create a tensor of shape 5, 5
seq_len = torch.zeros(5, 5) # inputs
mask = torch.tril(torch.ones(seq_len.size(-2), seq_len.size(-1)), diagonal=0) # mask
mask


tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])

In [41]:
def example_mask():
    LS_data = pd.concat(
        [
            pd.DataFrame(
                {
                    "Subsequent Mask": subsequent_mask(20)[0][x, y].flatten(),
                    "Window": y,
                    "Masking": x,
                }
            )
            for y in range(20)
            for x in range(20)
        ]
    )

    return (
        alt.Chart(LS_data)
        .mark_rect()
        .properties(height=250, width=250)
        .encode(
            alt.X("Window:O"),
            alt.Y("Masking:O"),
            alt.Color("Subsequent Mask:Q", scale=alt.Scale(scheme="viridis")),
        )
        .interactive()
    )


show_example(example_mask)

### Attention

* Attention is a mechanism that allows the model to focus on specific parts of the input sequence
* Attention is a way to compute a weighted average of the input sequence

> The main idea behind self-attention is that instead of using a fixed embedding for each token, we can use the whole sequence to compute a weighted average of each embedding. Tunstall el al. 2022:61

Let $x'_i$ be the linear combination of the $x_j$'s, where the coefficients $w_{ji}$ are computed as follows:

$$x'_i = \sum_{j=1}^n w_{ji} x_j$$

#### Scaled dot product attention

1. Project each token embedding into three vectors: $Q$, $K$, and $V$ where $Q$ is the query, $K$ is the key, and $V$ is the value
2. Compute the attention scores $A$ by taking the dot product of $Q$ and $K$. Large dot products are indicative to similarity and small dot products are indicative to dissimilarity.
3. Compute the attention weights are first multiplied by a scaling factor and then passed through a softmax function to ensure that the weights sum to 1
4. Update the token embeddings by computing multiplying the value $V$ to update the representation: $x'_i = \sum_{j=1}^n w_{ji} v_j$

In [54]:
# display attention weights example
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = BertModel.from_pretrained(model_checkpoint)
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode='light', layer=0, head=8)

Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 62.5kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 1.16MB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 8.37MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 2.79MB/s]
100%|██████████| 433/433 [00:00<00:00, 774801.04B/s]
100%|██████████| 440473133/440473133 [00:12<00:00, 35592115.36B/s]


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#### Query, key, value

Adapted from information retrieval systems:

* Query - think of recipe and ingredients
* Key - think of scanning your cupboard for ingredients
* Value - think of the ingredients you find


#### Atttention


$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

or graphically:

<center><img src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_attention-ops.png" height="200" width="1000"></center>

In [55]:
import torch.nn as nn

class MultiHeadedAttention(nn.Module):
    """
    A class for multi-headed attention mechanism.
    This class computes the multi-headed attention as described in the "Attention is All You Need" paper.
    """

    def __init__(self, h, d_model, dropout=0.1):
        """
        Initialize the MultiHeadedAttention class.

        Args:
            h (int): The number of attention heads.
            d_model (int): The dimension of the input tensor (typically the model's hidden size).
            dropout (float, optional): The dropout probability to be applied to the attention probabilities.
        """
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0  # Ensure that d_model is divisible by the number of heads
        self.d_k = d_model // h  # Calculate the dimension of each head
        self.h = h  # Number of attention heads
        # Create four identical linear layers for query, key, value, and output projections
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None  # Placeholder for attention probabilities
        self.dropout = nn.Dropout(p=dropout)  # Dropout layer

    def forward(self, query, key, value, mask=None):
        """
        Forward pass for the MultiHeadedAttention, implementing the mechanism described in Figure 2 of the original paper.

        Args:
            query (Tensor): The query tensor, typically representing the target sequence.
            key (Tensor): The key tensor, typically representing the source sequence.
            value (Tensor): The value tensor, typically the same as the key tensor.
            mask (Tensor, optional): A boolean mask tensor to be applied to the attention scores.

        Returns:
            Tensor: The output tensor after applying multi-headed attention and linear transformations.
        """
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )

        # 3) "Concat" using a view and apply a final linear.
        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(nbatches, -1, self.h * self.d_k)
        )
        # Free memory of temporary variables
        del query
        del key
        del value
        # Return the output tensor after applying the last linear transformation
        return self.linears[-1](x)

## Transformer architecture

* nn.Linear() is a linear transformation: $y = xA^T + b$
* nn.Module() is a base class for all neural network modules
* nn.Dropout() applies dropout to the input
* nn.LayerNorm() applies layer normalization to the input
* nn.Embedding() is a lookup table that stores embeddings of a fixed dictionary and size
* nn.GELU() applies the Gaussian error linear unit function
* nn.bmm() performs a batch matrix-matrix product of matrices stored in input and mat2
* model.forward() is the forward pass of the model


In [56]:
# example in code
import transformers
from transformers import AutoTokenizer


model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = BertModel.from_pretrained(model_checkpoint)
text = "time flies like an arrow"

In [57]:
# convert text to input ids
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

In [64]:
from torch import nn
from transformers import AutoConfig

# get the token embedding layer

config = AutoConfig.from_pretrained(model_checkpoint)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

In [65]:
# embed the input ids
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 5, 768])

In [66]:
inputs_embeds

tensor([[[-2.1053, -1.1127,  0.6525,  ..., -0.5947,  0.6064, -0.9472],
         [ 2.2600,  0.9035,  1.5905,  ...,  0.2860, -1.1919,  0.4626],
         [ 0.4092, -0.8213, -0.8163,  ...,  0.0517, -1.2940, -0.3726],
         [ 0.7965,  0.8215, -0.7940,  ...,  0.2679,  1.8959,  0.1162],
         [-0.7756,  1.3955,  0.3482,  ..., -0.7280,  0.8181, -1.3286]]],
       grad_fn=<EmbeddingBackward0>)

In [67]:
import torch
from math import sqrt

# compute attention scores
query = key = value = inputs_embeds
dim_k = key.size(-1)
scores = torch.bmm(query, key.transpose(-2, -1)) / sqrt(dim_k)
scores.size()

torch.Size([1, 5, 5])

In [68]:
scores

tensor([[[ 2.9773e+01, -2.3839e+00,  2.9451e-02,  1.7599e-01,  1.2913e+00],
         [-2.3839e+00,  2.7590e+01, -1.5230e+00,  2.5674e-01,  1.3453e+00],
         [ 2.9451e-02, -1.5230e+00,  2.7863e+01, -1.8315e+00, -4.6212e-01],
         [ 1.7599e-01,  2.5674e-01, -1.8315e+00,  2.7208e+01,  1.3140e+00],
         [ 1.2913e+00,  1.3453e+00, -4.6212e-01,  1.3140e+00,  3.1704e+01]]],
       grad_fn=<DivBackward0>)

In [69]:
# compute the softmax

import torch.nn.functional as F

weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

In [70]:
weights

tensor([[[1.0000e+00, 1.0826e-14, 1.2093e-13, 1.4002e-13, 4.2715e-13],
         [9.6039e-14, 1.0000e+00, 2.2715e-13, 1.3466e-12, 3.9994e-12],
         [8.1683e-13, 1.7294e-13, 1.0000e+00, 1.2704e-13, 4.9963e-13],
         [1.8208e-12, 1.9739e-12, 2.4458e-13, 1.0000e+00, 5.6819e-12],
         [6.1954e-14, 6.5386e-14, 1.0729e-14, 6.3373e-14, 1.0000e+00]]],
       grad_fn=<SoftmaxBackward0>)

In [62]:
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

torch.Size([1, 5, 768])

In [71]:
attn_outputs

tensor([[[-0.8190, -1.2342, -1.2699,  ...,  0.1690, -1.6524,  0.2212],
         [-2.2785,  0.1657, -0.7013,  ...,  1.4173,  0.1884,  2.0105],
         [-0.2505,  1.3569,  0.6797,  ...,  1.2840,  1.1937, -0.4831],
         [ 0.1833,  0.3535,  0.6963,  ..., -0.1256, -0.8033, -0.2110],
         [-0.5138, -0.3113,  0.8768,  ..., -1.1547, -0.3750, -0.3981]]],
       grad_fn=<BmmBackward0>)

In [72]:
# convert to a function

def scaled_dot_product(q, k, v):
    dim_k = q.size(-1)
    scores = torch.bmm(q, k.transpose(-2, -1)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, v)