# Transformers

## Introduction to Transformers Overview

* Our character RNN trained
* Introduction to Transformers
* HuggingFace Transformers library
* Transformers for NLP
* Embeddings

## Introduction to Transformers

### Milestones in Transformer Models

* Vaswani, Ashish, et al. Attention Is All You Need. arXiv:1706.03762, arXiv, 5 Dec. 2017. arXiv.org, https://doi.org/10.48550/arXiv.1706.03762.

### Some import models

* June 2018: GPT (OpenAI)
* October 2018: BERT (Google - summaries of sentences)
* February 2019: GPT-2 (OpenAI - not immediately released due to ethical concerns)
* October 2019: DistilBERT (Faster and better memory performance than BERT)
* October 2019: BART and T5 (large pretrained models)
* May 2020: GPT-3 (OpenAI - zero-shot learning)


### Key ideas

* Pretraining - Input is a very large corpus of text for weeks or months
* Fine-tuning - Input is a specific task (e.g. sentiment analysis)
* Encoder - Models that are good for understanding the input, like sentence classification or named entity recognition
* Decoder - Models that are good for generating output, like text generation or summarization
* Attention layers - Model attends to different relationships in different layers [BERT](https://huggingface.co/exbert/?model=bert-base-uncased&modelKind=bidirectional&sentence=The%20girl%20ran%20to%20a%20local%20pub%20to%20escape%20the%20din%20of%20her%20city.&layer=0&heads=..0,1,2,3,4,5,6,7,8,9,10,11&threshold=0.7&tokenInd=null&tokenSide=null&maskInds=..&hideClsSep=true)

[The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html)


## HuggingFace Transformers library

* [HuggingFace Transformers](https://huggingface.co/transformers/)
* [Natural Language Processing Course](https://huggingface.co/course/chapter1/1)

<center><img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" height="200" width="200"></center>

### Docs and Tutorials

* [Docs](https://huggingface.co/transformers/)
* [Tutorials](https://huggingface.co/docs/transformers/index)

### Installation

* `pip install transformers`
* `pip install datasets`

### Datasets

* [Datasets](https://huggingface.co/datasets/)
  * Multimodal
  * Computer Vision
  * NLP
  * Audio
  * Tabular

* NLP Datasets for various tasks
  * Text Classification
  * Token Classification
  * Table Question Answering
  * Question Answering
  * Zero-Shot Classification
  * Translation
  * Summarization
  * Conversational
  * Text Generation
  * Text2Text Generation
  * Fill Mask
  * Sentence similarity
  * Table to text
  * Multi-choice
  * Text retrieval


In [None]:
# HuggingFace Datasets https://github.com/huggingface/datasets
# !pip install datasets

from huggingface_hub import list_datasets
# from datasets import list_datasets
from datasets import load_dataset

from tqdm.autonotebook import tqdm as notebook_tqdm

all_ds = list(list_datasets())
print(f'There are {len(all_ds)} datasets available on the HuggingFace Hub')
print(f'The first 10 are: {all_ds[:10]}')

## Transformers Pipeline

In [None]:
# sentiment analysis
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier('We are very happy to show you the 🤗 Transformers library.')

In [None]:
## zero-shot classification
from transformers import pipeline

classifier = pipeline('zero-shot-classification')
classifier('We are very happy to show you the 🤗 Transformers library.', candidate_labels=['politics', 'business', 'sports', 'technology'])

In [None]:
# text generation
from transformers import pipeline
generator = pipeline('text-generation', model='gpt2')
generator('Frodo and Sam were walking through the Shire when')

In [None]:
# named entity recognition

from transformers import pipeline

ner = pipeline('ner', grouped_entities=True)
ner('Mary graduates this spring from William and Mary. She will continue to study Natural Language Processing at MIT.')

In [None]:
# question answering

from transformers import pipeline

question_answerer = pipeline('question-answering')

question_answerer(
    question='Where does Mary study?',
    context='Mary graduates this spring from William and Mary. She will continue to study Natural Language Processing at MIT.'
)

In [None]:
# summarization

summarizer = pipeline('summarization', max_length=48, min_length=30, do_sample=False, model='t5-base')
summarizer(
    """
    The Transformers library provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...)
    for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and
    deep interoperability between TensorFlow 2.0 and PyTorch.
    """)


### The importance of training data

While the above uses are super easy, the real power of Transformers comes from the fact that they can be fine-tuned on a wide variety of tasks with just a few lines of code. This is made possible by the fact that they are pretrained on a large dataset (usually a few hundred million words) and then fine-tuned on a specific task. This is why Transformers are so powerful and why they are so widely used in NLP.

In [None]:
# model bias - GPT-2 was trained on novels and other story-like texts, so we will get really poor results in specialized domains
generator('SARS-CoV-2, the causative agent of COVID-19, employs its spike glycoprotein', num_return_sequences=5, max_length=100)

## Ethical considerations & Subject Matter Experts

The above is good example of how large language models kind 'ramble'.

## Embeddings

### Word2Vec, GloVe, FastText Embeddings (Shortcomings)

The above embeddings are static and do not change based on the context of the sentence. For example, the word 'bank' has different meanings in the following sentences:

* I went to the bank to deposit my money.
* I sat on the bank of the river and watched the water flow by.
* The river bank was full of dead fish.
* The bank was robbed yesterday.
* I banked on it.
* I banked the plane to the left.

### Contextualized Embeddings

* [ELMo](https://allennlp.org/elmo)
* [ULMFiT](https://arxiv.org/abs/1801.06146)
* [OpenAI GPT](https://openai.com/blog/language-unsupervised/)
* [BERT](https://arxiv.org/abs/1810.04805)

* BERT = Bidirectional Encoder Representations from Transformers
* "BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks ..." [Delvin et al.](https://arxiv.org/abs/1810.04805)
* Trained on a masked language model (MLM) 
    * [Taylor, 1953](https://gwern.net/doc/psychology/writing/1953-taylor.pdf)
    * Randomly mask a percentage of the input and predict the masked words based on the context
* Trained on a next sentence prediction (NSP) task
    * [Bengio et al., 2003](https://www.aclweb.org/anthology/P03-1003.pdf)
    * Given two sentences, predict if the second sentence is the next sentence in a document
* [BERT](https://github.dev/google-research/bert) code from Google


<center><img src="../images/bert.png" height="400" width="1000"></center>

* [Leaderboard](https://gluebenchmark.com/)

### Downstream Tasks

* [Diachronic linguistic change, Giulianelli et al.](https://aclanthology.org/2020.acl-main.365)
* [Linguistic style](https://arxiv.org/abs/1905.05621)
* [Vector semantics](https://library.oapen.org/handle/20.500.12657/60191)
* Polysemy

## Examining Contextual Embeddings

In [None]:
import torch
import numpy as np
import pandas as pd


from transformers import BertModel, BertTokenizer
# from transformers import BloomModel, AutoTokenizer

model = BertModel.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext',
           output_hidden_states = True)
tokenizer = BertTokenizer.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')

In [None]:
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to('cpu')

In [None]:
# Create contextual embeddings

def bert_text_preparation(text, tokenizer):
  """
  Preprocesses text input in a way that BERT can interpret.
  """
  marked_text = "[CLS] " + text + " [SEP]"
  tokenized_text = tokenizer.tokenize(marked_text)
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
  segments_ids = [1]*len(indexed_tokens)

  # convert inputs to tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensor = torch.tensor([segments_ids])

  return tokenized_text, tokens_tensor, segments_tensor

In [None]:
def get_bert_embeddings(tokens_tensor, segments_tensor, model):
    """
    Obtains BERT embeddings for tokens, in context of the given sentence.
    """
    # gradient calculation id disabled
    with torch.no_grad():
      # obtain hidden states
      outputs = model(tokens_tensor, segments_tensor)
      # print(outputs[0])
      hidden_states = outputs[2]

    # concatenate the tensors for all layers
    # use "stack" to create new dimension in tensor
    token_embeddings = torch.stack(hidden_states, dim=0)

    # remove dimension 1, the "batches"
    token_embeddings = torch.squeeze(token_embeddings, dim=1)

    # swap dimensions 0 and 1 so we can loop over tokens
    token_embeddings = token_embeddings.permute(1,0,2)

    # intialized list to store embeddings
    token_vecs_sum = []

    # "token_embeddings" is a [Y x 12 x 768] tensor
    # where Y is the number of tokens in the sentence

    # loop over tokens in sentence
    for token in token_embeddings:

        # "token" is a [12 x 768] tensor

        # sum the vectors from the last four layers
        sum_vec = torch.sum(token[-4:], dim=0)
        token_vecs_sum.append(sum_vec)

    return token_vecs_sum

In [None]:
sentences = ['Advancing mHealth-supported Adoption and Sustainment of an Evidence-based Mental Health Intervention for Youth in a School-based Delivery Setting in Sierra Leone',
             'Refining and Pilot Testing a Decision Support Intervention to Facilitate Adoption of Evidence-Based Programs to Improve Parent and Child Mental Health',
             'Reusable, transparent, and reconfigurable N95-equivalent Respirator Masks: design, fabrication, and trials for enhanced adoption',
             'Understanding the Adoption and Impact of New Risk Assessment Technologies in Prostate Cancer Care',
             'Addressing adoption barriers to patient transportation services',
             'The College Alcohol Intervention Matrix (College AIM): Adoption and Implementation Across College Campuses',
             'Social Networks of Diffusion and Adoption: Investigating the Network Effects on implementation of evidence-based interventions for early intervention providers of children',
             'HPV ECHO: Increasing the adoption of evidence-based communication strategies for HPV vaccination in rural primary care practices',
             'Understanding disparities in the adoption and use of assistive technology by older Hispanics',
             'Adoption and Implementation of an Evidence-based Safe Driving Program for High-Risk Teen Drivers',
             'Motion Sequencing for All: pipelining, distribution and training to enable broad adoption of a next-generation platform for behavioral and neurobehavioral analysis',
             "The Implementation, Adoption, and Sustainability of Ho'ouna Pono",
             "The Challenges and Benefits of Adopting Teens: A Comparative Study",
             "Navigating the Unique Needs of Adolescent Adoption",
             "The Impact of Timing on Adoption Outcomes: Examining Infant and Teen Adoption",
             "Supporting the Transition to Adulthood in Adopted Teens",
             "Exploring the Long-Term Effects of Adopting Teens versus Infants",
             "Adopting Teens: A Systematic Review of the Literature",
             "Addressing the Stereotypes and Realities of Adopting Teens",
             "Comparing the Parenting Experiences of Adopting Infants and Teens"
             ]


In [None]:
from collections import OrderedDict

context_embeddings = []
context_tokens = []

for sentence in sentences:
  tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(sentence, tokenizer)
  list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)

  # make ordered dictionary to keep track of the position of each word
  tokens = OrderedDict()

  # loop over tokens in sensitive sentence
  for token in tokenized_text[1:-1]:
    # keep track of position of word and whether it occurs multiple times
    if token in tokens:
      tokens[token] += 1
    else:
      tokens[token] = 1

    # compute the position of the current token
    token_indices = [i for i, t in enumerate(tokenized_text) if t == token]
    current_index = token_indices[tokens[token]-1]

    # get the corresponding embedding
    token_vec = list_token_embeddings[current_index]
    
    # save values
    context_tokens.append(token)
    context_embeddings.append(token_vec)

In [None]:
context_tokens

In [None]:
from scipy.spatial.distance import cosine

# embeddings for the word 'record' 
token = 'adoption'
indices = [i for i, t in enumerate(context_tokens) if t == token]

token_embeddings = [context_embeddings[i] for i in indices]

# # compare 'record' with different contexts
list_of_distances = []
for sentence_1, embed1 in zip(sentences, token_embeddings):
  for sentence_2, embed2 in zip(sentences, token_embeddings):
    cos_dist = 1 - cosine(embed1, embed2)
    list_of_distances.append([sentence_1, sentence_2, cos_dist])

distances_df = pd.DataFrame(list_of_distances, columns=['sentence_1', 'sentence_2', 'distance'])
distances_df[distances_df.sentence_1.str.contains('adoption')]

In [None]:
import os

# filepath = os.path.join('gdrive/My Drive/projections/')
filepath = os.path.join('.')
name = 'metadata.tsv'

with open(os.path.join(filepath, name), 'w+') as file_metadata:
  for i, token in enumerate(context_tokens):
    file_metadata.write(token + '\n')
    
import csv

name = 'embeddings.tsv'

with open(os.path.join(filepath, name), 'w+') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t')
    for embedding in context_embeddings:
        writer.writerow(embedding.numpy())

## The Bert Model

In [None]:
# Choose compute architecture
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to('cpu')

## Visualizing Embeddings with PCA, t-SNE, and/or UMAP

### Principal Component Analysis

* [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) is a dimensionality reduction technique
* PCA is an orthogonal transformation of data
* [PCA Main ideas](https://www.youtube.com/watch?v=HMOI_lkzW08&t=161s)
* [PCA in depth](https://www.youtube.com/watch?v=FgakZw6K1QQ)
* cf. [SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition)

### T-Distributed Stochastic Neighbor (t-SNE)

* [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) is a dimensionality reduction technique to accurately represent high dimensional data in a low dimensional space
* Perplexity is a parameter that controls the balance between local and global structure
* [How to use t-SNE effectively](https://distill.pub/2016/misread-tsne/)
* [t-SNE in depth](https://www.youtube.com/watch?v=NEaUSP4YerM)

### UMAP Embedding

* Dimensionality reduction technique that can be used to visualize high dimensional data
* Topological data analysis [UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html)
* [How UMAP Works](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html)

## Transformers in Summary

* 'Attention is all you need' [Vaswani et al.](https://arxiv.org/abs/1706.03762) 2017
* BERT used a Transformer architecture (Significant breakthrough in AI/NLP)
* Annotated Transformer [Harvard NLP](http://nlp.seas.harvard.edu/annotated-transformer/)

N.B.: The below transformer code is taken from the Annotated Transformer paper.

### Key ideas

* Pretraining - Input is a very large corpus of text for weeks or months
* Fine-tuning - Input is a specific task (e.g. sentiment analysis)
* Encoder - Models that are good for understanding the input, like sentence classification or named entity recognition
* Decoder - Models that are good for generating output, like text generation or summarization
* Attention layers - Model attends to different relationships in different layers [BERT](https://huggingface.co/exbert/?model=bert-base-uncased&modelKind=bidirectional&sentence=The%20girl%20ran%20to%20a%20local%20pub%20to%20escape%20the%20din%20of%20her%20city.&layer=0&heads=..0,1,2,3,4,5,6,7,8,9,10,11&threshold=0.7&tokenInd=null&tokenSide=null&maskInds=..&hideClsSep=true)

<center><img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" height="600" width="400"></center>

<center><img src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_transformer-encoder-decoder.png" width="800" height="400"></center>

img src NLP with Transformers, Tunstall et al. 2022


### Encoder

* The encoder is a stack of $N$ identical layers (N=6 generally)
* Converts the input sequence of word embeddings into a sequence of vectors (the `hidden_state`)

<center><img src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_encoder-zoom.png" height="600" width="800"></center>

#### Input text

* Input text is tokenized into a sequence of tokens to create token embeddings
* Token embeddings are added to positional embeddings to capture sequence information
* Encoding layers can be called `blocks` or `layers` - similar to Convolutional Neural Networks
* Encoders output is fed to the decoder

### Attention

* Attention is a mechanism that allows the model to focus on specific parts of the input sequence
* Attention is a way to compute a weighted average of the input sequence

> The main idea behind self-attention is that instead of using a fixed embedding for each token, we can use the whole sequence to compute a weighted average of each embedding. Tunstall el al. 2022:61

Let $x'_i$ be the linear combination of the $x_j$'s, where the coefficients $w_{ji}$ are computed as follows:

$$x'_i = \sum_{j=1}^n w_{ji} x_j$$

#### Scaled dot product attention

1. Project each token embedding into three vectors: $Q$, $K$, and $V$ where $Q$ is the query, $K$ is the key, and $V$ is the value
2. Compute the attention scores $A$ by taking the dot product of $Q$ and $K$. Large dot products are indicative to similarity and small dot products are indicative to dissimilarity.
3. Compute the attention weights are first multiplied by a scaling factor and then passed through a softmax function to ensure that the weights sum to 1
4. Update the token embeddings by computing multiplying the value $V$ to update the representation: $x'_i = \sum_{j=1}^n w_{ji} v_j$

#### Query, key, value

Adapted from information retrieval systems:

* Query - think of recipe and ingredients
* Key - think of scanning your cupboard for ingredients
* Value - think of the ingredients you find


#### Atttention


$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

or graphically:

<center><img src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter03_attention-ops.png" height="200" width="1000"></center>

## Transformer architecture

* nn.Linear() is a linear transformation: $y = xA^T + b$
* nn.Module() is a base class for all neural network modules
* nn.Dropout() applies dropout to the input
* nn.LayerNorm() applies layer normalization to the input
* nn.Embedding() is a lookup table that stores embeddings of a fixed dictionary and size
* nn.GELU() applies the Gaussian error linear unit function
* nn.bmm() performs a batch matrix-matrix product of matrices stored in input and mat2
* model.forward() is the forward pass of the model


# Conclusion

## Encoder

The encoder in the Transformer architecture processes the input sequence. It consists of a stack of identical layers, each containing two main subcomponents: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. The multi-head attention mechanism allows the model to focus on different positions of the input sequence simultaneously, effectively capturing various aspects of the contextual relationship. The position-wise feed-forward networks apply linear transformations to each position separately and identically.

## Decoder

The decoder also consists of a stack of identical layers but with an additional sub-layer that performs multi-head attention over the output of the encoder stack. This structure helps the decoder focus on appropriate parts of the input sequence, enhancing its ability to generate coherent and contextually relevant outputs. The multi-head attention mechanism in the decoder ensures that predictions for a given position can depend only on known outputs at positions less than that position, thus maintaining the auto-regressive property.

## Attention Mechanism

The attention mechanism, particularly the scaled dot-product attention, is a key innovation in the Transformer model. It computes the attention score by taking the dot product of the query with all keys, dividing each by the square root of the dimension of the keys, applying a softmax function to obtain the weights on the values. The multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions, enhancing its ability to capture complex dependencies.

## Sources

1. **Original Paper**: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).