# Transformers

## Introduction to Transformers Overview

* Our character RNN trained
* Introduction to Transformers
* HuggingFace Transformers library
* Transformers for NLP
* Embeddings

## Introduction to Transformers

### Milestones in Transformer Models

* Vaswani, Ashish, et al. Attention Is All You Need. arXiv:1706.03762, arXiv, 5 Dec. 2017. arXiv.org, https://doi.org/10.48550/arXiv.1706.03762.

### Some import models

* June 2018: GPT (OpenAI)
* October 2018: BERT (Google - summaries of sentences)
* February 2019: GPT-2 (OpenAI - not immediately released due to ethical concerns)
* October 2019: DistilBERT (Faster and better memory performance than BERT)
* October 2019: BART and T5 (large pretrained models)
* May 2020: GPT-3 (OpenAI - zero-shot learning)


### Key ideas

* Pretraining - Input is a very large corpus of text for weeks or months
* Fine-tuning - Input is a specific task (e.g. sentiment analysis)
* Encoder - Models that are good for understanding the input, like sentence classification or named entity recognition
* Decoder - Models that are good for generating output, like text generation or summarization
* Attention layers - Model attends to different relationships in different layers [BERT](https://huggingface.co/exbert/?model=bert-base-uncased&modelKind=bidirectional&sentence=The%20girl%20ran%20to%20a%20local%20pub%20to%20escape%20the%20din%20of%20her%20city.&layer=0&heads=..0,1,2,3,4,5,6,7,8,9,10,11&threshold=0.7&tokenInd=null&tokenSide=null&maskInds=..&hideClsSep=true)

[The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html)


## HuggingFace Transformers library

* [HuggingFace Transformers](https://huggingface.co/transformers/)
* [Natural Language Processing Course](https://huggingface.co/course/chapter1/1)

<center><img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" height="200" width="200"></center>

### Docs and Tutorials

* [Docs](https://huggingface.co/transformers/)
* [Tutorials](https://huggingface.co/docs/transformers/index)

### Installation

* `pip install transformers`
* `pip install datasets`

### Datasets

* [Datasets](https://huggingface.co/datasets/)
  * Multimodal
  * Computer Vision
  * NLP
  * Audio
  * Tabular

* NLP Datasets for various tasks
  * Text Classification
  * Token Classification
  * Table Question Answering
  * Question Answering
  * Zero-Shot Classification
  * Translation
  * Summarization
  * Conversational
  * Text Generation
  * Text2Text Generation
  * Fill Mask
  * Sentence similarity
  * Table to text
  * Multi-choice
  * Text retrieval


In [7]:
# HuggingFace Datasets https://github.com/huggingface/datasets
# !pip install datasets

from huggingface_hub import list_datasets
# from datasets import list_datasets
from datasets import load_dataset

from tqdm.autonotebook import tqdm as notebook_tqdm

all_ds = list(list_datasets())
print(f'There are {len(all_ds)} datasets available on the HuggingFace Hub')
print(f'The first 10 are: {all_ds[:10]}')

There are 74733 datasets available on the HuggingFace Hub
The first 10 are: [DatasetInfo: { 
  {'_id': '621ffdd236468d709f181d58',
   'author': None,
   'cardData': None,
   'citation': '@inproceedings{veyseh-et-al-2020-what,\n'
               '   title={{What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and '
               'Disambiguation}},\n'
               '   author={Amir Pouran Ben Veyseh and Franck Dernoncourt and Quan Hung Tran and Thien Huu Nguyen},\n'
               '   year={2020},\n'
               '   booktitle={Proceedings of COLING},\n'
               '   link={https://arxiv.org/pdf/2010.14678v1.pdf}\n'
               '}',
   'description': 'Acronym identification training and development sets for the acronym identification task at '
                  'SDU@AAAI-21.',
   'disabled': False,
   'downloads': 2938,
   'gated': False,
   'id': 'acronym_identification',
   'lastModified': '2023-01-25T14:18:28.000Z',
   'likes': 17,
   'paperswit

## Transformers Pipeline

In [8]:
# sentiment analysis
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier('We are very happy to show you the 🤗 Transformers library.')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading model.safetensors: 100%|██████████| 268M/268M [00:03<00:00, 68.9MB/s] 


[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [9]:
## zero-shot classification
from transformers import pipeline

classifier = pipeline('zero-shot-classification')
classifier('We are very happy to show you the 🤗 Transformers library.', candidate_labels=['politics', 'business', 'sports', 'technology'])

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading model.safetensors: 100%|██████████| 1.63G/1.63G [00:23<00:00, 68.8MB/s]


{'sequence': 'We are very happy to show you the 🤗 Transformers library.',
 'labels': ['technology', 'business', 'sports', 'politics'],
 'scores': [0.7958350777626038,
  0.08859799802303314,
  0.06836581230163574,
  0.04720110818743706]}

In [10]:
# text generation
from transformers import pipeline
generator = pipeline('text-generation', model='gpt2')
generator('Frodo and Sam were walking through the Shire when')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Frodo and Sam were walking through the Shire when a group of three men approached the house and started firing. The men fired at the house, but the women came and told Mr. Frye to go back upstairs to the house. There'}]

In [11]:
# named entity recognition

from transformers import pipeline

ner = pipeline('ner', grouped_entities=True)
ner('Mary graduates this spring from William and Mary. She will continue to study Natural Language Processing at MIT.')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading model.safetensors: 100%|██████████| 1.33G/1.33G [00:19<00:00, 69.2MB/s]
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification

[{'entity_group': 'PER',
  'score': 0.9994654,
  'word': 'Mary',
  'start': 0,
  'end': 4},
 {'entity_group': 'ORG',
  'score': 0.98400134,
  'word': 'William and Mary',
  'start': 32,
  'end': 48},
 {'entity_group': 'MISC',
  'score': 0.9641359,
  'word': 'Natural Language Processing',
  'start': 77,
  'end': 104},
 {'entity_group': 'ORG',
  'score': 0.9948832,
  'word': 'MIT',
  'start': 108,
  'end': 111}]

In [12]:
# question answering

from transformers import pipeline

question_answerer = pipeline('question-answering')

question_answerer(
    question='Where does Mary study?',
    context='Mary graduates this spring from William and Mary. She will continue to study Natural Language Processing at MIT.'
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading model.safetensors: 100%|██████████| 261M/261M [00:03<00:00, 69.8MB/s] 


{'score': 0.9272308349609375, 'start': 108, 'end': 111, 'answer': 'MIT'}

In [13]:
# summarization

summarizer = pipeline('summarization', max_length=48, min_length=30, do_sample=False, model='t5-base')
summarizer(
    """
    The Transformers library provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...)
    for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and
    deep interoperability between TensorFlow 2.0 and PyTorch.
    """)


Downloading model.safetensors: 100%|██████████| 892M/892M [00:13<00:00, 68.0MB/s] 
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'summary_text': 'the Transformers library provides state-of-the-art general-purpose architectures for natural language understanding (NLU) and natural language generation (NLG) with over 32+ pretrained models in 100+ languages .'}]

### The importance of training data

While the above uses are super easy, the real power of Transformers comes from the fact that they can be fine-tuned on a wide variety of tasks with just a few lines of code. This is made possible by the fact that they are pretrained on a large dataset (usually a few hundred million words) and then fine-tuned on a specific task. This is why Transformers are so powerful and why they are so widely used in NLP.

In [14]:
# model bias - GPT-2 was trained on novels and other story-like texts, so we will get really poor results in specialized domains
generator('SARS-CoV-2, the causative agent of COVID-19, employs its spike glycoprotein', num_return_sequences=5, max_length=100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'SARS-CoV-2, the causative agent of COVID-19, employs its spike glycoprotein ligand at either 1:1 or 2:2:N with an affinity of about 1/5:1:1 for the activation of transcriptional transcripts that are involved in synaptic transmission. This action in turn results from binding to a protein called interferon 3 (IF4). It is expressed via a large number of GPCRs and therefore has a direct action at'},
 {'generated_text': 'SARS-CoV-2, the causative agent of COVID-19, employs its spike glycoprotein D6 to inhibit the expression of human adenosine triphosphate (ADP). The results show that if a single amino acid is substituted for another, the human adenosine triphosphate is inhibited. Moreover, if either a single amino acid or another individual is added (like diphthongs and amines), the inhibition can also be reversed or increased'},
 {'generated_text': 'SARS-CoV-2, the causative agent of COVID-19, employs its spike glycoprotein [J-E1] in the preparation of COVID by a proce

## Ethical considerations & Subject Matter Experts

The above is good example of how large language models kind 'ramble'.

## Embeddings

In [15]:
import torch
import numpy as np
import pandas as pd


from transformers import BertModel, BertTokenizer
# from transformers import BloomModel, AutoTokenizer

model = BertModel.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext',
           output_hidden_states = True)
tokenizer = BertTokenizer.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')

In [16]:
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to('cpu')

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [17]:
# Create contextual embeddings

def bert_text_preparation(text, tokenizer):
  """
  Preprocesses text input in a way that BERT can interpret.
  """
  marked_text = "[CLS] " + text + " [SEP]"
  tokenized_text = tokenizer.tokenize(marked_text)
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
  segments_ids = [1]*len(indexed_tokens)

  # convert inputs to tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensor = torch.tensor([segments_ids])

  return tokenized_text, tokens_tensor, segments_tensor

In [18]:
def get_bert_embeddings(tokens_tensor, segments_tensor, model):
    """
    Obtains BERT embeddings for tokens, in context of the given sentence.
    """
    # gradient calculation id disabled
    with torch.no_grad():
      # obtain hidden states
      outputs = model(tokens_tensor, segments_tensor)
      # print(outputs[0])
      hidden_states = outputs[2]

    # concatenate the tensors for all layers
    # use "stack" to create new dimension in tensor
    token_embeddings = torch.stack(hidden_states, dim=0)

    # remove dimension 1, the "batches"
    token_embeddings = torch.squeeze(token_embeddings, dim=1)

    # swap dimensions 0 and 1 so we can loop over tokens
    token_embeddings = token_embeddings.permute(1,0,2)

    # intialized list to store embeddings
    token_vecs_sum = []

    # "token_embeddings" is a [Y x 12 x 768] tensor
    # where Y is the number of tokens in the sentence

    # loop over tokens in sentence
    for token in token_embeddings:

        # "token" is a [12 x 768] tensor

        # sum the vectors from the last four layers
        sum_vec = torch.sum(token[-4:], dim=0)
        token_vecs_sum.append(sum_vec)

    return token_vecs_sum

In [19]:
sentences = ['Advancing mHealth-supported Adoption and Sustainment of an Evidence-based Mental Health Intervention for Youth in a School-based Delivery Setting in Sierra Leone',
             'Refining and Pilot Testing a Decision Support Intervention to Facilitate Adoption of Evidence-Based Programs to Improve Parent and Child Mental Health',
             'Reusable, transparent, and reconfigurable N95-equivalent Respirator Masks: design, fabrication, and trials for enhanced adoption',
             'Understanding the Adoption and Impact of New Risk Assessment Technologies in Prostate Cancer Care',
             'Addressing adoption barriers to patient transportation services',
             'The College Alcohol Intervention Matrix (College AIM): Adoption and Implementation Across College Campuses',
             'Social Networks of Diffusion and Adoption: Investigating the Network Effects on implementation of evidence-based interventions for early intervention providers of children',
             'HPV ECHO: Increasing the adoption of evidence-based communication strategies for HPV vaccination in rural primary care practices',
             'Understanding disparities in the adoption and use of assistive technology by older Hispanics',
             'Adoption and Implementation of an Evidence-based Safe Driving Program for High-Risk Teen Drivers',
             'Motion Sequencing for All: pipelining, distribution and training to enable broad adoption of a next-generation platform for behavioral and neurobehavioral analysis',
             "The Implementation, Adoption, and Sustainability of Ho'ouna Pono",
             "The Challenges and Benefits of Adopting Teens: A Comparative Study",
             "Navigating the Unique Needs of Adolescent Adoption",
             "The Impact of Timing on Adoption Outcomes: Examining Infant and Teen Adoption",
             "Supporting the Transition to Adulthood in Adopted Teens",
             "Exploring the Long-Term Effects of Adopting Teens versus Infants",
             "Adopting Teens: A Systematic Review of the Literature",
             "Addressing the Stereotypes and Realities of Adopting Teens",
             "Comparing the Parenting Experiences of Adopting Infants and Teens"
             ]


In [20]:
from collections import OrderedDict

context_embeddings = []
context_tokens = []

for sentence in sentences:
  tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(sentence, tokenizer)
  list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)

  # make ordered dictionary to keep track of the position of each word
  tokens = OrderedDict()

  # loop over tokens in sensitive sentence
  for token in tokenized_text[1:-1]:
    # keep track of position of word and whether it occurs multiple times
    if token in tokens:
      tokens[token] += 1
    else:
      tokens[token] = 1

    # compute the position of the current token
    token_indices = [i for i, t in enumerate(tokenized_text) if t == token]
    current_index = token_indices[tokens[token]-1]

    # get the corresponding embedding
    token_vec = list_token_embeddings[current_index]
    
    # save values
    context_tokens.append(token)
    context_embeddings.append(token_vec)

In [21]:
context_tokens

['advancing',
 'mh',
 '##eal',
 '##th',
 '-',
 'supported',
 'adoption',
 'and',
 'sustain',
 '##ment',
 'of',
 'an',
 'evidence',
 '-',
 'based',
 'mental',
 'health',
 'intervention',
 'for',
 'youth',
 'in',
 'a',
 'school',
 '-',
 'based',
 'delivery',
 'setting',
 'in',
 'sie',
 '##rr',
 '##a',
 'leon',
 '##e',
 'ref',
 '##ining',
 'and',
 'pilot',
 'testing',
 'a',
 'decision',
 'support',
 'intervention',
 'to',
 'facilitate',
 'adoption',
 'of',
 'evidence',
 '-',
 'based',
 'programs',
 'to',
 'improve',
 'parent',
 'and',
 'child',
 'mental',
 'health',
 're',
 '##usa',
 '##ble',
 ',',
 'transparent',
 ',',
 'and',
 'recon',
 '##fig',
 '##urable',
 'n',
 '##95',
 '-',
 'equivalent',
 'respir',
 '##ator',
 'masks',
 ':',
 'design',
 ',',
 'fabrication',
 ',',
 'and',
 'trials',
 'for',
 'enhanced',
 'adoption',
 'understanding',
 'the',
 'adoption',
 'and',
 'impact',
 'of',
 'new',
 'risk',
 'assessment',
 'technologies',
 'in',
 'prostate',
 'cancer',
 'care',
 'addressing',

In [22]:
from scipy.spatial.distance import cosine

# embeddings for the word 'record' 
token = 'adoption'
indices = [i for i, t in enumerate(context_tokens) if t == token]

token_embeddings = [context_embeddings[i] for i in indices]

# # compare 'record' with different contexts
list_of_distances = []
for sentence_1, embed1 in zip(sentences, token_embeddings):
  for sentence_2, embed2 in zip(sentences, token_embeddings):
    cos_dist = 1 - cosine(embed1, embed2)
    list_of_distances.append([sentence_1, sentence_2, cos_dist])

distances_df = pd.DataFrame(list_of_distances, columns=['sentence_1', 'sentence_2', 'distance'])
distances_df[distances_df.sentence_1.str.contains('adoption')]

Unnamed: 0,sentence_1,sentence_2,distance
30,"Reusable, transparent, and reconfigurable N95-...",Advancing mHealth-supported Adoption and Susta...,0.753539
31,"Reusable, transparent, and reconfigurable N95-...",Refining and Pilot Testing a Decision Support ...,0.746505
32,"Reusable, transparent, and reconfigurable N95-...","Reusable, transparent, and reconfigurable N95-...",1.000000
33,"Reusable, transparent, and reconfigurable N95-...",Understanding the Adoption and Impact of New R...,0.736815
34,"Reusable, transparent, and reconfigurable N95-...",Addressing adoption barriers to patient transp...,0.766890
...,...,...,...
160,"Motion Sequencing for All: pipelining, distrib...","Motion Sequencing for All: pipelining, distrib...",1.000000
161,"Motion Sequencing for All: pipelining, distrib...","The Implementation, Adoption, and Sustainabili...",0.906431
162,"Motion Sequencing for All: pipelining, distrib...",The Challenges and Benefits of Adopting Teens:...,0.732473
163,"Motion Sequencing for All: pipelining, distrib...",Navigating the Unique Needs of Adolescent Adop...,0.820002


In [None]:
import os

filepath = os.path.join('gdrive/My Drive/projections/')

name = 'metadata.tsv'

with open(os.path.join(filepath, name), 'w+') as file_metadata:
  for i, token in enumerate(context_tokens):
    file_metadata.write(token + '\n')
    
import csv

name = 'embeddings.tsv'

with open(os.path.join(filepath, name), 'w+') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t')
    for embedding in context_embeddings:
        writer.writerow(embedding.numpy())