<br>

# 🤗 HuggingFace Course

<br>

In [1]:
import transformers

<br>

## 🤗 Transformer Models

Transformers, why are they so damn cool?

<br>

### Introduction

* Chapters 1-4: introduction of the main concepts of 🤗 Transformers library
* Chapters 5-8: teach the basics of 🤗 Datasets and 🤗 Tokenizers + some intro to NLP'
* Chapters 9-12: deeper dive to showcase specialized architectures

<br>

### Natural Language Processing

**NLP**  
a list of common tasks:  
* classifying whole sentences - sentiment, classification (spam), grammer, etc.
* classifying words in a sentence - grammer, POS, entities
* generating text content - filling in blacks, sentence completion
* extracting an answer form text - given a question + content, find answer
* generating a new sentence from text input - summarizing text, translation

<br>

### Tranformers, what can they do?

The 🤗 Transformers library provides the functionality to create and use hosted models. The Model Hub contains many thousands of pretrained models that anyone can download and use.  

#### Working with `pipelines`

`pipelines` connect a model with it's necessary preprocessing and postprocessing steps
1. the text is preprocessed into a format the model can understand
2. the preprocessed inputs are passed to the model
3. the predictions of the model are post-processed, so you can make sense of them

In [2]:
from transformers import pipeline

classifier = pipeline( "sentiment-analysis" )
# also available: feature-extraction, fill-mask, named entity recognition, questions-answering,
#                 summarization, text-generation, translation, zero-shot classification
classifier( "I've been waiting for a HuggingFace course since last Spring" )

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.8835864663124084}]

**Note** - the model is cached, $\therefore$ it does not need to be loaded when calling the pipeline object again

In [3]:
classifier( "I'm really excited to take this HuggingFace course" )

[{'label': 'POSITIVE', 'score': 0.9996492862701416}]

<br>

Taking a look at a few other pipelines:  

#### Zero-Shot Classfication

classifying texts that have not been labeled. allows you to specify which labels to use for the classification, so that you don't need to rely on the labels of the pretrained model

In [4]:
classifier = pipeline( "zero-shot-classification" )
classifier( 
    "These are my notes from the HuggingFace online course",
    candidate_labels = ['education', 'science', 'politics', 'entertainment']
)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

{'sequence': 'These are my notes from the HuggingFace online course',
 'labels': ['education', 'entertainment', 'science', 'politics'],
 'scores': [0.6758780479431152,
  0.21396446228027344,
  0.08056329190731049,
  0.029594212770462036]}

In [6]:
classifier( [ 'NASA’s James Webb Space Telescope team fully deployed its 21-foot, gold-coated primary mirror, successfully completing the final stage of all major spacecraft deployments to prepare for science operations.',
             'The world’s largest and most complex space science telescope will now begin moving its 18 primary mirror segments to align the telescope optics.'],
          candidate_labels = ['education', 'science', 'politics', 'entertainment']
          )

[{'sequence': 'NASA’s James Webb Space Telescope team fully deployed its 21-foot, gold-coated primary mirror, successfully completing the final stage of all major spacecraft deployments to prepare for science operations.',
  'labels': ['science', 'education', 'entertainment', 'politics'],
  'scores': [0.972022294998169,
   0.011234794743359089,
   0.010500448755919933,
   0.006242458242923021]},
 {'sequence': 'The world’s largest and most complex space science telescope will now begin moving its 18 primary mirror segments to align the telescope optics.',
  'labels': ['science', 'entertainment', 'education', 'politics'],
  'scores': [0.9860273599624634,
   0.005517215467989445,
   0.005045033525675535,
   0.0034104110673069954]}]

<br>

#### Text Generation

provide a prompt and the model will auto-complete to generate the remaining text

In [7]:
generator = pipeline( 'text-generation' )
generator( 'In this course, we will teach you how to understand' )

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to understand the principles behind the two-sided nature of human being.'}]

In [8]:
generator( 'In this course, we will teach you how to understand', num_return_sequences = 2, max_length = 15 )

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to understand and avoid some common'},
 {'generated_text': 'In this course, we will teach you how to understand the fundamentals of data'}]

In [9]:
generator( 'Seeing Slayer live was the greatest', num_return_sequences = 5, max_length = 25 )

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Seeing Slayer live was the greatest battle of the year (and probably of all time). It was so fierce that it was the'},
 {'generated_text': 'Seeing Slayer live was the greatest experience of all." After my introduction, Liu Yao glanced at me. "We will be running'},
 {'generated_text': 'Seeing Slayer live was the greatest moment of their lives, and how they felt at that moment — their life that never ends —'},
 {'generated_text': 'Seeing Slayer live was the greatest fantasy that was left out in the world."\n\n"And by the same token, if'},
 {'generated_text': 'Seeing Slayer live was the greatest accomplishment of her lives.\n\nShe had never used any supernatural powers before and could hardly feel'}]

In [10]:
# specifying a model from the Hub
generator = pipeline('text-generation', model='distilgpt2')
generator( 'Seeing Slayer live was the greatest', num_return_sequences = 5, max_length = 25 )

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Seeing Slayer live was the greatest show of all time - in 1995, the television giant put back its plans, but that was'},
 {'generated_text': 'Seeing Slayer live was the greatest day in professional soccer history, with many leagues reaching huge financial heights. To qualify for the World'},
 {'generated_text': 'Seeing Slayer live was the greatest show ever released in Japan.\n\n\n\nThe Japanese version of Slayer premiered at the U'},
 {'generated_text': 'Seeing Slayer live was the greatest success in the world,” said the CEO of the agency.\n\n\n\n\n'},
 {'generated_text': 'Seeing Slayer live was the greatest act of the show!\nThe cast also talked about their experiences in New York, Los Angeles'}]

<br>

#### Mask Filling

fill in blanks in a given text

In [11]:
unmasker = pipeline( 'fill-mask' )
unmasker( 'This course will teach you all about <mask> models', top_k=5 )

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'sequence': 'This course will teach you all about mathematical models',
  'score': 0.19631721079349518,
  'token': 30412,
  'token_str': ' mathematical'},
 {'sequence': 'This course will teach you all about building models',
  'score': 0.04449249804019928,
  'token': 745,
  'token_str': ' building'},
 {'sequence': 'This course will teach you all about predictive models',
  'score': 0.039371054619550705,
  'token': 27930,
  'token_str': ' predictive'},
 {'sequence': 'This course will teach you all about role models',
  'score': 0.03575519099831581,
  'token': 774,
  'token_str': ' role'},
 {'sequence': 'This course will teach you all about business models',
  'score': 0.027736075222492218,
  'token': 265,
  'token_str': ' business'}]

In [12]:
unmasker = pipeline( 'fill-mask', model = 'bert-base-cased' )
unmasker( 'This course will teach you all about [MASK] models', top_k=5 )

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

[{'sequence': 'This course will teach you all about role models',
  'score': 0.07605545967817307,
  'token': 1648,
  'token_str': 'role'},
 {'sequence': 'This course will teach you all about the models',
  'score': 0.05633450299501419,
  'token': 1103,
  'token_str': 'the'},
 {'sequence': 'This course will teach you all about fashion models',
  'score': 0.04619521275162697,
  'token': 4633,
  'token_str': 'fashion'},
 {'sequence': 'This course will teach you all about computer models',
  'score': 0.030348550528287888,
  'token': 2775,
  'token_str': 'computer'},
 {'sequence': 'This course will teach you all about life models',
  'score': 0.019965192303061485,
  'token': 1297,
  'token_str': 'life'}]

<br>

#### Named Entity Recognition

**NER** - find which parts of an input text correspond to entities

In [14]:
ner = pipeline( "ner", grouped_entities=True )
ner( "Let me be clear on this: My brother is qualified for the position, Adams told CNN’s 'State of the Union' when asked about making his sibling Bernard Williams a deputy NYPD commissioner.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'entity_group': 'PER',
  'score': 0.99946195,
  'word': 'Adams',
  'start': 67,
  'end': 72},
 {'entity_group': 'ORG',
  'score': 0.99552476,
  'word': 'CNN',
  'start': 78,
  'end': 81},
 {'entity_group': 'ORG',
  'score': 0.8375926,
  'word': 'State of the Union',
  'start': 85,
  'end': 103},
 {'entity_group': 'PER',
  'score': 0.99927413,
  'word': 'Bernard Williams',
  'start': 141,
  'end': 157},
 {'entity_group': 'ORG',
  'score': 0.9963703,
  'word': 'NYPD',
  'start': 167,
  'end': 171}]

In [22]:
# POS tagging
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("vblagoje/roberta-base-english-upos")
model = AutoModelForTokenClassification.from_pretrained("vblagoje/roberta-base-english-upos")
ner = pipeline( "ner", model = model, tokenizer=tokenizer, grouped_entities=True )
ner( "Let me be clear on this: My brother is qualified for the position, Adams told CNN’s 'State of the Union' when asked about making his sibling Bernard Williams a deputy NYPD commissioner.")

Downloading:   0%|          | 0.00/326 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

[{'entity_group': 'VERB',
  'score': 0.9986479,
  'word': ' Let',
  'start': 1,
  'end': 3},
 {'entity_group': 'PRON',
  'score': 0.99980384,
  'word': ' me',
  'start': 4,
  'end': 6},
 {'entity_group': 'AUX',
  'score': 0.98620975,
  'word': ' be',
  'start': 7,
  'end': 9},
 {'entity_group': 'ADJ',
  'score': 0.9991356,
  'word': ' clear',
  'start': 10,
  'end': 15},
 {'entity_group': 'ADP',
  'score': 0.99943495,
  'word': ' on',
  'start': 16,
  'end': 18},
 {'entity_group': 'PRON',
  'score': 0.99961954,
  'word': ' this',
  'start': 19,
  'end': 23},
 {'entity_group': 'PUNCT',
  'score': 0.93715745,
  'word': ':',
  'start': 23,
  'end': 24},
 {'entity_group': 'PRON',
  'score': 0.99976885,
  'word': ' My',
  'start': 25,
  'end': 27},
 {'entity_group': 'NOUN',
  'score': 0.9995938,
  'word': ' brother',
  'start': 28,
  'end': 35},
 {'entity_group': 'AUX',
  'score': 0.99777746,
  'word': ' is',
  'start': 36,
  'end': 38},
 {'entity_group': 'ADJ',
  'score': 0.9990982,
  'wor

<br>

#### Question Answering

In [24]:
q_answering = pipeline( 'question-answering' )
q_answering( 
    question = 'Where does Zach work?',
    context = 'Zach is a bike messenger in NYC and lives in Brooklyn'
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'score': 0.38330498337745667, 'start': 45, 'end': 53, 'answer': 'Brooklyn'}

<br>

#### Summarization

distilling text

In [25]:
summarizer = pipeline( 'summarization' )
summarizer( """The Witcher is a fantasy drama streaming television series created by Lauren Schmidt Hissrich, fan fiction based on the book series of the same name by Polish writer Andrzej Sapkowski. Set on a fictional, medieval-inspired landmass known as "the Continent", The Witcher explores the legend of Geralt of Rivia and Princess Ciri, who are linked to each other by destiny. It stars Henry Cavill, Freya Allan and Anya Chalotra.

The first season consisted of eight episodes and was released on Netflix in its entirety on December 20, 2019. It was based on The Last Wish and Sword of Destiny, which are collections of short stories that precede the main Witcher saga. The second season, consisting of eight episodes, was released on December 17, 2021. In September 2021, Netflix renewed the series for a third season. An animated origin story film, The Witcher: Nightmare of the Wolf, was released on August 23, 2021, while a prequel miniseries, The Witcher: Blood Origin, will be released in 2022.

The story begins by following three main characters: the witcher Geralt of Rivia; crown princess Cirilla of Cintra; and the sorceress Yennefer of Vengerberg, meeting them at different points in time. A witcher is a human trained and magically mutated since childhood to be strong and resilient enough to fight the many monsters that plague The Continent. The first season bounces the viewer back and forth in time, exploring formative events that shape the main characters before eventually merging into a single timeline.

Yennefer and Geralt encounter each other several times across many decades, as both are magic-users with unnaturally long lives. They clash more than once and are on-and-off lovers, both of them longing for more but afraid to admit it.

Meanwhile, Geralt and princess Cirilla are linked by destiny before she is born: for saving her father's life, Geralt elected to be paid by invoking "the Law of Surprise", a custom in which the one who was saved gives his savior the next thing he learns that he possesses, but did NOT know about at the time his life was saved. After promising Geralt the surprise, Cirilla's father Duny learns that his bride is pregnant. Thus, the unborn princess Ciri was bound to the witcher in an unbreakable pact of destiny. Geralt wants nothing to do with a child and goes on his solitary way. After the two finally meet, when the 12-year-old princess has been orphaned by war, the witcher becomes her protector; he must keep her safe and fight against many pursuers to prevent Ciri's powerful magic from being used to destroy the world.
""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

[{'summary_text': ' The Witcher is a fantasy drama based on the book series of the same name by Polish writer Andrzej Sapkowski . Set on a fictional, medieval-inspired landmass known as "the Continent", The Witcher explores the legend of Geralt of Rivia and Princess Ciri, who are linked to each other by destiny . The first season consisted of eight episodes and was released on Netflix in its entirety on December 20, 2019 .'}]

<br>

#### Translation

In [28]:
translator = pipeline( 'translation', model = 'Helsinki-NLP/opus-mt-fr-en' )
translator('Ce cours est produit par HuggingFace')

[{'translation_text': 'This course is produced by HuggingFace'}]

<br>

### How do Transformers work?

Transformers were introduced in June 2017  
They can be broadly grouped into 3 categories: 
* GPT-like - autoregressive
* BERT-like - autoencoding
* BART/T5-like - seq2seq 

These models learned language by **self-supervised learning** - automatically compute from inputs to the model to develop a statistical understanding of the language.  
Transformers models are very broad and not very useful for specific tasks. However, they can be used as a base case and be fine-tuned using supervised learning for a specific task == **transfer learning**  


Large Models have a big carbon footprint  
Several things to consider:
* use pretrained models when they are available
* use fine-tuning vs. from scratch
* starting with smaller experiments and then debugging
* doing literature review to choose hyperparameter ranges
* random seach vs grid search

[**Machine Learning Emissions Calculator**](https://mlco2.github.io/impact/)  
also codecarbon  

In [30]:
#pip install codecarbon
from codecarbon import EmissionsTracker
help( EmissionsTracker )

Help on class EmissionsTracker in module codecarbon.emissions_tracker:

class EmissionsTracker(BaseEmissionsTracker)
 |  EmissionsTracker(project_name: str = 'codecarbon', measure_power_secs: int = 15, output_dir: str = '.', save_to_file: bool = True, gpu_ids: Union[List, NoneType] = None, emissions_endpoint: Union[str, NoneType] = None, co2_signal_api_token: Union[str, NoneType] = None)
 |  
 |  An online emissions tracker that auto infers geographical location,
 |  using `geojs` API
 |  
 |  Method resolution order:
 |      EmissionsTracker
 |      BaseEmissionsTracker
 |      abc.ABC
 |      builtins.object
 |  
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from BaseEmissionsTracker:
 |  
 |  __init__(self, project_name: str = 'codecarbon', measure_power_secs: int = 15, output_dir: str = '.', save_to_file: bool = True, gpu_ids: Union[List, No

<br>

Sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community.  

#### Transfer Learning

**Transfer Learning** - Transferring the knowledge of a pretrained model to a new model by initializing the second model with the first model's weights.  
Leverage the knowledge gained from training a model on a very large dataset on another task.  
Initialize a model with the same weights as a pretrained model, thus transfering the knowledge.  
**pre-training** - the act of trainign a model from scratch: the weights are started at random and the training starts without any prior knowledge. pre-training si usually done on very large training data sets.  
**fine-tuning** - training done after a model has been pretrained typically with a comparatively smaller dataset  

fine-tuning has lower time, data, financial and environmental costs. However, the resulting model inherits any bias present from the pretrained model.  

#### General Archtecture

* **Encoder** - acquires and interprets input: receives input and builds representation of its features
* **Decoder** - generates output: ises the encoder's representation (features) along with other inputs to generate a target sequence
* Types of Models:
    * **Encoder-only models** - tasks for understanding the input (classification, NER)
    * **Decoder-only models** - generative tasks (text generation)
    * **Encoder-decoder models** - Sequence-to-sequence models; good for generative tasks that require an input (translation, summarization)
* **Attention Layers** - tells the model to pay specific attention to certain words in the sentence it was passed; a word by itself has meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the wordbeing studied.  
* **Architecture** - the skeleton of the model. An architecture is a succession of mathematical functions to build a model and its weights are those functions parameters.
* **Checkpoints** - the weights that will be lloaded by a given architecture
* **Model** - umbrella term

<br>

### Encoder Models

* Encoder models use only the encoder of a Transformer model
* **feature vector/tensor** - holds the meaning of the word within the text. retrieves a numerical vector representation of each word with dimensions defined by the architecture of the model. each numerical representation is contexualized: each word in the initial sequence affects every word's representation
* **"Bi-directional" attention** - the vectorization takes into account tokens both befor and after
* **auto-encoding models**
* Examples: 
    - ALBERT
    - BERT
    - DistilBERT
    - ELECTRA
    - RoBERTa
* When to use an Encoder Model:
    - sequence classification (sentiment analysis), question answering, masked language modeling
    - NLU: Natural Language Understanding

<br>

### Decoder Models

* Decoder Models use only the Decoder of a Transformer Model
* **Auto-regressive** - attention layers can only access the words positioned before it in a sentence
* **feature vector/tensor** - one vector per word with length determined by the model architecture differs from Encoder models by its self attention mechnanism
* **masked self-attention** - the tokens to the right or left of the word are masked; feature vector is decided by one-directional context
* best suited for text generation tasks
* Examples:
    - CTRL
    - GPT
    - GPT-2
    - Transformer XL
* when to use Decoder Model:
    - Causal Language Modeling: great at generational tests: generating sequences of text
    - NLG: Natural Language Generation

<br>

### Sequence-to-sequence Models

* Encoder-Decoder Models use both parts of the Transformers architecture.
* Works by:
    - the encoder takes care of understanding the sequence
    - takes the numerical representation from the Encoder (meaning of the sequence)
    - pass to the Decoder along with a token to start a sequence
    - the Decoder takes care of generating a sequence according to the understanding of the Encoder
    - the Decoder will try to sequencially decode the Encoder's output as a 'word'
    - Decoder works in a auto-regressive way to add tokens until a stop is encountered
* Are best for tasks revolving around generation of new text which depends on context: 
    - summarization
    - traslation
    - generative question answering
* Examples: 
    - BART
    - Marian
    - T5
* When to use an Encoder-Decoder Model?
    - for Sequence-to-Sequence tasks: translation, summarization, many-to-many
    - the weights between Encoder & Decoder are not necesarily shared

<br>

### Bias and limitations

To enable pretraining on large amounts of data, researchers often scrape all the content they can find, taking the best as well as the worst of what is available on the internet. Therefore, it is important to keep in mind that the original model you are using could very easily generate sexist, racist, or homophobic content. Fine-tuning the model on your data won't make this intrinsic bias disappear.  

Possible sources of bias observed in a model:
* The model is a fine-tuned  version of a pretrained model and it picked up its bias from it
* The data the model was trained on is biased
* The metric the model was optimized for is biased

<br>

### Summary

|     **Model**    |                **Examples**                |                           **Tasks**                          |
|:----------------:|:------------------------------------------:|:------------------------------------------------------------:|
|      Encoder     | ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa | Sentence classification, NER,  Ectractive question answering |
|      Decoder     |       CTRL, GPT, GPT-1 Transformer XL      |                        Text generation                       |
| Encoder- Decoder |           BART, T5, Marian, mBART          | Summarization, Translation, Generative question answering    |

<br>

### End of Chapter Quiz

In [31]:
ner = pipeline( 'ner', grouped_entities=True )
ner( 'My name is Sylvian and I work at Hugging Face in Brooklyn' )

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'entity_group': 'PER',
  'score': 0.99876773,
  'word': 'Sylvian',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9672198,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9846444,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [33]:
filler = pipeline("fill-mask", model="bert-base-cased")
result = filler("This [MASK] has been waiting for you.")
result

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'sequence': 'This man has been waiting for you.',
  'score': 0.09050368517637253,
  'token': 1299,
  'token_str': 'man'},
 {'sequence': 'This place has been waiting for you.',
  'score': 0.07195132970809937,
  'token': 1282,
  'token_str': 'place'},
 {'sequence': 'This world has been waiting for you.',
  'score': 0.05576983094215393,
  'token': 1362,
  'token_str': 'world'},
 {'sequence': 'This one has been waiting for you.',
  'score': 0.04573335126042366,
  'token': 1141,
  'token_str': 'one'},
 {'sequence': 'This woman has been waiting for you.',
  'score': 0.03509223833680153,
  'token': 1590,
  'token_str': 'woman'}]

In [34]:
classifier = pipeline("zero-shot-classification")
result = classifier("This is a course about the Transformers library",
                   candidate_labels = ['education', 'science', 'politics', 'entertainment'])
result

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


{'sequence': 'This is a course about the Transformers library',
 'labels': ['entertainment', 'education', 'science', 'politics'],
 'scores': [0.4878741204738617,
  0.4179564118385315,
  0.07267913222312927,
  0.021490389481186867]}

<br>

## Using 🤗 Transformers

Transformers are hard. How can their API be so simple?  

<br>

### Introduction

The 🤗 Transformers library: a single API through which any Transformer model can be loaded, trained, and saved.
* **Easy to use** - download, load, use
* **Flexibility** - models are either `PyTorch nn.Module` or `TensorFlow tf.keras.Model` classes
* **Simplicity** - "All in one file"

<br>

### Behind the Pipeline

Let's review the following example:

In [35]:
classifier = pipeline( "sentiment-analysis" )
classifier( [ "I've been waiting for a HuggingFace course my whole life",
            "I hat this so much!" ] )

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.951606810092926},
 {'label': 'POSITIVE', 'score': 0.999339759349823}]

<br>

the `pipeline` groups together three steps:  
1. preprocessing (tokenization)
2. passing the inputs through the model
3. postprocessing

<img src="https://huggingface.co/course/static/chapter2/full_nlp_pipeline.png" style="margin-left:auto; margin-right:auto">

<br>

Let's review these terms...

#### (1) Preprocessing with a tokenizer

**tokenization** - convert text inputs into numbers that the model can interpret
* split the text into **tokens** - words, subwords or symbols
* map each token to an integer
* adding additional inputs

Preprocessing needs to be done the same way as when the model was pretrained.  
This can be implemented with the `from_pretrained()` method

In [37]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# "distilbert-base-uncased-finetuned-sst-2-english" is the default model for 'sentiment-analysis'
tokenizer = AutoTokenizer.from_pretrained( checkpoint )
tokenizer

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

<br>

pass sentences to the tokenizer and get a dictionary for tensors back. Transformers models expect tensors, but these can be PyTorch, TensorFlow, numpy etc

In [40]:
raw_inputs = [ "I've been waiting for a HuggingFace course my whole life",
            "I hat this so much!" ]
inputs = tokenizer( raw_inputs, padding = True, truncation = True, return_tensors = 'pt' )
print( inputs )
print( inputs.keys() )

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,   102],
        [  101,  1045,  6045,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}
dict_keys(['input_ids', 'attention_mask'])


<br>

The output is a dictionary.  
* **input_ids** unique integer identifiers of the tokens in each sentence
* **attention_mask** tells the model which tokens to 'pay attention' to


#### (2) Passing Inputs Through the Model

Download a pretrained model:

In [41]:
from transformers import AutoModel

# use the same checkpoint
model = AutoModel.from_pretrained( checkpoint )
print( model )

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

<br>

When given some inputs, this models architecture returns as output the **hidden states**, or features: a high-dimensional vector representing the contexual understanding of the input. Hidden states can be used as inputs to a different model layer, the head, to perform specific NLP tasks.  

Hidden State Dimensions
* **batch Size** number of sequences processed at the same time
* **sequence length** length of the sequences being processed
* **hidden state** vector dimension of each model input

In [48]:
outputs = model( **inputs )
print( outputs.last_hidden_state.shape )

torch.Size([2, 15, 768])


<br>

The hidden states are passed to the model **Head**  
Model Heads - take the high-dimensional hidden states vectors and project onto a different dimension for output  
For example, in the following classification example, the output is mapped onto the number of sequences and the number of classification labels

In [51]:
from transformers import AutoModelForSequenceClassification

# using the same checkpoint
model = AutoModelForSequenceClassification.from_pretrained( checkpoint )
outputs = model( **inputs )
print( outputs )
print( 'Output shape:', outputs.logits.shape )

SequenceClassifierOutput(loss=None, logits=tensor([[-1.4683,  1.5105],
        [-3.4842,  3.8381]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
Output shape: torch.Size([2, 2])


<br>

#### (3) Postprocessing the Output

Values that come out of the model head do not make sense to a hooman:

In [52]:
print( outputs.logits )

tensor([[-1.4683,  1.5105],
        [-3.4842,  3.8381]], grad_fn=<AddmmBackward>)


<br>

The outputs are **logits** - the raw unnormalized score from the last model layer, not probabilities.  
To return a probability, the logits need to be passed through a **Softmax** layer:

In [54]:
import torch

predictions = torch.nn.functional.softmax( outputs.logits, dim=-1 )
print( predictions )

tensor([[4.8393e-02, 9.5161e-01],
        [6.6023e-04, 9.9934e-01]], grad_fn=<SoftmaxBackward>)


<br>

Now to return the predicted label:

In [55]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [58]:
raw_inputs = [ "Let me be clear, this is an animated series and NOT anime. Disclaimer (I enjoyed R. Armitage's brilliant performance in North and South). As for Castlevania, I've tried to watch it and in my desperation for some distraction may even watch season 3, which I hear may be promising. But each episode in Seasons 1 and 2 have been like a punishment. So incredibly dull! The only interesting character is Alucard, the beautiful dhampir son of Dracula. Why couldn't this series be more interesting? There's so much potential to make it a truly livening experience. The pacing is awful—each scene drags on with the sloppiness of a messy effort at storytelling. It's too dialog heavy, wit sounds like farce, fights look like a Saturday morning special. What happened here is a miserable attempt at animating a video game series to draw in a new audience who may be blown away by the animation style but have yet to experience the true genius of some fantastic Japanese anime gems.",
             "Season 1 was awesome, season 2 was AWESOME. Then came season 3 and again Netflix ruined what could have been a series for years. Castlevania is a game, one of the most brutal games you could play. It has material to make stories until we die. If I wanted to watch porno there are sites for that. If I wanted to follow an agenda (lgtb whatever), for that we have Disney. Stop damaging and ruined series the way you are doing it Netflix. Hate what you did to one of the main characters of the series.",
             "Incredible show. As someone who is usually not into Anime or at least incredibly picky about the shows I actually do watch to the very end instead of dropping them after the first few episodes, I have to say that this adaptation is likely going to be the best we're going to see of the castlevania franchise for a long long time.",
             "This is by far my favorite animated series ever.  I absolutely cannot get enough.  I think I must have watched it 6 times over.  The characters are sensationally well rounded.  Trevor and Sypha*, Alucard(Dracula's half human son), Issac is especially interesting.  The action is simply on a level of its own.  A good example of this would be Season 2 Episode 7.  It is incredible.  The studio that made this also made 'Blood of Zeus' which is on Netflix.  Blood of Z is not quite as good as Castlevania but still worth while.  I cannot wait for season 4 and hopefully many more seasons after that of Castlevania.  Its the best.",
             "The show's writing and dialogue are intelligent and witty, and the voice cast is great. But, sadly, by the end of season one, you realize the show succumbs to lambasting male characters as too many other Netflix shows have trended (like the horrific new outing of She-Man, uh, He-Man Mistress of the Universe). Why is this wrong? Well, let's substitute any other gender or race into the category of who gets lambasted or belittled, and see how that goes. If you can ignore the discrimination, it's actually a good show with an interesting story and great dialogue.",
             "Although overall anime is good. It feels like something is missing. I think this series feels more like a animated version of a Hollywood movie (I think it should be called animated series not Anime)  than Japanese style Anime. If you want deep and profound Anime then this is totally not for you."]
inputs = tokenizer( raw_inputs, padding = True, truncation = True, return_tensors = 'pt' )
model = AutoModelForSequenceClassification.from_pretrained( checkpoint )
outputs = model( **inputs )
predictions = torch.nn.functional.softmax( outputs.logits, dim=-1 )
print(predictions)
print( model.config.id2label )

tensor([[9.9936e-01, 6.4109e-04],
        [9.9908e-01, 9.2453e-04],
        [3.7674e-04, 9.9962e-01],
        [1.9647e-04, 9.9980e-01],
        [8.8015e-03, 9.9120e-01],
        [9.9797e-01, 2.0295e-03]], grad_fn=<SoftmaxBackward>)
{0: 'NEGATIVE', 1: 'POSITIVE'}


<br>

### Models

Creating and using a model  
Instantiating an existing model from a checkpoint


#### Creating a Transformer

Here we will create a BERT model from the default configuration. this will initialize it with random values

In [59]:
# initialize a BERT model
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel( config )
print( config )

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.11.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



<br>

The configuration has many different attributes.  
Again, the values are all initialized to a random state, $\therefore$ the model will output gibberish because it is untrained.  

Alternatively, we can load pretrained models...

In [61]:
from transformers import BertModel

model = BertModel.from_pretrained( 'bert-base-cased' )

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<br>

The model is now initialized with the weights of the pretrained BERT model


#### Saving Methods

models can be saved to a folder with a .json file of architecture parameters and a .bin file state dictionary of the models weights

In [None]:
# saving a model
model.save_pretrained( "place_to_call_home" )

<br>

#### Using a Transformer Model for Inference

Transformer models can only process numbers - numbers that the tokenizer generates

<br>

### Tokenizers

**tokenizers** - translate text into numerical data that can be processed by the model. a balancing act: find the representation that conveys the more meaning and the most sense to the model while fitting to the smallest representation possible.  

Types of Tokenization
* **Word-based** 
    - split raw text into words and find a numeric representation 
    - **splitting text** - split on whitespace, special rules for punctuation
    - **vocabulary** - total number of independent tokens present in a corpus
    - **corpus** - a body of text
    - each word gets assigned an ID starting from 0 and going up to the length of the vocabulary. these numeric IDs are used by the model
    - if there are words that are not recognized from the vocabulary, they get assigned an 'unknown' token (e.g. [UNK])
    - **[UNK]** - out of vocabulary words result in a loss of information
* **Character-based**
    - split text into characters rather than words
    - a much smaller vocabulary
    - fewer [UNK]
    - results in a comparatively longer tokenization which impacts the context that can be propagated by the model
    - however, this can lose to some loss of meaning (depending on the language)
* **Subword tokenization**
    - frequently used words should not be split up into smaller words, but rare words should be decomposed into meaningful subwords. ex: annoyingly = annoying + ly
    - ends up providing a lot of semantic meaning
    - allows us relatively good coverage: small vocabularies with fewer unknown tokens
    - EX: WordPiece, BPE, Unigram
* **Others...**
    - Byte-level
    - WordPiece
    - SentencePiece
    
#### Getting Started with the API

Loading and Saving a tokenizer is the same as loading/saving a model. Using `.from_pretrained('<checkpoint>')` and `.save_pretrained('<checkpoint>')`

* **Encoding** - generating input IDs - translating text into tokens
    - splitting the text: use the same rules that were used when the model was pretrained
    - convert the tokens into numbers using the same vocabulary as for pretraining
* **Tokenization** use the `tokenize()` method

In [63]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained( 'bert-base-cased' )
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize( sequence )
tokens

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']

<br>

* **From tokens to input IDs** use the `convert_tokens_to_ids( tokens )` method.

In [64]:
ids = tokenizer.convert_tokens_to_ids( tokens )
print( ids )

[7993, 170, 13809, 23763, 2443, 1110, 3014]


In [67]:
tokens = tokenizer.tokenize( raw_inputs[0] )
ids = tokenizer.convert_tokens_to_ids( tokens )
print( ids )

[2421, 1143, 1129, 2330, 117, 1142, 1110, 1126, 6608, 1326, 1105, 24819, 1942, 9499, 119, 14856, 20737, 4027, 113, 146, 4927, 155, 119, 24446, 5168, 2176, 112, 188, 8431, 2099, 1107, 1456, 1105, 1375, 114, 119, 1249, 1111, 3856, 25373, 1161, 117, 146, 112, 1396, 1793, 1106, 2824, 1122, 1105, 1107, 1139, 17398, 1111, 1199, 15879, 1336, 1256, 2824, 1265, 124, 117, 1134, 146, 2100, 1336, 1129, 10480, 119, 1252, 1296, 2004, 1107, 19939, 122, 1105, 123, 1138, 1151, 1176, 170, 7703, 119, 1573, 12170, 10884, 106, 1109, 1178, 5426, 1959, 1110, 2586, 23315, 2956, 117, 1103, 2712, 173, 2522, 8508, 1197, 1488, 1104, 20128, 119, 2009, 1577, 112, 189, 1142, 1326, 1129, 1167, 5426, 136, 1247, 112, 188, 1177, 1277, 3209, 1106, 1294, 1122, 170, 5098, 1686, 3381, 2541, 119, 1109, 17218, 1110, 9684, 783, 1296, 2741, 8194, 1116, 1113, 1114, 1103, 188, 13200, 18351, 3954, 1104, 170, 20549, 3098, 1120, 25514, 119, 1135, 112, 188, 1315, 17693, 8032, 2302, 117, 20787, 3807, 1176, 1677, 2093, 117, 9718, 1440,

<br>

* **Decoding** from vocabulary indices, we want to get a string. use the `decode()` method  
The decode methods not only translates the tokens, but regroups tokens that were split

In [69]:
decoded_string = tokenizer.decode(ids)
decoded_string

"Let me be clear, this is an animated series and NOT anime. Disclaimer ( I enjoyed R. Armitage's brilliant performance in North and South ). As for Castlevania, I've tried to watch it and in my desperation for some distraction may even watch season 3, which I hear may be promising. But each episode in Seasons 1 and 2 have been like a punishment. So incredibly dull! The only interesting character is Alucard, the beautiful dhampir son of Dracula. Why couldn't this series be more interesting? There's so much potential to make it a truly livening experience. The pacing is awful — each scene drags on with the sloppiness of a messy effort at storytelling. It's too dialog heavy, wit sounds like farce, fights look like a Saturday morning special. What happened here is a miserable attempt at animating a video game series to draw in a new audience who may be blown away by the animation style but have yet to experience the true genius of some fantastic Japanese anime gems."

<br>

### Handling Multiple Sequences

#### Models Expect Batches of Inputs

tokenizers translate text into sequences of numbers. Let's cast the numbers as a tensor and pass to the model

In [70]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained( checkpoint )
model = AutoModelForSequenceClassification.from_pretrained( checkpoint )

sequence = "I've been waiting for a HuggingFace course my whole life"

tokens = tokenizer.tokenize( sequence )
ids = tokenizer.convert_tokens_to_ids( tokens )
input_ids = torch.tensor( ids )
model( input_ids )

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

<br>

**Note**: Transformers expexts multiple sequences by default  
add a new dimension to the tensor:

In [71]:
input_ids = torch.tensor([ids])
print( "Input IDs:", input_ids )

output = model( input_ids )
print( "Logits:", output.logits )

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166]])
Logits: tensor([[-3.1398,  3.3515]], grad_fn=<AddmmBackward>)


<br>

**Batching** - the act of sending multiple sentences through the model all at once

In [73]:
batched_ids = [ids, ids]
input_ids = torch.tensor(batched_ids)
print( "Input IDs:", input_ids )

output = model( input_ids )
print( "Logits:", output.logits )

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166],
        [ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166]])
Logits: tensor([[-3.1398,  3.3515],
        [-3.1398,  3.3515]], grad_fn=<AddmmBackward>)


<br>

#### Padding the Inputs

**padding token** - a special token used to pad the length of sequences such that they all have the same length. It is important for tensors to have the same length so that they can be batched and processed uniformly by the linear algebra routines that are running under the hood.  

In [75]:
print( tokenizer.pad_token_id )

0


<br>

Send two sequences through individually and compare to running them through batched together

In [76]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [[200, 200, 200],
              [200, 200, tokenizer.pad_token_id]]

print( model( torch.tensor( sequence1_ids )).logits )
print( model( torch.tensor( sequence2_ids )).logits )
print( model( torch.tensor( batched_ids )).logits )

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)


<br>

**Observe** - the logits for the seperately run sequences and the batched sequences are different despite being the identical tokens. The difference is due to the padding that was added to make the sequences the same length for the batched run. To take the padding into account, we make use of an attention mask.

#### Attention Masks

**Attention masks** - tensors with the exact shape as the input, but filled with 1s where tokens are used and 0s where padding tokens are used; this is used to tell the model where in the sequence to pay attention. Use of an attention mask will ensure that the same sequence with different padding lenghts will result in the same logits:

In [77]:
attention_mask = [[1, 1, 1],
                 [1, 1, 0]]

print( model( torch.tensor( sequence1_ids )).logits )
print( model( torch.tensor( sequence2_ids )).logits )
print( model( torch.tensor( batched_ids ), 
             attention_mask = torch.tensor( attention_mask )).logits )

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)


<br>

Wonderful! Now we can see that the use of an attention mask yields the same logit values as when the sequences were run through the model independently.  

Another example:

In [98]:
text1 = "I've been waiting for a HuggingFace course my whole life"
text2 = "I hate this so much!"
print( 'text:' )
print( text1 )
print( text2 )
print( '' )

# tokenize the text
tokens1 = tokenizer.tokenize( text1 )
tokens2 = tokenizer.tokenize( text2 )
print( 'tokens:' )
print( tokens1 )
print( tokens2 )
print( '' )

# return the IDs
ids1 = tokenizer.convert_tokens_to_ids( tokens1 )
ids2 = tokenizer.convert_tokens_to_ids( tokens2 )
batched_ids = [ ids1, ids2 ]
print( 'inputs:' )
print( ids1 )
print( ids2 )
print( 'batched inputs:' )
print( batched_ids )
paddin = tokenizer.pad_token_id
num_pads = len( ids1 ) - len( ids2 )
ids2_pad = ids2.copy()
ids2_pad.extend([0] * num_pads)
padded_batched_ids = [ ids1, ids2_pad ]
print( padded_batched_ids )
print( '' )


# pass the inputs through the model
print( model( torch.tensor( [ids1] )).logits )
print( model( torch.tensor( [ids2] )).logits )
print( model( torch.tensor( padded_batched_ids )).logits )
attention_mask = [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]]
print( model( torch.tensor( padded_batched_ids ), 
             attention_mask = torch.tensor( [attention_mask] )).logits )

#  check and compare the logits

text:
I've been waiting for a HuggingFace course my whole life
I hate this so much!

tokens:
['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life']
['i', 'hate', 'this', 'so', 'much', '!']

inputs:
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166]
[1045, 5223, 2023, 2061, 2172, 999]
batched inputs:
[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166], [1045, 5223, 2023, 2061, 2172, 999]]
[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166], [1045, 5223, 2023, 2061, 2172, 999, 0, 0, 0, 0, 0, 0, 0]]

tensor([[-3.1398,  3.3515]], grad_fn=<AddmmBackward>)
tensor([[ 3.1931, -2.6685]], grad_fn=<AddmmBackward>)
tensor([[-3.1398,  3.3515],
        [ 2.6743, -2.2346]], grad_fn=<AddmmBackward>)
tensor([[-3.1398,  3.3515],
        [ 3.1931, -2.6685]], grad_fn=<AddmmBackward>)


<br>

#### Longer Sequences

Most models handle sequences with length maximums in the 512-1024 range. If working with longer sequences, check out Longformer or LED models

<br>

### Putting it all Together

The API can facilitate the process of preprocessing and passing inputs to a model.  
Use the `tokenizer` object:

In [111]:
batched_text = [ text1, text2 ]
print( batched_text )

model_inputs = tokenizer( batched_text )
print( model_inputs )

# padding methods
# pad to the longest
print( tokenizer( batched_text, padding='longest' ) )
# pad to the model's max
print( tokenizer( batched_text, padding='max_length' ) )
# pad to a specified max
print( tokenizer( batched_text, padding='max_length', max_length = 20 ) )
# truncate if sequences exceed max length
print( tokenizer( batched_text, padding='max_length', max_length = 10, truncation = True ) )
# return as pytorch tensors
print( tokenizer( batched_text, padding='longest', return_tensors = "pt" ) ) 
# return as tensorflow tensors
print( tokenizer( batched_text, padding='longest', return_tensors = "tf" ) )
# return as numpy arrays
print( tokenizer( batched_text, padding='longest', return_tensors = "np" ) )

["I've been waiting for a HuggingFace course my whole life", 'I hate this so much!']
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]]}
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

<br>

#### From Tokenizer to Model

Once more ...with feeling!

In [113]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# raw_inputs == the Wtcher reviews
tokens = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
output

SequenceClassifierOutput(loss=None, logits=tensor([[ 4.0592, -3.2925],
        [ 3.8101, -3.1752],
        [-3.8215,  4.0621],
        [-4.1085,  4.4263],
        [-2.3191,  2.4049],
        [ 3.4054, -2.7925]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

<br>

### Basic Usage Completed!

<br>

### End-of-Chapter Quiz

In [114]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
result = tokenizer.tokenize("Hello!")
result

['Hello', '!']