<br>

# 🤗 HuggingFace Course

<br>

In [1]:
import transformers

<br>

## 🤗 Transformer Models

Transformers, why are they so damn cool?

<br>

### Introduction

* Chapters 1-4: introduction of the main concepts of 🤗 Transformers library
* Chapters 5-8: teach the basics of 🤗 Datasets and 🤗 Tokenizers + some intro to NLP'
* Chapters 9-12: deeper dive to showcase specialized architectures

<br>

### Natural Language Processing

**NLP**  
a list of common tasks:  
* classifying whole sentences - sentiment, classification (spam), grammer, etc.
* classifying words in a sentence - grammer, POS, entities
* generating text content - filling in blacks, sentence completion
* extracting an answer form text - given a question + content, find answer
* generating a new sentence from text input - summarizing text, translation

<br>

### Tranformers, what can they do?

The 🤗 Transformers library provides the functionality to create and use hosted models. The Model Hub contains many thousands of pretrained models that anyone can download and use.  

#### Working with `pipelines`

`pipelines` connect a model with it's necessary preprocessing and postprocessing steps
1. the text is preprocessed into a format the model can understand
2. the preprocessed inputs are passed to the model
3. the predictions of the model are post-processed, so you can make sense of them

In [2]:
from transformers import pipeline

classifier = pipeline( "sentiment-analysis" )
# also available: feature-extraction, fill-mask, named entity recognition, questions-answering,
#                 summarization, text-generation, translation, zero-shot classification
classifier( "I've been waiting for a HuggingFace course since last Spring" )

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.8835864663124084}]

**Note** - the model is cached, $\therefore$ it does not need to be loaded when calling the pipeline object again

In [3]:
classifier( "I'm really excited to take this HuggingFace course" )

[{'label': 'POSITIVE', 'score': 0.9996492862701416}]

<br>

Taking a look at a few other pipelines:  

#### Zero-Shot Classfication

classifying texts that have not been labeled. allows you to specify which labels to use for the classification, so that you don't need to rely on the labels of the pretrained model

In [4]:
classifier = pipeline( "zero-shot-classification" )
classifier( 
    "These are my notes from the HuggingFace online course",
    candidate_labels = ['education', 'science', 'politics', 'entertainment']
)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

{'sequence': 'These are my notes from the HuggingFace online course',
 'labels': ['education', 'entertainment', 'science', 'politics'],
 'scores': [0.6758780479431152,
  0.21396446228027344,
  0.08056329190731049,
  0.029594212770462036]}

In [6]:
classifier( [ 'NASA’s James Webb Space Telescope team fully deployed its 21-foot, gold-coated primary mirror, successfully completing the final stage of all major spacecraft deployments to prepare for science operations.',
             'The world’s largest and most complex space science telescope will now begin moving its 18 primary mirror segments to align the telescope optics.'],
          candidate_labels = ['education', 'science', 'politics', 'entertainment']
          )

[{'sequence': 'NASA’s James Webb Space Telescope team fully deployed its 21-foot, gold-coated primary mirror, successfully completing the final stage of all major spacecraft deployments to prepare for science operations.',
  'labels': ['science', 'education', 'entertainment', 'politics'],
  'scores': [0.972022294998169,
   0.011234794743359089,
   0.010500448755919933,
   0.006242458242923021]},
 {'sequence': 'The world’s largest and most complex space science telescope will now begin moving its 18 primary mirror segments to align the telescope optics.',
  'labels': ['science', 'entertainment', 'education', 'politics'],
  'scores': [0.9860273599624634,
   0.005517215467989445,
   0.005045033525675535,
   0.0034104110673069954]}]

<br>

#### Text Generation

provide a prompt and the model will auto-complete to generate the remaining text

In [7]:
generator = pipeline( 'text-generation' )
generator( 'In this course, we will teach you how to understand' )

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to understand the principles behind the two-sided nature of human being.'}]

In [8]:
generator( 'In this course, we will teach you how to understand', num_return_sequences = 2, max_length = 15 )

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to understand and avoid some common'},
 {'generated_text': 'In this course, we will teach you how to understand the fundamentals of data'}]

In [9]:
generator( 'Seeing Slayer live was the greatest', num_return_sequences = 5, max_length = 25 )

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Seeing Slayer live was the greatest battle of the year (and probably of all time). It was so fierce that it was the'},
 {'generated_text': 'Seeing Slayer live was the greatest experience of all." After my introduction, Liu Yao glanced at me. "We will be running'},
 {'generated_text': 'Seeing Slayer live was the greatest moment of their lives, and how they felt at that moment — their life that never ends —'},
 {'generated_text': 'Seeing Slayer live was the greatest fantasy that was left out in the world."\n\n"And by the same token, if'},
 {'generated_text': 'Seeing Slayer live was the greatest accomplishment of her lives.\n\nShe had never used any supernatural powers before and could hardly feel'}]

In [10]:
# specifying a model from the Hub
generator = pipeline('text-generation', model='distilgpt2')
generator( 'Seeing Slayer live was the greatest', num_return_sequences = 5, max_length = 25 )

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Seeing Slayer live was the greatest show of all time - in 1995, the television giant put back its plans, but that was'},
 {'generated_text': 'Seeing Slayer live was the greatest day in professional soccer history, with many leagues reaching huge financial heights. To qualify for the World'},
 {'generated_text': 'Seeing Slayer live was the greatest show ever released in Japan.\n\n\n\nThe Japanese version of Slayer premiered at the U'},
 {'generated_text': 'Seeing Slayer live was the greatest success in the world,” said the CEO of the agency.\n\n\n\n\n'},
 {'generated_text': 'Seeing Slayer live was the greatest act of the show!\nThe cast also talked about their experiences in New York, Los Angeles'}]

<br>

#### Mask Filling

fill in blanks in a given text

In [11]:
unmasker = pipeline( 'fill-mask' )
unmasker( 'This course will teach you all about <mask> models', top_k=5 )

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'sequence': 'This course will teach you all about mathematical models',
  'score': 0.19631721079349518,
  'token': 30412,
  'token_str': ' mathematical'},
 {'sequence': 'This course will teach you all about building models',
  'score': 0.04449249804019928,
  'token': 745,
  'token_str': ' building'},
 {'sequence': 'This course will teach you all about predictive models',
  'score': 0.039371054619550705,
  'token': 27930,
  'token_str': ' predictive'},
 {'sequence': 'This course will teach you all about role models',
  'score': 0.03575519099831581,
  'token': 774,
  'token_str': ' role'},
 {'sequence': 'This course will teach you all about business models',
  'score': 0.027736075222492218,
  'token': 265,
  'token_str': ' business'}]

In [12]:
unmasker = pipeline( 'fill-mask', model = 'bert-base-cased' )
unmasker( 'This course will teach you all about [MASK] models', top_k=5 )

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

[{'sequence': 'This course will teach you all about role models',
  'score': 0.07605545967817307,
  'token': 1648,
  'token_str': 'role'},
 {'sequence': 'This course will teach you all about the models',
  'score': 0.05633450299501419,
  'token': 1103,
  'token_str': 'the'},
 {'sequence': 'This course will teach you all about fashion models',
  'score': 0.04619521275162697,
  'token': 4633,
  'token_str': 'fashion'},
 {'sequence': 'This course will teach you all about computer models',
  'score': 0.030348550528287888,
  'token': 2775,
  'token_str': 'computer'},
 {'sequence': 'This course will teach you all about life models',
  'score': 0.019965192303061485,
  'token': 1297,
  'token_str': 'life'}]

<br>

#### Named Entity Recognition

**NER** - find which parts of an input text correspond to entities

In [14]:
ner = pipeline( "ner", grouped_entities=True )
ner( "Let me be clear on this: My brother is qualified for the position, Adams told CNN’s 'State of the Union' when asked about making his sibling Bernard Williams a deputy NYPD commissioner.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'entity_group': 'PER',
  'score': 0.99946195,
  'word': 'Adams',
  'start': 67,
  'end': 72},
 {'entity_group': 'ORG',
  'score': 0.99552476,
  'word': 'CNN',
  'start': 78,
  'end': 81},
 {'entity_group': 'ORG',
  'score': 0.8375926,
  'word': 'State of the Union',
  'start': 85,
  'end': 103},
 {'entity_group': 'PER',
  'score': 0.99927413,
  'word': 'Bernard Williams',
  'start': 141,
  'end': 157},
 {'entity_group': 'ORG',
  'score': 0.9963703,
  'word': 'NYPD',
  'start': 167,
  'end': 171}]

In [22]:
# POS tagging
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("vblagoje/roberta-base-english-upos")
model = AutoModelForTokenClassification.from_pretrained("vblagoje/roberta-base-english-upos")
ner = pipeline( "ner", model = model, tokenizer=tokenizer, grouped_entities=True )
ner( "Let me be clear on this: My brother is qualified for the position, Adams told CNN’s 'State of the Union' when asked about making his sibling Bernard Williams a deputy NYPD commissioner.")

Downloading:   0%|          | 0.00/326 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

[{'entity_group': 'VERB',
  'score': 0.9986479,
  'word': ' Let',
  'start': 1,
  'end': 3},
 {'entity_group': 'PRON',
  'score': 0.99980384,
  'word': ' me',
  'start': 4,
  'end': 6},
 {'entity_group': 'AUX',
  'score': 0.98620975,
  'word': ' be',
  'start': 7,
  'end': 9},
 {'entity_group': 'ADJ',
  'score': 0.9991356,
  'word': ' clear',
  'start': 10,
  'end': 15},
 {'entity_group': 'ADP',
  'score': 0.99943495,
  'word': ' on',
  'start': 16,
  'end': 18},
 {'entity_group': 'PRON',
  'score': 0.99961954,
  'word': ' this',
  'start': 19,
  'end': 23},
 {'entity_group': 'PUNCT',
  'score': 0.93715745,
  'word': ':',
  'start': 23,
  'end': 24},
 {'entity_group': 'PRON',
  'score': 0.99976885,
  'word': ' My',
  'start': 25,
  'end': 27},
 {'entity_group': 'NOUN',
  'score': 0.9995938,
  'word': ' brother',
  'start': 28,
  'end': 35},
 {'entity_group': 'AUX',
  'score': 0.99777746,
  'word': ' is',
  'start': 36,
  'end': 38},
 {'entity_group': 'ADJ',
  'score': 0.9990982,
  'wor

<br>

#### Question Answering

In [24]:
q_answering = pipeline( 'question-answering' )
q_answering( 
    question = 'Where does Zach work?',
    context = 'Zach is a bike messenger in NYC and lives in Brooklyn'
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'score': 0.38330498337745667, 'start': 45, 'end': 53, 'answer': 'Brooklyn'}

<br>

#### Summarization

distilling text

In [25]:
summarizer = pipeline( 'summarization' )
summarizer( """The Witcher is a fantasy drama streaming television series created by Lauren Schmidt Hissrich, fan fiction based on the book series of the same name by Polish writer Andrzej Sapkowski. Set on a fictional, medieval-inspired landmass known as "the Continent", The Witcher explores the legend of Geralt of Rivia and Princess Ciri, who are linked to each other by destiny. It stars Henry Cavill, Freya Allan and Anya Chalotra.

The first season consisted of eight episodes and was released on Netflix in its entirety on December 20, 2019. It was based on The Last Wish and Sword of Destiny, which are collections of short stories that precede the main Witcher saga. The second season, consisting of eight episodes, was released on December 17, 2021. In September 2021, Netflix renewed the series for a third season. An animated origin story film, The Witcher: Nightmare of the Wolf, was released on August 23, 2021, while a prequel miniseries, The Witcher: Blood Origin, will be released in 2022.

The story begins by following three main characters: the witcher Geralt of Rivia; crown princess Cirilla of Cintra; and the sorceress Yennefer of Vengerberg, meeting them at different points in time. A witcher is a human trained and magically mutated since childhood to be strong and resilient enough to fight the many monsters that plague The Continent. The first season bounces the viewer back and forth in time, exploring formative events that shape the main characters before eventually merging into a single timeline.

Yennefer and Geralt encounter each other several times across many decades, as both are magic-users with unnaturally long lives. They clash more than once and are on-and-off lovers, both of them longing for more but afraid to admit it.

Meanwhile, Geralt and princess Cirilla are linked by destiny before she is born: for saving her father's life, Geralt elected to be paid by invoking "the Law of Surprise", a custom in which the one who was saved gives his savior the next thing he learns that he possesses, but did NOT know about at the time his life was saved. After promising Geralt the surprise, Cirilla's father Duny learns that his bride is pregnant. Thus, the unborn princess Ciri was bound to the witcher in an unbreakable pact of destiny. Geralt wants nothing to do with a child and goes on his solitary way. After the two finally meet, when the 12-year-old princess has been orphaned by war, the witcher becomes her protector; he must keep her safe and fight against many pursuers to prevent Ciri's powerful magic from being used to destroy the world.
""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

[{'summary_text': ' The Witcher is a fantasy drama based on the book series of the same name by Polish writer Andrzej Sapkowski . Set on a fictional, medieval-inspired landmass known as "the Continent", The Witcher explores the legend of Geralt of Rivia and Princess Ciri, who are linked to each other by destiny . The first season consisted of eight episodes and was released on Netflix in its entirety on December 20, 2019 .'}]

<br>

#### Translation

In [28]:
translator = pipeline( 'translation', model = 'Helsinki-NLP/opus-mt-fr-en' )
translator('Ce cours est produit par HuggingFace')

[{'translation_text': 'This course is produced by HuggingFace'}]

<br>

### How do Transformers work?

Transformers were introduced in June 2017  
They can be broadly grouped into 3 categories: 
* GPT-like - autoregressive
* BERT-like - autoencoding
* BART/T5-like - seq2seq 

These models learned language by **self-supervised learning** - automatically compute from inputs to the model to develop a statistical understanding of the language.  
Transformers models are very broad and not very useful for specific tasks. However, they can be used as a base case and be fine-tuned using supervised learning for a specific task == **transfer learning**  


Large Models have a big carbon footprint  
Several things to consider:
* use pretrained models when they are available
* use fine-tuning vs. from scratch
* starting with smaller experiments and then debugging
* doing literature review to choose hyperparameter ranges
* random seach vs grid search

[**Machine Learning Emissions Calculator**](https://mlco2.github.io/impact/)  
also codecarbon  

In [30]:
#pip install codecarbon
from codecarbon import EmissionsTracker
help( EmissionsTracker )

Help on class EmissionsTracker in module codecarbon.emissions_tracker:

class EmissionsTracker(BaseEmissionsTracker)
 |  EmissionsTracker(project_name: str = 'codecarbon', measure_power_secs: int = 15, output_dir: str = '.', save_to_file: bool = True, gpu_ids: Union[List, NoneType] = None, emissions_endpoint: Union[str, NoneType] = None, co2_signal_api_token: Union[str, NoneType] = None)
 |  
 |  An online emissions tracker that auto infers geographical location,
 |  using `geojs` API
 |  
 |  Method resolution order:
 |      EmissionsTracker
 |      BaseEmissionsTracker
 |      abc.ABC
 |      builtins.object
 |  
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from BaseEmissionsTracker:
 |  
 |  __init__(self, project_name: str = 'codecarbon', measure_power_secs: int = 15, output_dir: str = '.', save_to_file: bool = True, gpu_ids: Union[List, No

<br>

Sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community.  

#### Transfer Learning

Leverage the knowledge gained from training a model on a very large dataset on another task.  
Initialize a model with the same weights as a pretrained model, thus transfering the knowledge.  
**pre-training** - the act of trainign a model from scratch: the weights are started at random and the training starts without any prior knowledge. pre-training si usually done on very large training data sets.  
**fine-tuning** - training done after a model has been pretrained typically with a comparatively smaller dataset  

fine-tuning has lower time, data, financial and environmental costs. However, the resulting model inherits any bias present from the pretrained model.  

#### General Archtecture

* **Encoder** - acquires and interprets input: receives input and builds representation of its features
* **Decoder** - generates output: ises the encoder's representation (features) along with other inputs to generate a target sequence
* Types of Models:
    * **Encoder-only models** - tasks for understanding the input (classification, NER)
    * **Decoder-only models** - generative tasks (text generation)
    * **Encoder-decoder models** - Sequence-to-sequence models; good for generative tasks that require an input (translation, summarization)
* **Attention Layers** - tells the model to pay specific attention to certain words in the sentence it was passed; a word by itself has meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the wordbeing studied.  
* **Architecture** - the skeleton of the model
* **Checkpoints** - the weights that will be lloaded by a given architecture
* **Model** - umbrella term

<br>

### Encoder Models

* Encoder models use only the encoder of a Transformer model
* **feature vector/tensor** - holds the meaning of the word within the text. retrieves a numerical vector representation of each word with dimensions defined by the architecture of the model. each numerical representation is contexualized: each word in the initial sequence affects every word's representation
* **"Bi-directional" attention** - the vectorization takes into account tokens both befor and after
* **auto-encoding models**
* Examples: 
    - ALBERT
    - BERT
    - DistilBERT
    - ELECTRA
    - RoBERTa
* When to use an Encoder Model:
    - sequence classification (sentiment analysis), question answering, masked language modeling
    - NLU: Natural Language Understanding

<br>

### Decoder Models

* Decoder Models use only the Decoder of a Transformer Model
* **Auto-regressive** - attention layers can only access the words positioned before it in a sentence
* **feature vector/tensor** - one vector per word with length determined by the model architecture differs from Encoder models by its self attention mechnanism
* **masked self-attention** - the tokens to the right or left of the word are masked; feature vector is decided by one-directional context
* best suited for text generation tasks
* Examples:
    - CTRL
    - GPT
    - GPT-2
    - Transformer XL
* when to use Decoder Model:
    - Causal Language Modeling: great at generational tests: generating sequences of text
    - NLG: Natural Language Generation

<br>

### Sequence-to-sequence Models

* Encoder-Decoder Models use both parts of the Transformers architecture.
* Works by:
    - takes the numerical representation from the Encoder (meaning of the sequence)
    - pass to the Decoder along with a token to start a sequence
    - the Decoder will try to sequencially decode the Encoder's output as a 'word'
    - Decoder works in a auto-regressive way to add tokens until a stop is encountered
* Are best for tasks revolving around generation of new text which depends on context: 
    - summarization
    - traslation
    - generative question answering
* Examples: 
    - BART
    - Marian
    - T5

<br>

### Bias and limitations

<br>

### Summary

<br>

### End of Chapter Quiz

<br>

## Using 🤗 Transformers