# Introduction to Hugging Face

Applying a novel machine learning architecture to a new task can be a complex undertaking, and ussually involves the folliwwing steps:
1. Implement new code for the model architecture (Pytorch or TensorFlow).
2. Load the pretrained weights from a server if they are available.
3. Preprocess the inputs, pass them throught the model, and apply some task-specific postprocessing.
4. Implement dataloaders and define loss functions and optimizers to train the model.

Each of theses steps can be time consuming and load pretrained weights can be very hard if the realesed code is not standardized.

<span style="color:red">Hugging Face comes to the NLP practitionner's rescue. So what is Hugging Face?</span>

Hugging Face is a company that focuses on natural language processing and provides various tools and libraries for working with NLP. The Hugging Face ecosystem consists of mainly two parts:
- a family of librairies
- The Hub

<center><img src="images/HF_hub.PNG" alt="An overview of the Hugging Face ecosystem" width="300"></center>

The librairies provide the code while the Hub provides the pretrained model weights, datasets, scripts for the evaluation metrics and more.

## The Hugging Face Hub

Transfer learning is one of the key factors driving the sucess of transformers because it makes it possible to reuse pretrained models for new tasks. So, it is crucial to be able to load pretrained models quickly and run experiments with them. The Hugging Face Hub hosts over 20000 freely avalaible models. As shown in the figure below, there are filters for tasks, datasets, framework and more. This makes experimenting with a wide range of models simple and allows you to focus on the domain-specific parts of your project.

![Hugging Face Hub](images/HF_hub2.PNG)

**Lets dive in!!**

# A tour of Transformer Applications with Hugging Face

In [1]:
from transformers import pipeline
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

## 1. Text Classification

In [3]:
model_name = "google-bert/bert-base-uncased"

In [5]:
classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [8]:
outputs = classifier(text)
pd.DataFrame(outputs)  

Unnamed: 0,label,score
0,NEGATIVE,0.901546


In [9]:
# Imports
import re, unicodedata, math, random, json, os
from collections import Counter
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 6)
random.seed(42)

# Chargement du corpus d'avis clients
path = 'data/customer_reviews_fr.txt'
with open(path, 'r', encoding='utf-8') as f:
    docs = [l.strip() for l in f if l.strip()]
len(docs), docs[:3]


(30,
 ['Commande reçue en avance, emballage impeccable. Produit conforme, bon rapport qualité/prix.',
  'Livraison en retard de deux jours et carton abîmé. Le SAV a répondu tardivement.',
  'Excellent aspirateur, silencieux et puissant. Par contre, le manuel est incomplet.'])

In [13]:
docs[12]

'Facture erronée deux mois de suite. Résolution efficace après réclamation écrite.'

In [11]:
outputs = classifier(docs)

In [12]:
outputs

[{'label': 'POSITIVE', 'score': 0.9910628795623779},
 {'label': 'NEGATIVE', 'score': 0.97502201795578},
 {'label': 'POSITIVE', 'score': 0.9994239807128906},
 {'label': 'NEGATIVE', 'score': 0.9751178026199341},
 {'label': 'NEGATIVE', 'score': 0.9502578377723694},
 {'label': 'NEGATIVE', 'score': 0.9681137800216675},
 {'label': 'POSITIVE', 'score': 0.8426457047462463},
 {'label': 'NEGATIVE', 'score': 0.947262167930603},
 {'label': 'POSITIVE', 'score': 0.954017162322998},
 {'label': 'NEGATIVE', 'score': 0.8194940090179443},
 {'label': 'POSITIVE', 'score': 0.996121346950531},
 {'label': 'NEGATIVE', 'score': 0.9794974327087402},
 {'label': 'NEGATIVE', 'score': 0.5416688919067383},
 {'label': 'NEGATIVE', 'score': 0.9821609258651733},
 {'label': 'NEGATIVE', 'score': 0.6248091459274292},
 {'label': 'NEGATIVE', 'score': 0.9579129815101624},
 {'label': 'NEGATIVE', 'score': 0.9486048221588135},
 {'label': 'NEGATIVE', 'score': 0.8725120425224304},
 {'label': 'POSITIVE', 'score': 0.6470173001289368}

The model is very confident that the text has a negative sentiment. Let's niw take a look at another common task, identifying named entities in text.

## 2. Named Entity Recognition

Predicting the sentiment of customer feedback is a good first step, but often want to know if the feedback was about a particular item or service. 

In [14]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs) 

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.87901,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.556571,Mega,208,212
4,PER,0.590255,##tron,212,216
5,ORG,0.669692,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.775362,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


## 3. Question Answering (QA)

In QA, we provide the model with the passage of text called the context with a question whose answer we'd like to extract.The model then returns the span of text corresponding to the answer.

In [15]:
print(text)

Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.


In [16]:
reader = pipeline("question-answering")
question = "What items customer order for?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs]) 

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,score,start,end,answer
0,0.11571,350,358,Megatron


With this approach you can read and extract relevant information quickly from a customer's feedback. But what if you get a mountain of long-winded complaints and you don't have the time to read them all? Let's see if a summarization model can help!

## 4. Summarization

The goal of text summarization is to take a long text as input and generate a short version with all the relevant facts. This is much complicated task than the previous ones since it requires the model to generate coherent text.

In [17]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your min_length=56 must be inferior than your max_length=45.


In [18]:
print(outputs[0]['summary_text'])

 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.


This summary is not too bad§ Although parts of the original text have been copied, the model was able to capture the essence of the problem and correctly identify that **Bumblebee** was the author of complaint.

But what happens when you get feedback that is in a language you don't understand? You could use DeepL, or you can use your very own transformer to translate it for you!!

## 5. Translation

In [19]:
translator = pipeline("translation_en_to_fr")

No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [20]:

outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Malheureusement, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead!As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon.


The translation is completly mess up. Lets use appropriate model for this task.

In [21]:
translator = pipeline("translation_en_to_fr", model="t5-small")

In [22]:
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=400)
print(outputs[0]['translation_text'])



Cher Amazon, la semaine dernière, j'ai commandé une figure d'action Optimus Prime à votre magasin en ligne en Allemagne. Malheureusement, lorsque j'ai ouvert le paquet, j'ai découvert à mon horreur que j'avais reçu une figure d'action de Megatron au lieu d'être envoyée, en tant qu'ennemi de la décepticon, j'espère que vous pouvez comprendre mon     la semaine dernière, j'ai command un Optimus Prime    en ligne en Allemagne, je                                                                          


## 6. Text Generation

Let's say you would like to be able to provide faster replies to customer feedback by having access to an autocomplete function. With text generation model you can do this as follows:

In [23]:
generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. If, for example- I didn't get to send Optimus Prime back before 2 a.m., or, to get the picture on the door, when I saw the package being assembled- I could not reach the shipping address within 5 minutes! I also asked for an email address, which you gave me. Here are the specific steps
