<a href="https://colab.research.google.com/github/JavierPachas/huggingface/blob/main/huggingface_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Hugging Face course on NLP**

##Setup

In [None]:
#In a python virtual environment

#mkdir ~/transformers-course
#cd ~/transformers-course

#python -m venv .env
#ls -a
#source .env/bin/activate
#source .env/bin/deactivate

#which python

/usr/local/bin/python


In [None]:
!pip install transformers

In [2]:
import transformers

In [3]:
import pandas as pd

##**1. Transformers models**

In [4]:
from google.colab import userdata
HF_TOKEN=userdata.get('HuggingFace')

In [42]:
text = """Ordered a McDonald's hamburger via Fasty delivery app. Disappointed as the order arrived late and cold.
Customer service was unresponsive. This experience was frustrating, especially living near the Main Square of Lima.
Fasty needs to improve their delivery efficiency and customer support for a better experience.
The delay in receiving my order made me rethink using this app for future deliveries. Sincerely, Richard Feynman."""

In [28]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
output = classifier("I've been waiting for a HuggingFace course my whole life.")
pd.DataFrame(output)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,label,score
0,POSITIVE,0.959805


In [32]:
output = classifier(text)
pd.DataFrame(output)

Unnamed: 0,label,score
0,NEGATIVE,0.999011


In [12]:
test = classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "It is Monday, I hate it already!"]
)
test

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994075298309326}]

In [14]:
pd.DataFrame(test)

Unnamed: 0,label,score
0,POSITIVE,0.959805
1,NEGATIVE,0.999408


Some of the currently available pipelines are:

* feature-extraction (get the vector representation of a text)
* fill-mask
* ner (named entity recognition)
* question-answering
* sentiment-analysis
* summarization
* text-generation
* translation
* zero-shot-classification

###**Zero-shot classification**


In [23]:
from transformers import pipeline

classifier = pipeline('zero-shot-classification')
classifier("I'm learning from HuggingFace in order to improve my NLP skills",
            candidate_labels = ['education', 'politics','health','business']
            )

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': "I'm learning from HuggingFace in order to improve my NLP skills",
 'labels': ['education', 'business', 'health', 'politics'],
 'scores': [0.6614951491355896,
  0.1459779441356659,
  0.1450687199831009,
  0.0474582202732563]}

###**Named Entity Recognition**

In [33]:
ner_tagger = pipeline('ner', aggregation_strategy = 'simple')
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.96121,McDonald ' s,10,20
1,ORG,0.764648,Fasty,35,40
2,LOC,0.991929,Main Square,199,210
3,LOC,0.998869,Lima,214,218
4,ORG,0.923802,Fasty,220,225
5,PER,0.925603,Richard Feynman,412,427


###**Question answering**

In [36]:
reader = pipeline('question-answering') #extractive question answering
question = 'What does the customer want?'
output = reader(question = question, context = text)
pd.DataFrame([output])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,score,start,end,answer
0,0.493044,296,313,better experience


###**Summarization**

In [43]:
summarizer = pipeline('summarization')
outputs = summarizer(text, max_length = 45, clean_up_tokenization_spaces = True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your min_length=56 must be inferior than your max_length=45.


 Richard Feynman ordered a McDonald's hamburger via Fasty delivery app. The order arrived late and cold, and customer service was unresponsive. The delay in receiving my order made me rethink using this app


###**Translation**

In [None]:
!pip install sacremoses

In [57]:
translator = pipeline('translation_en_to_es',
                      model = 'Helsinki-NLP/opus-mt-en-es')
outputs = translator(text, clean_up_tokenization_spaces = True, min_length = 100)
print(outputs[0]['translation_text'])

Pedí una hamburguesa McDonald's a través de la aplicación de entrega Fasty. Decepcionado como el pedido llegó tarde y frío. El servicio al cliente no respondió. Esta experiencia fue frustrante, especialmente viviendo cerca de la Plaza Principal de Lima. Fasty necesita mejorar su eficiencia de entrega y soporte al cliente para una mejor experiencia. El retraso en recibir mi pedido me hizo repensar el uso de esta aplicación para entregas futuras. Sinceramente, Richard Feynman.


###**Text generation**

In [22]:
from transformers import pipeline

generator = pipeline('text-generation')
generator("In this machine learning course, we say a model is supervised when",
          num_return_sequences = 2,
          max_length = 40)

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this machine learning course, we say a model is supervised when the probability of a given condition are zero, and so the probability is an integer. We then assume one of two conditions for the model'},
 {'generated_text': 'In this machine learning course, we say a model is supervised when we can see the results from other models. You can ask a question or try to figure out how a particular system worked. One or'}]

The previous model was a default model openai-community/gpt2.

Let's try another model called distilgpt2 and gpt2-medium.

In [20]:
generator = pipeline('text-generation', model = 'distilgpt2')
generator("In this machine learning course, we say a model is supervised when",
          num_return_sequences = 2,
          max_length = 40)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this machine learning course, we say a model is supervised when it is trained correctly. We say that we can compute with the model the value of the model (how is the model related) which'},
 {'generated_text': "In this machine learning course, we say a model is supervised when you try it on your machine. For example, we're not in a state where any of the data is collected, it's just"}]

In [18]:
generator = pipeline('text-generation', model = 'gpt2-medium')
generator("In this machine learning course, we say a model is supervised when",
          num_return_sequences = 2,
          max_length = 40)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this machine learning course, we say a model is supervised when it learns certain behavior, and semi-supervised when the model is able to learn to anticipate other actions that it hasn't already learned"},
 {'generated_text': 'In this machine learning course, we say a model is supervised when its accuracy is good or good enough. We then ask why it was trained in the first place and we give the example of a photo'}]

In [73]:
generator = pipeline('text-generation', model = 'gpt2-medium')
response = "Dear Richard, I am sorry to hear that your order arrived late and cold."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt,
          max_length = 155)
print(outputs[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Ordered a McDonald's hamburger via Fasty delivery app. Disappointed as the order arrived late and cold. 
Customer service was unresponsive. This experience was frustrating, especially living near the Main Square of Lima. 
Fasty needs to improve their delivery efficiency and customer support for a better experience. 
The delay in receiving my order made me rethink using this app for future deliveries. Sincerely, Richard Feynman.

Customer service response:
Dear Richard, I am sorry to hear that your order arrived late and cold. Fasty Delivery did not expect us to deliver a hamburger today in an hour because the wait time for a delivery was longer than expected. We apologize for any inconvenience you may have experienced. We appreciate your patience as
