# Introduction to NLP with Hugging Face and Transformers

Hugging Face is a framework that allows users to create, train, and deploy machine learning (ML) models. It is also a community for data scientists, researchers, and machine learning engineers to share ideas, get support, and contribute to open-source projects. In this notebook I'll show you how to do various NLP analyzes.

## Text Classification

Let's get a text that shows the comment of a customer on a book and show how to perform text classification with Hugging Face.

In [1]:
text = """The 3-star rating is for Amazon not the book. Book arrived on time. Lots of great information. 
Problem? Opened the cover and the first page showing is page 29. 
No table of contents, preface, or the first pages of chapter 1. 
I am using this in conjunction with my Neural Networks class I am taking as part of my masters program and need the book. 
Mind you, I am working on stuff further along than chapter 1 but I would like to have the complete book that I paid for. 
I contacted Amazon customer service to get a replacement and explained 
that I needed the book but was told that I would need to return this one to get a replacement. 
I do not want to go 2 or more weeks without the book so I just went ahead and ordered a new one 
and will return this one once the replacement is received."""

Let's create the text classification pipeline.

In [2]:
from transformers import pipeline
classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

2022-11-07 10:48:05.164232: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Now that we have a pipeline, let’s make some predictions! Each pipeline takes a string of text (or a list of strings) as input and returns a list of predictions. Note that for sentiment analysis tasks the pipeline only returns one of the POSITIVE or NEGATIVE labels, since the other can be inferred by computing 1-score.


In [3]:
import pandas as pd
outputs = classifier(text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,NEGATIVE,0.997864


As you can see the model is very confident that the text has a negative sentiment. we’re dealing with a complaint from an angry customer! Let’s now take a look at another common task.

## Named Entity Recognition

In NLP, real-world objects like products, places, and people are called named entities, and extracting them from text is called named entity recognition (NER). Let's take a look at NER by loading the corresponding pipeline and feeding our customer review to it. Here I'm going to pass the aggregation_strategy argument to group the words according to the model’s predictions.

In [4]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Let's taka a look at NER of text.

In [5]:
outputs = ner_tagger(text)
pd.DataFrame(outputs)

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.947415,Amazon,25,31
1,ORG,0.837264,Neural Networks,266,281
2,ORG,0.98908,Amazon,484,490


As you can see, the pipeline found all of the entities and assigned them a category such as ORG (organization) for the text.

## Question Answering

In question answering, we give the model a passage of text known as the context, as well as a question whose answer we want to extract.
The model then returns the text span associated with the answer. Let's take a look at what happens if we ask a specific question about customer feedback: 

In [6]:
reader = pipeline("question-answering")
question = "What does the customer request?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Unnamed: 0,score,start,end,answer
0,0.558932,515,528,a replacement


You can see the answer. Note that the pipeline also gives us the start and end integers corresponding to the character indices where the answer range is located.

## Summarization

With text summarization, you can take a long text as input and generate a short version. Let's take a look at this technique.

In [7]:
summarizer = pipeline("summarization")
outputs = summarizer(text, min_length=20, max_length=60, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

 The 3-star rating is for Amazon not the book. Book arrived on time. Lots of great information. Opened the cover and the first page showing is page 29. No table of contents, preface, or the first pages of chapter 1.


As you can see that the model was able to capture the essence of the problem and correctly identify.

## Translation

Translation, like summarization, is a task whose output is generated text. To translate an English text to German, let's use a translation.

In [8]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Downloading:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/284M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/750k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Die 3-Sterne-Bewertung ist für Amazon nicht das Buch. Buch kam pünktlich. Viele tolle Informationen. Problem? Öffnen Sie das Cover und die erste Seite zeigt ist Seite 29. Kein Inhaltsverzeichnis, Vorwort, oder die ersten Seiten von Kapitel 1. Ich benutze dies in Verbindung mit meiner Neural Networks Klasse Ich nehme als Teil meines Masterprogramms und brauche das Buch. Ich denke, ich arbeite an Sachen weiter als Kapitel 1, aber ich möchte das komplette Buch, das ich bezahlt habe. Ich habe Amazon Kundendienst kontaktiert, um einen Ersatz zu bekommen und erklärte, dass ich das Buch brauchte, aber wurde gesagt, dass ich dieses zurückgeben müsste, um einen Ersatz zu bekommen. Ich will nicht 2 oder mehr Wochen ohne das Buch gehen, also habe ich nur ein neues bestellt und werde dieses zurückgeben, sobald der Ersatz eingegangen ist.


As you can see translation isn't bad. You can find models for thousands of language pairs on the Hugging Face Hub.

## Text Generation

Assume you want to be able to respond to customer feedback more quickly by having access to an autocomplete function. This is possible with a text generation model: 

In [9]:
generator = pipeline("text-generation")
response = "Dear Customer, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=600)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The 3-star rating is for Amazon not the book. Book arrived on time. Lots of great information. 
Problem? Opened the cover and the first page showing is page 29. 
No table of contents, preface, or the first pages of chapter 1. 
I am using this in conjunction with my Neural Networks class I am taking as part of my masters program and need the book. 
Mind you, I am working on stuff further along than chapter 1 but I would like to have the complete book that I paid for. 
I contacted Amazon customer service to get a replacement and explained 
that I needed the book but was told that I would need to return this one to get a replacement. 
I do not want to go 2 or more weeks without the book so I just went ahead and ordered a new one 
and will return this one once the replacement is received.

Customer service response:
Dear Customer, I am sorry to hear that your order was mixed up. I have ordered multiple books but all was poorly packed with the same single item. Please, please return this bo

You can generate a response like this to calm the customer.

## Conclusion

Now that you've seen several great applications of transformer models. All the models we use in this section are public and have already been fine-tuned for the task at hand. But in general, you can fine-tune models on your own data.

Thanks for reading. 

Follow us [Twitter](https://twitter.com/TirendazAcademy) | [Instagram](https://www.instagram.com/tirendazacademy) | [YouTube](https://www.youtube.com/channel/UCFU9Go20p01kC64w-tmFORw) | [Tiktok](https://www.tiktok.com/@tirendazacademy) | [Medium](https://tirendazacademy.medium.com) | [Reddit](https://www.reddit.com/user/TirendazAcademy) 