In [None]:
# Based on the Huggin Face Course Introduction : https://huggingface.co/
# Modified by Mehdi Ammi, Univ. Paris 8

# Transformers: Technical Introduction 

This notebook provides a comprehensive introduction to the Hugging Face Transformers library, focusing on various natural language processing (NLP) tasks using pre-trained language models. Participants will learn how to install and verify the library, utilize pipelines for sentiment analysis, zero-shot classification, text generation, mask filling, named entity recognition (NER), question answering, text summarization, and translation. By the end of this course, learners will be equipped to effectively leverage these pipelines for a wide range of NLP applications, enhancing their skills in modern language processing techniques.

## Installing Required Libraries

First, we need to install the `transformers` library. This library is developed by Hugging Face and provides a wide range of pre-trained models for NLP tasks.

In [None]:
# Install the transformers library
!pip install transformers

After installing the transformers library, we will check its version to ensure it has been installed correctly.

In [None]:
# Verify the installation of the transformers library
import transformers
print(transformers.__version__)

## Working with pipelines

The most basic object in the Huggin Face Transformers library is the pipeline() function. 

There are three main steps involved when you pass some text to a pipeline():

 - Text preprocessing,
 - Model prediction,
 - Output post-processing.

We can directly input any text into it and get an intelligible answer.
By default, this pipeline selects a particular pretrained model.

Let's try to use it !

## Sentiment Analysis Pipeline

This pipeline is a pre-configured model that can analyze the sentiment of a given text, categorizing it as positive, negative, or neutral.

Initialize the sentiment analysis pipeline. This will download the pre-trained model and tokenizer.

In [None]:
# Initialize the sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")

In [None]:
# List of texts to analyze
texts = [
    "Nice, I've been waiting for a short HuggingFace course my whole life!",
    "I hate this so much"
]

# Analyze the sentiment of each text
results = classifier(texts)

# Display the results
for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']}, Score: {result['score']:.4f}\n")

In [None]:
>>
Text: Nice, I've been waiting for a short HuggingFace course my whole life!
Sentiment: POSITIVE, Score: 0.9979

Text: I hate this so much
Sentiment: NEGATIVE, Score: 0.9995

Each result contains:

label: The predicted sentiment label (e.g., POSITIVE or NEGATIVE).
score: The confidence score of the prediction.
Let's break down the results for better understanding.

In [None]:
# Example analysis
example_text = "Nice, I've been waiting for a short HuggingFace course my whole life!"
example_result = classifier(example_text)[0]

# Display the detailed result
print(f"Text: {example_text}")
print(f"Sentiment: {example_result['label']}, Score: {example_result['score']:.4f}")

In [None]:
>>
Text: Nice, I've been waiting for a short HuggingFace course my whole life!
Sentiment: POSITIVE, Score: 0.9979

Feel free to analyze more texts by modifying the texts list and re-running the analysis cell.

In [None]:
# Add more texts to analyze
more_texts = [
    "This is the best movie I have ever seen!",
    "The product quality is terrible and I'm very disappointed.",
    "I'm feeling great today!",
    "It's a gloomy and rainy day."
]

# Analyze the sentiment of each text
more_results = classifier(more_texts)

# Display the results
for text, result in zip(more_texts, more_results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']}, Score: {result['score']:.4f}\n")

In [None]:
>>
Text: This is the best movie I have ever seen!
Sentiment: POSITIVE, Score: 0.9999

Text: The product quality is terrible and I'm very disappointed.
Sentiment: NEGATIVE, Score: 0.9998

Text: I'm feeling great today!
Sentiment: POSITIVE, Score: 0.9999

Text: It's a gloomy and rainy day.
Sentiment: NEGATIVE, Score: 0.9975

## Other pipelines
Some of the currently available pipelines are:

 - feature-extraction
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

## Zero-shot classification

The zero-shot-classification pipeline is very powerful for tasks where we need to classify texts that haven’t been labelled. It returns probability scores for any list of labels you want!
It's called zero-shot because you don’t need to fine-tune the model on your data to use it.

### Initialize the Classifier

We initialize the classifier using the `pipeline` function and specify `"zero-shot-classification"` as the task.

In [None]:
# Create a zero-shot classification pipeline
classifier = pipeline("zero-shot-classification")

### Classify a Sample Text

We will classify the sample text into one of the candidate labels provided.


In [None]:
# Sample text to classify
text = "This is a short course about the Transformers library" 

# List of candidate labels for classification
candidate_labels = ["education", "politics", "business"]

# Perform zero-shot classification
result = classifier(text, candidate_labels=candidate_labels)

# Print the result
print(result)

In [None]:
>>
{
  'sequence': 'This is a short course about the Transformers library',  # The input text
  'labels': ['education', 'business', 'politics'],  # The candidate labels
  'scores': [0.7481650114059448, 0.17828474938869476, 0.07355023920536041]  # The confidence scores for each label
}

### Explanation of Results

The output is a dictionary containing the input sequence, the list of candidate labels, and the corresponding scores.
The scores represent the model's confidence in each label.

In this example, the model has determined that "education" is the most appropriate label for the input text, with a high confidence score.


### Further Exploration

You can experiment with different texts and sets of candidate labels to see how the model performs. 
Try classifying the following texts:

1. "The stock market is showing signs of recovery after a steep decline."
2. "The new policy aims to improve healthcare accessibility for all citizens."

Use candidate labels such as `["finance", "healthcare", "politics"]`.

In [None]:
texts = [
    "The stock market is showing signs of recovery after a steep decline.",  # Example text 1
    "The new policy aims to improve healthcare accessibility for all citizens."  # Example text 2
]

# New set of candidate labels
candidate_labels = ["finance", "healthcare", "politics"]

# Perform zero-shot classification for each text
for text in texts:
    result = classifier(text, candidate_labels=candidate_labels)  
    print(f"Text: {text}")  # Print the input text
    print(f"Classification: {result}\n")  # Print the classification result

In [None]:
>>
Text: The stock market is showing signs of recovery after a steep decline.
Classification: {'sequence': 'The stock market is showing signs of recovery after a steep decline.', 'labels': ['finance', 'healthcare', 'politics'], 'scores': [0.9854428172111511, 0.007386600133031607, 0.007170595694333315]}

Text: The new policy aims to improve healthcare accessibility for all citizens.
Classification: {'sequence': 'The new policy aims to improve healthcare accessibility for all citizens.', 'labels': ['healthcare', 'politics', 'finance'], 'scores': [0.9620512127876282, 0.027672864496707916, 0.010275940410792828]}

### Control

 - candidate_labels

## Text generation

The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text.

Here, we create a text generation pipeline by calling pipeline with the argument "text-generation". This pipeline will use a pre-trained model to generate text based on a given prompt.

In [None]:
# Creating a text generation pipeline using a pre-trained model
generator = pipeline("text-generation")

First, we define the prompt that we want to complete using the model. In this case, the prompt is "In this course, we will teach you how to".

Then, we generate the text by calling the generator with the prompt. We also specify max_length=50 to limit the total length of the output text to 50 tokens, and num_return_sequences=3 to generate three different sequences based on the prompt.

Finally, we display the generated text

In [None]:
# Setting up the prompt
prompt = "In this course, we will teach you how to"

# Generating text with specific control parameters
generated_text = generator(prompt, max_length=50, num_return_sequences=3)

In [None]:
# Displaying the generated text
for i, text in enumerate(generated_text):
    print(f"Generated Text {i+1}: {text['generated_text']}")

In [None]:
>>
Generated Text 1: In this course, we will teach you how to create simple software components with advanced principles and develop new ones. The purpose of this course is to provide you with a good understanding of programming by way of programming principles. In the course, you will develop
Generated Text 2: In this course, we will teach you how to use the same techniques to control a network of computers, to control a web server operating in the cloud. We will show you how to use your data to build web applications and applications based on the same
Generated Text 3: In this course, we will teach you how to build a custom Linux shell by analyzing the following techniques and implementing them in your own shell.

### Control

 - max_length: total length of the output text.
 - num_return_sequences: number of returning sequences.

## Using any model from the Hub in a pipeline

The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation.

Let’s try the distilgpt2 model : 

In [None]:
# Creating a text generation pipeline using the 'distilgpt2' model from the Hugging Face Hub
generator = pipeline("text-generation", model="distilgpt2")

# Defining the prompt for text generation
prompt = "In this course, we will teach you how to"

# Generating text with specific parameters
# max_length: the total length of the generated text
# num_return_sequences: the number of generated sequences
generated_text = generator(prompt, max_length=30, num_return_sequences=2)

# Displaying the generated text
for i, text in enumerate(generated_text):
    print(f"Generated Text {i+1}: {text['generated_text']}")

In [None]:
>>
Generated Text 1: In this course, we will teach you how to improve your daily routine and take notes. You can also learn to perform a lot of things including meditation
Generated Text 2: In this course, we will teach you how to understand a language as a whole and how to create a useful alternative to the language.

## Mask filling

The idea of this task is to fill in the blanks in a given text using pre-trained language models.

### Creating the Unmasker:

This line initializes a pipeline for the fill-mask task. The "fill-mask" argument specifies that we want to use a model trained to predict missing words in a sentence.

In [None]:
# Create a pipeline for the fill-mask task
unmasker = pipeline("fill-mask")

### Filling the Mask:

Here, we use the unmasker to predict the masked word in the sentence. The top_k argument specifies how many of the top predictions we want to display. Here, we request the top 2 predictions.

In [None]:
# The '<mask>' token is a placeholder for the word that the model will predict
results = unmasker("This course will teach you all about <mask> models.", top_k=2)

In [None]:
# Display the results
# The results show the top_k predictions the model suggests for the masked word
for result in results:
    print(result)

In [None]:
{'score': 0.19619794189929962, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}
{'score': 0.04052729159593582, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}

This loop prints out each prediction result. Each result includes:

score: The model's confidence in the prediction.
token: The token ID of the predicted word.
token_str: The predicted word.
sequence: The full sentence with the predicted word filled in.

### Control 
 - top_k argument: controls how many possibilities you want to be displayed.
 - <mask>: mask token or special word the model must fills in. It depends on the used model.

## Named entity recognition (NER)

NER is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.

### Creating the NER Pipeline:

This line initializes a pipeline for the named entity recognition (NER) task. The "ner" argument specifies that we want to use a model trained for NER. The grouped_entities=True argument groups together consecutive tokens that are part of the same entity.

In [None]:
# Create a pipeline for the named entity recognition (NER) task
ner = pipeline("ner", grouped_entities=True)

### Identifying Entities:

In this line, we use the ner pipeline to identify entities in the given sentence.

In [None]:
# Use the pipeline to identify entities in the given sentence
results = ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

# Display the results
# The results show the entities found in the sentence along with their types and positions
for result in results:
    print(result)

In [None]:
>>
{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18}
{'entity_group': 'ORG', 'score': 0.9796019, 'word': 'Hugging Face', 'start': 33, 'end': 45}
{'entity_group': 'LOC', 'score': 0.9932106, 'word': 'Brooklyn', 'start': 49, 'end': 57}

This loop prints out each identified entity. Each result includes:

- entity_group: The type of entity (e.g., PER for person, ORG for organization, LOC for location).
- score: The model's confidence in the prediction.
- word: The entity found in the text.
- start: The starting position of the entity in the text.
- end: The ending position of the entity in the text.

In the given example, the model correctly identified:

 - Sylvain as a person (PER),
 - Hugging Face as an organization (ORG),
 - Brooklyn as a location (LOC).

### Control 

 - grouped_entities=True: regroup together the parts of the sentence that correspond to the same entity (grouping “Hugging” and “Face” as a single organization)

## Question answering

The question-answering pipeline answers questions using information from a given context.

### Creating the Question-Answering Pipeline

This line initializes a pipeline for the question-answering task. The "question-answering" argument specifies that we want to use a model trained to answer questions based on a given context.

In [None]:
# Create a pipeline for the question-answering task
question_answerer = pipeline("question-answering")

### Answering the Question

In this block, we use the question_answerer pipeline to answer the question "Where do I work?" based on the provided context.

In [None]:
# Use the pipeline to answer a question based on the given context
result = question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

# Display the result
# The result shows the answer to the question along with additional information
print(result)

In [None]:
>>
{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

This line prints out the result of the question-answering task. The result includes:

- score: The model's confidence in the answer.
- start: The starting position of the answer in the context.
- end: The ending position of the answer in the context.
- answer: The extracted answer from the context.


In the given example, the model correctly identified the answer to the question "Where do I work?" as Hugging Face. The result also provides the confidence score and the positions of the answer within the context.

## Summarization 

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text.

### Creating the Summarization Pipeline

This line initializes a pipeline for the summarization task. The "summarization" argument specifies that we want to use a model trained to summarize text.

In [None]:
# Create a pipeline for the summarization task
summarizer = pipeline("summarization")

### Summarizing the Text

In this block, we use the summarizer pipeline to condense the provided text into a shorter version while keeping the most important information.

In [None]:
# Use the pipeline to summarize the given text
summary = summarizer(
    """
    America has changed dramatically during recent years. Not only has the number 
    of graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering 
    graduates and a lack of well-educated engineers.
    """
)

# Display the summary
# The summary provides a condensed version of the original text while retaining the most important information
print(summary)

In [None]:
>>
[{'summary_text': ' America has changed dramatically during recent years . The number of graduates in traditional engineering disciplines has declined . China and India graduate six and eight times as many traditional engineers as does the United States . Rapidly developing economies such as India and Europe continue to encourage and advance the teaching of engineering .'}]

### Control

Same as text generation:

 - max_length
 - min_length

## Translation

For translation, you can use a default model if you provide a language pair in the task name (such as "translation_en_to_fr"), but the easiest way is to pick the model you want to use on the Model Hub.

### Creating the Translation Pipeline

This line initializes a pipeline for the translation task. The "translation" argument specifies that we want to use a model trained for translation. The model="Helsinki-NLP/opus-mt-fr-en" argument specifies the specific model to use for translating from French to English.

In [None]:
# Create a pipeline for the translation task
# Specify the model to be used for translation from French to English
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

### Translating the Text

In this line, we use the translator pipeline to translate the provided French text into English.

In [None]:
# Use the pipeline to translate the given text from French to English
translation = translator("Ce cours est produit par Hugging Face.")

# Display the translation
# The translation provides the English version of the original French text
print(translation)

In [None]:
>>
[{'translation_text': 'This course is produced by Hugging Face.'}]

### Control

Same as text generation & summarization:

- max_length
- min_length

## Exercices 

### Exercise 1: Sentiment Analysis

 - Analyze the sentiment of the following sentences: "I am very happy with the service." and "The food was terrible."
 - Add a new sentence to the list: "I'm feeling neutral about this." and analyze its sentiment.

### Exercise 2: Zero-shot Classification

 - Classify the following text into one of the candidate labels: "finance", "healthcare", "politics": "The new policy aims to improve healthcare accessibility for all citizens."
 - Change the candidate labels to "education", "entertainment", "business" and classify the same text.

### Exercise 3: Text Generation

 - Generate text with the prompt "Artificial intelligence will change the world by" with a maximum length of 50 tokens.
 - Change the prompt to "In the future, we will see advancements in" and generate text with a maximum length of 30 tokens.

### Exercise 4: Mask Filling

 - Predict the masked word in the sentence "Artificial intelligence is the future of <mask>."
 - Change the sentence to "The development of AI will revolutionize <mask>." and predict the masked word.

### Exercise 5: Named Entity Recognition (NER)

 - Identify entities in the sentence "Elon Musk founded SpaceX and Tesla."
 - Add a new sentence "Barack Obama was the 44th President of the United States." and identify entities.

### Exercise 6: Question Answering

 - Answer the question "What is the name of the company?" given the context "Amazon is a global company based in Seattle."
 - Change the context to "Google was founded by Larry Page and Sergey Brin." and ask the question "Who founded Google?"

### Exercise 7: Summarization

 - Summarize the following text: "Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention."
 - Summarize the following text: "Deep learning is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. Like a human, the algorithm learns from examples. While traditional machine learning algorithms are linear, deep learning algorithms are stacked in a hierarchy of increasing complexity and abstraction."

### Exercise 8: Translation

 - Translate the sentence "Bonjour tout le monde." from French to English.
 - Translate the sentence "La technologie transforme notre monde." from French to English.

### Exercise 9: Comprehensive NLP Task

This exercise will combine sentiment analysis, zero-shot classification, and text generation.

 - Perform sentiment analysis on the following sentences: "I love the new features of this product." and "This is the worst experience I've ever had."
 - Classify the following text into one of the candidate labels: "technology", "customer service", "product quality": "The new smartphone has several innovative features that are very user-friendly."
 - Generate text with the prompt "Based on the customer feedback, we can improve our product by" with a maximum length of 50 tokens.

### Exercise 10: Multi-Function NLP Task

This exercise will combine named entity recognition (NER), question answering, and text summarization.

 - Identify entities in the sentence: "Sundar Pichai is the CEO of Google, which is headquartered in Mountain View, California."
 - Answer the question "Where is Google headquartered?" given the context "Sundar Pichai is the CEO of Google, which is headquartered in Mountain View, California."
 - Summarize the following text: "Google, a multinational technology company specializing in Internet-related services and products, was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University."