<a href="https://colab.research.google.com/github/Abdeslemissaadi/Transformers/blob/main/TR_C_3_Abdeslem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Based on the Huggin Face Course Introduction : https://huggingface.co/
# Modified by Abdeslem ISSAADI, Univ. Paris 8

# Transformers: Technical Introduction

This notebook provides a comprehensive introduction to the Hugging Face Transformers library, focusing on various natural language processing (NLP) tasks using pre-trained language models. Participants will learn how to install and verify the library, utilize pipelines for sentiment analysis, zero-shot classification, text generation, mask filling, named entity recognition (NER), question answering, text summarization, and translation. By the end of this course, learners will be equipped to effectively leverage these pipelines for a wide range of NLP applications, enhancing their skills in modern language processing techniques.

## Installing Required Libraries

First, we need to install the `transformers` library. This library is developed by Hugging Face and provides a wide range of pre-trained models for NLP tasks.

In [None]:
# Install the transformers library
!pip install transformers

After installing the transformers library, we will check its version to ensure it has been installed correctly.

In [None]:
# Verify the installation of the transformers library
import transformers
print(transformers.__version__)

4.57.1


## Working with pipelines

The most basic object in the Huggin Face Transformers library is the pipeline() function.

There are three main steps involved when you pass some text to a pipeline():

 - Text preprocessing,
 - Model prediction,
 - Output post-processing.

We can directly input any text into it and get an intelligible answer.
By default, this pipeline selects a particular pretrained model.

Let's try to use it !

## Sentiment Analysis Pipeline

This pipeline is a pre-configured model that can analyze the sentiment of a given text, categorizing it as positive, negative, or neutral.

Initialize the sentiment analysis pipeline. This will download the pre-trained model and tokenizer.

In [None]:
# Initialize the sentiment analysis pipeline
# Importer Transformers
from transformers import pipeline

# Initialiser le pipeline d'analyse de sentiment
sentiment_classifier = pipeline(task="sentiment-analysis")

# Exemple d'utilisation
sentence = "I absolutely loved the movie! It was fantastic and thrilling."
result = sentiment_classifier(sentence)
print(result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.999885082244873}]


In [None]:
from transformers import pipeline

# Initialiser le pipeline d'analyse de sentiment
sentiment_analyzer = pipeline("sentiment-analysis")

# Liste de textes à analyser
texts = [
    "Nice, I've been waiting for a short HuggingFace course my whole life!",
    "I hate this so much"
]

# Analyser le sentiment de chaque texte
results = sentiment_analyzer(texts)

# Afficher les résultats
for txt, res in zip(texts, results):
    print(f"Texte: {txt}")
    print(f"Sentiment: {res['label']}, Score: {res['score']:.4f}\n")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Texte: Nice, I've been waiting for a short HuggingFace course my whole life!
Sentiment: POSITIVE, Score: 0.9979

Texte: I hate this so much
Sentiment: NEGATIVE, Score: 0.9995



Each result contains:

label: The predicted sentiment label (e.g., POSITIVE or NEGATIVE).
score: The confidence score of the prediction.
Let's break down the results for better understanding.

In [None]:
from transformers import pipeline

# Initialiser le pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

# Exemple de texte
example_text = "Nice, I've been waiting for a short HuggingFace course my whole life!"
example_result = sentiment_analyzer(example_text)[0]

# Afficher le résultat détaillé
print(f"Texte: {example_text}")
print(f"Sentiment: {example_result['label']}, Score: {example_result['score']:.4f}")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Texte: Nice, I've been waiting for a short HuggingFace course my whole life!
Sentiment: POSITIVE, Score: 0.9979


In [None]:
>>
Texte: Nice, I've been waiting for a short HuggingFace course my whole life!
Sentiment: POSITIVE, Score: 0.9979

Feel free to analyze more texts by modifying the texts list and re-running the analysis cell.

In [None]:
from transformers import pipeline

# Initialiser le pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

# Liste de nouveaux textes à analyser
more_texts = [
    "This is the best movie I have ever seen!",
    "The product quality is terrible and I'm very disappointed.",
    "I'm feeling great today!",
    "It's a gloomy and rainy day."
]

# Analyser le sentiment de chaque texte
more_results = sentiment_analyzer(more_texts)

# Afficher les résultats
for txt, res in zip(more_texts, more_results):
    print(f"Texte: {txt}")
    print(f"Sentiment: {res['label']}, Score: {res['score']:.4f}\n")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Texte: This is the best movie I have ever seen!
Sentiment: POSITIVE, Score: 0.9999

Texte: The product quality is terrible and I'm very disappointed.
Sentiment: NEGATIVE, Score: 0.9998

Texte: I'm feeling great today!
Sentiment: POSITIVE, Score: 0.9999

Texte: It's a gloomy and rainy day.
Sentiment: NEGATIVE, Score: 0.9975



In [None]:
>>
Texte: This is the best movie I have ever seen!
Sentiment: POSITIVE, Score: 0.9999

Texte: The product quality is terrible and I'm very disappointed.
Sentiment: NEGATIVE, Score: 0.9998

Texte: I'm feeling great today!
Sentiment: POSITIVE, Score: 0.9999

Texte: It's a gloomy and rainy day.
Sentiment: NEGATIVE, Score: 0.9975

## Other pipelines
Some of the currently available pipelines are:

 - feature-extraction
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

## Zero-shot classification

The zero-shot-classification pipeline is very powerful for tasks where we need to classify texts that haven’t been labelled. It returns probability scores for any list of labels you want!
It's called zero-shot because you don’t need to fine-tune the model on your data to use it.

### Initialize the Classifier

We initialize the classifier using the `pipeline` function and specify `"zero-shot-classification"` as the task.

In [None]:
# Create a zero-shot classification pipeline
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


### Classify a Sample Text

We will classify the sample text into one of the candidate labels provided.


In [None]:
from transformers import pipeline

# Initialiser le pipeline zéro-shot
zero_shot_classifier = pipeline("zero-shot-classification")

# Exemple de texte à classifier
text = "This is a short course about the Transformers library"

# Liste de labels candidats
candidate_labels = ["education", "politics", "business"]

# Classification zéro-shot
result = zero_shot_classifier(text, candidate_labels=candidate_labels)

# Afficher le résultat
print(result)


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'sequence': 'This is a short course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.7481651902198792, 0.17828458547592163, 0.07355019450187683]}


In [None]:
>>
{
  'sequence': 'This is a short course about the Transformers library',  # The input text
  'labels': ['education', 'business', 'politics'],  # The candidate labels
  'scores': [0.7481651902198792, 0.17828458547592163, 0.07355019450187683]  # The confidence scores for each label
}

### Explanation of Results

The output is a dictionary containing the input sequence, the list of candidate labels, and the corresponding scores.
The scores represent the model's confidence in each label.

In this example, the model has determined that "education" is the most appropriate label for the input text, with a high confidence score.


### Further Exploration

You can experiment with different texts and sets of candidate labels to see how the model performs.
Try classifying the following texts:

1. "The stock market is showing signs of recovery after a steep decline."
2. "The new policy aims to improve healthcare accessibility for all citizens."

Use candidate labels such as `["finance", "healthcare", "politics"]`.

In [None]:
from transformers import pipeline

# Initialiser le pipeline zéro-shot
zero_shot_classifier = pipeline("zero-shot-classification")

# Liste de textes à classifier
texts = [
    "The stock market is showing signs of recovery after a steep decline.",  # Exemple 1
    "The new policy aims to improve healthcare accessibility for all citizens."  # Exemple 2
]

# Liste de labels candidats
candidate_labels = ["finance", "healthcare", "politics"]

# Classification zéro-shot pour chaque texte
for txt in texts:
    result = zero_shot_classifier(txt, candidate_labels=candidate_labels)
    print(f"Texte: {txt}")
    print(f"Classification: {result}\n")


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Texte: The stock market is showing signs of recovery after a steep decline.
Classification: {'sequence': 'The stock market is showing signs of recovery after a steep decline.', 'labels': ['finance', 'healthcare', 'politics'], 'scores': [0.9854428172111511, 0.007386598736047745, 0.007170593831688166]}

Texte: The new policy aims to improve healthcare accessibility for all citizens.
Classification: {'sequence': 'The new policy aims to improve healthcare accessibility for all citizens.', 'labels': ['healthcare', 'politics', 'finance'], 'scores': [0.9620512127876282, 0.027672862634062767, 0.010275942273437977]}



In [None]:
>>
Text: The stock market is showing signs of recovery after a steep decline.
Classification: {'sequence': 'The stock market is showing signs of recovery after a steep decline.', 'labels': ['finance', 'healthcare', 'politics'], 'scores': [0.9854428172111511, 0.007386600133031607, 0.007170595694333315]}

Text: The new policy aims to improve healthcare accessibility for all citizens.
Classification: {'sequence': 'The new policy aims to improve healthcare accessibility for all citizens.', 'labels': ['healthcare', 'politics', 'finance'], 'scores': [0.9620512127876282, 0.027672864496707916, 0.010275940410792828]}

### Control

 - candidate_labels

## Text generation

The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text.

Here, we create a text generation pipeline by calling pipeline with the argument "text-generation". This pipeline will use a pre-trained model to generate text based on a given prompt.

In [None]:
# Creating a text generation pipeline using a pre-trained model
generator = pipeline("text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


First, we define the prompt that we want to complete using the model. In this case, the prompt is "In this course, we will teach you how to".

Then, we generate the text by calling the generator with the prompt. We also specify max_length=50 to limit the total length of the output text to 50 tokens, and num_return_sequences=3 to generate three different sequences based on the prompt.

Finally, we display the generated text

In [None]:
from transformers import pipeline

# Initialiser le pipeline de génération de texte
text_generator = pipeline("text-generation", model="gpt2")  # On peut choisir "gpt2" ou un autre modèle

# Définir le prompt
prompt = "In this course, we will teach you how to"

# Générer plusieurs textes à partir du prompt
generated_texts = text_generator(prompt, max_length=50, num_return_sequences=3)

# Afficher les résultats
for i, gen in enumerate(generated_texts, 1):
    print(f"Generated Text {i}:\n{gen['generated_text']}\n")


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated Text 1:
In this course, we will teach you how to set up data-driven data analysis tools that will help you better understand your data, what you need to do to build them, and how to use them to analyze your business.

This course is designed to prepare you for the next generation of data analysis, and will give you the tools and tools necessary to start using these tools and tools, while also helping you to work on your data.

This course will also provide you with practical advice on how to get started, and how to plan your data for your business.

This course is designed to be an introductory course, to help students learn about data analysis, and to give you the tools needed to start using these tools. This course is designed to help you to quickly build your data, while also help you to develop your data so that it can be used for your business.

This course is designed to be a introductory course, to help students learn about data analysis, and to give you the tools need

In [None]:
# Afficher les textes générés
for i, gen in enumerate(generated_texts, 1):
    print(f"Generated Text {i}:\n{gen['generated_text']}\n")


Generated Text 1:
In this course, we will teach you how to set up data-driven data analysis tools that will help you better understand your data, what you need to do to build them, and how to use them to analyze your business.

This course is designed to prepare you for the next generation of data analysis, and will give you the tools and tools necessary to start using these tools and tools, while also helping you to work on your data.

This course will also provide you with practical advice on how to get started, and how to plan your data for your business.

This course is designed to be an introductory course, to help students learn about data analysis, and to give you the tools needed to start using these tools. This course is designed to help you to quickly build your data, while also help you to develop your data so that it can be used for your business.

This course is designed to be a introductory course, to help students learn about data analysis, and to give you the tools need

### Control

 - max_length: total length of the output text.
 - num_return_sequences: number of returning sequences.

## Using any model from the Hub in a pipeline

The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation.

Let’s try the distilgpt2 model :

In [None]:
from transformers import pipeline

# Créer un pipeline de génération de texte avec le modèle 'distilgpt2'
text_generator = pipeline("text-generation", model="distilgpt2")

# Définir le prompt
prompt = "In this course, we will teach you how to"

# Générer du texte avec les mêmes paramètres
generated_texts = text_generator(prompt, max_length=30, num_return_sequences=2)

# Afficher les textes générés
for idx, gen in enumerate(generated_texts, 1):
    print(f"Generated Text {idx}: {gen['generated_text']}")


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated Text 1: In this course, we will teach you how to build a successful business model that will help make it happen.



Awards

Awards
Generated Text 2: In this course, we will teach you how to use the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the language of the languag

## Mask filling

The idea of this task is to fill in the blanks in a given text using pre-trained language models.

### Creating the Unmasker:

This line initializes a pipeline for the fill-mask task. The "fill-mask" argument specifies that we want to use a model trained to predict missing words in a sentence.

In [None]:
# Create a pipeline for the fill-mask task
unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


### Filling the Mask:

Here, we use the unmasker to predict the masked word in the sentence. The top_k argument specifies how many of the top predictions we want to display. Here, we request the top 2 predictions.

In [None]:
# The '<mask>' token is a placeholder for the word that the model will predict
results = unmasker("This course will teach you all about <mask> models.", top_k=2)

In [None]:
# Display the results
# The results show the top_k predictions the model suggests for the masked word
for result in results:
    print(result)

{'score': 0.19619767367839813, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}
{'score': 0.04052715748548508, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}


In [None]:
{'score': 0.19619794189929962, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}
{'score': 0.04052729159593582, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}

{'score': 0.04052729159593582,
 'token': 38163,
 'token_str': ' computational',
 'sequence': 'This course will teach you all about computational models.'}

This loop prints out each prediction result. Each result includes:

score: The model's confidence in the prediction.
token: The token ID of the predicted word.
token_str: The predicted word.
sequence: The full sentence with the predicted word filled in.

### Control
 - top_k argument: controls how many possibilities you want to be displayed.
 - <mask>: mask token or special word the model must fills in. It depends on the used model.

## Named entity recognition (NER)

NER is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.

### Creating the NER Pipeline:

This line initializes a pipeline for the named entity recognition (NER) task. The "ner" argument specifies that we want to use a model trained for NER. The grouped_entities=True argument groups together consecutive tokens that are part of the same entity.

In [None]:
from transformers import pipeline

# Créer un pipeline pour la reconnaissance d'entités nommées (NER)
ner_pipeline = pipeline("ner", grouped_entities=True)

# Exemple de texte
text = "Barack Obama was the 44th president of the United States."

# Appliquer le pipeline NER
entities = ner_pipeline(text)

# Afficher les entités détectées
for entity in entities:
    print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, Score: {entity['score']:.4f}")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Entity: Barack Obama, Type: PER, Score: 0.9992
Entity: United States, Type: LOC, Score: 0.9987


### Identifying Entities:

In this line, we use the ner pipeline to identify entities in the given sentence.

In [None]:
from transformers import pipeline

# Créer le pipeline NER
ner_pipeline = pipeline("ner", grouped_entities=True)

# Texte à analyser
text = "My name is Sylvain and I work at Hugging Face in Brooklyn."

# Identifier les entités dans le texte
entities = ner_pipeline(text)

# Afficher les résultats de manière claire
for entity in entities:
    print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, Score: {entity['score']:.4f}")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Entity: Sylvain, Type: PER, Score: 0.9982
Entity: Hugging Face, Type: ORG, Score: 0.9796
Entity: Brooklyn, Type: LOC, Score: 0.9932


In [None]:
>>
{'entity_group': 'PER', 'score': 0.9981694}
{'entity_group': 'ORG', 'score': 0.9796019}
{'entity_group': 'LOC', 'score': 0.9932106}

This loop prints out each identified entity. Each result includes:

- entity_group: The type of entity (e.g., PER for person, ORG for organization, LOC for location).
- score: The model's confidence in the prediction.
- word: The entity found in the text.
- start: The starting position of the entity in the text.
- end: The ending position of the entity in the text.

In the given example, the model correctly identified:

 - Sylvain as a person (PER),
 - Hugging Face as an organization (ORG),
 - Brooklyn as a location (LOC).

### Control

 - grouped_entities=True: regroup together the parts of the sentence that correspond to the same entity (grouping “Hugging” and “Face” as a single organization)

## Question answering

The question-answering pipeline answers questions using information from a given context.

### Creating the Question-Answering Pipeline

This line initializes a pipeline for the question-answering task. The "question-answering" argument specifies that we want to use a model trained to answer questions based on a given context.

In [None]:
# Create a pipeline for the question-answering task
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


### Answering the Question

In this block, we use the question_answerer pipeline to answer the question "Where do I work?" based on the provided context.

In [None]:
from transformers import pipeline

# Créer le pipeline pour le question-answering
question_answerer = pipeline("question-answering")

# Contexte et question
context = "My name is Sylvain and I work at Hugging Face in Brooklyn"
question = "Where do I work?"

# Obtenir la réponse
result = question_answerer(question=question, context=context)

# Afficher le résultat de manière lisible
print(f"Question: {question}")
print(f"Answer: {result['answer']}")
print(f"Score: {result['score']:.4f}")
print(f"Start: {result['start']}, End: {result['end']},Answer: {result['answer']}")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Question: Where do I work?
Answer: Hugging Face
Score: 0.6950
Start: 33, End: 45,Answer: Hugging Face


In [None]:
>>
{'score': 0.690, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

This line prints out the result of the question-answering task. The result includes:

- score: The model's confidence in the answer.
- start: The starting position of the answer in the context.
- end: The ending position of the answer in the context.
- answer: The extracted answer from the context.


In the given example, the model correctly identified the answer to the question "Where do I work?" as Hugging Face. The result also provides the confidence score and the positions of the answer within the context.

## Summarization

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text.

### Creating the Summarization Pipeline

This line initializes a pipeline for the summarization task. The "summarization" argument specifies that we want to use a model trained to summarize text.

In [None]:
# Create a pipeline for the summarization task
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


### Summarizing the Text

In this block, we use the summarizer pipeline to condense the provided text into a shorter version while keeping the most important information.

In [None]:
# Use the pipeline to summarize the given text
summary = summarizer(
    """
    America has changed dramatically during recent years. Not only has the number
    of graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering
    graduates and a lack of well-educated engineers.
    """
)

# Display the summary
# The summary provides a condensed version of the original text while retaining the most important information
print(summary)

[{'summary_text': ' China and India graduate six and eight times as many traditional engineers as the U.S. as does the United States . Rapidly developing economies such as India and Europe continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues, and technology subjects .'}]


In [None]:
>>
[{'summary_text': ' America has changed dramatically during recent years . The number of graduates in traditional engineering disciplines has declined . China and India graduate six and eight times as many traditional engineers as does the United States . Rapidly developing economies such as India and Europe continue to encourage and advance the teaching of engineering .'}]


### Control

Same as text generation:

 - max_length
 - min_length

## Translation

For translation, you can use a default model if you provide a language pair in the task name (such as "translation_en_to_fr"), but the easiest way is to pick the model you want to use on the Model Hub.

### Creating the Translation Pipeline

This line initializes a pipeline for the translation task. The "translation" argument specifies that we want to use a model trained for translation. The model="Helsinki-NLP/opus-mt-fr-en" argument specifies the specific model to use for translating from French to English.

In [None]:
# Create a pipeline for the translation task
# Specify the model to be used for translation from French to English
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


### Translating the Text

In this line, we use the translator pipeline to translate the provided French text into English.

In [None]:
from transformers import pipeline

# Créer le pipeline de traduction (français → anglais)
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

# Texte à traduire
text_to_translate = "Ce cours est produit par Hugging Face."

# Effectuer la traduction
translation_result = translator(text_to_translate)

# Afficher la traduction
for t in translation_result:
    print(f"Translated Text: {t['translation_text']}")


Device set to use cpu


Translated Text: This course is produced by Hugging Face.


In [None]:
>>
[{'translation_text': 'This course is produced by Hugging Face.'}]

### Control

Same as text generation & summarization:

- max_length
- min_length

## Exercices

### Exercise 1: Sentiment Analysis

 - Analyze the sentiment of the following sentences: "I am very happy with the service." and "The food was terrible."
 - Add a new sentence to the list: "I'm feeling neutral about this." and analyze its sentiment.

In [None]:
from transformers import pipeline

# Créer le pipeline pour l'analyse de sentiment
sentiment_analyzer = pipeline("sentiment-analysis")

# Liste de phrases à analyser
texts = [
    "I am very happy with the service.",
    "The food was terrible.",
    "I'm feeling neutral about this."
]

# Analyser le sentiment de chaque phrase
results = sentiment_analyzer(texts)

# Afficher les résultats
for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']}, Score: {result['score']:.4f}\n")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Text: I am very happy with the service.
Sentiment: POSITIVE, Score: 0.9999

Text: The food was terrible.
Sentiment: NEGATIVE, Score: 0.9992

Text: I'm feeling neutral about this.
Sentiment: NEGATIVE, Score: 0.9982



### Exercise 2: Zero-shot Classification

 - Classify the following text into one of the candidate labels: "finance", "healthcare", "politics": "The new policy aims to improve healthcare accessibility for all citizens."
 - Change the candidate labels to "education", "entertainment", "business" and classify the same text.

In [None]:
from transformers import pipeline

# Créer le pipeline pour la classification zero-shot
zero_shot_classifier = pipeline("zero-shot-classification")

# Texte à classifier
text = "The new policy aims to improve healthcare accessibility for all citizens."

# Premier ensemble d'étiquettes
candidate_labels_1 = ["finance", "healthcare", "politics"]
result_1 = zero_shot_classifier(text, candidate_labels=candidate_labels_1)
print("=== Classification with labels:", candidate_labels_1, "===")
print(f"Text: {text}")
print(f"Classification: {result_1}\n")

# Deuxième ensemble d'étiquettes
candidate_labels_2 = ["education", "entertainment", "business"]
result_2 = zero_shot_classifier(text, candidate_labels=candidate_labels_2)
print("=== Classification with labels:", candidate_labels_2, "===")
print(f"Text: {text}")
print(f"Classification: {result_2}\n")


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


=== Classification with labels: ['finance', 'healthcare', 'politics'] ===
Text: The new policy aims to improve healthcare accessibility for all citizens.
Classification: {'sequence': 'The new policy aims to improve healthcare accessibility for all citizens.', 'labels': ['healthcare', 'politics', 'finance'], 'scores': [0.9620512127876282, 0.027672862634062767, 0.010275942273437977]}

=== Classification with labels: ['education', 'entertainment', 'business'] ===
Text: The new policy aims to improve healthcare accessibility for all citizens.
Classification: {'sequence': 'The new policy aims to improve healthcare accessibility for all citizens.', 'labels': ['business', 'entertainment', 'education'], 'scores': [0.5286611318588257, 0.30794841051101685, 0.16339050233364105]}



### Exercise 3: Text Generation

 - Generate text with the prompt "Artificial intelligence will change the world by" with a maximum length of 50 tokens.
 - Change the prompt to "In the future, we will see advancements in" and generate text with a maximum length of 30 tokens.

In [None]:
from transformers import pipeline

# Créer le pipeline de génération de texte avec le modèle distilgpt2
text_generator = pipeline("text-generation", model="distilgpt2")

# --- Premier prompt ---
prompt_1 = "Artificial intelligence will change the world by"
generated_1 = text_generator(prompt_1, max_length=50, num_return_sequences=1)

print("=== Generated Text for Prompt 1 ===")
for i, text in enumerate(generated_1):
    print(f"Generated Text {i+1}: {text['generated_text']}\n")

# --- Deuxième prompt ---
prompt_2 = "In the future, we will see advancements in"
generated_2 = text_generator(prompt_2, max_length=30, num_return_sequences=1)

print("=== Generated Text for Prompt 2 ===")
for i, text in enumerate(generated_2):
    print(f"Generated Text {i+1}: {text['generated_text']}\n")


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/m

=== Generated Text for Prompt 1 ===
Generated Text 1: Artificial intelligence will change the world by the end of the decade as we learn more about the artificial intelligence.



















































































































































































































































=== Generated Text for Prompt 2 ===
Generated Text 1: In the future, we will see advancements in the technology. We will see technological innovation in the way we use it and, in the future, we will see more innovations. We will see other innovations. We will see smart cars. We will see the advancements in technology. We will see people learning to drive. We will see technology. We will see people learning to drive. We will see people learning to drive. We will see people learning to drive. We will see people learning to drive. We will see people learning to drive. We will 

### Exercise 4: Mask Filling

 - Predict the masked word in the sentence "Artificial intelligence is the future of <mask>."
 - Change the sentence to "The development of AI will revolutionize <mask>." and predict the masked word.

In [None]:
from transformers import pipeline

# Créer le pipeline pour le remplissage de masque (Masked Language Modeling)
mask_filler = pipeline("fill-mask", model="bert-base-uncased")

# --- Premier exemple ---
sentence_1 = "Artificial intelligence is the future of [MASK]."
predictions_1 = mask_filler(sentence_1, top_k=3)

print("=== Masked Predictions for Sentence 1 ===")
for pred in predictions_1:
    print(f"Token: {pred['token_str']}, Score: {pred['score']:.4f}, Sequence: {pred['sequence']}")

print("\n")

# --- Deuxième exemple ---
sentence_2 = "The development of AI will revolutionize [MASK]."
predictions_2 = mask_filler(sentence_2, top_k=3)

print("=== Masked Predictions for Sentence 2 ===")
for pred in predictions_2:
    print(f"Token: {pred['token_str']}, Score: {pred['score']:.4f}, Sequence: {pred['sequence']}")


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


=== Masked Predictions for Sentence 1 ===
Token: science, Score: 0.2191, Sequence: artificial intelligence is the future of science.
Token: technology, Score: 0.0878, Sequence: artificial intelligence is the future of technology.
Token: society, Score: 0.0717, Sequence: artificial intelligence is the future of society.


=== Masked Predictions for Sentence 2 ===
Token: ai, Score: 0.1091, Sequence: the development of ai will revolutionize ai.
Token: society, Score: 0.0741, Sequence: the development of ai will revolutionize society.
Token: technology, Score: 0.0474, Sequence: the development of ai will revolutionize technology.


### Exercise 5: Named Entity Recognition (NER)

 - Identify entities in the sentence "Elon Musk founded SpaceX and Tesla."
 - Add a new sentence "Barack Obama was the 44th President of the United States." and identify entities.

In [None]:
from transformers import pipeline

# Créer le pipeline pour la reconnaissance d'entités nommées (NER)
ner_pipeline = pipeline("ner", grouped_entities=True)

# --- Premier exemple ---
sentence_1 = "Elon Musk founded SpaceX and Tesla."
entities_1 = ner_pipeline(sentence_1)

print("=== Named Entities for Sentence 1 ===")
for entity in entities_1:
    print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, Score: {entity['score']:.4f}")

print("\n")

# --- Deuxième exemple ---
sentence_2 = "Barack Obama was the 44th President of the United States."
entities_2 = ner_pipeline(sentence_2)

print("=== Named Entities for Sentence 2 ===")
for entity in entities_2:
    print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, Score: {entity['score']:.4f}")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


=== Named Entities for Sentence 1 ===
Entity: Elon Musk, Type: PER, Score: 0.9971
Entity: SpaceX, Type: ORG, Score: 0.9985
Entity: Tesla, Type: ORG, Score: 0.9917


=== Named Entities for Sentence 2 ===
Entity: Barack Obama, Type: PER, Score: 0.9991
Entity: United States, Type: LOC, Score: 0.9952


### Exercise 6: Question Answering

 - Answer the question "What is the name of the company?" given the context "Amazon is a global company based in Seattle."
 - Change the context to "Google was founded by Larry Page and Sergey Brin." and ask the question "Who founded Google?"

In [None]:
from transformers import pipeline

# Créer le pipeline pour Question Answering
qa_pipeline = pipeline("question-answering")

# --- Premier exemple ---
context_1 = "Amazon is a global company based in Seattle."
question_1 = "What is the name of the company?"

answer_1 = qa_pipeline(question=question_1, context=context_1)
print("=== Question Answering Example 1 ===")
print(f"Question: {question_1}")
print(f"Context: {context_1}")
print(f"Answer: {answer_1['answer']}, Score: {answer_1['score']:.4f}\n")

# --- Deuxième exemple ---
context_2 = "Google was founded by Larry Page and Sergey Brin."
question_2 = "Who founded Google?"

answer_2 = qa_pipeline(question=question_2, context=context_2)
print("=== Question Answering Example 2 ===")
print(f"Question: {question_2}")
print(f"Context: {context_2}")
print(f"Answer: {answer_2['answer']}, Score: {answer_2['score']:.4f}")


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


=== Question Answering Example 1 ===
Question: What is the name of the company?
Context: Amazon is a global company based in Seattle.
Answer: Amazon, Score: 0.9975

=== Question Answering Example 2 ===
Question: Who founded Google?
Context: Google was founded by Larry Page and Sergey Brin.
Answer: Larry Page and Sergey Brin, Score: 0.8125


### Exercise 7: Summarization

 - Summarize the following text: "Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention."
 - Summarize the following text: "Deep learning is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. Like a human, the algorithm learns from examples. While traditional machine learning algorithms are linear, deep learning algorithms are stacked in a hierarchy of increasing complexity and abstraction."

In [None]:
from transformers import pipeline

# Créer le pipeline pour la summarization
summarizer = pipeline("summarization")

# --- Premier texte ---
text_1 = (
    "Machine learning is a method of data analysis that automates analytical model building. "
    "It is a branch of artificial intelligence based on the idea that systems can learn from data, "
    "identify patterns, and make decisions with minimal human intervention."
)

summary_1 = summarizer(text_1, max_length=50, min_length=25, do_sample=False)[0]
print("=== Summarization Example 1 ===")
print(f"Original Text: {text_1}\n")
print(f"Summary: {summary_1['summary_text']}\n")

# --- Deuxième texte ---
text_2 = (
    "Deep learning is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, "
    "learn from large amounts of data. Like a human, the algorithm learns from examples. While traditional machine learning "
    "algorithms are linear, deep learning algorithms are stacked in a hierarchy of increasing complexity and abstraction."
)

summary_2 = summarizer(text_2, max_length=60, min_length=30, do_sample=False)[0]
print("=== Summarization Example 2 ===")
print(f"Original Text: {text_2}\n")
print(f"Summary: {summary_2['summary_text']}")


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Your max_length is set to 50, but your input_length is only 46. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)


=== Summarization Example 1 ===
Original Text: Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention.

Summary:  Machine learning is a method of data analysis that automates analytical model building . It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention .

=== Summarization Example 2 ===
Original Text: Deep learning is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. Like a human, the algorithm learns from examples. While traditional machine learning algorithms are linear, deep learning algorithms are stacked in a hierarchy of increasing complexity and abstraction.

Summary:  Deep learning is a 

### Exercise 8: Translation

 - Translate the sentence "Bonjour tout le monde." from French to English.
 - Translate the sentence "La technologie transforme notre monde." from French to English.

In [None]:
from transformers import pipeline

# Créer le pipeline pour la traduction (français -> anglais)
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

# --- Premier texte ---
text_1 = "Bonjour tout le monde."
translation_1 = translator(text_1, max_length=40)[0]
print("=== Translation Example 1 ===")
print(f"Original Text: {text_1}")
print(f"Translation: {translation_1['translation_text']}\n")

# --- Deuxième texte ---
text_2 = "La technologie transforme notre monde."
translation_2 = translator(text_2, max_length=50)[0]
print("=== Translation Example 2 ===")
print(f"Original Text: {text_2}")
print(f"Translation: {translation_2['translation_text']}")


Device set to use cpu


=== Translation Example 1 ===
Original Text: Bonjour tout le monde.
Translation: Hello, everybody.

=== Translation Example 2 ===
Original Text: La technologie transforme notre monde.
Translation: Technology is transforming our world.


### Exercise 9: Comprehensive NLP Task

This exercise will combine sentiment analysis, zero-shot classification, and text generation.

 - Perform sentiment analysis on the following sentences: "I love the new features of this product." and "This is the worst experience I've ever had."
 - Classify the following text into one of the candidate labels: "technology", "customer service", "product quality": "The new smartphone has several innovative features that are very user-friendly."
 - Generate text with the prompt "Based on the customer feedback, we can improve our product by" with a maximum length of 50 tokens.

In [None]:
from transformers import pipeline

# --- 1️⃣ Sentiment Analysis ---
sentiment_analyzer = pipeline("sentiment-analysis")

sentences = [
    "I love the new features of this product.",
    "This is the worst experience I've ever had."
]

print("=== Sentiment Analysis ===")
sentiment_results = sentiment_analyzer(sentences)
for text, result in zip(sentences, sentiment_results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']}, Score: {result['score']:.4f}\n")


# --- 2️⃣ Zero-Shot Classification ---
zero_shot_classifier = pipeline("zero-shot-classification")

text_to_classify = "The new smartphone has several innovative features that are very user-friendly."
candidate_labels = ["technology", "customer service", "product quality"]

classification_result = zero_shot_classifier(text_to_classify, candidate_labels=candidate_labels)
print("=== Zero-Shot Classification ===")
print(f"Text: {text_to_classify}")
print(f"Classification: {classification_result}\n")


# --- 3️⃣ Text Generation ---
text_generator = pipeline("text-generation", model="distilgpt2")

prompt = "Based on the customer feedback, we can improve our product by"
generated_texts = text_generator(prompt, max_length=50, num_return_sequences=2)

print("=== Text Generation ===")
for i, gen in enumerate(generated_texts):
    print(f"Generated Text {i+1}: {gen['generated_text']}")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


=== Sentiment Analysis ===


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Text: I love the new features of this product.
Sentiment: POSITIVE, Score: 0.9999

Text: This is the worst experience I've ever had.
Sentiment: NEGATIVE, Score: 0.9998



Device set to use cpu


=== Zero-Shot Classification ===
Text: The new smartphone has several innovative features that are very user-friendly.
Classification: {'sequence': 'The new smartphone has several innovative features that are very user-friendly.', 'labels': ['technology', 'product quality', 'customer service'], 'scores': [0.8073760867118835, 0.1833198368549347, 0.009304080158472061]}



Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


=== Text Generation ===
Generated Text 1: Based on the customer feedback, we can improve our product by releasing a product that features a more optimized UI, more features, and better functionality.”













































































































































































































































Generated Text 2: Based on the customer feedback, we can improve our product by enhancing our customer service and providing more customer service to customers.


### Exercise 10: Multi-Function NLP Task

This exercise will combine named entity recognition (NER), question answering, and text summarization.

 - Identify entities in the sentence: "Sundar Pichai is the CEO of Google, which is headquartered in Mountain View, California."
 - Answer the question "Where is Google headquartered?" given the context "Sundar Pichai is the CEO of Google, which is headquartered in Mountain View, California."
 - Summarize the following text: "Google, a multinational technology company specializing in Internet-related services and products, was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University."

In [None]:
from transformers import pipeline

# --- 1️⃣ Named Entity Recognition (NER) ---
ner_pipeline = pipeline("ner", grouped_entities=True)
ner_text = "Sundar Pichai is the CEO of Google, which is headquartered in Mountain View, California."
ner_results = ner_pipeline(ner_text)

print("=== Named Entity Recognition ===")
for entity in ner_results:
    print(entity)
print("\n")


# --- 2️⃣ Question Answering ---
qa_pipeline = pipeline("question-answering")
qa_context = "Sundar Pichai is the CEO of Google, which is headquartered in Mountain View, California."
qa_question = "Where is Google headquartered?"
qa_result = qa_pipeline(question=qa_question, context=qa_context)

print("=== Question Answering ===")
print(f"Question: {qa_question}")
print(f"Answer: {qa_result['answer']}")
print("\n")


# --- 3️⃣ Text Summarization ---
summarizer_pipeline = pipeline("summarization")
summarization_text = ("Google, a multinational technology company specializing in Internet-related services "
                      "and products, was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University.")
summary_result = summarizer_pipeline(summarization_text, max_length=50, min_length=25, do_sample=False)

print("=== Summarization ===")
print(f"Original Text: {summarization_text}")
print(f"Summary: {summary_result[0]['summary_text']}")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
No model w

=== Named Entity Recognition ===
{'entity_group': 'PER', 'score': np.float32(0.99594295), 'word': 'Sundar Pichai', 'start': 0, 'end': 13}
{'entity_group': 'ORG', 'score': np.float32(0.99887437), 'word': 'Google', 'start': 28, 'end': 34}
{'entity_group': 'LOC', 'score': np.float32(0.99463844), 'word': 'Mountain View', 'start': 62, 'end': 75}
{'entity_group': 'LOC', 'score': np.float32(0.99826956), 'word': 'California', 'start': 77, 'end': 87}




Device set to use cpu
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


=== Question Answering ===
Question: Where is Google headquartered?
Answer: Mountain View, California




Device set to use cpu
Your max_length is set to 50, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


=== Summarization ===
Original Text: Google, a multinational technology company specializing in Internet-related services and products, was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University.
Summary:  Google was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University . Google is a multinational technology company specializing in Internet-related services and products .


In [None]:
pip install --upgrade nbformat nbconvert




In [None]:
notebook_path = "TR_C_3_Abdeslem.ipynb"


In [None]:
# --- 1️⃣ Importer le fichier depuis ton ordinateur ---
from google.colab import files
uploaded = files.upload()  # sélectionne ton fichier TR_C_3_Abdeslem.ipynb

# --- 2️⃣ Charger et "réparer" le notebook ---
import nbformat

# Remplace ici par le nom exact de ton fichier uploadé
notebook_path = "TR_C_3_Abdeslem.ipynb"

# Lecture du notebook (version 4 = compatible)
with open(notebook_path, "r", encoding="utf-8") as f:
    nb = nbformat.read(f, as_version=4)

# Sauvegarde en version plus récente (4.5) pour GitHub
fixed_path = "TR_C_3_Abdeslem_fixed.ipynb"
nbformat.write(nb, open(fixed_path, "w", encoding="utf-8"))

print("✅ Notebook corrigé enregistré sous :", fixed_path)

# --- 3️⃣ Télécharger le fichier corrigé ---
files.download(fixed_path)


Saving TR_C_3_Abdeslem.ipynb to TR_C_3_Abdeslem.ipynb
✅ Notebook corrigé enregistré sous : TR_C_3_Abdeslem_fixed.ipynb


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>