# **TP3: Utilisez les architecture Transformers sur Huggingface pour faire les différents NLP tasks vus dans le cours**

- Sentiment analysis: un texte est-il positif ou négatif ?
- Text generation: provide a prompt and the model will generate what follows.
- Name entity recognition(NER): in an input sentence, label each word with the entity it represent ( person, place, animal..).
- Question answering: provide the model with some context and a question, extract the answer from the context.
- Filling masked text: given a text with masked words.
- Summarization: generate a summary of a long text.
- Translation: translate a text in another language.
- Feature extraction: return a tensor representation of the text.


---
**Atourabi Fatima zahra**


## **Sentiment analysis**

In [1]:
! pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.10.2-py3-none-any.whl (542 kB)
[K     |████████████████████████████████| 542 kB 6.5 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.9.0-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 52.2 MB/s 
Collecting tqdm>=4.42
  Downloading tqdm-4.61.2-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 3.7 MB/s 
Collecting fsspec>=2021.05.0
  Downloading fsspec-2021.7.0-py3-none-any.whl (118 kB)
[K     |████████████████████████████████| 118 kB 55.5 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 46.1 MB/s 
Collecting huggingface-hub<0.1.0
  Downloading huggingface_hub-0.0.14-py3-none-any.whl (43 kB)
[K     |████████████████████████████████| 43 kB 894 kB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |███████████

In [2]:
from transformers import pipeline

we use  **bert-base-multilingual-uncased model**  for sentiment analysis for a French data ( we can use it also for English, Dutch, German, Spanish and Italian).
It predicts the sentiment of the review as a number of stars (between 1 and 5).

In [10]:
sentiment_nlp = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [11]:
results = sentiment_nlp(["Noa peut être agaçante ",
           "mais c'est un super chat."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: 2 stars, with score: 0.4103
label: 4 stars, with score: 0.3874


You can see the first sentence has been classified as negative (score = 2) and the second positive with a score=4.


## **Text generation**

I use **antoiloui/belgpt2** model to generate a french text

In [12]:
generator_nlp = pipeline("text-generation",model='antoiloui/belgpt2')
generator_nlp("Mon objectif est ", max_length=50, do_sample=False)

Downloading:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/974k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/532k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Mon objectif est ˆtre de vous aider dans votre recherche de stage . Il est donc important de bien choisir son matériel de ski . Il est donc important de bien choisir son matériel de ski . Il est donc important de bien choisir son matériel de'}]

I use **aubmindlab/aragpt2-base** model to generate an arabic exemple

In [10]:
generator_ar = pipeline("text-generation", model="aubmindlab/aragpt2-base")
generator_ar("القدس مدينة تاريخية", max_length=50, do_sample=False)

Downloading:   0%|          | 0.00/843 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/553M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


[{'generated_text': 'القدس مدينة تاريخية إسلامية ، يقصدها المسلمون من جميع أنحاء العالم لأداء فريضة الحج في مكة المكرمة والمدينة المنورة على مدار العام ؛ لما لها من المكانة والجلالة من مكانة بين المسجد النبوي والمسجد الأقصى المبارك وهي من أقدس الأماكن في الإسلام وقد ورد اسمها في'}]

## **Name entity recognition(NER)**

The 9 classes of tokens used in NER:

- **O**, Outside of a named entity

- **B-MIS**, Beginning of a miscellaneous entity right after another miscellaneous entity

- **I-MIS**, Miscellaneous entity

- **B-PER**, Beginning of a person’s name right after another person’s name

- **I-PER**, Person’s name

- **B-ORG**, Beginning of an organisation right after another organisation

- **I-ORG**, Organisation

- **B-LOC**, Beginning of a location right after another location

- **I-LOC**, Location

In [13]:
ner_nlp= pipeline('ner')

Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [14]:
ner_nlp("Je suis Hajar, j'habite à Skhirat")

[{'end': 10,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.9958343,
  'start': 8,
  'word': 'Ha'},
 {'end': 13,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.8763645,
  'start': 10,
  'word': '##jar'},
 {'end': 27,
  'entity': 'I-LOC',
  'index': 12,
  'score': 0.95819896,
  'start': 26,
  'word': 'S'},
 {'end': 29,
  'entity': 'I-LOC',
  'index': 13,
  'score': 0.8969286,
  'start': 27,
  'word': '##kh'},
 {'end': 32,
  'entity': 'I-LOC',
  'index': 14,
  'score': 0.7850835,
  'start': 29,
  'word': '##ira'},
 {'end': 33,
  'entity': 'I-LOC',
  'index': 15,
  'score': 0.9063476,
  'start': 32,
  'word': '##t'}]

## **Question answering**

In [28]:
context=("Etalab est une administration publique française qui fait notamment office de Chief Data Officer de l'État et coordonne la conception et la mise en œuvre de sa stratégie dans le domaine de la donnée (ouverture et partage des données publiques ou open data, exploitation des données et intelligence artificielle...). Ainsi, Etalab développe et maintient le portail des données ouvertes du gouvernement français data.gouv.fr. Etalab promeut également une plus grande ouverture l'administration sur la société (gouvernement ouvert) : transparence de l'action publique, innovation ouverte, participation citoyenne... elle promeut l’innovation, l’expérimentation, les méthodes de travail ouvertes, agiles et itératives, ainsi que les synergies avec la société civile pour décloisonner l’administration et favoriser l’adoption des meilleures pratiques professionnelles dans le domaine du numérique. À ce titre elle étudie notamment l’opportunité de recourir à des technologies en voie de maturation issues du monde de la recherche. Cette entité chargée de l'innovation au sein de l'administration doit contribuer à l'amélioration du service public grâce au numérique. Elle est rattachée à la Direction interministérielle du numérique, dont les missions et l’organisation ont été fixées par le décret du 30 octobre 2019.  Dirigé par Laure Lucchesi depuis 2016, elle rassemble une équipe pluridisciplinaire d'une trentaine de personnes.")




1.   **Test for a french exemple**




In [29]:
QA_nlp= pipeline('question-answering', model="etalab-ia/camembert-base-squadFR-fquad-piaf")

Downloading:   0%|          | 0.00/515 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/443M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/811k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/210 [00:00<?, ?B/s]

In [30]:
result = QA_nlp(question="Comment s'appelle le portail open data du gouvernement ?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: ' data.gouv.fr.', score: 0.9959, start: 409, end: 423


In [31]:
result = QA_nlp(question="c'est quoi la Etalab", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: ' administration publique française', score: 0.4399, start: 14, end: 48




2.   **Test for an arabic exemple**




In [33]:
text =("المَغْرِبُ رسميًا المَمْلَكَةُ المَغْرِبِيَّةُ  هي دولة إسلامية تقع في أقصى غرب شمال أفريقيا، عاصمتها الرباط وأكبر مدنها الدار البيضاء؛ تُطل على البحر المتوسط شمالًا والمحيط الأطلسي غربًا، وتحدها الجزائر شرقًا وموريتانيا جنوبًا؛ وفي الشريط البحري الضيق الفاصل بين المغرب وإسبانيا ثلاث مكتنفات متنازع عليها بين البلدين وهي سبتة ومليلية وصخرة قميرة.")
QA_ar=pipeline("question-answering", model="wissamantoun/araelectra-base-artydiqa")
result = QA_ar(question="ما هي عاصمة المغرب؟", context=text)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Downloading:   0%|          | 0.00/771 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/390 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/761k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Answer: 'الرباط', score: 0.9976, start: 102, end: 108


## **Filling masked text**

* **French exemple**





In [21]:
mask_nlp= pipeline('fill-mask')
mask_nlp('Je pense que vous <mask> faire ce tp')

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.1626989245414734,
  'sequence': 'Je pense que vous les faire ce tp',
  'token': 7427,
  'token_str': ' les'},
 {'score': 0.09955654293298721,
  'sequence': 'Je pense que vous faire faire ce tp',
  'token': 25241,
  'token_str': ' faire'},
 {'score': 0.07789743691682816,
  'sequence': 'Je pense que vous le faire ce tp',
  'token': 2084,
  'token_str': ' le'},
 {'score': 0.040745314210653305,
  'sequence': 'Je pense que vous pas faire ce tp',
  'token': 6977,
  'token_str': ' pas'},
 {'score': 0.036333516240119934,
  'sequence': 'Je pense que vous ne faire ce tp',
  'token': 3087,
  'token_str': ' ne'}]

* **Arabic exemple**

In [3]:
mask_nlp= pipeline('fill-mask', model="aubmindlab/bert-large-arabertv2")

Some weights of the model checkpoint at aubmindlab/bert-large-arabertv2 were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
mask_nlp(' عاصمة لبنان هي [MASK] . ')

[{'score': 0.9751732349395752,
  'sequence': 'عاصمة لبنان هي بيروت.',
  'token': 1875,
  'token_str': 'بيروت'},
 {'score': 0.0105501813814044,
  'sequence': 'عاصمة لبنان هي طرابلس.',
  'token': 3472,
  'token_str': 'طرابلس'},
 {'score': 0.0028214568737894297,
  'sequence': 'عاصمة لبنان هي دمشق.',
  'token': 2314,
  'token_str': 'دمشق'},
 {'score': 0.0026184851303696632,
  'sequence': 'عاصمة لبنان هي بعبدا.',
  'token': 9701,
  'token_str': 'بعبدا'},
 {'score': 0.0018964111804962158,
  'sequence': 'عاصمة لبنان هي جونيه.',
  'token': 20732,
  'token_str': 'جونيه'}]

## **Summarization**

In [22]:
summarizer = pipeline('summarization', model="moussaKam/barthez-orangesum-abstract")

Downloading:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/557M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.63M [00:00<?, ?B/s]

In [23]:
article ="Citant les préoccupations de ses clients dénonçant des cas de censure après la suppression du compte de Trump, un fournisseur d'accès Internet de l'État de l'Idaho a décidé de bloquer Facebook et Twitter. La mesure ne concernera cependant que les clients mécontents de la politique de ces réseaux sociaux."

In [24]:
summarizer(article, max_length=50, min_length=30, do_sample =False)

[{'summary_text': "Un fournisseur d'accès Internet de l'Idaho a décidé de bloquer Facebook et Twitter. La mesure ne concernera que les clients mécontents de la politique des réseaux sociaux."}]

## **Translation**

- Translation from english to french 

In [25]:
trans = pipeline("translation_en_to_fr")

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [26]:
result= trans("A scientific article is a piece of writing that reports the findings of a scientific experiment. Scientists use these types of articles to informother scientists, as well as regular people, about their discoveries.")
print(result)

[{'translation_text': "Un article scientifique est une uvre qui fait état des rsultats d'une exprience scientifique. Les scientifiques utilisent ce genre d'articles pour informer d'autres scientifiques et des gens ordinaires de leurs dcouvertes."}]


- Translation from english to arabic using **Helsinki-NLP/opus-mt-en-ar** model 

In [27]:
trans_ar= pipeline('translation', model="Helsinki-NLP/opus-mt-en-ar")
trans_ar(r"""A scientific article is a piece of writing that reports the findings of a scientific experiment. Scientists use these types of articles to inform.scientific experiment. Scientists use these types of articles to inform
other scientists, as well as regular people, about their discoveries.""")

[{'translation_text': 'المادة العلمية هي جزء من الكتابة التي تبلغ عن نتائج تجربة علمية. يستخدم العلماء هذه الأنواع من المقالات في الإعلام. التجربة العلمية. يستخدم العلماء هذه الأنواع من المقالات لإبلاغ علماء آخرين، فضلاً عن أشخاص عاديين، عن اكتشافاتهم.'}]