## Transformers

***1. Sentiment analysis***

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 34.3 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 6.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 54.6 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 39.5 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 59.9 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installat

In [3]:
from transformers import pipeline

nlp = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

resultat = nlp("très Bien recu")[0]
print(resultat)
print(f"label: {resultat['label']}, avec un score de: {round(resultat['score']*100, 2)}%")

resultat = nlp("très mauvaise remarque")[0]
print(f"label: {resultat['label']}, avec un score de: {round(resultat['score']*100, 2)}%")

{'label': '5 stars', 'score': 0.6836311221122742}
label: 5 stars, avec un score de: 68.36%
label: 1 star, avec un score de: 82.02%


***2. Text generation***

In [4]:
from transformers import pipeline

In [6]:
# French
Fr_Text = pipeline('text-generation', model='dbddv01/gpt2-french-small')

print(Fr_Text("je dance avec", max_length=50, do_sample=False))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "je dance avec le groupe The Pogues, qui joue sur la scène de la chanson. Le groupe sort en single le 11 avril 2009. Le single est le premier single de l'album, et est suivi par le second single, le 11 avril"}]


In [7]:
# Arabic
Ar_Text = pipeline('text-generation', model='akhooli/gpt2-small-arabic')
print(Ar_Text("انها معلمة تاريخية", max_length=50, do_sample=False))

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/510M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.55M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/120 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'انها معلمة تاريخية في مدرسة القرية، وهي مدرسة ابتدائية للبنين، ومدرسة ابتدائية للبنات، ومدرسة ابتدائية للبنات، ومدرسة ابتدائية للبنات، ومدرسة ابتدائية للبنات، ومدرسة ابتدائية للبنات، ومدرسة ابتدائية للبنات، ومدرسة ابتدائية للبنات، ومدرسة ابتدائية للبنات، ومدرسة ابتدائية للبنات، ومدرسة ابتدائية'}]


***3. Name entity recognition (NER)***

In [8]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

ENG = pipeline("ner", model="dslim/bert-base-NER", tokenizer="dslim/bert-base-NER")
AR = pipeline("ner", model="hatmimoha/arabic-ner", tokenizer="hatmimoha/arabic-ner")
FR = pipeline("ner", model="gilf/french-postag-model", tokenizer="gilf/french-postag-model")

print(AR("انها معلمة تاريخية"))
 
print(ENG("tout est bien"))

print(FR("je dance avec"))

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/334k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/712M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity': 'LABEL_12', 'score': 0.9995168, 'index': 1, 'word': 'انها', 'start': 0, 'end': 4}, {'entity': 'LABEL_12', 'score': 0.9998653, 'index': 2, 'word': 'معلم', 'start': 5, 'end': 9}, {'entity': 'LABEL_12', 'score': 0.9994434, 'index': 3, 'word': '##ة', 'start': 9, 'end': 10}, {'entity': 'LABEL_12', 'score': 0.99991083, 'index': 4, 'word': 'تاريخية', 'start': 11, 'end': 18}]
[]
[{'entity': 'CLS', 'score': 0.9996672, 'index': 1, 'word': 'je', 'start': 0, 'end': 2}, {'entity': 'V', 'score': 0.8005907, 'index': 2, 'word': 'dance', 'start': 3, 'end': 8}, {'entity': 'P', 'score': 0.9998656, 'index': 3, 'word': 'avec', 'start': 9, 'end': 13}]


***4. Question answering***

In [9]:
from transformers import pipeline
Answer = pipeline("question-answering")
context = """
Lionel Messi, parfois surnommé Leo Messi, né le 24 juin 1987 à Rosario en Argentine, est un footballeur international argentin évoluant au poste d'attaquant au Paris Saint-Germain, après avoir joué au FC Barcelone..
"""
qst = "Qui est Lionel Messi?"
resultat = Answer(question=qst, context=context)
print("Reponse:", resultat['answer'])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Reponse: né le 24 juin 1987 à Rosario en Argentine


***5. Filling masked text***

In [10]:
from transformers import pipeline

NLP = pipeline("fill-mask")

from pprint import pprint
pprint(NLP(f"Mohammed VI, né le 21 août 1963 à Rabat, est  {NLP.tokenizer.mask_token} et le troisième à porter le titre de roi du Maroc."))

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.3971644937992096,
  'sequence': 'Mohammed VI, né le 21 août 1963 à Rabat, est ét et le troisième '
              'à porter le titre de roi du Maroc.',
  'token': 10221,
  'token_str': 'ét'},
 {'score': 0.08196265250444412,
  'sequence': 'Mohammed VI, né le 21 août 1963 à Rabat, est iced et le '
              'troisième à porter le titre de roi du Maroc.',
  'token': 12646,
  'token_str': 'iced'},
 {'score': 0.060506634414196014,
  'sequence': 'Mohammed VI, né le 21 août 1963 à Rabat, est és et le troisième '
              'à porter le titre de roi du Maroc.',
  'token': 5739,
  'token_str': 'és'},
 {'score': 0.056976623833179474,
  'sequence': 'Mohammed VI, né le 21 août 1963 à Rabat, est ident et le '
              'troisième à porter le titre de roi du Maroc.',
  'token': 8009,
  'token_str': 'ident'},
 {'score': 0.03867478668689728,
  'sequence': 'Mohammed VI, né le 21 août 1963 à Rabat, est ieri et le '
              'troisième à porter le titre de roi du Maroc.',
  't

In [11]:
#Arabic
from transformers import pipeline
arabic_fill_mask = pipeline('fill-mask', model='CAMeL-Lab/bert-base-camelbert-ca')
pprint(arabic_fill_mask(" ‏بسم [MASK]‬ الرحمن الرحيم ."))

Downloading:   0%|          | 0.00/468 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of the model checkpoint at CAMeL-Lab/bert-base-camelbert-ca were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/305k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'score': 0.999610960483551,
  'sequence': 'بسم الله الرحمن الرحيم.',
  'token': 1953,
  'token_str': 'الله'},
 {'score': 0.00019928190158680081,
  'sequence': 'بسم الرحمن الرحمن الرحيم.',
  'token': 4289,
  'token_str': 'الرحمن'},
 {'score': 3.6494100640993565e-05,
  'sequence': 'بسم لله الرحمن الرحيم.',
  'token': 2784,
  'token_str': 'لله'},
 {'score': 2.092821705446113e-05,
  'sequence': 'بسم اله الرحمن الرحيم.',
  'token': 2090,
  'token_str': 'اله'},
 {'score': 9.44888870435534e-06,
  'sequence': 'بسم اللهم الرحمن الرحيم.',
  'token': 2168,
  'token_str': 'اللهم'}]


***6. Summarization***

In [12]:
from transformers import pipeline

debrief = pipeline("summarization")

TEXTE = """ 
Mohammed VI (en arabe marocain : محمد السادس, en berbère marocain : ⵎⵓⵃⵎⵎⴷ ⵡⵉⵙⵙ ⵚⴹⵉⵚ), né le 21 août 1963 à Rabat (Maroc), est le vingt-troisième monarque de la dynastie alaouite, et le troisième à porter le titre de roi du Maroc, depuis le 23 juillet 1999.
"""
print(debrief(TEXTE, max_length=130, min_length=30, do_sample=False))

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' Mohammed VI, né le 21 août 1963 à Rabat (Maroc), est le vingt-troisième monarque de la dynastie alaouite . He is the title de roi du Maroc, depuis le 23 juillet 1999 .'}]


***7. Translation***

In [13]:
from transformers import pipeline
# English to french
trans = pipeline("translation_en_to_fr")
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

pprint(trans("The king initially introduced reforms to grant women more power. Leaked diplomatic cables from WikiLeaks have alleged extensive corruption in the court of King Mohammed VI, implicating the king and his closest advisors.", max_length=40))

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Your input_length: 53 is bigger than 0.9 * max_length: 40. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


[{'translation_text': 'Le roi a initialement introduit des réformes pour '
                      'conférer plus de pouvoir aux femmes. Des câbles '
                      'diplomatiques divulgués par WikiLeak'}]


***8. Feature extraction***

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

MOTS = [
    "it is very soft and kind",
    "you should see it ",
    "You are a very handsome guy and good teacher, teacher.",
    "You can be a good father in the future",
    "You are smart as ELON MUSK",]

vectorizer = CountVectorizer(stop_words='english')

vectorizer.fit(MOTS)

vectorizer.get_feature_names()

['elon',
 'father',
 'future',
 'good',
 'guy',
 'handsome',
 'kind',
 'musk',
 'smart',
 'soft',
 'teacher']