Exemples d'utilisation Transformer avec l'interface Hugging Face.
IL faut au préalable installer la bibliothèque Transformer (voir https://huggingface.co/docs/transformers/installation)
Voyons d'abord un exemple minimaliste d'utilisation de pipeline pour l'analyse de sentiment

In [1]:
from transformers import pipeline
import torch
print(torch.__version__)
import torch.nn.functional as F

2.0.1


La fonction pipeline() charge un modèle déjà entraîné pour une tâche particulière ainsi qu'un Tokenizer approprié

In [2]:
classifieur = pipeline("sentiment-analysis")
texte1 = "I like going to the movies!"
classifieur(texte1)
resultat = classifieur(texte1)
print(resultat)
texte2 = "I hate waiting when I call a customer service number."
resultat = classifieur(texte2)
print(resultat)

[{'label': 'POSITIVE', 'score': 0.9996259212493896}]
[{'label': 'NEGATIVE', 'score': 0.9978412985801697}]


In [3]:
liste_input = [texte1,texte2]
classifieur(liste_input)

[{'label': 'POSITIVE', 'score': 0.9996259212493896},
 {'label': 'NEGATIVE', 'score': 0.9978412985801697}]

In [4]:
classifieur_tout_score = pipeline("sentiment-analysis",return_all_scores=True)
classifieur_tout_score(liste_input)

[[{'label': 'NEGATIVE', 'score': 0.0003740561078302562},
  {'label': 'POSITIVE', 'score': 0.9996259212493896}],
 [{'label': 'NEGATIVE', 'score': 0.9978412985801697},
  {'label': 'POSITIVE', 'score': 0.0021586893126368523}]]

Les tâches et leur modèle de défault :
https://github.com/huggingface/transformers/blob/71688a8889c4df7dd6d90a65d895ccf4e33a1a56/src/transformers/pipelines.py#L2716-L2804

In [5]:
nom_modele = "distilbert-base-uncased-finetuned-sst-2-english"
classifieur = pipeline("sentiment-analysis", model=nom_modele)
classifieur(liste_input)

[{'label': 'POSITIVE', 'score': 0.9996259212493896},
 {'label': 'NEGATIVE', 'score': 0.9978412985801697}]

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
modele = AutoModelForSequenceClassification.from_pretrained(nom_modele)
modele

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [7]:
tokenizer = AutoTokenizer.from_pretrained(nom_modele)
tokenizer

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [8]:
classifieur = pipeline("sentiment-analysis", model=modele, tokenizer=tokenizer)
classifieur(liste_input)

[{'label': 'POSITIVE', 'score': 0.9996259212493896},
 {'label': 'NEGATIVE', 'score': 0.9978412985801697}]

In [9]:
jetons = tokenizer.tokenize(texte1)
print(jetons)

['i', 'like', 'going', 'to', 'the', 'movies', '!']


In [10]:
jetons_ids = tokenizer.convert_tokens_to_ids(jetons)
print(jetons_ids)

[1045, 2066, 2183, 2000, 1996, 5691, 999]


In [11]:
jetons = tokenizer.tokenize(texte2)
print(jetons)

['i', 'hate', 'waiting', 'when', 'i', 'call', 'a', 'customer', 'service', 'number', '.']


In [12]:
jetons_ids = tokenizer.convert_tokens_to_ids(jetons)
print(jetons_ids)

[1045, 5223, 3403, 2043, 1045, 2655, 1037, 8013, 2326, 2193, 1012]


In [13]:
tokenizer(texte1)

{'input_ids': [101, 1045, 2066, 2183, 2000, 1996, 5691, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [14]:
lot_entree = tokenizer(liste_input,padding=True,truncation=True,max_length=512, return_tensors="pt")
print(lot_entree)

{'input_ids': tensor([[ 101, 1045, 2066, 2183, 2000, 1996, 5691,  999,  102,    0,    0,    0,
            0],
        [ 101, 1045, 5223, 3403, 2043, 1045, 2655, 1037, 8013, 2326, 2193, 1012,
          102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [15]:
with torch.no_grad():
    lot_output= modele(**lot_entree) # ** pour passer argument sous forme de dict
    print(lot_output)
    predictions = F.softmax(lot_output.logits, dim=1)
    print(predictions)
    resultats = torch.argmax(predictions, dim=1)
    print(resultats)
    etiquettes = [modele.config.id2label[label_id] for label_id in resultats.tolist()]
    print(etiquettes)

SequenceClassifierOutput(loss=None, logits=tensor([[-3.8102,  4.0805],
        [ 3.3661, -2.7700]]), hidden_states=None, attentions=None)
tensor([[3.7406e-04, 9.9963e-01],
        [9.9784e-01, 2.1587e-03]])
tensor([1, 0])
['POSITIVE', 'NEGATIVE']


In [16]:
with torch.no_grad():
    lot_output= modele(**lot_entree, labels = torch.tensor([1,0])) # etiquettes des classes pour calcul du coût
    print(lot_output)
    predictions = F.softmax(lot_output.logits, dim=1)
    print(predictions)
    resultats = torch.argmax(predictions, dim=1)
    print(resultats)
    etiquettes = [modele.config.id2label[label_id] for label_id in resultats.tolist()]
    print(etiquettes)

SequenceClassifierOutput(loss=tensor(0.0013), logits=tensor([[-3.8102,  4.0805],
        [ 3.3661, -2.7700]]), hidden_states=None, attentions=None)
tensor([[3.7406e-04, 9.9963e-01],
        [9.9784e-01, 2.1587e-03]])
tensor([1, 0])
['POSITIVE', 'NEGATIVE']


In [17]:
nom_modele = "cardiffnlp/twitter-roberta-base-sentiment-latest"
classifieur = pipeline("sentiment-analysis", model=nom_modele)
classifieur(liste_input)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'label': 'positive', 'score': 0.9706764817237854},
 {'label': 'negative', 'score': 0.9286103248596191}]

In [18]:
classifieur_multilingue = pipeline("sentiment-analysis",
    model="lxyuan/distilbert-base-multilingual-cased-sentiments-student", 
    return_all_scores=True
)

# français
classifieur_multilingue(["J'aime aller au cinéma","Je déteste le froid hivernal"])


[[{'label': 'positive', 'score': 0.8709720969200134},
  {'label': 'neutral', 'score': 0.09320797771215439},
  {'label': 'negative', 'score': 0.0358198881149292}],
 [{'label': 'positive', 'score': 0.09347988665103912},
  {'label': 'neutral', 'score': 0.11243458092212677},
  {'label': 'negative', 'score': 0.7940855026245117}]]

Les tâches et leur modèle de défault : https://github.com/huggingface/transformers/blob/71688a8889c4df7dd6d90a65d895ccf4e33a1a56/src/transformers/pipelines.py#L2716-L2804

In [21]:
generateur_texte = pipeline("text-generation")
generateur_texte("Large language models will")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Large language models will be updated in the near future to enhance the accessibility that people have for the new data and data analytics methods.\n\nIn the near future, users will see the same data being sent to and from Google+, such as traffic numbers'}]

In [26]:
from transformers import AutoModelForCausalLM
modele = AutoModelForCausalLM.from_pretrained("gpt2")
modele

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)