# What about the LLMs?

LLMs are reputed to have revolutionised automatic language processing. Since the introduction of BERT-type models, all language processing applications have been based on LLMs, of varying degrees of sophistication and size. These models are trained on multiple tasks and are therefore capable of performing new tasks without learning, simply from a prompt. This is known as "zero-shot learning" because there is no learning phase as such. We are going to test these models on our classification task.

Huggingface is a Franco-American company that develops tools for building applications based on Deep Learning. In particular, it hosts the huggingface.co portal, which contains numerous Deep Learning models. These models can be used very easily thanks to the [Transformer] library (https://huggingface.co/docs/transformers/quicktour) developed by HuggingFace.

Using a transform model in zero-shot learning with HuggingFace is very simple: [see documentation](https://huggingface.co/tasks/zero-shot-classification)

However, you need to choose a suitable model from the list of models compatible with Zero-Shot classification. HuggingFace offers [numerous models](https://huggingface.co/models?pipeline_tag=zero-shot-classification). 

The classes proposed to the model must also provide sufficient semantic information for the model to understand them.

**Question**:

* Write a code to classify an example of text from an article in Le Monde using a model transformed using zero-shot learning with the HuggingFace library.
* choose a model and explain your choice
* choose a formulation for the classes to be predicted
* show that the model predicts a class for the text of the article (correct or incorrect, analyse the results)
* evaluate the performance of your model on 100 articles (a test set).
* note model sizes, processing times and classification results


Notes :
* make sure that you use the correct Tokenizer when using a model 
* start testing with a small number of articles and the first 100's of characters for faster experiments.

In [1]:
# Choisir modèle multilingue, essayer avec des modèles de différentes tailles.

## Installation des packages

In [2]:
!pip install torch

Collecting torch
  Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl.metadata (28 kB)
Collecting filelock (from torch)
  Downloading filelock-3.17.0-py3-none-any.whl.metadata (2.9 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from to

In [3]:
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.18.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting absl-py>=1.0.0 (from tensorflow)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=24.3.25 (from tensorflow)
  Downloading flatbuffers-25.2.10-py2.py3-none-any.whl.metadata (875 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Downloading gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting libclang>=13.0.0 (from tensorflow)
  Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
Collecting opt-einsum>=2.3.2 (from tensorflow)
  Downloading opt_einsum-3.4.0-py3-none-any.whl.metadata (6.3 kB)
Collecting termcolor>=1.1.0 (

In [4]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers)
  Downloading huggingface_hub-0.29.1-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.5.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading transformers-4.49.0-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading huggingface_hub-0.29.1-py3-none-any.whl (468 kB)
Downloading safetensors-0.5.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (461 kB)
Downloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K 

In [5]:
!pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.18.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
Installing collected packages: tf-keras
Successfully installed tf-keras-2.18.0


In [6]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m48.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [7]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Downloading sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m82.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.2.0


## Importation des librairies

In [8]:
import pandas as pd

In [9]:
import torch
import tensorflow as tf
from transformers import pipeline

2025-02-21 08:57:36.208617: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-21 08:57:36.229028: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740128256.251233    1003 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740128256.257967    1003 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-21 08:57:36.281871: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

In [10]:
import time
from tqdm import tqdm

## Importation de la base de données et tirage de l'ensemble de test

In [11]:
 articles_df = pd.read_csv('LeMonde2003_9classes.csv.gz')

In [12]:
# Filter out the UNE class
articles_df = articles_df[articles_df['category'] != 'UNE']

In [13]:
sample = articles_df.sample(100)

In [14]:
sample

Unnamed: 0,text,category
28179,jean-paul huchon ps a indiqué lundi 22 décembr...,FRA
409,ce n'était vraiment pas le moment d'être membr...,INT
2692,la compagnie d'assurance suisse a enregistré e...,ENT
6773,consacrant l'essentiel de son discours vendred...,INT
15693,le nombre de bacheliers pour la session 2003 d...,SOC
...,...,...
5638,en choisissant de montrer le film de vincent g...,ART
6581,a deux semaines du sommet du g8 d'evian où le ...,INT
25112,washington l'organisation américaine de défens...,INT
2572,répétées à rome et créées à la biennale de ven...,ART


In [15]:
texts = sample.text

In [16]:
articles_df.category.unique()

array(['SPO', 'ART', 'FRA', 'SOC', 'INT', 'ENT'], dtype=object)

## Définition du modèle et des classes à prédire

In [17]:
# Noms des classes à prédire
candidate_labels = ['companies', 'international', 'arts', 'society', 'France', 'sports', 'font page articles']

# Dictionnaire de correspondance avec les valeurs de la base de données
corresp_dict = {
    'companies':'ENT',
    'international':'INT',
    'arts':'ART',
    'society':'SOC',
    'France':'FRA',
    'sports':'SPO',
    'font page articles':'UNE'
}

In [18]:
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

Device set to use cpu


## Prédiction des classes sur l'ensemble de test et évaluation

In [19]:
predictions = []
scores = []

# Mesure du temps de traitement
start_time = time.time()
for text in tqdm(texts.to_list(), desc="Traitement", unit="texte"):
    result = classifier(text[:100], candidate_labels, multi_label=True)
    
    # Trouver l'indice du score le plus élevé
    max_score_index = result['scores'].index(max(result['scores']))

    # Sélectionner le label correspondant et son score
    selected_label = result['labels'][max_score_index]
    selected_score = result['scores'][max_score_index]

    # Ajouter les éléments aux listes
    predictions.append(selected_label)
    scores.append(selected_score)
    
# Temps de traitement
end_time = time.time()
processing_time = end_time - start_time

Traitement: 100%|██████████| 100/100 [03:10<00:00,  1.91s/texte]


In [20]:
# Afficher les résultats
print(f"Time processing: {processing_time:.4f} seconds")

Time processing: 190.8923 seconds


In [21]:
sample['prediction'] = predictions
sample['scores'] = scores

In [22]:
sample

Unnamed: 0,text,category,prediction,scores
28179,jean-paul huchon ps a indiqué lundi 22 décembr...,FRA,companies,0.972328
409,ce n'était vraiment pas le moment d'être membr...,INT,international,0.691037
2692,la compagnie d'assurance suisse a enregistré e...,ENT,companies,0.976389
6773,consacrant l'essentiel de son discours vendred...,INT,arts,0.553660
15693,le nombre de bacheliers pour la session 2003 d...,SOC,France,0.487241
...,...,...,...,...
5638,en choisissant de montrer le film de vincent g...,ART,sports,0.997939
6581,a deux semaines du sommet du g8 d'evian où le ...,INT,international,0.992085
25112,washington l'organisation américaine de défens...,INT,international,0.715720
2572,répétées à rome et créées à la biennale de ven...,ART,international,0.963955


In [23]:
# Remplacer les valeurs dans la colonne en utilisant le dictionnaire
sample['prediction'] = sample['prediction'].replace(corresp_dict)

In [24]:
# Calculer le nombre de classifications correctes
count_equal = (sample['category'] == sample['prediction']).sum()
total_lines = len(sample)
accuracy_percentage = (count_equal / total_lines) * 100

In [25]:
print(f"L'évaluation des performances du modèle montre que 'category' et 'prediction' sont égales dans {count_equal} cas sur {total_lines} lignes, soit un taux de précision de {accuracy_percentage:.2f}%")

L'évaluation des performances du modèle montre que 'category' et 'prediction' sont égales dans 42 cas sur 100 lignes, soit un taux de précision de 42.00%


Résultats faibles, essayer de ralonger le nombre de caractères considérés pour la prédiction et essayer d'autres modèles. 

## Expériences avec différents modèles 

In [26]:
def evaluate_zero_shot_classification(model_name, text_length=100):
    """
    Effectue une classification zero-shot sur un échantillon des données.
    
    Paramètres:
    - df: DataFrame contenant une colonne 'text' et 'category'
    - model_name: Nom du modèle à utiliser
    - candidate_labels: Liste des labels possibles
    - sample_size: Nombre d'échantillons à utiliser (par défaut 100)
    - text_length: Nombre de caractères du texte pris en compte (par défaut 100)
    
    Retourne:
    - Un DataFrame avec les prédictions et scores
    - Le temps de traitement
    - L'accuracy du modèle
    """
    # Tirage de l'échantillon
    sample = articles_df.sample(100).copy()
    texts = sample['text']

    # Définition du modèle
    classifier = pipeline("zero-shot-classification", model=model_name)
    predictions, scores = [], []

    start_time = time.time()
    for text in tqdm(texts.to_list(), desc="Traitement", unit="text"):
        result = classifier(text[:text_length], candidate_labels, multi_label=True)
        
        max_score_index = result['scores'].index(max(result['scores']))
        selected_label = result['labels'][max_score_index]
        selected_score = result['scores'][max_score_index]
        
        predictions.append(selected_label)
        scores.append(selected_score)
    
    end_time = time.time()
    processing_time = end_time - start_time

    sample['prediction'] = predictions
    sample['scores'] = scores

    # Remplacement des labels selon un dictionnaire de correspondance
    sample['prediction'] = sample['prediction'].replace(corresp_dict)
    
    # Calcul de l'accuracy
    count_equal = (sample['category'] == sample['prediction']).sum()
    total_lines = len(sample)
    accuracy_percentage = (count_equal / total_lines) * 100
    
    print(f"Time processing: {processing_time:.4f} seconds")
    print(f"Accuracy: {accuracy_percentage:.2f}% ({count_equal}/{total_lines})")
    
    return sample, processing_time, accuracy_percentage

In [28]:
# Liste des modèles à tester
AVAILABLE_MODELS = [
    "knowledgator/comprehend_it-base",
    "facebook/bart-large-mnli",
    "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli", 
    "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli", 
]

# Dictionnaire des tailles des modèles
MODEL_SIZES = {
    "knowledgator/comprehend_it-base": 0.47,
    "facebook/bart-large-mnli": 1.63,  # En Go
    "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli": 0.49,
    "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli": 0.49,
}

In [29]:
# Liste pour stocker les résultats
results = []

# Boucle sur les modèles
for model in AVAILABLE_MODELS:
    print(f"Testing model: {model}")
    sample_results, time_taken, accuracy = evaluate_zero_shot_classification(model, 100)

    # Ajouter les résultats dans la liste
    results.append({
        "Modèle": model,
        "Taille (Go)": MODEL_SIZES.get(model, "N/A"),
        "Temps (s)": round(time_taken, 2),
        "Accuracy (%)": round(accuracy, 2)
    })


Testing model: knowledgator/comprehend_it-base


Device set to use cpu
Traitement: 100%|██████████| 100/100 [01:59<00:00,  1.19s/text]


Time processing: 119.1703 seconds
Accuracy: 42.00% (42/100)
Testing model: facebook/bart-large-mnli


Device set to use cpu
Traitement: 100%|██████████| 100/100 [03:05<00:00,  1.85s/text]


Time processing: 185.1920 seconds
Accuracy: 47.00% (47/100)
Testing model: MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli


Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Traitement: 100%|██████████| 100/100 [01:51<00:00,  1.12s/text]


Time processing: 111.9397 seconds
Accuracy: 35.00% (35/100)
Testing model: MoritzLaurer/mDeBERTa-v3-base-mnli-xnli


Device set to use cpu
Traitement: 100%|██████████| 100/100 [01:55<00:00,  1.15s/text]

Time processing: 115.3597 seconds
Accuracy: 37.00% (37/100)





In [30]:
# Convertir en DataFrame
df_results = pd.DataFrame(results)
df_results = df_results.sort_values(by="Accuracy (%)", ascending=False)

# Afficher le tableau
print(df_results)

                                         Modèle  Taille (Go)  Temps (s)  \
1                      facebook/bart-large-mnli         1.63     185.19   
0               knowledgator/comprehend_it-base         0.47     119.17   
3       MoritzLaurer/mDeBERTa-v3-base-mnli-xnli         0.49     115.36   
2  MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli         0.49     111.94   

   Accuracy (%)  
1          47.0  
0          42.0  
3          37.0  
2          35.0  


Le modèle facebook/bart-large-mnli a un taux de précision plus élevé que les autres modèles, sa taille et son temps d'exécution sont également les plus élevés. 

## Evaluation avec plus de mots dans le contexte pour la prédiction

In [27]:
sample_results, time_taken, accuracy = evaluate_zero_shot_classification("facebook/bart-large-mnli", 300)

Device set to use cpu
Traitement: 100%|██████████| 100/100 [04:03<00:00,  2.43s/text]

Time processing: 243.1157 seconds
Accuracy: 58.00% (58/100)





Meilleurs résultats avec 300 caractères au lieu de 100 pour prédire la classe. 58% de bonnes prédictions au lieu de 47% précédemment.

In [31]:
sample_results, time_taken, accuracy = evaluate_zero_shot_classification("facebook/bart-large-mnli", 500)

Device set to use cpu
Traitement: 100%|██████████| 100/100 [09:56<00:00,  5.96s/text]

Time processing: 596.2937 seconds
Accuracy: 35.00% (35/100)





In [56]:
# Tirage de l'échantillon
sample = articles_df.sample(100).copy()
texts = sample['text']

# Définition du modèle
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
predictions, scores = [], []

start_time = time.time()
for text in tqdm(texts.to_list(), desc="Traitement", unit="text"):
    result = classifier(text[:500], candidate_labels, multi_label=True)
    
    max_score_index = result['scores'].index(max(result['scores']))
    selected_label = result['labels'][max_score_index]
    selected_score = result['scores'][max_score_index]
    
    predictions.append(selected_label)
    scores.append(selected_score)

end_time = time.time()
processing_time = end_time - start_time

sample['prediction'] = predictions
sample['scores'] = scores

# Remplacement des labels selon un dictionnaire de correspondance
sample['prediction'] = sample['prediction'].replace(corresp_dict)

# Calcul de l'accuracy
count_equal = (sample['category'] == sample['prediction']).sum()
total_lines = len(sample)
accuracy_percentage = (count_equal / total_lines) * 100

print(f"Time processing: {processing_time:.4f} seconds")
print(f"Accuracy: {accuracy_percentage:.2f}% ({count_equal}/{total_lines})")

Device set to use cpu
Traitement: 100%|██████████| 100/100 [03:47<00:00,  2.28s/text]

Time processing: 227.8575 seconds
Accuracy: 47.00% (47/100)





SyntaxError: 'return' outside function (3393591218.py, line 37)