## What about the LLMs?

**You must write the answer to this question in a notebook hosted in your github account and give access to your supervisor.**

LLMs are reputed to have revolutionised automatic language processing. Since the introduction of BERT-type models, all language processing applications have been based on LLMs, of varying degrees of sophistication and size. These models are trained on multiple tasks and are therefore capable of performing new tasks without learning, simply from a prompt. This is known as "zero-shot learning" because there is no learning phase as such. We are going to test these models on our classification task.

Huggingface is a Franco-American company that develops tools for building applications based on Deep Learning. In particular, it hosts the huggingface.co portal, which contains numerous Deep Learning models. These models can be used very easily thanks to the [Transformer] library (https://huggingface.co/docs/transformers/quicktour) developed by HuggingFace.

Using a transform model in zero-shot learning with HuggingFace is very simple: [see documentation](https://huggingface.co/tasks/zero-shot-classification)

However, you need to choose a suitable model from the list of models compatible with Zero-Shot classification. HuggingFace offers [numerous models](https://huggingface.co/models?pipeline_tag=zero-shot-classification). 

The classes proposed to the model must also provide sufficient semantic information for the model to understand them.

**Question**:

* Write a code to classify an example of text from an article in Le Monde using a model transformed using zero-sot learning with the HuggingFace library.
* choose a model and explain your choice
* choose a formulation for the classes to be predicted
* show that the model predicts a class for the text of the article (correct or incorrect, analyse the results)
* evaluate the performance of your model on 100 articles (a test set).
* note model sizes, processing times and classification results


Notes :
* make sure that you use the correct Tokenizer when using a model 
* start testing with a small number of articles and the first 100's of characters for faster experiments.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import time

# Load the dataset
csv_file_path = 'data/LeMonde2003_9classes.csv.gz'
df = pd.read_csv(csv_file_path, compression='gzip')

# Remove the class 'UNE' and merge 'FRANCE' and 'SOCIETE'
df = df[df['category'] != 'UNE']
df.loc[df['category'] == 'FRA', 'category'] = 'SOC'

# Create new splits
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Extract features and labels for the test set
X_test = test_df['text']
y_test = test_df['category']

# Choose a model for zero-shot classification
model_name = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
classifier = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

  from .autonotebook import tqdm as notebook_tqdm
Device set to use cuda:0


Le modèle ``MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7`` trouvé sur HF est un modèle de Zero-Shot classification pour le texte et compatible avec le français. C'est le meilleur que j'ai trouvé après en avoir testé quelque uns.

In [2]:
# Define the candidate labels and their mappings in French with more expressive descriptions
candidate_labels = [
    "article sur les entreprises et le monde des affaires",
    "article sur l'actualité internationale",
    "article sur les arts et la culture",
    "article sur la société française et les questions sociales",
    "article sur le sport"
]
label_mapping = {
    "article sur les entreprises et le monde des affaires": "ENT",
    "article sur l'actualité internationale": "INT",
    "article sur les arts et la culture": "ART",
    "article sur la société française et les questions sociales": "SOC",
    "article sur le sport": "SPO"
}

Ce type de label très explicite a plutôt bien fonctionné après quelques tests.

In [3]:
# Classify an example text
example_text = X_test.iloc[0]
result = classifier(example_text, candidate_labels)
print("Classification Result for Example Text:")
print(result)

Classification Result for Example Text:
{'sequence': 'premier tirage 5 6 13 18 36 41 complémentaire 44 pas de gagnant pour 6 numéros gagnant pour 5 numéros et complémentaire 21 658,30 5 numéros 901,20 4 numéros et complémentaire 38 4 numéros 19 3 numéros et complémentaire 8,20 3 numéros 4,10 second tirage 4 17 18 21 30 44 complémentaire 41 rapports pour 6 numéros 912 567 5 numéros et complémentaire 17 486,70 5 numéros 911,50 4 numéros et complémentaire 41,80 4 numéros 20,90 3 numéros et complémentaire 4,60 3 numéros 2,30', 'labels': ['article sur le sport', 'article sur la société française et les questions sociales', 'article sur les entreprises et le monde des affaires', 'article sur les arts et la culture', "article sur l'actualité internationale"], 'scores': [0.9180249571800232, 0.03507557511329651, 0.022624339908361435, 0.01783699356019497, 0.0064380900003015995]}


In [4]:
print(f"Par exemple pour le texte : {example_text[:100]}\n",
      f"Le modèle prédit label suivant : {result['labels'][0]}")

Par exemple pour le texte : premier tirage 5 6 13 18 36 41 complémentaire 44 pas de gagnant pour 6 numéros gagnant pour 5 numéro
 Le modèle prédit label suivant : article sur le sport


In [5]:
# Evaluate the model on the test set
correct_predictions = 0
total_articles = 100
processing_times = []

for i in range(total_articles):
    text = X_test.iloc[i]
    true_label = y_test.iloc[i]

    start_time = time.time()
    result = classifier(text, candidate_labels)
    end_time = time.time()

    predicted_label = label_mapping[result['labels'][0]]
    processing_times.append(end_time - start_time)

    if predicted_label == true_label:
        correct_predictions += 1

accuracy = correct_predictions / total_articles
average_processing_time = sum(processing_times) / total_articles

print(f"Accuracy: {accuracy:.4f}")
print(f"Average Processing Time: {average_processing_time:.4f} seconds")

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Accuracy: 0.7000
Average Processing Time: 0.3333 seconds


In [6]:
# Note model sizes and processing times
model_size = sum(p.numel() for p in model.parameters()) / 1e6  # Model size in millions of parameters
print(f"Model Size: {model_size:.2f} million parameters")

Model Size: 278.81 million parameters
