## What about the LLMs?

**You must write the answer to this question in a notebook hosted in your github account and give access to your supervisor.**

LLMs are reputed to have revolutionised automatic language processing. Since the introduction of BERT-type models, all language processing applications have been based on LLMs, of varying degrees of sophistication and size. These models are trained on multiple tasks and are therefore capable of performing new tasks without learning, simply from a prompt. This is known as "zero-shot learning" because there is no learning phase as such. We are going to test these models on our classification task.

Huggingface is a Franco-American company that develops tools for building applications based on Deep Learning. In particular, it hosts the huggingface.co portal, which contains numerous Deep Learning models. These models can be used very easily thanks to the [Transformer] library (https://huggingface.co/docs/transformers/quicktour) developed by HuggingFace.

Using a transform model in zero-shot learning with HuggingFace is very simple: [see documentation](https://huggingface.co/tasks/zero-shot-classification)

However, you need to choose a suitable model from the list of models compatible with Zero-Shot classification. HuggingFace offers [numerous models](https://huggingface.co/models?pipeline_tag=zero-shot-classification). 

The classes proposed to the model must also provide sufficient semantic information for the model to understand them.

**Question**:

* Write a code to classify an example of text from an article in Le Monde using a model transformed using zero-sot learning with the HuggingFace library.
* choose a model and explain your choice
* choose a formulation for the classes to be predicted
* show that the model predicts a class for the text of the article (correct or incorrect, analyse the results)
* evaluate the performance of your model on 100 articles (a test set).
* note model sizes, processing times and classification results


Notes :
* make sure that you use the correct Tokenizer when using a model 
* start testing with a small number of articles and the first 100's of characters for faster experiments.

## Answers

***Model***: we would like a model whose base architecture is performant, which was trained on a corpus including a large number of French newspaper article, and which can process up to 4000 tokens, i.e., the approximate max lenght of our documents. As far as I know, all models available for zero-shot classification on HuggingFace match the first two conditions, and none match the last, meaning the documents longuer than 512 tokens will be truncated (architectures like Longformer and BigBird can process longer sequences but are not optimazed for zero-shot classification). After experimenting with several models, I obtained the best results with [MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli](https://huggingface.co/MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli). The reason is probably that it is one of the biggest model available, with 435M parameters. This makes it slower, but not exceedingly with GPU acceleration (it runs 5.5x faster than on CPU on a M2 Max with 38 GPU cores).

***Formulation of the classes***: I created a dictionary to replace the three-letter section codes with their full names before feeding them into the classifier. This provides the model with more meaningful information. This step proved necessary to achieve correct performance, as the abbreviated codes alone do not provide enough context.

***Tokenization:*** it is performed automatically by the model.

In [1]:
import pandas as pd

data = pd.read_csv('https://cloud.teklia.com/index.php/s/isNwnwA7a7AWst6/download/LeMonde2003_9classes.csv.gz')

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

max_token_length = data['text'].astype(str).apply(lambda x: len(tokenizer.tokenize(x))).max()
print(f'The longest document includes {max_token_length} tokens.')

Token indices sequence length is longer than the specified maximum sequence length for this model (1046 > 512). Running this sequence through the model will result in indexing errors


The longest document includes 3817 tokens.


In [3]:
import torch

if torch.cuda.is_available(): device = torch.device("cuda")
elif torch.backends.mps.is_available(): device = torch.device("mps")
else: device = torch.device("cpu")

In [None]:
from transformers import pipeline

# Create the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification",
                     model="MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli",
                     framework="pt", # Using PyTorch to avoid a conflict with Keras
                     device=device)

# Associate the sections' code with their full name
label_codes = {
    'sports': 'SPO',
    'arts': 'ART',
    'France': 'FRA',
    'société': 'SOC',
    'international': 'INT',
    'entreprises': 'ENT',
    'une': 'UNE'
}

# Pass the full name to the classifier, than revert its output back to code
def predict_category(text, labels):
    result = classifier(text, candidate_labels=list(label_codes.keys()))
    predicted_label = result['labels'][0]
    return label_codes[predicted_label]

In [None]:
# Select a single sample
random_idx = 423
sample_text = data['text'].iloc[random_idx]
sample_label = data['category'].iloc[random_idx]

# Get detailed predictions
results = classifier(sample_text, candidate_labels=list(label_codes.keys()))

# Print all predictions in descending order
print(f"\nPredicted categories for sample #{random_idx}:")
predictions = sorted(zip(results['labels'], results['scores']), 
                    key=lambda x: x[1], 
                    reverse=True)

for label, score in predictions:
    print(f"{label}: {score:.3f}")

# Check if the prediction is correct
top_prediction = label_codes[predictions[0][0]]

is_correct = top_prediction == sample_label
print(f"\nTop prediction ({top_prediction}) {'matches' if is_correct else 'does not match'} true label ({sample_label}).")

***Comment:*** The classifier accurately predicts the class of document #423, with a very high level of certainty. I assume the reason is that an article on international relations would include the word international and its derivatives many times, but not the other categories names, except maybe for France.

In [None]:
import random
from tqdm import tqdm

# Select 100 samples (with fixed seed for reproductibility)
sample_data = data.sample(n=100, random_state=86)

# Predict for each sample
tqdm.pandas()
sample_data['predicted_category'] = sample_data['text'].progress_apply(
    lambda x: predict_category(x, list(label_codes.keys()))
)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assess performance
y_true = sample_data['category']
y_pred = sample_data['predicted_category']

print(classification_report(y_true, y_pred))

cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=list(label_codes.keys()), yticklabels=list(label_codes.keys()))
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

***Comment:*** since the model is pretrained, there’s no need to split the data into training and testing sets. Performance can be assessed directly on a random sample of documents. The results show a significant performance contrast: the model performs very well on four categories—ART, ENT, INT, and SPO—but poorly on the remaining three—FRA, SOC, and UNE. The reason is likely the same as above: as titles, Arts, Enterprises, International and Sports provide highly informative content regarding their sections, whereas France, Society and Une are more ambiguous. Indeed, any topic can appear in the UNE category, while SOC and FRA tend to encompass a wide range of subjects that don’t belong to more specialized categories. The classifier is likely to struggle with these categories unless it is provided with more detailed information about their content.

Summary:
- Model size: 435M parameters
- Processing time: 1:01 for 100 documents, i.e., ≈ 0.6 secund per document in average
- Classification results: weighted average of precision, recall and f1 score are 0.52, 0.45 and 0.45 respectively