# Aperçu
Dans ce devoir, nous explorerons le finetuning de deux modèles distincts :
1. Distillbert pour une tâche de classification de sentiments.
2. Le récent modèle OpenLlama-2-3b pour le transformer en chatbot.


In [13]:
# Installation des dépendances
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [14]:
!pip install -q -U datasets bitsandbytes einops

In [15]:
!pip install -U fsspec==2023.9.2



# 1. Distillbert pour Classification de Sentiments

**Livrables :**

Explorez 2 méthodes différentes d'ajustement de modèle (nombre d'époques, taux d'apprentissage, weight decay, etc.) pour améliorer les performances de classification. Détaillez la méthodologie suivie pour améliorer les performances du modèle. Une discussion approfondie des approches choisies est attendue (des points seront déduits pour des modifications aléatoires des hyperparamètres du modèle).

Vous devrez inclure dans votre rapport :
- L'exactitude (accuracy)
- La précision
- Le rappel
- Les scores F1
- L'image de votre matrice de confusion sous forme de heatmap

In [None]:
import torch
torch.cuda.is_available()

True

In [None]:
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification

from sklearn.model_selection import train_test_split
import pandas as pd

from datasets import load_dataset, Dataset, DatasetDict

## Ensemble de données

In [None]:
imdb_df = pd.read_csv("data/IMDB_dataset_clean.csv")

In [None]:
X_train, X_test = train_test_split(imdb_df, test_size=0.2, random_state=42)

In [None]:
dataset = DatasetDict({
    "train": Dataset.from_pandas(X_train, preserve_index=False),
    "test": Dataset.from_pandas(X_test, preserve_index=False)
    })

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True)

# Tokenize the training and test sets
train_tokenized = dataset["train"].map(tokenize_function, batched=True)
test_tokenized = dataset["test"].map(tokenize_function, batched=True)

Map:   0%|          | 0/39665 [00:00<?, ? examples/s]

Map:   0%|          | 0/9917 [00:00<?, ? examples/s]

In [None]:
train_tokenized, test_tokenized

(Dataset({
     features: ['text', 'labels', 'input_ids', 'attention_mask'],
     num_rows: 39665
 }),
 Dataset({
     features: ['text', 'labels', 'input_ids', 'attention_mask'],
     num_rows: 9917
 }))

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Téléchargement du modèle

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.metrics import confusion_matrix

def compute_metrics(pred):
    labels = pred.label_ids
    probabilities = pred.predictions[:, 1]  # En assumant les probabilités pour la classe 1

    preds = (probabilities > 0.5).astype(int)  # Threshold à 0.5 pour déterminer la classe

    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    cm = confusion_matrix(labels, preds, labels=[0, 1])
    acc = accuracy_score(labels, preds)

    return {
        "y_true": labels,
        "y_pred": preds,
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1-score": f1,
        "confusion_matrix": cm
    }

In [None]:
training_args = TrainingArguments(
    output_dir="test_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=False,
    push_to_hub=False,

)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [None]:
train_result = trainer.train()

In [None]:
eval_results = trainer.evaluate()

In [None]:
# METTRE CODE D'ÉVALUATION ICI

# 2. Finetune d'OpenLlama-2-3b
Cette section explique comment ajuster finetune le modèle OpenLlama-2-3b sur Google Colab pour le transformer en chatbot.

Nous utiliserons la bibliothèque PEFT de l'écosystème Hugging Face, ainsi que QLoRA pour être plus efficace en termes de mémoire.

**Livrables**

1. Expérimentez avec 3 configurations différentes pour LORA et créez un graphique linéaire avec le paramètre r sur l'axe des x. Incluez une discussion sur les effets de la modification de cet hyperparamètre.

2. Écrivez le code pour ajouter un exemple au jeu de données.

# Ensemble de données

In [32]:
from datasets import load_dataset

dataset_name = 'gberseth/IFT6758-comments'
dataset = load_dataset(dataset_name, split="train")

In [34]:
dataset

Dataset({
    features: ['input', 'output'],
    num_rows: 5749
})

In [35]:
dataset = dataset.map(lambda example: {'text': example['input'] + example['output']})

Map:   0%|          | 0/5749 [00:00<?, ? examples/s]

In [36]:
dataset

Dataset({
    features: ['input', 'output', 'text'],
    num_rows: 5749
})

# Téléchargement du modèle

In [19]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "openlm-research/open_llama_3b_v2"

In [20]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

In [21]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


pytorch_model.bin:   0%|          | 0.00/6.85G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [22]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/512k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggin

In [23]:
from peft import LoraConfig, get_peft_model

lora_alpha = 8
lora_dropout = 0.1
lora_r = 8

In [24]:
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

In [25]:
from transformers import TrainingArguments

In [26]:
output_dir = "./results"
per_device_train_batch_size = 1
gradient_accumulation_steps = 2
optim = "paged_adamw_32bit"
save_steps = 1
num_train_epochs = 4
logging_steps = 1
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 200
warmup_ratio = 0.03
lr_scheduler_type = "linear"

In [27]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    report_to="none",
)

In [28]:
from trl import SFTTrainer

In [29]:
max_seq_length = 512

In [37]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/5749 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [38]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

In [None]:
train_result = trainer.train()

Step,Training Loss
1,3.1823
2,3.3093
3,3.1542
4,3.2173
5,3.603
6,4.0059
7,4.0802
8,3.468
9,3.4727
10,3.9054


In [None]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model
model_to_save.save_pretrained("outputs")

In [None]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

In [None]:
# Exemple de génération de texte
text = dataset['text'][5]
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
## Mettre à jour le jeu de donnée avec votre nouvel exemple

In [None]:
## Ré-entrainer le modèle avec le nouveau jeu de données

In [None]:
## Demander au modèle la sortie correspondant à l'entrée de l'exemple