# Instructions

Ce notebook ne doit contenir que votre script servant à l'entrainement de votre modèle. Nous devons pouvoir l'exécuter en cliquant sur *Exécution -> Tout exécuter*.

Veuillez également ajouter des commentaires dans votre code pour expliquer ce que vous faites. N'hésitez pas à ajouter des blocs de textes (cliquez sur le bouton *+ Texte* en dessous du menu) pour ajouter plus d'explications.

Vous devrez déposer sur Moodle une archive au format .zip contenant un dossier avec vos noms.

Dans ce dossier, nous devons retrouver les deux notebooks (training et testing) ainsi qu'un nouveau dossier *models* contenant les poids de vos modèles entrainés, et si nécessaire un dossier *datasets* contenant d'autres données utilisée pour effectuer l'apprentissage de vos modèles (données obtenues par récupération sur le web "web scraping"  ou bien augmentation de données "data augmentation"). Si vous effectuez de l'augmentation de données, fournissez aussi le code pour la réaliser dans le notebook.

In [1]:
import numpy as np
import pandas as pd
import torch
from datasets import Dataset
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer, DataCollatorWithPadding
import evaluate

np.set_printoptions(edgeitems=3, infstr='inf', linewidth=150, nanstr='nan', precision=3, suppress=False, threshold=1000, formatter=None)
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))

  from .autonotebook import tqdm as notebook_tqdm


True
1
NVIDIA GeForce RTX 3060 Ti


In [2]:
dataset = Dataset.from_pandas(pd.read_csv('fake_train.csv'))

ds_train, ds_test = dataset.train_test_split(test_size=0.2).values()

In [3]:
id2label = {0: "News", 1: "Fake News"}
label2id = {"News": 0, "Fake News": 1}

model = AutoModelForSequenceClassification.from_pretrained(
    "roberta-base",
    num_labels=2,
    id2label=id2label,
    label2id=label2id
).cuda()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

def preprocess_function(examples):
    return tokenizer(examples['data'], truncation=True, padding=True, max_length=512)

tokenized_train = ds_train.map(preprocess_function, batched=True)
tokenized_test = ds_test.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Map: 100%|██████████| 1166/1166 [00:00<00:00, 3614.05 examples/s]
Map: 100%|██████████| 292/292 [00:00<00:00, 3447.63 examples/s]


In [5]:
training_args = TrainingArguments(
    output_dir="defi_3_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    push_to_hub=False,
)

# pour save mais bug sur mon pc
#    save_strategy="epoch",
#    load_best_model_at_end=True,

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

                                                
 20%|██        | 73/365 [08:43<28:31,  5.86s/it]

{'eval_loss': 0.2186487466096878, 'eval_accuracy': 0.9212328767123288, 'eval_runtime': 64.6972, 'eval_samples_per_second': 4.513, 'eval_steps_per_second': 0.077, 'epoch': 1.0}


                                                  
 40%|████      | 146/365 [16:52<18:58,  5.20s/it]

{'eval_loss': 0.3268195688724518, 'eval_accuracy': 0.910958904109589, 'eval_runtime': 64.4286, 'eval_samples_per_second': 4.532, 'eval_steps_per_second': 0.078, 'epoch': 2.0}


                                                   
 60%|██████    | 219/365 [24:39<12:31,  5.14s/it]

{'eval_loss': 0.1477624773979187, 'eval_accuracy': 0.9623287671232876, 'eval_runtime': 63.3799, 'eval_samples_per_second': 4.607, 'eval_steps_per_second': 0.079, 'epoch': 3.0}


                                                 
 80%|████████  | 292/365 [32:29<06:21,  5.22s/it]

{'eval_loss': 0.1419944167137146, 'eval_accuracy': 0.9623287671232876, 'eval_runtime': 64.5653, 'eval_samples_per_second': 4.523, 'eval_steps_per_second': 0.077, 'epoch': 4.0}


                                                 
100%|██████████| 365/365 [40:02<00:00,  6.58s/it]

{'eval_loss': 0.1294557750225067, 'eval_accuracy': 0.9726027397260274, 'eval_runtime': 61.4227, 'eval_samples_per_second': 4.754, 'eval_steps_per_second': 0.081, 'epoch': 5.0}
{'train_runtime': 2402.6595, 'train_samples_per_second': 2.426, 'train_steps_per_second': 0.152, 'train_loss': 0.19368752257464683, 'epoch': 5.0}





TrainOutput(global_step=365, training_loss=0.19368752257464683, metrics={'train_runtime': 2402.6595, 'train_samples_per_second': 2.426, 'train_steps_per_second': 0.152, 'train_loss': 0.19368752257464683, 'epoch': 5.0})

In [6]:
trainer.save_model("saved_model")
tokenizer.save_pretrained("saved_model")

('saved_model\\tokenizer_config.json',
 'saved_model\\special_tokens_map.json',
 'saved_model\\vocab.json',
 'saved_model\\merges.txt',
 'saved_model\\added_tokens.json',
 'saved_model\\tokenizer.json')