<h1> Classification des étoiles avec camembert </h1>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import CamembertTokenizer,TrainingArguments, CamembertForSequenceClassification,Trainer
import torch 
import os

os.environ["TQDM_NOTEBOOK"] = "1"

df = pd.read_csv('../data/avis/general_df_clean_sent_15k.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15050 entries, 0 to 15049
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       15050 non-null  int64  
 1   user             15049 non-null  object 
 2   etoiles          15050 non-null  int64  
 3   n_avis           15050 non-null  float64
 4   localisation     15050 non-null  object 
 5   date_avis        15050 non-null  object 
 6   titre_avis       15050 non-null  object 
 7   text_avis        15050 non-null  object 
 8   date_experience  15003 non-null  object 
 9   page             15003 non-null  object 
 10  label            15050 non-null  int64  
 11  score            15050 non-null  float64
 12  sentiment_norm   15050 non-null  float64
 13  longueur_text    15050 non-null  float64
dtypes: float64(4), int64(3), object(7)
memory usage: 1.6+ MB


In [2]:
df['text_avis'] = df['text_avis'].astype("str")

In [3]:
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [5]:
# split du dataset

features = df.text_avis
target = df.etoiles

X_train, X_temp, y_train, y_temp = train_test_split(features, target, random_state= 7,
test_size = 0.3)

X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size= 0.5,
random_state = 7)

X_train = X_train.tolist()
X_valid = X_valid.tolist()
X_test= X_test.tolist()

# on doit soustraire 1 à chaque étoile car classifieur commence à 0
y_train = y_train - 1
y_valid = y_valid - 1

In [6]:
# tokenization

tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
train_encodings = tokenizer(X_train, truncation = True, padding = True,
                            max_length = 512)
valid_encodings = tokenizer(X_valid, truncation = True, padding = True, 
                            max_length= 512)


In [7]:
train_dataset = CustomDataset(train_encodings, y_train.tolist())
valid_dataset = CustomDataset(valid_encodings, y_valid.tolist())

In [8]:
# entrainement du modèle

model = CamembertForSequenceClassification.from_pretrained('camembert-base', num_labels = 5) # pour les 5 étoiles

training_args = TrainingArguments(
    output_dir= './results',
    num_train_epochs = 3,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    warmup_steps = 500,
    weight_decay = 0.01,
    logging_dir = ".logs")

trainer = Trainer(
    model= model,
    args =  training_args,
    train_dataset = train_dataset, 
    eval_dataset = valid_dataset)

trainer.train()

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForSequenceClassification: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.out_proj.bias

  0%|          | 0/3951 [00:00<?, ?it/s]

Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json


{'loss': 1.2298, 'learning_rate': 5e-05, 'epoch': 0.38}


Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json


{'loss': 0.8854, 'learning_rate': 4.275572297884671e-05, 'epoch': 0.76}


Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json


{'loss': 0.7626, 'learning_rate': 3.551144595769342e-05, 'epoch': 1.14}


Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2000
Configuration saved in ./results/checkpoint-2000/config.json


{'loss': 0.6816, 'learning_rate': 2.8267168936540134e-05, 'epoch': 1.52}


Model weights saved in ./results/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2500
Configuration saved in ./results/checkpoint-2500/config.json


{'loss': 0.6604, 'learning_rate': 2.1022891915386843e-05, 'epoch': 1.9}


Model weights saved in ./results/checkpoint-2500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-3000
Configuration saved in ./results/checkpoint-3000/config.json


{'loss': 0.5153, 'learning_rate': 1.3778614894233557e-05, 'epoch': 2.28}


Model weights saved in ./results/checkpoint-3000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-3500
Configuration saved in ./results/checkpoint-3500/config.json


{'loss': 0.4983, 'learning_rate': 6.5343378730802664e-06, 'epoch': 2.66}


Model weights saved in ./results/checkpoint-3500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 65332.5376, 'train_samples_per_second': 0.484, 'train_steps_per_second': 0.06, 'train_loss': 0.7153817043820987, 'epoch': 3.0}


TrainOutput(global_step=3951, training_loss=0.7153817043820987, metrics={'train_runtime': 65332.5376, 'train_samples_per_second': 0.484, 'train_steps_per_second': 0.06, 'train_loss': 0.7153817043820987, 'epoch': 3.0})