# [Detecting the difficulty level of French texts](https://www.kaggle.com/c/detecting-the-difficulty-level-of-french-texts/overview/evaluation)
### Implementing `Camembert` model
---

![camembertLogo](https://i2.wp.com/ledatascientist.com/wp-content/uploads/2020/11/camembert.png?resize=200%2C220&ssl=1)


### What is Camenbert ?

[Camembert](https://camembert-model.fr/) is a pre-trained NLP model. It is based on the [RoBERTa](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/) architecture, a variant of BERT, released in 2019 by Facebook AI researchers. Camembert differs from other BERT-based models because it has been trained on a French corpus.


### What is a pre-trained model ?

The large amount of data and resources needed for many deep learning applications, and the large computation times for these, have encouraged researchers and data scientists to use pre-trained models. Such an approach is called transfer learning. It consists in reusing a model that has been trained on a specific task to another task similar to the first one.


### How can it help us ?

Words that have been seen by Camembert in the same context will have vector with close value (small distance). If we tell the model that a sentence with complex words is labelled as C1, there are chances that these words have been observed in a context close to other complicated words. If the model sees these, it will know that the vector is close to the vector that was labeled C1.



# 1. Fine-tuning the model

This technic involved deeper technics so we won't go in all the details because that something we didn't see in class. But pretty much all everything we did is very well explained on huggingface website:

https://huggingface.co/docs/transformers/training

https://huggingface.co/camembert-base

***Be careful***, the training takes a lot of time.


## 1.1 Split dataset and encode labels
We first need to split training data into a training, validation and test set.

In [1]:
import random as r
from sklearn.metrics import confusion_matrix, accuracy_score
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow as tf
from transformers import TFCamembertForSequenceClassification, CamembertTokenizer, AutoConfig
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# (for reproduciblity)
r.seed(0)
tf.random.set_seed(0)

training_data = 'https://raw.githubusercontent.com/LaCrazyTomato/Group-Project-DM-ML-2021/main/data/training_data.csv'

df = pd.read_csv(training_data, encoding='utf-8')

X = df['sentence'].values # to remove the index
y = df['difficulty'].values

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.1,
                                                    shuffle=True,
                                                    random_state=0)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
                                                  test_size=0.2,
                                                  random_state=0
                                                  )

# We need to encode output variable (since they are strings)
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_val = label_encoder.transform(y_val)
y_test = label_encoder.transform(y_test)



# We define the tokenizer
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
config = AutoConfig.from_pretrained('camembert-base')
config.num_labels = 6
config.hidden_dropout_prob = 0.1

model = TFCamembertForSequenceClassification.from_pretrained('camembert-base',
                                                             config=config
                                                             )



All model checkpoint layers were used when initializing TFCamembertForSequenceClassification.

Some layers of TFCamembertForSequenceClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Examples

In [2]:
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')

inputs = tokenizer("J'étudie à la HEC Lausanne.")

encoded_sequence = inputs["input_ids"]

encoded_sequence


[5, 121, 11, 141, 744, 4669, 15, 13, 454, 4801, 12907, 9, 6]

In [3]:
tokenizer.decode(encoded_sequence)

"J'étudie à la HEC Lausanne."

In [4]:
inputs = tokenizer("J'étudie à la HEC Genève.")

encoded_sequence = inputs["input_ids"]

encoded_sequence

[5, 121, 11, 141, 744, 4669, 15, 13, 454, 4801, 4802, 9, 6]

We can see above an example of vectorization done by the camembert model.

The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model (more infos [here](https://huggingface.co/docs/transformers/glossary)).

## 1.2 Fine-tune the model

Now, what we will do, in simple words, is to show to the model which output value we attribute to which vectorized sentences.


In [None]:
# In deep learning, to train a model, we need a training set and a validation set
X_train, X_val, X_test, y_train, y_val, y_test = get_datasets_from_csv('training_data')

# Tokenize
X_train = tokenizer(X_train.tolist(), padding="max_length", truncation=True, return_tensors='tf')
X_train = {x: X_train[x] for x in tokenizer.model_input_names}
X_val = tokenizer(X_val.tolist(), padding="max_length", truncation=True, return_tensors='tf')
X_val = {x: X_val[x] for x in tokenizer.model_input_names}
X_test = tokenizer(X_test.tolist(), padding="max_length", truncation=True, return_tensors='tf')
X_test = {x: X_test[x] for x in tokenizer.model_input_names}


# We will use tensorflow to build and train our model

# SETTINGS
BATCH_SIZE = 6
EPOCHS = 6

# Create tensorflow dataset
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(10000).batch(BATCH_SIZE)\
    .prefetch(tf.data.AUTOTUNE)

val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val)).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)


# Prepare the model (solver, metrics)
model.compile(
    optimizer=tf.keras.optimizers.Adam(3e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)


# Train
es_cb = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=2, restore_best_weights=True)
lr_cb = tf.keras.callbacks.ReduceLROnPlateau(patience=0, restore_best_weights=True, verbose=1, min_lr=1e-8)
csv_cb = tf.keras.callbacks.CSVLogger('history.csv')
cp_cb = tf.keras.callbacks.ModelCheckpoint('cp')


model.fit(train_ds,
          validation_data=val_ds,
          epochs=EPOCHS,
          callbacks=[es_cb, lr_cb, csv_cb, cp_cb]
          )

# Save the tuned model
model.save_pretrained('trained_camembert')


# 2. Predict with the model
Now that we have trained our model , ....

In [7]:
label_encoder.inverse_transform([0, 1, 2, 3, 4, 5])

array(['A1', 'A2', 'B1', 'B2', 'C1', 'C2'], dtype=object)

In [8]:
# 1) Load the trained model
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
model = TFCamembertForSequenceClassification.from_pretrained('trained_camembert')

# 2) Tokenize the sentences
example_sents = ["Bob Seely, élu de l’île de Wight, s’est ainsi dit « fatigué » des « extrapolations ridicules » \
                    des scientifiques conseillant le gouvernement.", 
                 "Je mange une pomme.", 
                 "L'inflation inquiète la place financière."]

tokenized_sents = tokenizer(example_sents, padding="max_length", truncation=True, return_tensors='tf')
model_input = {x: tokenized_sents[x] for x in tokenizer.model_input_names}


model_output = model.predict(model_input) # Vector of the sentences
pred_logits = model_output.logits # Probabilities
pred_classes = np.argmax(pred_logits, axis=1) # Get the highest probabilities


All model checkpoint layers were used when initializing TFCamembertForSequenceClassification.

All the layers of TFCamembertForSequenceClassification were initialized from the model checkpoint at trained_camembert.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFCamembertForSequenceClassification for predictions without further training.


In [9]:
pred_classes

array([4, 0, 4], dtype=int64)

In [10]:
label_encoder.inverse_transform(pred_classes)

array(['C1', 'A1', 'C1'], dtype=object)

### Kaggle unlabeled datas for submission

In [65]:
kaggleDf = pd.read_csv("https://raw.githubusercontent.com/LaCrazyTomato/Group-Project-DM-ML-2021/main/data/unlabelled_test_data.csv")

X = kaggleDf.sentence.to_list()


tokenized_sents = tokenizer(X, padding="max_length", truncation=True, return_tensors='tf')
model_input = {x: tokenized_sents[x] for x in tokenizer.model_input_names}

model_output = model.predict(model_input)
pred_logits = model_output.logits
pred_classes = np.argmax(pred_logits, axis=1)


pred = pd.DataFrame({'id': [i for i in range(len(pred_classes))], 'difficulty': pred_classes})

pred.difficulty = pred.difficulty.apply(lambda x: label_encoder.inverse_transform([x])[0])

pred



Unnamed: 0,id,difficulty
0,0,C2
1,1,A2
2,2,B2
3,3,A2
4,4,C2
...,...,...
1195,1195,B1
1196,1196,B1
1197,1197,C2
1198,1198,B2


Voilà ! That's how we achieved an accuracy of 57 % on the kaggle set.