###  Un petit prototype de l'utilisation de Camembert

Télécharger les fichiers qui suivent à partir de https://huggingface.co/jplu/tf-camembert-base/tree/main/ et les copier dans le sous répertoire local **huggingface/tf-camembert-base/**
  * config.json
  * tokenizer.json
  * sentencepiece.bpe.model
  * tf_model.h5


In [16]:
from rakuten_common import *

In [17]:
# Quelques liens
# https://nlpinfrench.fr/transformers/02_firstBert_fr.html
# https://github.com/TheophileBlard/french-sentiment-analysis-with-bert/blob/master/03_bert.ipynb
# https://huggingface.co/models?filter=tf&search=camembert  
# https://huggingface.co/jplu/tf-camembert-base/tree/main  # Download
# http://mccormickml.com/2020/07/29/smart-batching-tutorial/#51-load-pre-trained-model   batch_size
# https://melusine.readthedocs.io/en/latest/readme.html

import transformers

bertdir = os.path.join("huggingface", "tf-camembert-base")
max_length = 200 # Longueur maximum des phrases

def model_body():
    """ Architecture réseau """
    bert_model = transformers.TFCamembertModel.from_pretrained(bertdir,
                                                               local_files_only=True)
    txt_input = Input(shape=(max_length,), dtype="int32")
    att_input = Input(shape=(max_length,), dtype="int32")
    inp = [txt_input, att_input]
    x = bert_model(txt_input,attention_mask=att_input)[1]
#    x = Dense(200, activation="relu")(x)
#    x = Dropout(0.2)(x)
    x = Dense(100, activation="relu")(x)
    x = Dropout(0.2)(x)
    return inp, x

def preprocess_X(X):
    """ Preprocessing de texte """
    seqs = tokenizer.batch_encode_plus(X, max_length=max_length,
                                            padding="max_length",
                                            truncation=True)
    return np.asarray(seqs["input_ids"]),\
           np.asarray(seqs["attention_mask"]),

# Création du modèle
inp, x = model_body()
x = Dense(27, activation="softmax")(x)
model = tf.keras.models.Model(inputs=inp, outputs=x)

# Congélation du camembert
for layer in model.layers:
    if layer.name.find("tf_camembert_model") >= 0:
        layer.trainable = False

# Compilation du modèle
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
model.summary()

# Split des donnéees
X = get_X_text()[:NB_ECHANTILLONS]
y = get_y()[:NB_ECHANTILLONS]

X, X_test, y, y_test = train_test_split(X, y, test_size=TEST_SIZE)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=VALIDATION_SPLIT)

# Conversion des X
tokenizer = transformers.CamembertTokenizer.from_pretrained(bertdir,
                                                            local_files_only=True)
X_train = preprocess_X(X_train)
X_val = preprocess_X(X_val)
X_test = preprocess_X(X_test)

# Conversion des y
fit_labels = {i: v for i, v in enumerate(sorted(list(set(y_train))))}
assert len(fit_labels) == NB_CLASSES
rv = {fit_labels[i]: i for i in fit_labels}

y_train = np.array([rv[v] for v in y_train])
y_val = np.array([rv[v] for v in y_val])
y_test = np.array([rv[v] for v in y_test])


Some layers from the model checkpoint at huggingface\tf-camembert-base were not used when initializing TFCamembertModel: ['lm_head']
- This IS expected if you are initializing TFCamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFCamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFCamembertModel were not initialized from the model checkpoint at huggingface\tf-camembert-base and are newly initialized: ['roberta/pooler/dense/kernel:0', 'roberta/pooler/dense/bias:0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_15 (InputLayer)           [(None, 200)]        0                                            
__________________________________________________________________________________________________
input_16 (InputLayer)           [(None, 200)]        0                                            
__________________________________________________________________________________________________
tf_camembert_model_7 (TFCamembe TFBaseModelOutputWit 110621952   input_15[0][0]                   
                                                                 input_16[0][0]                   
__________________________________________________________________________________________________
dense_17 (Dense)                (None, 100)          76900       tf_camembert_model_7[0][1] 

In [None]:
# Entrainement
model.fit(X_train, y_train, validation_data=(X_val, y_val), verbose=1, epochs=1, batch_size=32)

In [None]:
# Prédiction
softmaxout = model.predict(X_test, verbose=1)
y_pred = [fit_labels[i] for i in np.argmax(softmaxout, axis=1)]
classification_report(y_test, y_pred)