# BERT
BERT est un modèle créé par Google qui est aujourd'hui à l'état de l'art dans le domaine du NLP.
Il est basé sur un réseau de neuronnes pré-entraîné sur un corpus gigantesque. L'idée est de réutiliser ce modèle en l'adaptant à notre propre corpus.

**Important :** Ce notebook ne fonctionne pas. Aucune de nos nombreuses tentatives pour réaliser un fine-tuning du modèle BERT n'a pu aboutir. 

## Sources
* https://skimai.com/fine-tuning-bert-for-sentiment-analysis/
* https://huggingface.co/docs/transformers/training
* https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#tfbertforsequenceclassification
* https://www.tensorflow.org/api_docs/python/tf/keras/Model
* https://stackoverflow.com/questions/60463829/training-tfbertforsequenceclassification-with-custom-x-and-y-data

In [1]:
import pandas as pd
import numpy as np
from typing import List, Tuple
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
from transformers import TFBertForSequenceClassification
from tensorflow.keras.optimizers import Adam
from tensorflow.python.framework.ops import EagerTensor
import tensorflow as tf

2022-12-18 23:12:59.001413: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-18 23:12:59.143425: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-18 23:12:59.143451: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-12-18 23:12:59.171064: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-12-18 23:12:59.925163: W tensorflow/stream_executor/platform/de

In [2]:
# Chargement des données
df = pd.read_csv(r"../data/news_dataset.csv")
# Encodage des classes
le = LabelEncoder()
df["category"] = le.fit_transform(df["category"])
# Découpage du dataset
X, y = df["text"], df["category"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=True)

In [3]:
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)

In [4]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-cased", num_labels=df["category"].nunique())
#model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=df["category"].nunique())

2022-12-18 23:13:02.592303: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-12-18 23:13:02.592344: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-12-18 23:13:02.592371: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (maxime-HP): /proc/driver/nvidia/version does not exist
2022-12-18 23:13:02.592602: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertFo

In [5]:
# Étape 1: Trouver la longueur moyenne d'un article en nombre de tokens
len_articles = []
for t in df["text"]:
    len_articles.append(len(bert_tokenizer.tokenize(t)))
MAX_LEN = int(np.array(len_articles).mean() + 1)
print(MAX_LEN)

453


Il est important de déterminer quelle taille de vecteur choisir. Si on prend une taille trop grande on aura beaucoup de padding et donc on consommera de la mémoire pour rien. Si on prend une taille trop petite on risque de compresser et donc perdre de l'information. C'est pourquoi on choisit la moyenne du nombre de tokens sur le jeu de données.

In [6]:
# Étape 2: Vectorisation à l'aide du tokenizer inclus avec le modèle BERT
def tokenize(corpus: List[str]) -> Tuple[EagerTensor,  EagerTensor]:
    return bert_tokenizer.batch_encode_plus(
        corpus,
        add_special_tokens=True,   # Add `[CLS]` and `[SEP]`
        padding=True,              # Pad with longest sequence length
        truncation=True,           # Removal of excess tokens
        max_length=MAX_LEN,        # Length of each tensor
        return_tensors='tf',       # Return tensorflow tensor
        return_token_type_ids=False,
        return_attention_mask=True
    )

In [7]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(tokenize(X_train)),y_train))#.shuffle(1000).batch(16).prefetch(1)

In [8]:
# Étape 3: Fine-tuning pour de la classification multi-classe en utilisant l'API tensorflow
model.compile(optimizer=Adam(3e-5), loss=model.compute_loss, metrics=["accuracy"])
#model.fit(train_inputs, y_train)
model.fit(x=train_dataset, y=None)

AttributeError: in user code:

    File "/home/maxime/Programmation/NLP-project/venv/lib/python3.8/site-packages/keras/engine/training.py", line 1160, in train_function  *
        return step_function(self, iterator)
    File "/home/maxime/Programmation/NLP-project/venv/lib/python3.8/site-packages/transformers/modeling_tf_utils.py", line 1436, in compute_loss  *
        return super().compute_loss(*args, **kwargs)
    File "/home/maxime/Programmation/NLP-project/venv/lib/python3.8/site-packages/keras/engine/training.py", line 1052, in compute_loss  **
        return self.compiled_loss(
    File "/home/maxime/Programmation/NLP-project/venv/lib/python3.8/site-packages/keras/engine/compile_utils.py", line 263, in __call__
        y_t, y_p, sw = match_dtype_and_rank(y_t, y_p, sw)
    File "/home/maxime/Programmation/NLP-project/venv/lib/python3.8/site-packages/keras/engine/compile_utils.py", line 840, in match_dtype_and_rank
        if (y_t.dtype.is_floating and y_p.dtype.is_floating) or (

    AttributeError: 'NoneType' object has no attribute 'dtype'


In [None]:
# Étape 4: Évaluation des prédictions du modèle
#model.predict()