# Fine Tunning d'un modèle de Sentiment analysis
L'objectif de ce notebook est de sur-entrainé un modèle de NLP (Bert) afin de le spécialisé dans l'analyse de sentiment transmis par une phrase.
Pour ce faire, nous disposons d'un dataset etiquetté sous `data/amazon_cells_albelled.txt`

In [2]:
%pip install -r ../src/requirements.txt

Ignoring ipython: markers 'python_version < "3.8"' don't match your environment








Note: you may need to restart the kernel to use updated packages.


In [2]:
# Data processing
import pandas as pd
import numpy as np

In [3]:
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from sklearn.metrics import accuracy_score

2023-10-03 18:20:36.926816: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-03 18:20:36.952656: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-03 18:20:37.245962: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-03 18:20:37.249036: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
dataset_amazon = pd.read_csv('../data/01_raw/amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])
print(dataset_amazon.head())
print(dataset_amazon.info())

                                              review  label
0  So there is no way for me to plug it in here i...      0
1                        Good case, Excellent value.      1
2                             Great for the jawbone.      1
3  Tied to charger for conversations lasting more...      0
4                                  The mic is great.      1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  1000 non-null   object
 1   label   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB
None


## Phase d'entraînement
L'objectif ici est de charger un modèle pré-entrainé et son tokenizer et de sur-entrainé ce dernier avec le dataset contenu dans `data/amazon_cells_labelled.txt`. 

Train Test split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(dataset_amazon['review'],
                                                    dataset_amazon['label'],
                                                    test_size = 0.25, 
                                                    random_state = 42)

print(f'Train: {len(X_train)}')
print(f'Test: {len(X_test)}')

Train: 750
Test: 250


Load and use a Bert Tokenizer to pre-process the dataset

In [6]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
preprocessed_train = tokenizer(X_train.to_list(), return_tensors="np", padding=True)
preprocessed_test = tokenizer(X_test.to_list(), return_tensors="np", padding=True)

# Create label list
labels_train = np.array(y_train)  
labels_test = np.array(y_test)

Load a pre-trained model and compile it 

In [7]:
# Load pre-trained model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
loss_function = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=Adam(5e-6), loss=loss_function, metrics=['accuracy'])

2023-10-03 18:20:46.115492: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-10-03 18:20:46.118149: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably T

Train the model on the new dataset

In [8]:
# Fit the model
model.fit(dict(preprocessed_train), 
          labels_train, 
          validation_data=(dict(preprocessed_test), labels_test),
          batch_size=4, 
          epochs=1)



<keras.callbacks.History at 0x7fd271debc10>

In [9]:
%pwd
%cd ..
%pwd

/home/adrien/Documents/pro/challenge_serving_api/kedro-pipeline


'/home/adrien/Documents/pro/challenge_serving_api/kedro-pipeline'

In [10]:
%pwd

'/home/adrien/Documents/pro/challenge_serving_api/kedro-pipeline'

In [23]:
from kedro.config import ConfigLoader
from kedro.io import DataCatalog

# Charger la configuration du catalogue
config_loader = ConfigLoader("conf/")
print(config_loader.values)
catalog_config = config_loader.get("catalog*")

# Créer un catalogue avec la configuration chargée
catalog = DataCatalog.from_config(catalog_config)

#Save
catalog.save("bert_model", model)
# Charger l'artefact
artifact = catalog.load("bert_model")
#data = artifact.load()

<bound method Mapping.values of ConfigLoader(conf_source=conf/, env=None, config_patterns={'catalog': ['catalog*', 'catalog*/**', '**/catalog*'], 'parameters': ['parameters*', 'parameters*/**', '**/parameters*'], 'credentials': ['credentials*', 'credentials*/**', '**/credentials*'], 'logging': ['logging*', 'logging*/**', '**/logging*']})>
data/06_models/bert_trained_model


Some layers from the model checkpoint at data/06_models/bert_trained_model were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at data/06_models/bert_trained_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [12]:
model.save_pretrained("/home/adrien/Documents/pro/challenge_serving_api/kedro-pipeline/data/06_models/")

In [13]:
test = TFAutoModelForSequenceClassification.from_pretrained("/home/adrien/Documents/pro/challenge_serving_api/kedro-pipeline/data/06_models/")

Some layers from the model checkpoint at /home/adrien/Documents/pro/challenge_serving_api/kedro-pipeline/data/06_models/ were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at /home/adrien/Documents/pro/challenge_serving_api/kedro-pipeline/data/06_models/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassifi

In [14]:
test

<transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification at 0x7fd16422aa10>

Evaluate the model performances

In [21]:
test_predictions = model.predict(dict(preprocessed_test))['logits']
test_probabilities = tf.nn.softmax(test_predictions)
test_predictions_class = np.argmax(test_probabilities, axis=1)
accuracy_score(test_predictions_class, y_test)



0.912

In [22]:
test_predictions = artifact.predict(dict(preprocessed_test))['logits']
test_probabilities = tf.nn.softmax(test_predictions)
test_predictions_class = np.argmax(test_probabilities, axis=1)
accuracy_score(test_predictions_class, y_test)



0.912

Save the tokenizer and the trained model to local folder

In [12]:
tokenizer.save_pretrained('out/')
model.save_pretrained('out/')