# Introduction

## Objectif du Notebook

L'objectif de ce notebook est de développer et d'évaluer des modèles d'analyse de sentiments en utilisant **BERT** (Bidirectional Encoder Representations from Transformers), tout en comparant différentes versions des données textuelles : avec et sans pré-traitements (lemmatisation et stemming). BERT, étant un modèle pré-entraîné puissant, nous permet de capturer les relations contextuelles entre les mots dans un texte. En parallèle, nous allons comparer les performances des modèles CNN et LSTM entraînés sur des textes bruts et pré-traités.

## Présentation de la Méthode BERT

BERT est un modèle de traitement du langage naturel basé sur une architecture de **Transformers**, permettant de capturer de manière bidirectionnelle les relations entre les mots. BERT est particulièrement efficace pour des tâches telles que l'analyse de sentiments car il tient compte du contexte des mots dans les phrases, contrairement aux modèles classiques comme Word2Vec qui sont unidirectionnels.

### BERT (Bidirectional Encoder Representations from Transformers)

BERT prend en compte le contexte avant et après un mot pour créer des représentations contextuelles riches. C'est pourquoi il est très performant pour des tâches de compréhension sémantique, comme la classification de sentiments, où l'interprétation d'un mot dépend de son contexte global.

### Comparaison avec et sans Pré-traitements (Lemmatisation et Stemming)

Dans ce notebook, nous utiliserons à la fois :
- **Les textes pré-traités** (lemmatisation et stemming) : Ces techniques consistent à ramener les mots à leur forme de base ou à leur racine pour réduire la variabilité linguistique. Cela permet de simplifier les phrases avant de les soumettre à BERT.
- **Les textes bruts** : Sans aucun pré-traitement linguistique, laissant à BERT la capacité d'analyser les mots dans leur forme originale et de capturer les nuances du langage.

L'objectif est de comparer l'impact du pré-traitement sur les performances du modèle BERT.

## Plan du Notebook

1. **Chargement et Préparation des Données** : Nous allons charger les données, les nettoyer, les tokeniser, et les préparer sous trois formats : brut, lemmatisé et stemmé.
2. **Tokenisation avec BERT** : Nous utiliserons le tokenizer BERT pour convertir les trois versions de textes en tokens adaptés au modèle BERT.
3. **Utilisation de BERT** : Nous extrairons des embeddings contextuels à partir du modèle BERT pour c
4. **Définition du modèle CNN** : Ici nous définissons le modèle CNN, avec une gridSearch et une méthode Mlflow. chaque version de du modèle CNN. **EICI, nous allons entrainer le modeles de CNN .ersi**Définition du modèle LSTM** : Ici nous définissons le modèle LSTM, avec une gridSearch et une méthode Mlflow.
7. **Entraînement du modèle LSTM** : ICI, nous allons entrainer le modeles de LSTM .re dans une API de prédiction.


# 1. Importation des Bibliothèques

In [9]:
import numpy as np
import pandas as pd
import mlflow
import mlflow.keras
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import Adam
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import BertTokenizer, TFBertModel


In [11]:
# Charger les données (en supposant que vous avez déjà les colonnes nettoyées)
data = pd.read_csv('../data/database_p7_rework.csv')

# Transformation des labels : 0 reste 0 et 4 devient 1
data['target_binary'] = data['target'].apply(lambda x: 0 if x == 0 else 1)

# Vérification des transformations
print(data['target_binary'].value_counts())

# Ensuite, définissez y comme suit :
y = data['target_binary']

0    800000
1    798315
Name: target_binary, dtype: int64


In [25]:
# Sélection d'un échantillon équilibré de 16 000 données (8 000 par classe)
sample_data = data.groupby('target_binary', group_keys=False).apply(lambda x: x.sample(8000, random_state=42))

# 2. Importation & Tokénization avec BERT

In [54]:
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
import time

# Initialiser le tokenizer et le modèle BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Utiliser le GPU si disponible
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Fonction pour encoder des textes en utilisant BERT avec batching et un timer
def encode_with_bert_batch(texts, batch_size=32):
    all_embeddings = []
    total_batches = len(texts) // batch_size + int(len(texts) % batch_size > 0)
    
    start_time = time.time()
    
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        
        # Chronométrage pour le lot actuel
        batch_start_time = time.time()
        
        inputs = tokenizer(batch_texts, return_tensors='pt', padding=True, truncation=True, max_length=25).to(device)
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.cpu().detach().numpy()
        all_embeddings.extend(embeddings)
        
        # Calcul du temps écoulé pour ce lot
        batch_time = time.time() - batch_start_time
        elapsed_time = time.time() - start_time
        estimated_total_time = (elapsed_time / (i + batch_size)) * len(texts)
        remaining_time = estimated_total_time - elapsed_time
        
        print(f"Lot {i//batch_size + 1}/{total_batches} traité en {batch_time:.2f} secondes. Temps restant estimé : {remaining_time:.2f} secondes.")
    
    return np.array(all_embeddings)





In [56]:
# Appliquer BERT sur les descriptions lemmatisées avec batching et timer
bert_features = encode_with_bert_batch(sample_data['text_cleaned'].tolist())

Lot 1/500 traité en 0.54 secondes. Temps restant estimé : 270.63 secondes.
Lot 2/500 traité en 0.44 secondes. Temps restant estimé : 243.38 secondes.
Lot 3/500 traité en 0.39 secondes. Temps restant estimé : 226.32 secondes.
Lot 4/500 traité en 0.37 secondes. Temps restant estimé : 215.53 secondes.
Lot 5/500 traité en 0.35 secondes. Temps restant estimé : 206.99 secondes.
Lot 6/500 traité en 0.39 secondes. Temps restant estimé : 204.39 secondes.
Lot 7/500 traité en 0.38 secondes. Temps restant estimé : 201.36 secondes.
Lot 8/500 traité en 0.38 secondes. Temps restant estimé : 199.23 secondes.
Lot 9/500 traité en 0.41 secondes. Temps restant estimé : 199.31 secondes.
Lot 10/500 traité en 0.40 secondes. Temps restant estimé : 198.56 secondes.
Lot 11/500 traité en 0.46 secondes. Temps restant estimé : 200.62 secondes.
Lot 12/500 traité en 0.42 secondes. Temps restant estimé : 200.46 secondes.
Lot 13/500 traité en 0.46 secondes. Temps restant estimé : 201.97 secondes.
Lot 14/500 traité en 

In [57]:
# Séparation des données en ensembles d'entraînement et de test pour les données lemmatisées
X_train, X_test, y_train, y_test = train_test_split(
    bert_features, sample_data['target_binary'], test_size=0.2, random_state=42)

# 3. Construction du modèle CNN 

In [59]:
import mlflow
import mlflow.keras
import time
from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Dense, Dropout, BatchNormalization, LeakyReLU, PReLU
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import ParameterGrid

# Fonction pour créer le modèle CNN
def create_cnn_model(max_length=100, num_filters=128, kernel_size=5, dropout_rate=0.2, activation_type='relu', embedding_dim=768):
    model = Sequential()

    # Couche de convolution et normalisation
    model.add(Conv1D(num_filters, kernel_size=kernel_size, activation='relu', input_shape=(25, embedding_dim)))

    model.add(BatchNormalization())
    model.add(GlobalMaxPooling1D())
    
    # Ajout de l'activation dynamique dans les couches denses
    if activation_type == 'relu':
        model.add(Dense(128, activation='relu'))
    elif activation_type == 'leaky_relu':
        model.add(Dense(128))
        model.add(LeakyReLU(alpha=0.1))
    elif activation_type == 'prelu':
        model.add(Dense(128))
        model.add(PReLU())

    model.add(Dropout(dropout_rate))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

# Fonction pour entraîner et loguer un modèle CNN avec GridSearch et les activations avancées
def train_and_log_cnn(X_train, y_train, X_test, y_test, experiment_name, param_grid, max_length=100, embedding_dim=768):
    mlflow.set_experiment(experiment_name)
    
    best_model = None
    best_accuracy = 0
    best_params = None

    # Parcourir chaque combinaison d'hyperparamètres
    for params in ParameterGrid(param_grid):
        num_filters = params['num_filters']
        kernel_size = params['kernel_size']
        dropout_rate = params['dropout_rate']
        activation_type = params['activation_type']

        with mlflow.start_run(run_name=f"CNN_filters={num_filters}_kernel={kernel_size}_dropout={dropout_rate}_activation={activation_type}"):

            # Créer le modèle CNN avec les hyperparamètres courants
            model = create_cnn_model(max_length=max_length, num_filters=num_filters, kernel_size=kernel_size, dropout_rate=dropout_rate, activation_type=activation_type, embedding_dim=embedding_dim)

            # Early stopping pour éviter l'overfitting
            early_stopping = EarlyStopping(monitor='val_loss', patience=4)
            reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2, min_lr=0.0001)

            # Entraîner le modèle
            start_time = time.time()
            history = model.fit(X_train, y_train, epochs=20, batch_size=64, validation_data=(X_test, y_test), callbacks=[early_stopping, reduce_lr], verbose=1)
            training_time = time.time() - start_time

            # Prédictions et évaluation
            y_pred = (model.predict(X_test) > 0.5).astype("int32")
            y_pred_proba = model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            auc_score = roc_auc_score(y_test, y_pred_proba)
            precision = precision_score(y_test, y_pred)
            recall = recall_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)

            # Matrice de confusion
            conf_matrix = confusion_matrix(y_test, y_pred)

            # Loguer les hyperparamètres dans MLFlow
            mlflow.log_param("num_filters", num_filters)
            mlflow.log_param("kernel_size", kernel_size)
            mlflow.log_param("dropout_rate", dropout_rate)
            mlflow.log_param("activation_type", activation_type)

            # Loguer les métriques dans MLFlow
            mlflow.log_metric("accuracy", accuracy)
            mlflow.log_metric("auc", auc_score)
            mlflow.log_metric("precision", precision)
            mlflow.log_metric("recall", recall)
            mlflow.log_metric("f1_score", f1)
            mlflow.log_metric("training_time", training_time)

            # Sauvegarder le modèle avec MLFlow
            mlflow.keras.log_model(model, f"cnn_model_{num_filters}_{kernel_size}_{dropout_rate}")

            # Sauvegarder et loguer la matrice de confusion
            plt.figure(figsize=(6, 4))
            sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
            plt.xlabel('Prédictions')
            plt.ylabel('Vérités')
            plt.title(f"Matrice de Confusion - CNN")
            conf_matrix_path = f"./matrice/confusion_matrix_cnn_filters={num_filters}_kernel={kernel_size}_dropout={dropout_rate}.png"
            plt.savefig(conf_matrix_path)
            mlflow.log_artifact(conf_matrix_path)
            plt.close()  # Fermer la figure pour éviter l'affichage dans le notebook

            # Sauvegarder et loguer la courbe ROC
            fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
            plt.figure()
            plt.plot(fpr, tpr, label=f"ROC curve (AUC = {auc_score:.2f})")
            plt.xlabel("False Positive Rate")
            plt.ylabel("True Positive Rate")
            plt.title("ROC Curve")
            plt.legend(loc="best")
            roc_curve_path = f"./matrice/roc_curve_cnn_filters={num_filters}_kernel={kernel_size}_dropout={dropout_rate}.png"
            plt.savefig(roc_curve_path)
            mlflow.log_artifact(roc_curve_path)
            plt.close()  # Fermer la figure pour éviter l'affichage dans le notebook

            # Comparer pour garder le meilleur modèle
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params = params
                best_model = model

    print(f"Meilleurs paramètres : {best_params} avec une accuracy de {best_accuracy:.4f}")

    # Retourner le meilleur modèle
    return best_model, best_params

# GridSearch pour le modèle CNN avec les hyperparamètres supplémentaires
param_grid_cnn = {
    'num_filters': [64, 128, 256],   # Filtres à tester
    'kernel_size': [3, 5],        # Tailles de kernel à tester
    'dropout_rate': [0.2, 0.5],      # Dropout à tester
    'activation_type': ['relu', 'leaky_relu', 'prelu']  # Activations à tester
}

# 4. Entrainement du modèle CNN

In [62]:
# Entraînement du modèle CNN avec GridSearch sur les embeddings BERT
best_model, best_params = train_and_log_cnn(
    X_train, y_train, X_test, y_test, 
    experiment_name="BERT_CNN_Experiment", 
    param_grid=param_grid_cnn, 
    max_length=100, 
    embedding_dim=768  # Embedding dim from BERT
)

print(f"Meilleur modèle entraîné avec les paramètres : {best_params}")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmprgh8j560\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmprgh8j560\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpu3ip3908\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpu3ip3908\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpqg1z67oz\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpqg1z67oz\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpmiauf3zf\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpmiauf3zf\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpidecqcs8\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpidecqcs8\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpf_qht_qp\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpf_qht_qp\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp5pmgvel8\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp5pmgvel8\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmps643z31c\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmps643z31c\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpye7tnzhg\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpye7tnzhg\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpba42k_8v\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpba42k_8v\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmplvfajqzv\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmplvfajqzv\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmptp37l6cw\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmptp37l6cw\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp5wwbfrb_\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp5wwbfrb_\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp3zbpf1cd\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp3zbpf1cd\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpk_h9j31u\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpk_h9j31u\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp6i2itfcm\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp6i2itfcm\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpcotln3sg\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpcotln3sg\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpswjbiuyi\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpswjbiuyi\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpqwtr9_sk\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpqwtr9_sk\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp4als8amt\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp4als8amt\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpfk1bakr7\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpfk1bakr7\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp8r6jy9cl\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp8r6jy9cl\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpzh_3u2x1\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpzh_3u2x1\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmppvxq5s3f\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmppvxq5s3f\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpc0utoql4\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpc0utoql4\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp5hq1z7vd\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp5hq1z7vd\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpl88p1wtp\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpl88p1wtp\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp00hxm293\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp00hxm293\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpe4uh1ynv\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpe4uh1ynv\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpx8arbwpr\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpx8arbwpr\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpbrygvdzm\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpbrygvdzm\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpkzevafek\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpkzevafek\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpuhkcwtg1\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpuhkcwtg1\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpvk5x1ipi\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpvk5x1ipi\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpk8q3upzq\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpk8q3upzq\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp2li4g0te\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp2li4g0te\model\data\model\assets


Meilleurs paramètres : {'activation_type': 'prelu', 'dropout_rate': 0.5, 'kernel_size': 3, 'num_filters': 128} avec une accuracy de 0.7791
Meilleur modèle entraîné avec les paramètres : {'activation_type': 'prelu', 'dropout_rate': 0.5, 'kernel_size': 3, 'num_filters': 128}


# 5. Construction du modèle LSTM

In [84]:
import mlflow
import mlflow.keras
import time
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, BatchNormalization, LeakyReLU, PReLU,  Bidirectional
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import ParameterGrid



# Fonction pour créer un modèle LSTM
def create_lstm_model(input_shape, lstm_units=128, dropout_rate=0.2, activation_type='relu'):
    model = Sequential()

    # Couche LSTM simple
    model.add((Bidirectional(LSTM(lstm_units, return_sequences=False, input_shape=input_shape))))
    
    # Batch Normalization après la couche LSTM
    model.add(BatchNormalization())

    # Activation dynamique ou fixe
    if activation_type == 'leakyrelu':
        model.add(LeakyReLU())
    elif activation_type == 'prelu':
        model.add(PReLU())
    else:
        model.add(Dense(128, activation=activation_type))

    # Dropout pour éviter le surapprentissage
    model.add(Dropout(dropout_rate))
    
    # Couche Dense
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(dropout_rate))
    
    # Couche de sortie
    model.add(Dense(1, activation='sigmoid'))

    # Compilation du modèle
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

# Fonction pour entraîner et loguer un modèle LSTM avec GridSearch et les métriques avancées
def train_and_log_lstm(X_train, y_train, X_test, y_test, experiment_name, param_grid, input_shape):
    mlflow.set_experiment(experiment_name)
    
    best_model = None
    best_accuracy = 0
    best_params = None

    # Parcourir chaque combinaison d'hyperparamètres
    for params in ParameterGrid(param_grid):
        lstm_units = params['lstm_units']
        dropout_rate = params['dropout_rate']
        activation_type = params['activation_type']

        with mlflow.start_run(run_name=f"LSTM_units={lstm_units}_dropout={dropout_rate}_activation={activation_type}"):

            # Créer le modèle LSTM avec les hyperparamètres courants
            model = create_lstm_model(input_shape=input_shape, lstm_units=lstm_units, dropout_rate=dropout_rate, activation_type=activation_type)

            # Early stopping pour éviter l'overfitting
            early_stopping = EarlyStopping(monitor='val_loss', patience=4)

            # Réduction du taux d'apprentissage lorsque la validation stagne
            reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2, min_lr=0.0001)

            # Entraîner le modèle
            start_time = time.time()
            history = model.fit(X_train, y_train, epochs=20, batch_size=64, validation_data=(X_test, y_test), callbacks=[early_stopping, reduce_lr], verbose=1)
            training_time = time.time() - start_time

            # Prédictions et évaluation
            y_pred = (model.predict(X_test) > 0.5).astype("int32")
            y_pred_proba = model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            auc_score = roc_auc_score(y_test, y_pred_proba)
            precision = precision_score(y_test, y_pred)
            recall = recall_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)

            # Matrice de confusion
            conf_matrix = confusion_matrix(y_test, y_pred)

            # Loguer les hyperparamètres dans MLFlow
            mlflow.log_param("lstm_units", lstm_units)
            mlflow.log_param("dropout_rate", dropout_rate)
            mlflow.log_param("activation_type", activation_type)

            # Loguer les métriques dans MLFlow
            mlflow.log_metric("accuracy", accuracy)
            mlflow.log_metric("auc", auc_score)
            mlflow.log_metric("precision", precision)
            mlflow.log_metric("recall", recall)
            mlflow.log_metric("f1_score", f1)
            mlflow.log_metric("training_time", training_time)

            # Sauvegarder le modèle avec MLFlow
            mlflow.keras.log_model(model, f"lstm_model_{lstm_units}_{dropout_rate}_{activation_type}")

            # Sauvegarder et loguer la matrice de confusion
            plt.figure(figsize=(6, 4))
            sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
            plt.xlabel('Prédictions')
            plt.ylabel('Vérités')
            plt.title(f"Matrice de Confusion - LSTM")
            conf_matrix_path = f"./matrice/confusion_matrix_lstm_units={lstm_units}_dropout={dropout_rate}_activation={activation_type}.png"
            plt.savefig(conf_matrix_path)
            mlflow.log_artifact(conf_matrix_path)
            plt.close()  # Fermer la figure pour éviter l'affichage dans le notebook
            
            # Sauvegarder et loguer la courbe ROC
            fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
            plt.figure()
            plt.plot(fpr, tpr, label=f"ROC curve (AUC = {auc_score:.2f})")
            plt.xlabel("False Positive Rate")
            plt.ylabel("True Positive Rate")
            plt.title("ROC Curve")
            plt.legend(loc="best")
            roc_curve_path = f"./matrice/roc_curve_lstm_units={lstm_units}_dropout={dropout_rate}_activation={activation_type}.png"
            plt.savefig(roc_curve_path)
            mlflow.log_artifact(roc_curve_path)
            plt.close()  # Fermer la figure pour éviter l'affichage dans le notebook

            # Comparer pour garder le meilleur modèle
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params = params
                best_model = model

    print(f"Meilleurs paramètres : {best_params} avec une accuracy de {best_accuracy:.4f}")

    # Retourner le meilleur modèle
    return best_model, best_params

# GridSearch pour le modèle LSTM avec les hyperparamètres
param_grid_lstm = {
    'lstm_units': [64, 128, 256],      # Nombre de cellules LSTM
    'dropout_rate': [0.2, 0.5],        # Taux de Dropout
    'activation_type': ['relu', 'leakyrelu', 'prelu']  # Type d'activation
}



# 6. entrainement du modèle LSTM

In [86]:
# Entraînement du modèle LSTM avec GridSearch sur les embeddings BERT
best_model, best_params = train_and_log_lstm(
    X_train, y_train, X_test, y_test, 
    experiment_name="BERT_LSTM_Experiment", 
    param_grid=param_grid_lstm, 
    input_shape=(25, 768)  # Correspond à la longueur des séquences et à la dimension des embeddings BERT
)

print(f"Meilleur modèle entraîné avec les paramètres : {best_params}")


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp_lmbomt8\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp_lmbomt8\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp73bs1cui\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp73bs1cui\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpu0_a4y6n\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpu0_a4y6n\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmppxrglbw1\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmppxrglbw1\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpd1o6cm44\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpd1o6cm44\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpm9_nv7hc\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpm9_nv7hc\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmps6f401js\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmps6f401js\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp6cjyd15k\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp6cjyd15k\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpbvcsk9qu\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpbvcsk9qu\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp1rej6qci\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp1rej6qci\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp9u6go4ju\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp9u6go4ju\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpajhzt5ul\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpajhzt5ul\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp5dp27dfq\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp5dp27dfq\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpzsbg9kkz\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpzsbg9kkz\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpjogfccky\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpjogfccky\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpy9gwht11\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpy9gwht11\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpgvtg3uph\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmpgvtg3uph\model\data\model\assets


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20




INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp68qb22xd\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\trist\AppData\Local\Temp\tmp68qb22xd\model\data\model\assets


Meilleurs paramètres : {'activation_type': 'leakyrelu', 'dropout_rate': 0.5, 'lstm_units': 64} avec une accuracy de 0.7831
Meilleur modèle entraîné avec les paramètres : {'activation_type': 'leakyrelu', 'dropout_rate': 0.5, 'lstm_units': 64}


Dans notre cas pour comparer les résultats nous allons le faire sous le ui de mlflow.