Gaëlle_Genvrin_P5_4_112024

# PLAN

**1. Introduction et Objectifs**

Explication des objectifs du notebook

Rôle des tags dans la classification supervisée

**2. Préparation des données**

<i>2.1. Gestion des Tags</i>

Les tags sont les labels de notre problème de classification.

Chaque question (Title + Body) est associée à un ou plusieurs tags.

<u>Stratégies possibles</u> :

Classification multi-label : chaque question peut appartenir à plusieurs classes (tags).

Classification multi-classe (approximative) : en limitant à un tag dominant.

<i>2.2. Prénettoyage</i>

Fusion des colonnes Title et Body en une seule variable Title_Body

Nettoyage de Tags (on enlève les balises)

Gestion des outliers (questions trop longues, trop courtes…) Non réalisé

Vérification des déséquilibres de classes (certains tags plus fréquents que d'autres) non réalisé

**3. Transformation des données en fonction des modélisations sur 3 colonnes**

transform_bow_fct → Bag-of-Words -> Word2Vec (1° colonne)

transform_bow_lem_fct → Bag-of-Words lemmatisé -> Word2Vec (2° colonne)

transform_dl_fct → Approches Deep Learning -> USE, BERT (3° et 4° colonnes)

Traitement des Tags avec TF_IDF

**4. Expérimentation des modèles supervisés**

<i>4.1. Approche Bag-of-Words</i>

TF-IDF + Modèle de classification supervisée (Random Forest, Logistic Regression, SGD CLassifier)

<i>4.2. Approches Word/Sentence Embedding</i>

Word2Vec (sur BoW Lemmatisé) 

BERT (fine-tuning ou embeddings pré-entraînés)

Universal Sentence Encoder (USE)

<i>4.3. Comparaison des performances</i>

Précision, rappel, F1-score

Taux de couverture des tags

Autres métriques adaptées au contexte

**5. Suivi des expérimentations avec MLFlow**

Tracking des hyperparamètres et résultats

Stockage centralisé des modèles

**6. Démarche MLOps et industrialisation**

<i>6.1. Mise en place d’un pipeline de traitement des données et des modèles</i>

Exploration d’outils : Kedro, MLFlow Recipes

<i>6.2. Suivi des performances du modèle en production</i>

Analyse du model drift (data drift, concept drift)

Exploration d’outils : evidentlyAI, Prometheus, Popmon

**7. Vérification de la stabilité du modèle dans le temps**

Évaluation mensuelle sur 1 an

Mesure de l’évolution des performances

**8. Conclusion et prochaines étapes**

Résumé des résultats

Perspectives d’amélioration

Intégration future dans une API

In [2]:
!python --versionimport


Python 3.9.21


In [1]:
import tensorflow as tf
print(tf.__version__)

ModuleNotFoundError: No module named 'tensorflow'

In [2]:

# SELECTION DE BIBLIOTHEQUES pour limiter l'importation de bibliothèques dans les scripts techniques par la suite

# Pour la gestion des données
import pandas as pd
import numpy as np
import os

# Pour le traitement du texte
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec, Doc2Vec
import spacy
from sklearn.preprocessing import LabelEncoder

# Pour le deep learning et les embeddings
import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout, LSTM, Embedding, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow_hub as hub
from transformers import BertTokenizer, TFBertModel, BertModel

# Pour la modélisation supervisée
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline

# Pour la gestion du suivi MLFlow
import mlflow
import mlflow.sklearn
import mlflow.keras
from mlflow.models import infer_signature

# Pour les outils de MLOps et industrialisation
import kedro
from kedro.framework.context import KedroContext
import evidently
from evidently import ColumnMapping
import prometheus_client
import popmon

# Pour le prétraitement et les transformations de données
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Pour l'évaluation et la gestion de l'expérience
import matplotlib.pyplot as plt
import seaborn as sns

# Autres bibliothèques
from sklearn.datasets import load_iris  # Import de la fonction load_iris pour un essai








  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from sklearn.metrics import f1_score, hamming_loss, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier


# essai

# Démarre un run MLflow
with mlflow.start_run():
    # Enregistrer un paramètre, à titre d'exemple
    mlflow.log_metric("accuracy", 0.95)

# Vérification des logs et l'état dans l'interface MLflow UI (je vais dans le bash et je lance mlflow ui)
print("Run enregistré !")


In [4]:
print(mlflow.get_tracking_uri())


file:///C:/Users/ggenv/mlruns


## 1. Introduction et Objectifs
Ce projet vise à développer un système supervisé pour l'attribution automatique de tags à des textes (titres + corps).<br>
L'objectif est de proposer des tags pertinents pour organiser et indexer le contenu de manière efficace. Plusieurs approches seront explorées, à commencer par des modèles classiques comme TF-IDF et des algorithmes de Random Forest et Logistic Regression, puis des techniques plus avancées comme Word2Vec, et des modèles Deep Learning tels que BERT et USE.<br>
Le suivi des expériences sera assuré via MLFlow, avec un focus sur l'industrialisation du modèle grâce à MLOps et l'évaluation mensuelle de la stabilité en production.

## 2. Préparation des données

**2.1. Gestion des Tags**

Les tags sont les labels de notre problème de classification.

Chaque question (Title + Body) est associée à un ou plusieurs tags.

In [5]:
# Spécifie le chemin du fichier
chemin_fichier = r"C:\Users\ggenv\OneDrive\Documents\MLE\P5\P5_jeu de donnees\QueryResults3.csv"

# Lecture du fichier CSV
df0 = pd.read_csv(chemin_fichier)

# Affichage des premières lignes
df0.head()

Unnamed: 0,CreationDate,PostTypeId,Title,Body,Tags,Id,Score,ViewCount,CommentCount,AnswerCount
0,2023-01-01 00:07:53,1,Optimized way to filter and return objects in ...,<p>I have the following Movie class -</p>\n<pr...,<java><algorithm><oop><data-structures><time-c...,74972603,2,80,3,1
1,2023-01-01 01:30:29,1,LiveData observer is not removed,<p>I am trying to get <code>LiveData</code> up...,<android><kotlin><android-livedata><observer-p...,74972777,1,308,4,1
2,2023-01-01 01:38:12,1,MAUI ContentView can't inherit from custom bas...,<p>I have a ContentView called HomePageOrienta...,<inheritance><controls><code-generation><maui>...,74972784,2,1153,2,1
3,2023-01-01 01:48:00,1,My if statement is not working in React Native,<p>I want to build a search bar that filters a...,<react-native><if-statement><components><react...,74972800,1,60,0,2
4,2023-01-01 02:11:45,1,jax.lax.select vs jax.numpy.where,"<p>Was taking a look at the <a href=""https://f...",<python><numpy><machine-learning><deep-learnin...,74972850,3,1787,0,1


 **On optera pour la classification multi-label : chaque question peut appartenir à plusieurs classes (tags).**

 **2.2. Pré-nettoyage**


In [6]:
# Fusion des colonnes Title et Body
df0['Title_Body'] = df0['Title'] + " " + df0['Body']

# Affichage des premières lignes pour vérifier
df0[['Title', 'Body', 'Title_Body', 'Tags']].head()

Unnamed: 0,Title,Body,Title_Body,Tags
0,Optimized way to filter and return objects in ...,<p>I have the following Movie class -</p>\n<pr...,Optimized way to filter and return objects in ...,<java><algorithm><oop><data-structures><time-c...
1,LiveData observer is not removed,<p>I am trying to get <code>LiveData</code> up...,LiveData observer is not removed <p>I am tryin...,<android><kotlin><android-livedata><observer-p...
2,MAUI ContentView can't inherit from custom bas...,<p>I have a ContentView called HomePageOrienta...,MAUI ContentView can't inherit from custom bas...,<inheritance><controls><code-generation><maui>...
3,My if statement is not working in React Native,<p>I want to build a search bar that filters a...,My if statement is not working in React Native...,<react-native><if-statement><components><react...
4,jax.lax.select vs jax.numpy.where,"<p>Was taking a look at the <a href=""https://f...",jax.lax.select vs jax.numpy.where <p>Was takin...,<python><numpy><machine-learning><deep-learnin...


In [7]:
df0[['Title_Body', 'Tags']].head()

Unnamed: 0,Title_Body,Tags
0,Optimized way to filter and return objects in ...,<java><algorithm><oop><data-structures><time-c...
1,LiveData observer is not removed <p>I am tryin...,<android><kotlin><android-livedata><observer-p...
2,MAUI ContentView can't inherit from custom bas...,<inheritance><controls><code-generation><maui>...
3,My if statement is not working in React Native...,<react-native><if-statement><components><react...
4,jax.lax.select vs jax.numpy.where <p>Was takin...,<python><numpy><machine-learning><deep-learnin...


**Nettoyage des Tags**

In [8]:
import re
pd.set_option('display.max_columns', None)

# Fonction pour extraire le contenu des balises HTML
def extraire_tags(tags):
    # Trouve tout ce qui est à l'intérieur des balises et les met dans une liste
    return ' '.join(re.findall(r'<(.*?)>', tags))

# Appliquer la fonction sur la colonne 'Tags'
df0['Tags'] = df0['Tags'].apply(extraire_tags)

# Vérification des premières lignes
df0[['Tags', 'Title_Body']].head()




Unnamed: 0,Tags,Title_Body
0,java algorithm oop data-structures time-comple...,Optimized way to filter and return objects in ...
1,android kotlin android-livedata observer-patte...,LiveData observer is not removed <p>I am tryin...
2,inheritance controls code-generation maui cont...,MAUI ContentView can't inherit from custom bas...
3,react-native if-statement components react-nat...,My if statement is not working in React Native...
4,python numpy machine-learning deep-learning jax,jax.lax.select vs jax.numpy.where <p>Was takin...


## 3. Transformation des données en fonction des modélisations sur 3 colonnes

transform_bow_fct → Bag-of-Words -> Word2Vec (1° colonne)

transform_bow_lem_fct → Bag-of-Words lemmatisé -> Word2Vec (2° colonne)

transform_dl_fct → Approches Deep Learning -> USE (3° colonne)

batch_transform_bert_tf → Approches Deep Learning -> BERT (4° colonne)

In [9]:

# Charger le modèle spaCy pour la lemmatisation
nlp = spacy.load('en_core_web_sm')

# Fonction pour supprimer les chiffres
def supprimer_chiffres(text):
    return re.sub(r'\d+', '', text)  # Remplace les chiffres par un espace vide

# Fonction de lemmatisation avec spaCy
def lemmatiser(text):
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc if not token.is_stop])

# Fonction Bag-of-Words (BoW) -> Word2Vec
def transform_bow_fct(text):
    # Suppression des chiffres
    text = supprimer_chiffres(text)
    
    # Transformation BoW
    vectorizer = CountVectorizer()
    bow = vectorizer.fit_transform([text]).toarray()
    
    # Entraîner Word2Vec avec BoW
    model = Word2Vec([text.split()], vector_size=100, window=5, min_count=1, workers=4)
    word2vec_vector = model.wv[text.split()[0]]  # Exemple de vecteur pour le premier mot
    return word2vec_vector

# Fonction Bag-of-Words lemmatisé -> Word2Vec
def transform_bow_lem_fct(text):
    # Suppression des chiffres
    text = supprimer_chiffres(text)
    
    # Lemmatisation et transformation BoW
    lemmatized_text = lemmatiser(text)
    vectorizer = CountVectorizer()
    bow = vectorizer.fit_transform([lemmatized_text]).toarray()
    
    # Entraîner Word2Vec avec BoW lemmatisé
    model = Word2Vec([lemmatized_text.split()], vector_size=100, window=5, min_count=1, workers=4)
    word2vec_vector = model.wv[lemmatized_text.split()[0]]  # Exemple de vecteur pour le premier mot
    return word2vec_vector




In [10]:
# Appliquer les transformations sur les colonnes
df0['BoW_Word2Vec'] = df0['Title_Body'].apply(transform_bow_fct)

In [11]:
df0['BoW_Lem_Word2Vec'] = df0['Title_Body'].apply(transform_bow_lem_fct)


# Fonction pour transformer avec Universal Sentence Encoder (USE)

In [13]:
# Charger le modèle USE
use_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Fonction pour traiter par batch
def batch_transform(texts, batch_size=1000):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_embeddings = use_model(batch).numpy()
        embeddings.extend(batch_embeddings)
    return embeddings

# Nettoyer les textes
df0['Title_Body_clean'] = df0['Title_Body'].apply(supprimer_chiffres)

# Appliquer la transformation par batch
df0['DL_USE'] = list(batch_transform(df0['Title_Body_clean'].tolist()))








In [14]:
df0

Unnamed: 0,CreationDate,PostTypeId,Title,Body,Tags,Id,Score,ViewCount,CommentCount,AnswerCount,Title_Body,BoW_Word2Vec,BoW_Lem_Word2Vec,Title_Body_clean,DL_USE
0,2023-01-01 00:07:53,1,Optimized way to filter and return objects in ...,<p>I have the following Movie class -</p>\n<pr...,java algorithm oop data-structures time-comple...,74972603,2,80,3,1,Optimized way to filter and return objects in ...,"[0.0034893542, 0.007686467, 0.005897363, 0.008...","[0.00919876, 0.0013405064, -0.004493543, 0.006...",Optimized way to filter and return objects in ...,"[-0.016126204, -0.06459423, 0.062102184, 0.007..."
1,2023-01-01 01:30:29,1,LiveData observer is not removed,<p>I am trying to get <code>LiveData</code> up...,android kotlin android-livedata observer-patte...,74972777,1,308,4,1,LiveData observer is not removed <p>I am tryin...,"[0.0043471786, 0.0068961945, 0.0009009706, -0....","[-0.009008026, 0.0053380113, 0.003758295, -0.0...",LiveData observer is not removed <p>I am tryin...,"[0.030374147, 0.010818745, -0.065288365, -0.06..."
2,2023-01-01 01:38:12,1,MAUI ContentView can't inherit from custom bas...,<p>I have a ContentView called HomePageOrienta...,inheritance controls code-generation maui cont...,74972784,2,1153,2,1,MAUI ContentView can't inherit from custom bas...,"[-0.004020552, 0.0025713993, 0.0016023592, -0....","[-0.009063528, 0.005615017, 0.0036862437, -0.0...",MAUI ContentView can't inherit from custom bas...,"[-0.026131395, 0.012135708, -0.0077633136, -0...."
3,2023-01-01 01:48:00,1,My if statement is not working in React Native,<p>I want to build a search bar that filters a...,react-native if-statement components react-nat...,74972800,1,60,0,2,My if statement is not working in React Native...,"[0.008811565, -0.0022609183, 0.0043626935, -0....","[-0.0022346813, 0.0075390935, -0.0030874617, 0...",My if statement is not working in React Native...,"[-0.050018243, -0.010045774, 0.008870118, -0.0..."
4,2023-01-01 02:11:45,1,jax.lax.select vs jax.numpy.where,"<p>Was taking a look at the <a href=""https://f...",python numpy machine-learning deep-learning jax,74972850,3,1787,0,1,jax.lax.select vs jax.numpy.where <p>Was takin...,"[-0.0017422211, 0.0022451014, -0.003727198, 0....","[-0.004146943, -0.009880067, -0.0071219914, -0...",jax.lax.select vs jax.numpy.where <p>Was takin...,"[-0.0408414, -0.06767161, -0.0016601549, -0.06..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2024-10-29 14:59:45,1,How to extract data using LeetCode GraphQL query,<p>I just want to know that how to print all t...,python json web-scraping python-requests graphql,79137785,2,269,1,1,How to extract data using LeetCode GraphQL que...,"[0.001136272, 0.006173037, -0.0065935263, 0.00...","[-0.0005435155, 0.00024361088, 0.005107024, 0....",How to extract data using LeetCode GraphQL que...,"[0.022740941, -0.06486706, 0.005697829, 0.0212..."
49996,2024-10-29 15:11:25,1,Z-index ignored while transitioning using the ...,<p><strong>Note: The example below uses the <a...,javascript css vue.js z-index view-transitions...,79137836,1,128,0,1,Z-index ignored while transitioning using the ...,"[0.0014509934, -0.0033614584, 0.0070376634, -0...","[-0.0030618808, -0.0051274016, 0.007299507, -0...",Z-index ignored while transitioning using the ...,"[-0.058368765, -0.043366622, 0.048300456, -0.0..."
49997,2024-10-29 15:22:44,1,Give the result string provided minimum number...,<p>A good follow up question asked in one of t...,string algorithm data-structures stack dynamic...,79137882,4,268,10,1,Give the result string provided minimum number...,"[0.009377887, 0.0073678787, 0.006877803, 0.006...","[-0.0071886983, 0.004251417, 0.0021597142, 0.0...",Give the result string provided minimum number...,"[-0.008913429, -0.06844016, 0.033410776, 0.065..."
49998,2024-10-29 15:36:53,1,How to Exclude Tagless Structs Using ASTMatcher?,<p>I'm currently working with Clang's ASTMatch...,c clang abstract-syntax-tree libtooling clang-...,79137927,2,57,4,2,How to Exclude Tagless Structs Using ASTMatche...,"[0.0038029673, 0.001572058, -0.0031528277, -0....","[0.0081225075, -0.004442059, -0.0010730232, 0....",How to Exclude Tagless Structs Using ASTMatche...,"[-0.055789653, -0.0628867, -0.029554589, -0.05..."


In [15]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())  # Vérifie si CUDA (GPU) est disponible


2.6.0+cpu
False


In [16]:
import pickle
import tensorflow as tf
from transformers import TFBertModel, BertTokenizer
import gc

# Charger le modèle BERT et le tokenizer en forçant l'utilisation de TensorFlow
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# Fonction pour transformer les données par batch sans @tf.function
def batch_transform_bert_tf(texts, batch_size=8):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        
        # Tokeniser les textes par batch, limiter la longueur à 128 pour réduire la mémoire
        inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="tf", max_length=128)
        
        # Calculer les embeddings BERT avec TensorFlow
        outputs = bert_model(**inputs)
        
        # Extraire les embeddings du dernier couche (pooler_output)
        batch_embeddings = outputs.pooler_output.numpy()  # Convertir vers NumPy directement après l'exécution
        embeddings.extend(batch_embeddings)
        
        # Libérer la mémoire
        del outputs
        gc.collect()
        
    return embeddings

# Nettoyer les textes et appliquer la transformation par batch pour BERT
df0['Title_Body_clean'] = df0['Title_Body'].apply(supprimer_chiffres)

# Sauvegarde des embeddings dans un fichier pickle
def save_embeddings(embeddings, filename='bert_embeddings.pkl'):
    with open(filename, 'wb') as f:
        pickle.dump(embeddings, f)

# Vérifier si les embeddings existent déjà
try:
    with open('bert_embeddings.pkl', 'rb') as f:
        embeddings = pickle.load(f)
    print("Embeddings chargés depuis le fichier pickle.")
except FileNotFoundError:
    print("Fichier pickle introuvable, génération des embeddings...")
    embeddings = batch_transform_bert_tf(df0['Title_Body_clean'].tolist(), batch_size=8)
    save_embeddings(embeddings, 'bert_embeddings.pkl')  # Sauvegarder après la génération
    print("Embeddings sauvegardés dans bert_embeddings.pkl.")

# Ajouter les embeddings BERT à la colonne DL_BERT
df0['DL_BERT'] = embeddings





Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Embeddings chargés depuis le fichier pickle.


In [17]:
df0


Unnamed: 0,CreationDate,PostTypeId,Title,Body,Tags,Id,Score,ViewCount,CommentCount,AnswerCount,Title_Body,BoW_Word2Vec,BoW_Lem_Word2Vec,Title_Body_clean,DL_USE,DL_BERT
0,2023-01-01 00:07:53,1,Optimized way to filter and return objects in ...,<p>I have the following Movie class -</p>\n<pr...,java algorithm oop data-structures time-comple...,74972603,2,80,3,1,Optimized way to filter and return objects in ...,"[0.0034893542, 0.007686467, 0.005897363, 0.008...","[0.00919876, 0.0013405064, -0.004493543, 0.006...",Optimized way to filter and return objects in ...,"[-0.016126204, -0.06459423, 0.062102184, 0.007...","[0.036435265, 0.024415137, -0.97473055, -0.132..."
1,2023-01-01 01:30:29,1,LiveData observer is not removed,<p>I am trying to get <code>LiveData</code> up...,android kotlin android-livedata observer-patte...,74972777,1,308,4,1,LiveData observer is not removed <p>I am tryin...,"[0.0043471786, 0.0068961945, 0.0009009706, -0....","[-0.009008026, 0.0053380113, 0.003758295, -0.0...",LiveData observer is not removed <p>I am tryin...,"[0.030374147, 0.010818745, -0.065288365, -0.06...","[0.04542997, -0.046744376, -0.7192131, -0.1898..."
2,2023-01-01 01:38:12,1,MAUI ContentView can't inherit from custom bas...,<p>I have a ContentView called HomePageOrienta...,inheritance controls code-generation maui cont...,74972784,2,1153,2,1,MAUI ContentView can't inherit from custom bas...,"[-0.004020552, 0.0025713993, 0.0016023592, -0....","[-0.009063528, 0.005615017, 0.0036862437, -0.0...",MAUI ContentView can't inherit from custom bas...,"[-0.026131395, 0.012135708, -0.0077633136, -0....","[-0.6552563, -0.63547015, -0.99479884, 0.75482..."
3,2023-01-01 01:48:00,1,My if statement is not working in React Native,<p>I want to build a search bar that filters a...,react-native if-statement components react-nat...,74972800,1,60,0,2,My if statement is not working in React Native...,"[0.008811565, -0.0022609183, 0.0043626935, -0....","[-0.0022346813, 0.0075390935, -0.0030874617, 0...",My if statement is not working in React Native...,"[-0.050018243, -0.010045774, 0.008870118, -0.0...","[0.07536105, 0.09974712, -0.30936754, -0.33255..."
4,2023-01-01 02:11:45,1,jax.lax.select vs jax.numpy.where,"<p>Was taking a look at the <a href=""https://f...",python numpy machine-learning deep-learning jax,74972850,3,1787,0,1,jax.lax.select vs jax.numpy.where <p>Was takin...,"[-0.0017422211, 0.0022451014, -0.003727198, 0....","[-0.004146943, -0.009880067, -0.0071219914, -0...",jax.lax.select vs jax.numpy.where <p>Was takin...,"[-0.0408414, -0.06767161, -0.0016601549, -0.06...","[-0.4247739, -0.14559332, -0.22167683, 0.17040..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2024-10-29 14:59:45,1,How to extract data using LeetCode GraphQL query,<p>I just want to know that how to print all t...,python json web-scraping python-requests graphql,79137785,2,269,1,1,How to extract data using LeetCode GraphQL que...,"[0.001136272, 0.006173037, -0.0065935263, 0.00...","[-0.0005435155, 0.00024361088, 0.005107024, 0....",How to extract data using LeetCode GraphQL que...,"[0.022740941, -0.06486706, 0.005697829, 0.0212...","[-0.66401017, -0.6139935, -0.9873576, 0.562386..."
49996,2024-10-29 15:11:25,1,Z-index ignored while transitioning using the ...,<p><strong>Note: The example below uses the <a...,javascript css vue.js z-index view-transitions...,79137836,1,128,0,1,Z-index ignored while transitioning using the ...,"[0.0014509934, -0.0033614584, 0.0070376634, -0...","[-0.0030618808, -0.0051274016, 0.007299507, -0...",Z-index ignored while transitioning using the ...,"[-0.058368765, -0.043366622, 0.048300456, -0.0...","[-0.70448077, -0.4781869, -0.9628538, 0.235329..."
49997,2024-10-29 15:22:44,1,Give the result string provided minimum number...,<p>A good follow up question asked in one of t...,string algorithm data-structures stack dynamic...,79137882,4,268,10,1,Give the result string provided minimum number...,"[0.009377887, 0.0073678787, 0.006877803, 0.006...","[-0.0071886983, 0.004251417, 0.0021597142, 0.0...",Give the result string provided minimum number...,"[-0.008913429, -0.06844016, 0.033410776, 0.065...","[-0.6534469, -0.45200142, -0.9248721, 0.395267..."
49998,2024-10-29 15:36:53,1,How to Exclude Tagless Structs Using ASTMatcher?,<p>I'm currently working with Clang's ASTMatch...,c clang abstract-syntax-tree libtooling clang-...,79137927,2,57,4,2,How to Exclude Tagless Structs Using ASTMatche...,"[0.0038029673, 0.001572058, -0.0031528277, -0....","[0.0081225075, -0.004442059, -0.0010730232, 0....",How to Exclude Tagless Structs Using ASTMatche...,"[-0.055789653, -0.0628867, -0.029554589, -0.05...","[-0.38035685, -0.31738442, -0.8149966, 0.02575..."


In [18]:
df0['DL_BERT'].apply(len).value_counts()



[1;36m768[0m    [1;36m50000[0m
Name: DL_BERT, dtype: int64

**Traitement des Tags**

In [18]:

# Fonction pour compter les mots dans chaque cellule de la colonne 'Tags'
df0['tags_count'] = df0['Tags'].apply(lambda x: len(x.split()))

# Afficher la cellule 'Tags' de la ligne où il y a 6 mots
df_six_tags = df0[df0['tags_count'] == 6]

pd.set_option('display.max_colwidth', None)

# Afficher les cellules 'Tags' pour la ligne concernée
print(df_six_tags[['Tags']])

                                                              Tags
4570  machine-learning next.js socket.io opencv mediastream python


In [19]:
# Suppression du mot 'opencv' et de son espace dans la cellule Tags à l'index 4570
df0.loc[4570, 'Tags'] = df0.loc[4570, 'Tags'].replace(' opencv', '')

# Afficher la ligne modifiée pour vérifier
print(df0.loc[4570, ['Tags']])

Tags    machine-learning next.js socket.io mediastream python
Name: 4570, dtype: object


In [20]:
# Vérifie que chaque cellule de df0['Tags'] contient bien 5 mots
df0['tags_count'] = df0['Tags'].apply(lambda x: len(x.split()))

# Filtrer pour afficher les lignes qui ne contiennent pas exactement 5 mots
invalid_tags = df0[df0['tags_count'] != 5]

# Afficher les lignes avec un nombre incorrect de mots
print(invalid_tags[['Tags', 'tags_count']])

Empty DataFrame
Columns: [Tags, tags_count]
Index: []


In [21]:
# Afficher les premiers éléments de df0['Tags']
print(df0['Tags'].head())

0                                   java algorithm oop data-structures time-complexity
1                           android kotlin android-livedata observer-pattern observers
2                                inheritance controls code-generation maui contentview
3    react-native if-statement components react-native-flatlist react-native-textinput
4                                      python numpy machine-learning deep-learning jax
Name: Tags, dtype: object


In [23]:
print(df0.shape)


(50000, 17)


In [24]:
from sklearn.preprocessing import MultiLabelBinarizer

# Créer une instance de MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Appliquer la transformation sur les tags dans df0['Tags']
y = mlb.fit_transform(df0['Tags'].apply(lambda x: x.split()))

# Convertir en DataFrame pour une visualisation plus facile
df_Tags_Mlb = pd.DataFrame(y, columns=mlb.classes_)


In [25]:
df_Tags_Mlb.shape

[1m([0m[1;36m50000[0m, [1;36m19277[0m[1m)[0m

In [26]:
# Calculer la fréquence des tags
tag_frequencies = df_Tags_Mlb.sum(axis=0)

# Garder les 50 tags les plus fréquents
top_500_tags = tag_frequencies.nlargest(50).index

# Filtrer df_Tags_Mlb pour ne garder que les colonnes correspondant aux 500 tags les plus fréquents
df_Tags_Mlb_reduced = df_Tags_Mlb[top_50_tags]

# Afficher les 5 premières lignes du DataFrame réduit
df_Tags_Mlb_reduced.shape


[1m([0m[1;36m50000[0m, [1;36m500[0m[1m)[0m

**4. Expérimentation des modèles supervisés**

<i>4.1. Approche Bag-of-Words</i>

TF-IDF + Modèle de classification supervisée (Random Forest, Logistic Regression…)

<i>4.2. Approches Word/Sentence Embedding</i>

Word2Vec ou Doc2Vec (avec modèle de classification)

BERT (fine-tuning ou embeddings pré-entraînés)

Universal Sentence Encoder (USE)

<i>4.3. Comparaison des performances</i>

Précision, rappel, F1-score, Jaccard

Taux de couverture des tags

Autres métriques adaptées au contexte

**4.1 Approche Bag of Word**

In [27]:
df6 = pd.read_pickle(r"C:\Users\ggenv\OneDrive\Documents\MLE\P5\df6.pkl")


**Regression logistique**

In [28]:

# X = df6 ou df7 (caractéristiques TF-IDF ou SVD)
X = df6  # Ou df7 si tu utilises SVD

# y = df_Tags_Mlb_reduced (labels multi-label avec 500 tags)
y = df_Tags_Mlb_reduced  # Labels multi-labels de forme (50000, 500)

# Normalisation des données (important pour la régression logistique)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Diviser en train et test
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Appliquer OneVsRestClassifier avec LogisticRegression pour chaque tag indépendamment
model = OneVsRestClassifier(LogisticRegression(max_iter=1000))
model.fit(X_train, y_train)

# Prédictions
y_pred = model.predict(X_test)

# Calcul des métriques
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')
hamming = hamming_loss(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')

# Affichage des résultats
print(f"Accuracy: {accuracy:.4f}")
print(f"F1-score (macro): {f1:.4f}")
print(f"Hamming Loss: {hamming:.4f}")
print(f"Precision (macro): {precision:.4f}")
print(f"Recall (macro): {recall:.4f}")


Accuracy: 0.0358
F1-score (macro): 0.2507
Hamming Loss: 0.0065
Precision (macro): 0.2904
Recall (macro): 0.2353


In [29]:
# Au passage, j'enregistre df0 en tant que df_4_transfos en format pickle

# Spécifie le chemin pour sauvegarder df0
file_path = r"C:\Users\ggenv\OneDrive\Documents\MLE\P5\df_4_transfos.pkl"

# Enregistrement de df0 sous forme de pickle
with open(file_path, 'wb') as file:
    pickle.dump(df0, file)

print("df0 a été sauvegardé sous df_4_transfos.pkl")


df0 a été sauvegardé sous df_4_transfos.pkl


In [12]:
df0

Unnamed: 0,CreationDate,PostTypeId,Title,Body,Tags,Id,Score,ViewCount,CommentCount,AnswerCount,Title_Body,BoW_Word2Vec,BoW_Lem_Word2Vec
0,2023-01-01 00:07:53,1,Optimized way to filter and return objects in ...,<p>I have the following Movie class -</p>\n<pr...,java algorithm oop data-structures time-comple...,74972603,2,80,3,1,Optimized way to filter and return objects in ...,"[0.0034893542, 0.007686467, 0.005897363, 0.008...","[0.00919876, 0.0013405064, -0.004493543, 0.006..."
1,2023-01-01 01:30:29,1,LiveData observer is not removed,<p>I am trying to get <code>LiveData</code> up...,android kotlin android-livedata observer-patte...,74972777,1,308,4,1,LiveData observer is not removed <p>I am tryin...,"[0.0043471786, 0.0068961945, 0.0009009706, -0....","[-0.009008026, 0.0053380113, 0.003758295, -0.0..."
2,2023-01-01 01:38:12,1,MAUI ContentView can't inherit from custom bas...,<p>I have a ContentView called HomePageOrienta...,inheritance controls code-generation maui cont...,74972784,2,1153,2,1,MAUI ContentView can't inherit from custom bas...,"[-0.004020552, 0.0025713993, 0.0016023592, -0....","[-0.009063528, 0.005615017, 0.0036862437, -0.0..."
3,2023-01-01 01:48:00,1,My if statement is not working in React Native,<p>I want to build a search bar that filters a...,react-native if-statement components react-nat...,74972800,1,60,0,2,My if statement is not working in React Native...,"[0.008811565, -0.0022609183, 0.0043626935, -0....","[-0.0022346813, 0.0075390935, -0.0030874617, 0..."
4,2023-01-01 02:11:45,1,jax.lax.select vs jax.numpy.where,"<p>Was taking a look at the <a href=""https://f...",python numpy machine-learning deep-learning jax,74972850,3,1787,0,1,jax.lax.select vs jax.numpy.where <p>Was takin...,"[-0.0017422211, 0.0022451014, -0.003727198, 0....","[-0.004146943, -0.009880067, -0.0071219914, -0..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2024-10-29 14:59:45,1,How to extract data using LeetCode GraphQL query,<p>I just want to know that how to print all t...,python json web-scraping python-requests graphql,79137785,2,269,1,1,How to extract data using LeetCode GraphQL que...,"[0.001136272, 0.006173037, -0.0065935263, 0.00...","[-0.0005435155, 0.00024361088, 0.005107024, 0...."
49996,2024-10-29 15:11:25,1,Z-index ignored while transitioning using the ...,<p><strong>Note: The example below uses the <a...,javascript css vue.js z-index view-transitions...,79137836,1,128,0,1,Z-index ignored while transitioning using the ...,"[0.0014509934, -0.0033614584, 0.0070376634, -0...","[-0.0030618808, -0.0051274016, 0.007299507, -0..."
49997,2024-10-29 15:22:44,1,Give the result string provided minimum number...,<p>A good follow up question asked in one of t...,string algorithm data-structures stack dynamic...,79137882,4,268,10,1,Give the result string provided minimum number...,"[0.009377887, 0.0073678787, 0.006877803, 0.006...","[-0.0071886983, 0.004251417, 0.0021597142, 0.0..."
49998,2024-10-29 15:36:53,1,How to Exclude Tagless Structs Using ASTMatcher?,<p>I'm currently working with Clang's ASTMatch...,c clang abstract-syntax-tree libtooling clang-...,79137927,2,57,4,2,How to Exclude Tagless Structs Using ASTMatche...,"[0.0038029673, 0.001572058, -0.0031528277, -0....","[0.0081225075, -0.004442059, -0.0010730232, 0...."


**Random Forest**

In [37]:
import gc
import joblib
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Charger df7 (SVD de df6) et df_Tags_Mlb_reduced est déjà dans le notebook
df7 = pd.read_pickle('df7.pkl')  # Charger df7 avec la décomposition SVD

# Optimiser la mémoire en réduisant les types de données
def optimize_memory(df):
    for col in df.select_dtypes(include=[np.float64]).columns:
        df[col] = df[col].astype(np.float32)
    for col in df.select_dtypes(include=[np.int64]).columns:
        df[col] = df[col].astype(np.int32)
    for col in df.select_dtypes(include=[object]).columns:
        df[col] = df[col].astype('category')
    return df

df7 = optimize_memory(df7)
df_Tags_Mlb_reduced = optimize_memory(df_Tags_Mlb_reduced)

# Libérer la mémoire après l'optimisation
gc.collect()

# La cible (y) est contenue dans df_Tags_Mlb_reduced, chaque colonne binaire représente un tag
y = df_Tags_Mlb_reduced.values  # Utilisation des valeurs binaires pour MultiLabelBinarizer

# Extraire les features (SVD de df7)
X = df7  # df7 contient déjà les composantes SVD

# Diviser les données en ensembles d'entraînement et de test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Entraînement d'un modèle RandomForest
rf_model = RandomForestClassifier(n_estimators=50, max_depth=10, criterion="entropy", max_samples=0.8, random_state=42)
rf_model.fit(X_train, y_train)

# Prédictions avec le modèle
y_pred = rf_model.predict(X_test)

# Calcul des métriques pour RandomForest
RF_accuracy = accuracy_score(y_test, y_pred)
RF_f1 = f1_score(y_test, y_pred, average='weighted')  # Ou 'micro', 'macro'
RF_classification_rep = classification_report(y_test, y_pred)

# Calcul de l'AUC Macro et Micro
# Utilisation de predict_proba pour obtenir les probabilités de chaque label
y_pred_proba = rf_model.predict_proba(X_test)


# Affichage des résultats
print("RandomForest Accuracy:", RF_accuracy)
print("RandomForest F1 Score:", RF_f1)
print("RandomForest Classification Report:\n", RF_classification_rep)


# Libérer la mémoire après les calculs
del X_train, X_test, y_train, y_test, y_pred, y_pred_proba  # Libérer les données inutilisées
gc.collect()

# Sauvegarder le modèle si nécessaire
joblib.dump(rf_model, 'rf_model.pkl')

# Libérer la mémoire à la fin
gc.collect()


RandomForest Accuracy: 0.0246
RandomForest F1 Score: 0.052719966607708973
RandomForest Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.22      0.35      1679
           1       0.00      0.00      0.00      1213
           2       0.90      0.04      0.07       781
           3       0.80      0.01      0.02       677
           4       0.67      0.08      0.15       662
           5       0.89      0.19      0.31       504
           6       0.94      0.26      0.41       458
           7       1.00      0.00      0.01       465
           8       0.00      0.00      0.00       459
           9       0.84      0.04      0.08       366
          10       0.82      0.17      0.29       303
          11       0.89      0.02      0.05       330
          12       1.00      0.00      0.01       312
          13       1.00      0.03      0.06       315
          14       0.86      0.08      0.15       303
          15       0.86 

[1;36m0[0m

In [36]:
# Initialisation des listes d'AUC pour chaque label
auc_scores_macro = []
auc_scores_micro = []

# Boucle sur chaque label
for i in range(y_test.shape[1]):
    auc_label = roc_auc_score(y_test[:, i], y_pred_proba[i][:, 1])  # On prend la 2ᵉ colonne (proba de classe 1)
    auc_scores_macro.append(auc_label)
    auc_scores_micro.append(auc_label)  

# Moyenne des AUC par label
RF_AUC_macro = np.mean(auc_scores_macro)
RF_AUC_micro = np.mean(auc_scores_micro)

print("RandomForest AUC Macro:", RF_AUC_macro)
print("RandomForest AUC Micro:", RF_AUC_micro)
    

RandomForest AUC Macro: 0.9055120647596278
RandomForest AUC Micro: 0.9055120647596278


**4.2. Approches Word/Sentence Embedding**

In [34]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import lightgbm as lgb
import joblib

# Créer un modèle LightGBM optimisé
base_lgb = lgb.LGBMClassifier(
    learning_rate=0.1,
    n_estimators=100,
    max_depth=5,
    feature_fraction=0.8,
    num_leaves=20,
    min_child_samples=10,  # Évite les splits inutiles
    force_col_wise=True,  # Évite le message sur le multi-threading
    verbose=-1,  # Réduit les logs
    random_state=42
)

# Adapter LightGBM à la classification multi-label
lgb_multi = MultiOutputClassifier(base_lgb)

# Entraîner le modèle
lgb_multi.fit(X_train, y_train)

# Prédictions
y_pred = lgb_multi.predict(X_test)

# Calcul des métriques
LGBM_accuracy = accuracy_score(y_test, y_pred)
LGBM_f1 = f1_score(y_test, y_pred, average='weighted')

print("LightGBM Accuracy:", LGBM_accuracy)
print("LightGBM F1 Score:", LGBM_f1)
print("LightGBM Classification Report:\n", classification_report(y_test, y_pred))

# Sauvegarder le modèle
joblib.dump(lgb_multi, 'lgb_model_embeddings.pkl')


LightGBM Accuracy: 0.0005
LightGBM F1 Score: 0.7333475163031993


LightGBM Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     10000
           1       0.00      0.00      0.00       686
           2       0.00      0.00      0.00       478
           3       0.79      1.00      0.88      7898
           4       0.00      0.00      0.00      2132
           5       0.00      0.00      0.00       418
           6       0.00      0.00      0.00       480
           7       1.00      0.00      0.00       559
           8       0.00      0.00      0.00       796
           9       0.00      0.00      0.00       248
          10       0.00      0.00      0.00       202
          11       0.00      0.00      0.00       278
          12       0.00      0.00      0.00       185
          13       0.00      0.00      0.00       248
          14       0.00      0.00      0.00        56
          15       0.94      1.00      0.97      9411
          16       0.45      0.08      0.14     

[1m[[0m[32m'lgb_model_embeddings.pkl'[0m[1m][0m

In [None]:
df0