# MARQUER Matthieu
## Projet 5: Catégorisez automatiquement des questions
 ![alt text](img/16480242457412.png "Stack Overflow")
Part: 4 Approche supervisée

### Importation des librairies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
warnings.filterwarnings('ignore')

### Modification des options

In [2]:
# https://pandas.pydata.org/docs/reference/api/pandas.set_option.html
#pd.set_option("display.max_rows", 200)
#pd.set_option("display.max_colwidth", 500)
#pd.set_option('display.max_columns', 100)

### Importation des fichiers

In [2]:
data = pd.read_csv('data/cleaned/data_cleaned.csv')
data.head(3)

Unnamed: 0,Id,Title,Body,Tags,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate,Title_Body
0,11227809,Why is processing a sorted array faster than p...,"<p>In this C++ code, sorting the data (<em>bef...",<java><c++><performance><cpu-architecture><bra...,27160,1851289,0.0,25,2012-06-27 13:51:36,Why processing sorted array faster processing ...
1,2003505,How do I delete a Git branch locally and remot...,<p>Failed Attempts to Delete a Remote Branch:<...,<git><version-control><git-branch><git-push><g...,20380,11236108,0.0,41,2010-01-05 01:12:15,How delete Git branch locally remotely Failed ...
2,1642028,What is the '-->' operator in C/C++?,"<p>After reading <a href=""http://groups.google...",<c++><c><operators><code-formatting><standards...,10112,994570,0.0,26,2009-10-29 06:57:45,What operator After reading Hidden Features Da...


In [4]:
# Nombre de lignes et de colonnes
data.shape

(50000, 10)

In [5]:
data.describe(include="all")

Unnamed: 0,Id,Title,Body,Tags,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate,Title_Body
count,50000.0,50000,50000,50000,50000.0,50000.0,49293.0,50000.0,50000,50000
unique,,49999,50000,48706,,,,,49994,50000
top,,A potentially dangerous Request.Form value was...,"<p>In this C++ code, sorting the data (<em>bef...",<javascript><jquery><html><css><twitter-bootst...,,,,,2013-07-12 13:28:17,Why processing sorted array faster processing ...
freq,,2,1,31,,,,,2,1
mean,22040070.0,,,,85.18684,90067.24,0.000811,6.177,,
std,18240210.0,,,,283.783728,217394.9,0.144692,5.933739,,
min,4.0,,,,20.0,206.0,0.0,1.0,,
25%,6153363.0,,,,26.0,17799.75,0.0,3.0,,
50%,17602110.0,,,,37.0,39205.5,0.0,5.0,,
75%,34923680.0,,,,68.0,86352.5,0.0,8.0,,


In [6]:
# Types
data.dtypes

Id                 int64
Title             object
Body              object
Tags              object
Score              int64
ViewCount          int64
FavoriteCount    float64
AnswerCount        int64
CreationDate      object
Title_Body        object
dtype: object

In [7]:
# Nombre de valeur manquantes par colonne
data.isna().sum()

Id                 0
Title              0
Body               0
Tags               0
Score              0
ViewCount          0
FavoriteCount    707
AnswerCount        0
CreationDate       0
Title_Body         0
dtype: int64

In [8]:
# Nombre de valeur differentes par colonne
data.nunique()

Id               50000
Title            49999
Body             50000
Tags             48706
Score             1148
ViewCount        41263
FavoriteCount        3
AnswerCount         79
CreationDate     49994
Title_Body       50000
dtype: int64

In [3]:
# Défini le nombre de lignes souhaitées
limite = 50000
data = data.sample(n=limite, random_state=42)

In [4]:
# Recuperation des tags
import re

tags = data["Tags"].apply(lambda x: re.findall(r'<(.*?)>', x))
tags

33553             [python, plugins, pycharm, pep8, flake8]
9427                     [c#, file, io, filesystems, copy]
199            [unix, ssh, passwords, openssh, passphrase]
12447    [ios, uiview, core-animation, uiviewanimation,...
39489    [c#, asp.net, asp.net-mvc, visual-studio, msbu...
                               ...                        
11284    [c, function, pointers, parameters, pass-by-va...
44732    [python, http, httprequest, wget, python-reque...
38158                  [java, json, rest, jersey, jackson]
860       [git, terminal, tree, console, revision-history]
15795    [asp.net-mvc, asp.net-mvc-3, asp.net-mvc-4, ra...
Name: Tags, Length: 50000, dtype: object

In [5]:
# Concatenation de tags
tags_global = [tag for sublist in tags for tag in sublist]

# Nombre d'apparition de chaque tag 
from collections import Counter
tags_global = Counter(tags_global)

# Tags par ordre des plus utilisé au moins utilisé
tags_decroissant = sorted(tags_global.items(), key=lambda x: x[1], reverse=True)

# Top 50 des tags les plus utilisé
tags_top_50 = tags_decroissant[:50]

# Dataframe de tags_top_50
tags_top_50 = pd.DataFrame(tags_top_50)

# list top 50
top_50_list = set(tags_top_50[0])

### Suppression des < et > sur Tags

In [6]:
# Garde seulement les tags du top 50 sur la variable Tags
import re

# Fonction pour filtrer les tags
def filter_tags(tags):
    return [tag for tag in re.findall(r'<(.*?)>', tags) if tag in top_50_list]

# Applique la fonction de filtre à la colonne "Tags"
data['Tags'] = data['Tags'].apply(filter_tags)

# Supprime les tags ayant plus de 25 caractères
data['Tags'] = data['Tags'].apply(lambda tags: [tag for tag in tags if len(tag) <= 25])

# Supprime les lignes sans tags
data = data[data['Tags'].apply(len) > 1]
data

Unnamed: 0,Id,Title,Body,Tags,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate,Title_Body
39489,49028212,Precompile asp.net views with ms build,<p>When I deploy asp.net application through v...,"[c#, asp.net, asp.net-mvc, visual-studio]",25,14467,0.0,1,2018-02-28 11:12:01,Precompile aspnet views build When deploy aspn...
10822,2875429,IUnityContainer.Resolve<T> throws error claimi...,<p>Yesterday I've implemented the code:</p>\n\...,"[c#, .net]",77,40728,0.0,5,2010-05-20 15:46:20,IUnityContainerResolve throws error claiming c...
4144,1235958,IPC performance: Named Pipe vs Socket,<p>Everyone seems to say named pipes are faste...,"[linux, performance]",169,140521,0.0,12,2009-08-05 21:52:24,IPC performance Named Pipe Socket Everyone see...
38695,30525184,Array vs Slice: accessing speed,<p>This question is about the speed of <em>acc...,"[arrays, performance]",25,10443,0.0,3,2015-05-29 08:49:35,Array Slice accessing speed This question spee...
29282,30386317,Babelify throws ParseError on import a module ...,<p>I'm working with <code>Babelify</code> and ...,"[javascript, node.js]",32,19613,0.0,3,2015-05-21 23:58:19,Babelify throws ParseError import module nodem...
...,...,...,...,...,...,...,...,...,...,...
1685,3304741,"How do I fix a ""type or namespace name could n...",<p>I'm getting a:</p>\n\n<blockquote>\n <p>ty...,"[c#, visual-studio]",334,514646,0.0,47,2010-07-21 23:42:53,How fix type namespace name could found error ...
44131,12013220,Celery creating a new connection for each task,<p>I'm using Celery with Redis to run some bac...,"[python, django]",23,9870,0.0,4,2012-08-17 21:09:13,Celery creating new connection task using Cele...
37194,12511711,Initializing std::vector with iterative functi...,"<p>In many languages, there are generators tha...","[c++, c++11]",26,9946,0.0,5,2012-09-20 11:30:52,Initializing stdvector iterative function call...
6265,11169418,NumPy style arrays for C++?,<p>Are there any C++ (or C) libs that have Num...,"[c++, arrays, python-3.x, numpy]",120,101074,0.0,13,2012-06-23 12:15:45,NumPy style arrays Are libs NumPylike arrays s...


In [7]:
# Verification nombre unique de tags
len(data["Tags"].explode().unique())

50

### Suppression des tags  ne revennant qu'une seule fois

In [8]:
# Compte le nombre d'occurrences de chaque classe
tag_counts = data['Tags'].explode().value_counts()

# Identifie les classes avec moins de deux occurrences
problematic_classes = tag_counts[tag_counts < 2].index

# Filtre les lignes de l'ensemble de données pour exclure les classes problématiques
data = data[~data['Tags'].apply(lambda x: any(tag in problematic_classes for tag in x))]
data

Unnamed: 0,Id,Title,Body,Tags,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate,Title_Body
39489,49028212,Precompile asp.net views with ms build,<p>When I deploy asp.net application through v...,"[c#, asp.net, asp.net-mvc, visual-studio]",25,14467,0.0,1,2018-02-28 11:12:01,Precompile aspnet views build When deploy aspn...
10822,2875429,IUnityContainer.Resolve<T> throws error claimi...,<p>Yesterday I've implemented the code:</p>\n\...,"[c#, .net]",77,40728,0.0,5,2010-05-20 15:46:20,IUnityContainerResolve throws error claiming c...
4144,1235958,IPC performance: Named Pipe vs Socket,<p>Everyone seems to say named pipes are faste...,"[linux, performance]",169,140521,0.0,12,2009-08-05 21:52:24,IPC performance Named Pipe Socket Everyone see...
38695,30525184,Array vs Slice: accessing speed,<p>This question is about the speed of <em>acc...,"[arrays, performance]",25,10443,0.0,3,2015-05-29 08:49:35,Array Slice accessing speed This question spee...
29282,30386317,Babelify throws ParseError on import a module ...,<p>I'm working with <code>Babelify</code> and ...,"[javascript, node.js]",32,19613,0.0,3,2015-05-21 23:58:19,Babelify throws ParseError import module nodem...
...,...,...,...,...,...,...,...,...,...,...
1685,3304741,"How do I fix a ""type or namespace name could n...",<p>I'm getting a:</p>\n\n<blockquote>\n <p>ty...,"[c#, visual-studio]",334,514646,0.0,47,2010-07-21 23:42:53,How fix type namespace name could found error ...
44131,12013220,Celery creating a new connection for each task,<p>I'm using Celery with Redis to run some bac...,"[python, django]",23,9870,0.0,4,2012-08-17 21:09:13,Celery creating new connection task using Cele...
37194,12511711,Initializing std::vector with iterative functi...,"<p>In many languages, there are generators tha...","[c++, c++11]",26,9946,0.0,5,2012-09-20 11:30:52,Initializing stdvector iterative function call...
6265,11169418,NumPy style arrays for C++?,<p>Are there any C++ (or C) libs that have Num...,"[c++, arrays, python-3.x, numpy]",120,101074,0.0,13,2012-06-23 12:15:45,NumPy style arrays Are libs NumPylike arrays s...


# Approche supervisée

In [9]:
# Reset de l'index avant la concatenation pour evité des NAN
data = data.reset_index(drop=True)
# Limite a 20 000 lignes
data = data.sample(n=20000, random_state=42)

In [10]:
# Encodage des tags 
from sklearn.preprocessing import MultiLabelBinarizer

# Instanciation
mlb = MultiLabelBinarizer()

# Lancer sur le top 50: 
tags_encoder = mlb.fit_transform(data["Tags"])

# DF des Tags encodés
tags_encoder_df = pd.DataFrame(tags_encoder, columns=mlb.classes_)

# Affichage des tags uniques
print("Classes (tags uniques) :", mlb.classes_)

# Affichage de la matrice encodée
print("Matrice encodée :\n", tags_encoder)

Classes (tags uniques) : ['.net' 'algorithm' 'android' 'arrays' 'asp.net' 'asp.net-mvc' 'bash' 'c'
 'c#' 'c++' 'c++11' 'cocoa-touch' 'css' 'database' 'django' 'gcc' 'git'
 'html' 'ios' 'iphone' 'java' 'javascript' 'jquery' 'json' 'linux' 'macos'
 'multithreading' 'mysql' 'node.js' 'numpy' 'objective-c' 'pandas'
 'performance' 'php' 'postgresql' 'python' 'python-3.x' 'reactjs' 'ruby'
 'ruby-on-rails' 'spring' 'sql' 'sql-server' 'string' 'swift'
 'unit-testing' 'visual-studio' 'windows' 'wpf' 'xcode']
Matrice encodée :
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]]


In [11]:
import joblib
joblib.dump(mlb.classes_, 'mlb_classes.pkl')

['mlb_classes.pkl']

In [12]:
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, jaccard_score
import mlflow
import mlflow.sklearn

# URL du tracking mlflow
mlflow.set_tracking_uri('http://localhost:5000')

X = data["Title_Body"]
y = tags_encoder

# Découpe du jeu en Train et Test 70 / 30
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=42)

# Utilisation de TF-IDF pour vectoriser le texte
tfidf_vectorizer = TfidfVectorizer(max_features=limite, stop_words="english")
X_tfidf_train = tfidf_vectorizer.fit_transform(X_train)
X_tfidf_test = tfidf_vectorizer.transform(X_test)

# Liste des modèles avec les paramètres pour GridSearchCV
models = [
    ('LogisticRegression', MultiOutputClassifier(LogisticRegression()), {
        'estimator__C': [600, 800, 1000]
    }),
    
    ('DecisionTreeClassifier', MultiOutputClassifier(DecisionTreeClassifier(random_state=42)), {
        'estimator__criterion': ['gini', 'entropy'],
        'estimator__max_depth': [1, 3, 5]
    }),
    
    ('RandomForestClassifier', MultiOutputClassifier(RandomForestClassifier(random_state=42)), {
        'estimator__n_estimators': [4, 5, 6],  
        'estimator__max_depth': [250, 270, 290]
    }),
    
    ('XGBClassifier', MultiOutputClassifier(XGBClassifier(random_state=42)), {
        'estimator__n_estimators': [175], # 125, 150, 
        'estimator__learning_rate': [0.05, 0.1, 0.15],
        'estimator__max_depth': [2, 4, 6]
    })
]

# Initialisation du tableau des résultats
resultat_mod = []  

# Initialisation de MLflow
mlflow.start_run()

# GridSearchCV sur les modèles
for name, model, params in models:
    
    with mlflow.start_run(nested=True):  # Création d'une exécution imbriquée
        
        # Enregistrement du nom de model dans MLflow
        run_name = f"{name} Rows: {limite}"
        mlflow.set_tag("mlflow.runName", run_name)
        
        
        grid = GridSearchCV(model, params, cv=5)
        grid.fit(X_tfidf_train, y_train)

        # Affiche l'accuracy sur Test
        y_pred_test = grid.predict(X_tfidf_test)
        accuracy_test = accuracy_score(y_test, y_pred_test)

        # Affiche l'indice de Jaccard (moyenne)
        jaccard = jaccard_score(y_test, y_pred_test, average='samples')

        # Affiche l'indice de Jaccard (pondéré)
        # jaccard_weighted = jaccard_score(y_test, y_pred_test, average='weighted')
        
        
        # Affiche les meilleurs paramètres après la recherche de grille
        print(f"{name}:")
        print("Meilleurs Paramètres:", grid.best_params_)
        print("Accuracy sur Train (CV):", grid.best_score_)
        print("Accuracy sur Test:", accuracy_test)
        print("Indice de Jaccard (moyenne):", jaccard)
        # print("Indice de Jaccard (pondéré):", jaccard_weighted)
        print()
        
        # Enregistrement des paramètres dans MLflow
        mlflow.log_params(grid.best_params_)
        # Enregistrement de l'accuracy sur Train (CV) dans MLflow
        mlflow.log_metric("Accuracy sur Train", grid.best_score_)
        # Enregistrement de l'accuracy sur Test dans MLflow
        mlflow.log_metric("Accuracy sur Test", accuracy_test)        
        # Enregistrement de l'indice de Jaccard dans MLflow
        mlflow.log_metric("Indice de Jaccard", jaccard)
        
        
        # Ajout des résultats au tableau
        resultat_mod.append({
            'Model': name,
            'Accuracy sur Train (CV)': grid.best_score_,
            'Accuracy sur Test': accuracy_test,
            'Indice de Jaccard (moyenne)': jaccard,
            'Best Parameters': grid.best_params_
        })

# Arrêt de MLflow
mlflow.end_run()

# Résultats
resultat_mod_df = pd.DataFrame(resultat_mod)
resultat_mod_df

MlflowException: API request to http://localhost:5000/api/2.0/mlflow/runs/create failed with exception HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/2.0/mlflow/runs/create (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb5cee18b20>: Failed to establish a new connection: [Errno 111] Connection refused'))

In [19]:
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, jaccard_score
import mlflow
import mlflow.sklearn
import joblib

X = data["Title_Body"]
y = tags_encoder

# Découpe du jeu en Train et Test 70 / 30
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=42)

# Utilisation de TF-IDF pour vectoriser le texte
tfidf_vectorizer = TfidfVectorizer(max_features=limite, stop_words="english")
X_tfidf_train = tfidf_vectorizer.fit_transform(X_train)
X_tfidf_test = tfidf_vectorizer.transform(X_test)

# Initialisation du modèle LogisticRegression avec les paramètres spécifiques
logisticRegression_model = MultiOutputClassifier(LogisticRegression(C=1000))

# Entraînement du modèle LogisticRegression
logisticRegression_model.fit(X_tfidf_train, y_train)

# Enregistrement du modele
joblib.dump(tfidf_vectorizer, "tfidf_vectorizer.joblib")
joblib.dump(logisticRegression_model, "logisticRegression_model.joblib")

# Afficher l'accuracy sur Train
accuracy_train = logisticRegression_model.score(X_tfidf_train, y_train)
print("Accuracy sur Train:", accuracy_train)

# Afficher l'accuracy sur Test
accuracy_test = logisticRegression_model.score(X_tfidf_test, y_test)
print("Accuracy sur Test:", accuracy_test)

# Prédiction sur le jeu de test
y_pred_test = logisticRegression_model.predict(X_tfidf_test)

# Afficher l'indice de Jaccard (moyenne)
jaccard = jaccard_score(y_test, y_pred_test, average='samples')
print("Indice de Jaccard (moyenne):", jaccard)

# Calculer et afficher l'indice de Jaccard pondéré
jaccard_weighted = jaccard_score(y_test, y_pred_test, average='weighted')
print("Indice de Jaccard pondéré:", jaccard_weighted)

# Pour 10 000: (C=2700)
# Accuracy sur Train: 1.0
# Accuracy sur Test: 0.2799395542123158
# Indice de Jaccard (moyenne): 0.46961339881626996
# Indice de Jaccard pondéré: 0.4338127014331402

Accuracy sur Train: 1.0
Accuracy sur Test: 0.27016666666666667
Indice de Jaccard (moyenne): 0.5574749999999999
Indice de Jaccard pondéré: 0.5326880499381086


In [20]:
# Utilisation de TF-IDF pour vectoriser le texte
tfidf_vectorizer = TfidfVectorizer(max_features=limite, stop_words="english")
X_tfidf_train = tfidf_vectorizer.fit_transform(X_train)
X_tfidf_test = tfidf_vectorizer.transform(X_test)

# Initialisation du modèle DecisionTreeClassifier avec les paramètres spécifiques
decision_tree_model = MultiOutputClassifier(DecisionTreeClassifier(criterion='gini', max_depth=14, random_state=42))

# Entraînement du modèle DecisionTreeClassifier
decision_tree_model.fit(X_tfidf_train, y_train)

# Afficher l'accuracy sur Train
accuracy_train = decision_tree_model.score(X_tfidf_train, y_train)
print("Accuracy sur Train:", accuracy_train)

# Afficher l'accuracy sur Test
accuracy_test = decision_tree_model.score(X_tfidf_test, y_test)
print("Accuracy sur Test:", accuracy_test)

# Prédiction sur le jeu de test
y_pred_test = decision_tree_model.predict(X_tfidf_test)

# Afficher l'indice de Jaccard (moyenne)
jaccard = jaccard_score(y_test, y_pred_test, average='samples')
print("Indice de Jaccard (moyenne):", jaccard)

# Calculer et afficher l'indice de Jaccard pondéré
jaccard_weighted = jaccard_score(y_test, y_pred_test, average='weighted')
print("Indice de Jaccard pondéré:", jaccard_weighted)

# Pour 10 000: (criterion='gini', max_depth=14, random_state=42)
# Accuracy sur Train: 0.6994818652849741
# Accuracy sur Test: 0.2319607102380053
# Indice de Jaccard (moyenne): 0.4388341698598593
# Indice de Jaccard pondéré: 0.41274001678755273

Accuracy sur Train: 0.5576428571428571
Accuracy sur Test: 0.20983333333333334
Indice de Jaccard (moyenne): 0.5002074074074074
Indice de Jaccard pondéré: 0.48395911458690943


# Word2Vec

In [11]:
import tensorflow as tf
import tensorflow.keras
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import metrics as kmetrics
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
import gensim

2024-02-26 23:16:28.261913: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-26 23:16:28.856581: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-26 23:16:31.306403: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-26 23:16:31.312539: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Création du modèle Word2Vec

In [16]:
w2v_size=300
w2v_window=5
w2v_min_count=1
w2v_epochs=100
maxlen = 24 # adapt to length of sentences
sentences = data['Title_Body'].to_list()
sentences = [gensim.utils.simple_preprocess(text) for text in sentences]

In [17]:
# Création et entraînement du modèle Word2Vec

print("Build & train Word2Vec model ...")
w2v_model = gensim.models.Word2Vec(min_count=w2v_min_count, window=w2v_window,
                                                vector_size=w2v_size,
                                                seed=42,
                                                workers=1)
#                                                workers=multiprocessing.cpu_count())
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=w2v_epochs)
model_vectors = w2v_model.wv
w2v_words = model_vectors.index_to_key
print("Vocabulary size: %i" % len(w2v_words))
print("Word2Vec trained")

Build & train Word2Vec model ...
Vocabulary size: 20097
Word2Vec trained


In [18]:
# Préparation des sentences (tokenization)

print("Fit Tokenizer ...")
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
x_sentences = pad_sequences(tokenizer.texts_to_sequences(sentences),
                                                     maxlen=maxlen,
                                                     padding='post') 
                                                   
num_words = len(tokenizer.word_index) + 1
print("Number of unique words: %i" % num_words)

Fit Tokenizer ...
Number of unique words: 20098


# Création de la matrice d'embedding

In [19]:
# Création de la matrice d'embedding

print("Create Embedding matrix ...")
w2v_size = 300
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1
embedding_matrix = np.zeros((vocab_size, w2v_size))
i=0
j=0
    
for word, idx in word_index.items():
    i +=1
    if word in w2v_words:
        j +=1
        embedding_vector = model_vectors[word]
        if embedding_vector is not None:
            embedding_matrix[idx] = model_vectors[word]
            
word_rate = np.round(j/i,4)
print("Word embedding rate : ", word_rate)
print("Embedding matrix: %s" % str(embedding_matrix.shape))

Create Embedding matrix ...


Word embedding rate :  1.0
Embedding matrix: (20098, 300)


# Création du modèle d'embedding

In [20]:
# Création du modèle

input=Input(shape=(len(x_sentences),maxlen),dtype='float64')
word_input=Input(shape=(maxlen,),dtype='float64')  
word_embedding=Embedding(input_dim=vocab_size,
                         output_dim=w2v_size,
                         weights = [embedding_matrix],
                         input_length=maxlen)(word_input)
word_vec=GlobalAveragePooling1D()(word_embedding)  
embed_model = Model([word_input],word_vec)

embed_model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 24)]              0         
                                                                 
 embedding (Embedding)       (None, 24, 300)           6029400   
                                                                 
 global_average_pooling1d (G  (None, 300)              0         
 lobalAveragePooling1D)                                          
                                                                 
Total params: 6,029,400
Trainable params: 6,029,400
Non-trainable params: 0
_________________________________________________________________


# Exécution du modèle

In [21]:
embeddings = embed_model.predict(x_sentences)
embeddings.shape



(2000, 300)

In [24]:
# Découpe du jeu en Train et Test 70 / 30
X_train, X_test, y_train, y_test = model_selection.train_test_split(embeddings, tags_encoder, test_size=0.3, random_state=42)

In [29]:
clf_word2vec = MultiOutputClassifier(LogisticRegression()).fit(X_train, y_train)

### Résultat

In [32]:
y_pred_word2vec = clf_word2vec.predict(X_test)
accuracy_word2vec = accuracy_score(y_test, y_pred_word2vec)
accuracy_word2vec

0.055

In [33]:
jaccard_word2vec = jaccard_score(y_test, y_pred_word2vec, average='samples')
jaccard_word2vec

0.23477777777777778

In [None]:
# Enregistrement des métriques dans MLflow
with mlflow.start_run(run_name=f"Word2Vec_LogReg Rows: {limite}"):
    mlflow.log_param("model", "Word2Vec_LogisticRegression")
    mlflow.log_metric("accuracy", accuracy_word2vec)
    mlflow.log_metric("jaccard", jaccard_word2vec)

# BERT

In [12]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow.keras
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import metrics as kmetrics
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
import time

# Bert
import os
import transformers
from transformers import TFAutoModel, AutoTokenizer

os.environ["TF_KERAS"]='1'

In [14]:
print(tf.__version__)
print(tensorflow.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
print(tf.test.is_built_with_cuda())

2.12.0
2.12.0
Num GPUs Available:  0
True


# Fonctions communes

In [13]:
from fonction import *

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/matthieu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/matthieu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/matthieu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# BERT HuggingFace

In [16]:
data.shape

(2000, 10)

In [14]:
# Application de la fonction transform_dl_fct
data['sentence_dl'] = data['Title_Body'].apply(lambda x : transform_dl_fct(x))

### 'bert-base-uncased'

In [18]:
max_length = 64
batch_size = 5
model_type = 'bert-base-uncased'
model = TFAutoModel.from_pretrained(model_type)
sentences = data['sentence_dl'].to_list()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [19]:
# Création des features
features_bert, last_hidden_states_tot = feature_BERT_fct(model, model_type, sentences, 
                                                         max_length, batch_size, mode='HF')

temps traitement :  290.0


In [22]:
# Découpe du jeu en Train et Test 70 / 30
X_train, X_test, y_train, y_test = model_selection.train_test_split(features_bert, tags_encoder, test_size=0.3, random_state=42)


'\n##################### test test test a verifier vraiment\n# Identifier les indices communs entre features_bert et tags_encoder\ncommon_indices = set(np.arange(len(features_bert))) & set(np.arange(len(tags_encoder)))\n\n# Filtrer features_bert et tags_encoder pour ne conserver que les indices communs\nfeatures_bert_filtered = features_bert[list(common_indices)]\ntags_encoder_filtered = tags_encoder[list(common_indices)]\n\n# Découper le jeu en Train et Test 70 / 30\nX_train, X_test, y_train, y_test = model_selection.train_test_split(features_bert_filtered, tags_encoder_filtered, test_size=0.3, random_state=42)'

In [23]:
clf_bert = MultiOutputClassifier(LogisticRegression()).fit(X_train, y_train)

In [24]:
y_pred_bert = clf_bert.predict(X_test)
accuracy_bert = accuracy_score(y_test, y_pred_bert)
accuracy_bert

0.07833333333333334

In [25]:
jaccard_bert = jaccard_score(y_test, y_pred_bert, average='samples')
jaccard_bert

0.2441111111111111

In [None]:
# Enregistrement des métriques dans MLflow
with mlflow.start_run(run_name=f"Bert_LogReg Rows: {limite}"):
    mlflow.log_param("model", "Bert_LogisticRegression")
    mlflow.log_metric("accuracy", accuracy_bert)
    mlflow.log_metric("jaccard", jaccard_bert)

In [None]:
# Arrêt de MLflow
mlflow.end_run()

### 'cardiffnlp/twitter-roberta-base-sentiment'
Modèle pré-entraîné sur des tweets pour l'analyse de sentiment = particulièrement adapté au contexte

In [None]:
# test
data.shape

(161, 12)

In [26]:
max_length = 64
batch_size = 5
model_type = 'cardiffnlp/twitter-roberta-base-sentiment'
model = TFAutoModel.from_pretrained(model_type)
sentences = data['sentence_dl'].to_list()

: 

In [None]:
features_bert, last_hidden_states_tot = feature_BERT_fct(model, model_type, sentences, 
                                                         max_length, batch_size, mode='HF')

In [None]:
# Découpe du jeu en Train et Test 70 / 30
X_train, X_test, y_train, y_test = model_selection.train_test_split(features_bert, tags_encoder, test_size=0.3, random_state=42)

In [None]:
clf_roberta = MultiOutputClassifier(LogisticRegression()).fit(X_train, y_train)

In [None]:
y_pred_roberta = clf_roberta.predict(X_test)
accuracy_roberta = accuracy_score(y_test, y_pred_roberta)
accuracy_roberta

In [None]:
jaccard_roberta = jaccard_score(y_test, y_pred_roberta, average='samples')
jaccard_roberta

In [None]:
# Enregistrement des métriques dans MLflow
with mlflow.start_run(run_name=f"Roberta_LogReg Rows: {limite}"):
    mlflow.log_param("model", "Roberta_LogisticRegression")
    mlflow.log_metric("accuracy", accuracy_roberta)
    mlflow.log_metric("jaccard", jaccard_roberta)

# BERT hub Tensorflow

In [None]:
import tensorflow_hub as hub
#import tensorflow_text 

# Guide sur le Tensorflow hub : https://www.tensorflow.org/text/tutorials/classify_text_with_bert
model_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4' #/4 a la base
bert_layer = hub.KerasLayer(model_url, trainable=True)

In [None]:
sentences = data['sentence_dl'].to_list()

In [None]:
max_length = 64
batch_size = 5
model_type = 'bert-base-uncased'
model = bert_layer

features_bert, last_hidden_states_tot = feature_BERT_fct(model, model_type, sentences, 
                                                         max_length, batch_size, mode='TFhub')

In [None]:
# Découpe du jeu en Train et Test 70 / 30
X_train, X_test, y_train, y_test = model_selection.train_test_split(features_bert, tags_encoder, test_size=0.3, random_state=42)

In [None]:
clf_bert_hub = MultiOutputClassifier(LogisticRegression()).fit(X_train, y_train)

In [None]:
y_pred_bert_hub = clf_bert_hub.predict(X_test)
accuracy_bert_hub = accuracy_score(y_test, y_pred_bert_hub)
accuracy_bert_hub

In [None]:
jaccard_bert_hub = jaccard_score(y_test, y_pred_bert_hub, average='samples')
jaccard_bert_hub

# USE - Universal Sentence Encoder

In [15]:
import tensorflow as tf
import tensorflow.keras
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import metrics as kmetrics
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model

# Bert
import transformers
#from transformers import *
import os
os.environ["TF_KERAS"]='1'

In [16]:
print(tf.__version__)
print(tensorflow.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
print(tf.test.is_built_with_cuda())

2.12.0
2.12.0
Num GPUs Available:  0
True


In [17]:
import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [18]:
def feature_USE_fct(sentences, b_size) :
    batch_size = b_size
    time1 = time.time()

    for step in range(len(sentences)//batch_size) :
        idx = step*batch_size
        feat = embed(sentences[idx:idx+batch_size])

        if step ==0 :
            features = feat
        else :
            features = np.concatenate((features,feat))

    time2 = np.round(time.time() - time1,0)
    return features

In [19]:
batch_size = 5
sentences = data['sentence_dl'].to_list()

In [20]:
features_USE = feature_USE_fct(sentences, batch_size)

2024-02-26 23:18:04.573834: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'inputs' with dtype string
	 [[{{node inputs}}]]


In [21]:
features_USE.shape, tags_encoder.shape

((2000, 512), (2000, 50))

In [24]:
# Découpe du jeu en Train et Test 70 / 30
X_train, X_test, y_train, y_test = model_selection.train_test_split(features_USE, tags_encoder, test_size=0.3, random_state=42)

'\n##################### test test test a verifier vraiment\n# Identifier les indices communs entre features_bert et tags_encoder\ncommon_indices = set(np.arange(len(features_USE))) & set(np.arange(len(tags_encoder)))\n\n# Filtrer features_bert et tags_encoder pour ne conserver que les indices communs\nfeatures_USE_filtered = features_USE[list(common_indices)]\ntags_encoder_filtered = tags_encoder[list(common_indices)]\n\n# Découper le jeu en Train et Test 70 / 30\nX_train, X_test, y_train, y_test = model_selection.train_test_split(features_USE_filtered, tags_encoder_filtered, test_size=0.3, random_state=42)'

In [25]:
clf_use = MultiOutputClassifier(LogisticRegression()).fit(X_train, y_train)

In [26]:
y_pred_use = clf_use.predict(X_test)
accuracy_use = accuracy_score(y_test, y_pred_use)
accuracy_use

0.06166666666666667

In [27]:
jaccard_use = jaccard_score(y_test, y_pred_use, average='samples')
jaccard_use

0.26899999999999996

In [None]:
# Enregistrement des métriques dans MLflow
with mlflow.start_run(run_name=f"USE_LogReg Rows: {limite}"):
    mlflow.log_param("model", "USE_LogisticRegression")
    mlflow.log_metric("accuracy", accuracy_use)
    mlflow.log_metric("jaccard", jaccard_use)

In [None]:
import joblib

# Enregistrement du modèle pré-entraîné clf_use
joblib.dump(clf_use, 'clf_use.joblib')

# Comparateur

In [None]:
# Comparaison de Word2Vec, BERT et USE
tableau_comparatif = pd.DataFrame({
    'Model': ['Word2Vec', 'BERT', 'USE'],
    'Accuracy': [accuracy_word2vec, accuracy_bert, accuracy_use],
    'Jaccard Score': [jaccard_word2vec, jaccard_bert, jaccard_use]
})

tableau_comparatif

In [None]:
# Comparaison de Word2Vec, BERT et USE
tableau_comparatif = pd.DataFrame({
    'Model': ['Word2Vec', 'BERT', 'BERT_Roberta', 'BERT_hub', 'USE'],
    'Accuracy': [accuracy_word2vec, accuracy_bert, accuracy_roberta, accuracy_bert_hub, accuracy_use],
    'Jaccard Score': [jaccard_word2vec, jaccard_bert, jaccard_roberta, jaccard_bert_hub, jaccard_use]
})

tableau_comparatif

In [None]:
resultat_mod_df

# Fonction de prediction

In [None]:
text_a_predir = ['django.core.servers.basehttp.FileWrapper disappears in Django 1.9',
                 'PyMongo create unique index with 2 or more fields',
                 'How to download google image search results in Python',
                 'sine calculation orders of magnitude slower than cosine',
                 'Using lambda expression to connect slots in pyqt']

In [None]:
# Chargement du modèle USE
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Fonction de prédiction basée sur USE
def predict_use(texts):
    embeddings = [embed([text]).numpy().flatten() for text in texts]
    return embeddings

# Utilisation de la fonction predict du modèle MultiOutputClassifier (clf_use)
tags_predictions = clf_use.predict(predict_use(text_a_predir)) 
    
# Affichage des prédictions avec les noms des tags
for text, prediction in zip(text_a_predir, tags_predictions):
    predicted_tags_indices = np.where(prediction == 1)[0]
    predicted_tags = tags_encoder_df.columns[predicted_tags_indices]
    print(f"Texte: {text}\nTags prédits: {list(predicted_tags)}")