# MARQUER Matthieu
## Projet 5: Catégorisez automatiquement des questions
 ![alt text](img/16480242457412.png "Stack Overflow")
 Part: 3 Approche non supervisée


### Importation des librairies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
warnings.filterwarnings('ignore')

### Modification des options

In [2]:
# https://pandas.pydata.org/docs/reference/api/pandas.set_option.html
#pd.set_option("display.max_rows", 200)
#pd.set_option("display.max_colwidth", 500)
#pd.set_option('display.max_columns', 100)

### Importation des fichiers

In [3]:
data = pd.read_csv('data/cleaned/data_cleaned.csv')
data.head(3)

Unnamed: 0,Id,Title,Body,Tags,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate,Title_Body
0,11227809,Why is processing a sorted array faster than p...,"<p>In this C++ code, sorting the data (<em>bef...",<java><c++><performance><cpu-architecture><bra...,27160,1851289,0.0,25,2012-06-27 13:51:36,Why processing sorted array faster processing ...
1,2003505,How do I delete a Git branch locally and remot...,<p>Failed Attempts to Delete a Remote Branch:<...,<git><version-control><git-branch><git-push><g...,20380,11236108,0.0,41,2010-01-05 01:12:15,How delete Git branch locally remotely Failed ...
2,1642028,What is the '-->' operator in C/C++?,"<p>After reading <a href=""http://groups.google...",<c++><c><operators><code-formatting><standards...,10112,994570,0.0,26,2009-10-29 06:57:45,What operator After reading Hidden Features Da...


In [4]:
# Nombre de lignes et de colonnes
data.shape

(50000, 10)

In [5]:
data.describe(include="all")

Unnamed: 0,Id,Title,Body,Tags,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate,Title_Body
count,50000.0,50000,50000,50000,50000.0,50000.0,49293.0,50000.0,50000,50000
unique,,49999,50000,48706,,,,,49994,50000
top,,A potentially dangerous Request.Form value was...,"<p>In this C++ code, sorting the data (<em>bef...",<javascript><jquery><html><css><twitter-bootst...,,,,,2013-07-12 13:28:17,Why processing sorted array faster processing ...
freq,,2,1,31,,,,,2,1
mean,22040070.0,,,,85.18684,90067.24,0.000811,6.177,,
std,18240210.0,,,,283.783728,217394.9,0.144692,5.933739,,
min,4.0,,,,20.0,206.0,0.0,1.0,,
25%,6153363.0,,,,26.0,17799.75,0.0,3.0,,
50%,17602110.0,,,,37.0,39205.5,0.0,5.0,,
75%,34923680.0,,,,68.0,86352.5,0.0,8.0,,


In [6]:
# Types
data.dtypes

Id                 int64
Title             object
Body              object
Tags              object
Score              int64
ViewCount          int64
FavoriteCount    float64
AnswerCount        int64
CreationDate      object
Title_Body        object
dtype: object

In [7]:
# Nombre de valeur manquantes par colonne
data.isna().sum()

Id                 0
Title              0
Body               0
Tags               0
Score              0
ViewCount          0
FavoriteCount    707
AnswerCount        0
CreationDate       0
Title_Body         0
dtype: int64

In [8]:
# Nombre de valeur differentes par colonne
data.nunique()

Id               50000
Title            49999
Body             50000
Tags             48706
Score             1148
ViewCount        41263
FavoriteCount        3
AnswerCount         79
CreationDate     49994
Title_Body       50000
dtype: int64

In [9]:
# Défini le nombre de lignes souhaitées
limite = 20000
data = data.sample(n=limite, random_state=42)

In [10]:
# Recuperation des tags
import re

tags = data["Tags"].apply(lambda x: re.findall(r'<(.*?)>', x))
tags

33553             [python, plugins, pycharm, pep8, flake8]
9427                     [c#, file, io, filesystems, copy]
199            [unix, ssh, passwords, openssh, passphrase]
12447    [ios, uiview, core-animation, uiviewanimation,...
39489    [c#, asp.net, asp.net-mvc, visual-studio, msbu...
                               ...                        
14324    [ios, xcode, facebook, swift, facebook-graph-api]
43453    [python, nlp, cluster-analysis, word2vec, word...
29499    [csv, apache-spark, header, apache-spark-sql, ...
42681    [javascript, google-chrome, google-chrome-exte...
42326        [python, python-3.x, cygwin, pip, python-3.4]
Name: Tags, Length: 20000, dtype: object

In [11]:
# Concatenation de tags
tags_global = [tag for sublist in tags for tag in sublist]

# Nombre d'apparition de chaque tag 
from collections import Counter
tags_global = Counter(tags_global)

# Tags par ordre des plus utilisé au moins utilisé
tags_decroissant = sorted(tags_global.items(), key=lambda x: x[1], reverse=True)

# Top 50 des tags les plus utilisé
tags_top_50 = tags_decroissant[:50]

# Dataframe de tags_top_50
tags_top_50 = pd.DataFrame(tags_top_50)

# list top 50
top_50_list = set(tags_top_50[0])

### Suppression des < et > sur Tags

In [12]:
# Garde seulement les tags du top 50 sur la variable Tags
import re

# Fonction pour filtrer les tags
def filter_tags(tags):
    return [tag for tag in re.findall(r'<(.*?)>', tags) if tag in top_50_list]

# Applique la fonction de filtre à la colonne "Tags"
data['Tags'] = data['Tags'].apply(filter_tags)

# Supprime les tags ayant plus de 25 caractères
data['Tags'] = data['Tags'].apply(lambda tags: [tag for tag in tags if len(tag) <= 25])

# Supprime les lignes sans tags
data = data[data['Tags'].apply(len) > 1]
data

Unnamed: 0,Id,Title,Body,Tags,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate,Title_Body
39489,49028212,Precompile asp.net views with ms build,<p>When I deploy asp.net application through v...,"[c#, asp.net, asp.net-mvc, visual-studio]",25,14467,0.0,1,2018-02-28 11:12:01,Precompile aspnet views build When deploy aspn...
10822,2875429,IUnityContainer.Resolve<T> throws error claimi...,<p>Yesterday I've implemented the code:</p>\n\...,"[c#, .net]",77,40728,0.0,5,2010-05-20 15:46:20,IUnityContainerResolve throws error claiming c...
4144,1235958,IPC performance: Named Pipe vs Socket,<p>Everyone seems to say named pipes are faste...,"[linux, performance]",169,140521,0.0,12,2009-08-05 21:52:24,IPC performance Named Pipe Socket Everyone see...
38695,30525184,Array vs Slice: accessing speed,<p>This question is about the speed of <em>acc...,"[arrays, performance]",25,10443,0.0,3,2015-05-29 08:49:35,Array Slice accessing speed This question spee...
29282,30386317,Babelify throws ParseError on import a module ...,<p>I'm working with <code>Babelify</code> and ...,"[javascript, node.js]",32,19613,0.0,3,2015-05-21 23:58:19,Babelify throws ParseError import module nodem...
...,...,...,...,...,...,...,...,...,...,...
5477,27828822,Can't understand this way to calculate the squ...,<p>I have found a function that calculates squ...,"[c, arrays]",135,9480,0.0,5,2015-01-07 21:18:20,Cant understand way calculate square number fo...
43235,28865222,Mapping Java boolean to Oracle Number column w...,<p>I have a property created like this in my m...,"[java, database]",23,32847,0.0,2,2015-03-04 21:21:35,Mapping Java boolean Oracle Number column JPA ...
33186,50951779,Angular 2+ wait for subscribe to finish to upd...,<p>I am having an issue with my variables bein...,"[javascript, angular]",29,106864,0.0,2,2018-06-20 15:23:12,Angular wait subscribe finish updateaccess var...
14324,32031677,Facebook Graph API GET request - Should contai...,<p>My iOS app uses Facebook's Graph API Reques...,"[ios, xcode, swift]",60,47952,0.0,5,2015-08-16 03:58:10,Facebook Graph API GET request Should contain ...


In [13]:
# Verification nombre unique de tags
len(data["Tags"].explode().unique())

50

### Suppression des tags  ne revennant qu'une seule fois

In [14]:
# Compte le nombre d'occurrences de chaque classe
tag_counts = data['Tags'].explode().value_counts()

# Identifie les classes avec moins de deux occurrences
problematic_classes = tag_counts[tag_counts < 2].index

# Filtre les lignes de l'ensemble de données pour exclure les classes problématiques
data = data[~data['Tags'].apply(lambda x: any(tag in problematic_classes for tag in x))]
data

Unnamed: 0,Id,Title,Body,Tags,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate,Title_Body
39489,49028212,Precompile asp.net views with ms build,<p>When I deploy asp.net application through v...,"[c#, asp.net, asp.net-mvc, visual-studio]",25,14467,0.0,1,2018-02-28 11:12:01,Precompile aspnet views build When deploy aspn...
10822,2875429,IUnityContainer.Resolve<T> throws error claimi...,<p>Yesterday I've implemented the code:</p>\n\...,"[c#, .net]",77,40728,0.0,5,2010-05-20 15:46:20,IUnityContainerResolve throws error claiming c...
4144,1235958,IPC performance: Named Pipe vs Socket,<p>Everyone seems to say named pipes are faste...,"[linux, performance]",169,140521,0.0,12,2009-08-05 21:52:24,IPC performance Named Pipe Socket Everyone see...
38695,30525184,Array vs Slice: accessing speed,<p>This question is about the speed of <em>acc...,"[arrays, performance]",25,10443,0.0,3,2015-05-29 08:49:35,Array Slice accessing speed This question spee...
29282,30386317,Babelify throws ParseError on import a module ...,<p>I'm working with <code>Babelify</code> and ...,"[javascript, node.js]",32,19613,0.0,3,2015-05-21 23:58:19,Babelify throws ParseError import module nodem...
...,...,...,...,...,...,...,...,...,...,...
5477,27828822,Can't understand this way to calculate the squ...,<p>I have found a function that calculates squ...,"[c, arrays]",135,9480,0.0,5,2015-01-07 21:18:20,Cant understand way calculate square number fo...
43235,28865222,Mapping Java boolean to Oracle Number column w...,<p>I have a property created like this in my m...,"[java, database]",23,32847,0.0,2,2015-03-04 21:21:35,Mapping Java boolean Oracle Number column JPA ...
33186,50951779,Angular 2+ wait for subscribe to finish to upd...,<p>I am having an issue with my variables bein...,"[javascript, angular]",29,106864,0.0,2,2018-06-20 15:23:12,Angular wait subscribe finish updateaccess var...
14324,32031677,Facebook Graph API GET request - Should contai...,<p>My iOS app uses Facebook's Graph API Reques...,"[ios, xcode, swift]",60,47952,0.0,5,2015-08-16 03:58:10,Facebook Graph API GET request Should contain ...


# Approche non supervisée

### Train test split

In [35]:
# TfidfVectorizer
from sklearn import model_selection
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

# Récupération des valeurs
X = data["Title_Body"]

# Découpe du jeu en Train et Test 70 / 30
X_train, X_test = model_selection.train_test_split(X, test_size=0.3, random_state=42)

# Vectorisation
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.95, 
    min_df=2, 
    max_features=limite, 
    stop_words="english")

# Ajustement de tfidf_vectorizer sur les documents
tfidf_data_vectorized = tfidf_vectorizer.fit_transform(X_train)
tfidf_data_test_vectorized = tfidf_vectorizer.transform(X_test)

# Recupération du vocabulaire
vocabulaire = tfidf_vectorizer.vocabulary_

# Paramètres LDA #'n_components': [10, 15, 20], 'learning_decay': [.5, .7, .9]
search_params = {'n_components': [10], 'learning_decay': [.9]} # A remettre plus tard (gain de temps de calcul)

# Initialisation du modèle LDA
lda = LatentDirichletAllocation()

# Initialisation GridSearchCV
model = GridSearchCV(lda, param_grid=search_params)

# Entrainement
model.fit(tfidf_data_vectorized)

# Meilleur modèle
best_lda_model = model.best_estimator_
best_lda_model

In [36]:
# Mesure des performances du modele

# Score de log-vraisemblance: plus c'est élévé mieux c'est
print("Log Likelihood: ", best_lda_model.score(tfidf_data_vectorized))

# Mesure de l'incertitude du modèle: plus c'est bas mieux c'est, cela suggère une meilleure capacité à prédire les données par le modèle
print("Perplexité: ", best_lda_model.perplexity(tfidf_data_vectorized))

print(best_lda_model.get_params())

Log Likelihood:  -356520.86871039117
Perplexité:  27294.244995604375
{'batch_size': 128, 'doc_topic_prior': None, 'evaluate_every': -1, 'learning_decay': 0.9, 'learning_method': 'batch', 'learning_offset': 10.0, 'max_doc_update_iter': 100, 'max_iter': 10, 'mean_change_tol': 0.001, 'n_components': 10, 'n_jobs': None, 'perp_tol': 0.1, 'random_state': None, 'topic_word_prior': None, 'total_samples': 1000000.0, 'verbose': 0}


In [37]:
import pyLDAvis
import pyLDAvis.lda_model

pyLDAvis.enable_notebook()
panel = pyLDAvis.lda_model.prepare(best_lda_model, tfidf_data_vectorized, tfidf_vectorizer)
panel

In [30]:
# https://openclassrooms.com/fr/courses/4470541-analysez-vos-donnees-textuelles/4855011-modelisez-des-sujets-avec-des-methodes-non-supervisees

# Fonction qui affiche les mots qui ont le plus de poids (termes)
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic {}:".format(topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

# Affiche les 10 mots les plus représentatifs de la catégorie qui a été détectée
display_topics(best_lda_model, tfidf_vectorizer.get_feature_names_out(), 10)

Topic 0:
using file error string code use public like class new
Topic 1:
nan sed dog whitespace accesscontrolalloworigin animal nextjs pandas browserify cors
Topic 2:
table column select sql pandas dataframe date rows query row
Topic 3:
div image button view ios color css width height html
Topic 4:
ptr mov eax rax directives nsnotificationcenter faviconico literals recognition reactscripts


In [31]:
# Enregistrement du vocabulaire dans un dictionnaire (enregistrement des mots unique)
vocabulaire = tfidf_vectorizer.vocabulary_
vocabulaire

{'bootstrap': 1320,
 'throws': 12848,
 'uncaught': 13375,
 'error': 4172,
 'bootstraps': 1322,
 'javascript': 6600,
 'requires': 10512,
 'jquery': 6707,
 'trying': 13151,
 'use': 13591,
 'make': 7383,
 'interface': 6331,
 'program': 9693,
 'added': 160,
 'head': 5491,
 'tag': 12522,
 'thought': 12819,
 'launch': 6929,
 'web': 14009,
 'page': 8939,
 'browser': 1418,
 'reports': 10417,
 'tried': 13110,
 'using': 13670,
 'copies': 2663,
 'hosted': 5650,
 'cdns': 1690,
 'work': 14228,
 'anybody': 516,
 'point': 9359,
 'wrong': 14279,
 'mimic': 7696,
 'keyboard': 6815,
 'animation': 488,
 'ios': 6441,
 'add': 159,
 'button': 1511,
 'numeric': 8521,
 'like': 7061,
 'older': 8639,
 'version': 13829,
 'cgrect': 1759,
 'cgrectmake': 1760,
 'uiapplication': 13264,
 'sharedapplication': 11322,
 'windows': 14158,
 'lastobject': 6918,
 'cgpoint': 1756,
 'uiview': 13341,
 'floatvalue': 4766,
 'delay': 3252,
 'intvalue': 6398,
 'animations': 489,
 'completionnil': 2283,
 'stopped': 12021,
 'working':

In [32]:
# Enregistrement des termes (mots) dans l'ordre
termes_ord = [terme for terme, indice in sorted(vocabulaire.items(), key=lambda x: x[1])]
termes_ord

['aaa',
 'aaaa',
 'aabb',
 'aac',
 'aactive',
 'aad',
 'aae',
 'aaf',
 'aapt',
 'aaron',
 'abab',
 'abandoned',
 'abc',
 'abcc',
 'abcd',
 'abcdef',
 'abcdefgh',
 'abcdefghijklmnop',
 'abe',
 'abi',
 'abilities',
 'ability',
 'able',
 'abnormally',
 'abort',
 'aborted',
 'aborting',
 'abortonerror',
 'abovementioned',
 'abrupt',
 'abruptly',
 'abs',
 'absence',
 'absolute',
 'absoluteimport',
 'absolutely',
 'abstract',
 'abstractbaseuser',
 'abstractcontrol',
 'abstractfalse',
 'abstraction',
 'absurd',
 'absurdly',
 'abuse',
 'academic',
 'acc',
 'accelerated',
 'acceleration',
 'accents',
 'accept',
 'acceptable',
 'acceptance',
 'accepted',
 'acceptencoding',
 'accepting',
 'acceptlanguage',
 'accepts',
 'access',
 'accesscontrolallowheaders',
 'accesscontrolallowmethods',
 'accesscontrolalloworigin',
 'accessed',
 'accesses',
 'accessibility',
 'accessible',
 'accessing',
 'accesslog',
 'accessor',
 'accessors',
 'accesstoken',
 'accesstokenexpiretimespan',
 'accident',
 'accident

In [33]:
# Enregistrement des indices des termes dans l'ordre de leur importance
indices = np.argsort(tfidf_vectorizer.idf_)

# Récupère les termes les plus fréquents
most_frequent_terms = [term for term, index in tfidf_vectorizer.vocabulary_.items() if index in indices[:30]]

# Récupère les termes les moins fréquents
least_frequent_terms = [term for term, index in tfidf_vectorizer.vocabulary_.items() if index in indices[-30:]]

print("Les termes les plus fréquents :", most_frequent_terms)
print()
print("Les termes les moins fréquents :", least_frequent_terms)

Les termes les plus fréquents : ['error', 'trying', 'use', 'make', 'tried', 'using', 'work', 'like', 'way', 'string', 'function', 'know', 'need', 'method', 'question', 'following', 'class', 'new', 'set', 'dont', 'data', 'file', 'return', 'code', 'problem', 'doesnt', 'application', 'ive', 'want', 'example']

Les termes les moins fréquents : ['locales', 'localdate', 'locationmanagermanager', 'zzz', 'locationid', 'locationreload', 'loaderresolve', 'loadergetmodulejob', 'locationmanagerdelegate', 'locating', 'logerror', 'locationpathname', 'localdatetime', 'localhostadmin', 'lockobj', 'locklockobj', 'lockguard', 'logdtag', 'localizable', 'locationresources', 'localfile', 'logback', 'logetag', 'locals', 'localstrategy', 'lockobject', 'localpath', 'loggerisdebugenabled', 'logincomponent', 'locs']


In [34]:
# Revisité: https://www.geeksforgeeks.org/how-to-calculate-jaccard-similarity-in-python/

# Initialiser une liste pour stocker les scores Jaccard de chaque document
jaccard_scores = []

# Boucle à travers chaque ligne de données
for index, row in data.iterrows():
    # Convertir les tags et les termes du document en ensembles
    true_tags = set(row['Tags'])

    # Obtenir les termes prédits avec leurs poids (scores) du modèle LDA
    predicted_terms_with_scores = best_lda_model.transform(tfidf_vectorizer.transform([row['Title_Body']]))
    
    # Sélectionner les termes avec un score seuil (par exemple, 0.2, ajustez selon vos besoins)
    threshold = 0.2
    predicted_terms_indices = predicted_terms_with_scores.argsort()[0][::-1]  # Obtenez les indices triés par score décroissant
    predicted_terms = set(tfidf_vectorizer.get_feature_names_out()[predicted_terms_indices[:10]])  # Choisissez les 10 premiers termes
    
    # Calculer l'intersection et l'union
    intersection = true_tags.intersection(predicted_terms)
    union = true_tags.union(predicted_terms)
    
    # Calculer le score Jaccard et l'ajouter à la liste
    jaccard_score = float(len(intersection)) / float(len(union))
    jaccard_scores.append(jaccard_score)

# Calculer la moyenne des scores Jaccard
average_jaccard_score = sum(jaccard_scores) / len(jaccard_scores)
print(f"Moyenne des scores Jaccard en non supervisé: {average_jaccard_score}")

Moyenne des scores Jaccard en non supervisé: 0.0
