# PROJET 5 : Catégorisez automatiquement les questions 

**PLAN DE PROJET**
1. Titre de projet : PROJET 5 - Catégorisez automatiquement les questions


2. Chargement de bibliothèques


3. Récupérer les données + Séparation de données en test et train
    - Enregistrement de fichiers en .csv :
        - X_train.csv
        - y_train.csv
        - X_test.csv
        - y_test.csv


4. Data cleaning
    - Features :
        - Enlever les balises HTML
        - Enlever la ponctuation
        - Mise en minuscule et tokenization
        - Enlever les stopwords
    - Target :
        - Enlever les balises "<>"


5. Feature engineering 
    - Recodage en bigrams
    - Fusion de title, body + bigrams


6. Analyse exploratoire
    - Analyses univariées
        - Description générale : Longueur de posts, nombre de tags
        - Bag of words : Les expressions les plus fréquentes : feature & target
            - Arrays générées:
                - X_train_bow
                - X_train_vocab_bow
                - X_train_dist_bow
                - y_train_bow
                - y_train_vocab_bow
                - y_train_dist_bow
                
                
        - TF - IDF : Les expressions les plus fréquentes : feature & target
             - Arrays générées:
                  - X_train_ifidf
                  - X_train_vocab_ifidf
                  - X_train_dist_ifidf
                  - y_train_ifidf
                  - y_train_vocab_ifidf
                  - y_train_dist_ifidf
                  

    - Analyse multivarié 
    **QUESTION : Peut-on considérer LDA comme analyse multivariée ?**
    
    
    - Réduction de dimensions
    **QUESTION : Peut-on faire un word2vec ?**
    
    
        

# Chargement de bibliothéques

In [1]:
# Import the libraries
import joblib
from IPython.core.display import display, HTML
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import nltk
#nltk.download()  # Download text data sets, including stop words
from nltk.corpus import stopwords # Import the stop word list
import re

# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup 

# Libraries for data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

In [2]:
#Disable warning for .loc
pd.options.mode.chained_assignment = None  # default='warn'

# Récupération de données

## Loader les fichiers .csv

In [3]:
# Load the .csv files
X_train = pd.read_csv('Data/X_train.csv', sep='\t')
X_test = pd.read_csv('Data/X_test.csv', sep='\t')
y_train = pd.read_csv('Data/y_train.csv', sep='\t')
y_test = pd.read_csv('Data/y_test.csv', sep='\t')

In [4]:
# Check the loaded files
print ("Le jeu de données X_train contient", X_train.shape[0], "observations et", X_train.shape[1], "features.") 
print ("Le vecteur y_train contient", y_train.shape[0], "observations.") 
print ("Le jeu de données X_test contient", X_test.shape[0], "observations et", X_test.shape[1], "features.") 
print ("Le vecteur y_test contient", y_test.shape[0], "observations.")  

Le jeu de données X_train contient 9843 observations et 3 features.
Le vecteur y_train contient 9842 observations.
Le jeu de données X_test contient 9844 observations et 3 features.
Le vecteur y_test contient 9843 observations.


## Récupérer les arrays

### Bag of words !!! Utilité ???

### TF-IDF

In [6]:
# Load the arrays 
X_train_tfidf = np.load('Data/X_train_tfidf.npy')
X_train_vocab_tfidf = np.load('Data/X_train_vocab_tfidf.npy')
X_train_dist_tfidf = np.load('Data/X_train_dist_tfidf.npy')
y_train_tfidf = np.load('Data/y_train_tfidf.npy')
y_train_vocab_tfidf = np.load('Data/y_train_vocab_tfidf.npy')
y_train_dist_tfidf = np.load('Data/y_train_dist_tfidf.npy')

In [24]:
X_train_title_tfidf = np.load('Data/X_train_title_tfidf.npy')
X_train_title_vocab_tfidf = np.load('Data/X_train_title_vocab_tfidf.npy')
X_train_title_dist_tfidf = np.load('Data/X_train_title_dist_tfidf.npy')

In [7]:
# Check the loaded arrays
X_train_tfidf.shape

(9843, 50000)

# Modélisation de flags basée sur les fréquences

## Fréquences BOW

Nous allons mettre en oeuvre une méthode basée uniquement sur les fréquences des expressions (mot / bigram) utilisées dans le post et nous allons regarder si les expressions les plus fréquentes apparaissent dans le vocabulaire de tags.

Tout d'abord, nous allons analyser s'il existe un ou plusieurs expressions, présentes au moins deux fois dans chaque post, qui matchent avec le vocabulaire de tags. 

Nous allons utiliser la décomposition en Bag of words créé dans le notebook1, chapître 6.1.2 :

In [None]:
#The tags vocabulary :
y_train_vocab_bow[:10]

In [None]:
#The features vocabulary : 
X_train_vocab_bow[:10]

In [None]:
#The BOW array :
X_train_bow[:10]

### Tester les fonctions sur le premier post

Nous allons tester notre idée sur le premier post.

In [None]:
BOW_post1 = X_train_bow[0]
BOW_post1

In [None]:
#We print all expressions which are at least 2 times in the post :
for freq, word in zip(BOW_post1, X_train_vocab_bow):
    if freq >= 2:
        print (freq, word)

In [None]:
#We compare the frequent expression to the tag's vocab:

for freq, word in zip(BOW_post1, X_train_vocab_bow):
    if freq >= 2:
        if word in y_train_vocab_bow:
            print(word)

In [None]:
predicted_tags_vect = []

for freq, word in zip(BOW_post1, X_train_vocab_bow):
    if freq >= 2:
        if word in y_train_vocab_bow:
            predicted_tags_vect.append(word)

predicted_tags_vect

### Création d'une fonction à appliquer sur toutes les données

In [None]:
Maintenant, nous allons créer une fonction qui va sortir les tags pour chaque post :

In [None]:
def pred_tag_freq(BOW, vocabulary, list_of_tags):
    
    """Function which generates a list of tags, based on frequency of expression in a BOW object and 
    its comparison to predefined list of tags.
    
    Input :
    - BOW : a BOW array
    - vocabulary : list of BOW vocabulary
    - list_of_tags : a list of tags
    
    Output :
    - a list of predicted tags  
    
    """
    predicted_tags = []
    
    for vect in range(BOW.shape[0]): 
        
        predicted_tags_vect = [] 
        
        for freq, word in zip(BOW[vect], vocabulary):
            
            if freq >= 2:
                if word in list_of_tags:
                    predicted_tags_vect.append(word)
                    
        predicted_tags.append(predicted_tags_vect)
        
    return predicted_tags

In [None]:
predicted_tags = pred_tag_freq(X_train_bow, X_train_vocab_bow, y_train_vocab_bow)

In [None]:
len(predicted_tags)

In [None]:
predicted_tags[:10]

In [None]:
Sauvegarder les tags prédits:

In [None]:
#Save the predicted tags:
np.save('Data/predicted_tags', predicted_tags)

Nous allons analyser le nombre de tag prédits par la méthode :

In [None]:
nbr_tags = []

for tag in range(len(predicted_tags)):
    length = len(predicted_tags[tag])
    nbr_tags.append(length)
    
nbr_tags = DataFrame(nbr_tags)

In [None]:
nbr_tags[0].value_counts()

Désavantage de la méthode : nous avons des posts sans tag attribué (1875 posts) et certains posts peuvent avoir un grand nombre de tags, même si c'est plutôt rare. Nous allons appliquer la même méthode avec TF-IDF et choisir 3 tags les plus fréquents basé sur le coefficient TF-IDF. 

## Fréquences TF-IDF

L'idée est d'utiliser les fréquences TF-IDF pour avoir la main sur le nombre de tags à prédire. Cette fois-ci, la méthode sera basé sur la procédure suivante :

1. Nous allons comparer toutes les expressions dans le post avec le vocabulaire de tags
2. Nous allons attribuer à chaque expression la distance relative TF-IDF de tag
3. Nous allons sortir 3 tags les plus fréquents

In [147]:
y_train_dist_tfidf[:10]

array([1.29137083e+00, 2.03157824e+01, 1.77910765e+00, 1.23469745e+03,
       7.61972597e-01, 1.03923973e+01, 1.00032151e+02, 5.74763060e+00,
       1.16193175e+02, 2.80201636e+00])

### Tester les fonctions sur le premier post

In [148]:
#Extract the array of first post
TFIDF_post1 = X_train_tfidf[0]
TFIDF_post1

array([0., 0., 0., ..., 0., 0., 0.])

In [149]:
#We compare the expressions in the post to the tag's vocab:
for freq, word in zip(TFIDF_post1, X_train_vocab_tfidf):
    if freq > 0:
        if word in y_train_vocab_tfidf:
            print(word)

add
authentication
handle
permissions
plugins
restful-authentication
role
using


In [175]:
#We list the common expressions which are in both document and the tag's vocabulary:

liste_tags = []

for freq, word in zip(TFIDF_post1, X_train_vocab_tfidf):
    if freq > 0:
        if word in y_train_vocab_tfidf:
            liste_tags.append(word)

In [176]:
liste_tags

['add',
 'authentication',
 'handle',
 'permissions',
 'plugins',
 'restful-authentication',
 'role',
 'using']

In [177]:
#We zip the list with tag's relative frequency:

for freq, word in zip(y_train_dist_tfidf, liste_tags):
    print(freq, word)

1.2913708286761647 add
20.315782399608562 authentication
1.779107648085159 handle
1234.6974528795652 permissions
0.7619725969000765 plugins
10.392397318668808 restful-authentication
100.03215112054504 role
5.747630599165471 using


In [178]:
#Zip the tags contained in the post and tag's frequency
liste = zip(y_train_dist_tfidf, liste_tags)

In [180]:
#Converting to list
liste = list(liste)

In [181]:
#Check
liste

[(1.2913708286761647, 'add'),
 (20.315782399608562, 'authentication'),
 (1.779107648085159, 'handle'),
 (1234.6974528795652, 'permissions'),
 (0.7619725969000765, 'plugins'),
 (10.392397318668808, 'restful-authentication'),
 (100.03215112054504, 'role'),
 (5.747630599165471, 'using')]

In [185]:
#Sort the list by frequency
liste_sort = sorted(liste, key = lambda x: x[0])

In [186]:
#Check
liste_sort

[(0.7619725969000765, 'plugins'),
 (1.2913708286761647, 'add'),
 (1.779107648085159, 'handle'),
 (5.747630599165471, 'using'),
 (10.392397318668808, 'restful-authentication'),
 (20.315782399608562, 'authentication'),
 (100.03215112054504, 'role'),
 (1234.6974528795652, 'permissions')]

In [193]:
#Extract 3 most frequent tags 
tags_final = liste_sort[-3:]

In [194]:
tags_final

[(20.315782399608562, 'authentication'),
 (100.03215112054504, 'role'),
 (1234.6974528795652, 'permissions')]

In [195]:
#Extract the tag's name
tags = [x[1] for x in tags_final]

In [196]:
tags

['authentication', 'role', 'permissions']

### Appliquer la fonctions sur toutes les données

Maintenant, nous allons créer une fonction qui va sortir les tags pour chaque post :

In [198]:
# At first, we will test the function on a sample
test_sample = X_train_tfidf[:100]

In [199]:
test_sample.shape

(100, 50000)

In [200]:
def pred_tag_tfidf(tfidf_array, tfidf_vocabulary, list_of_tags):
    
    """Function generatig a list of tags, based on frequency of expression in an TF-IDF object and 
    its comparison to predefined list of tags. 
    
    Is the document contains more than 3 common expressions with the list of tags, the tags are sorted by the 
    TF-IDF frequency and only 3 most common tags are the predicted tags. If the document contains 
    2 or less common expressions, all the expressions are considered comme predicted tags. 
        
    Input :
    - tfidf_array : a TF-IDF array
    - tfidf_vocabulary : a TF-IDF vocabulary object
    - list_of_tags : a list of tags
    
    Output :
    - a list of predicted tags  
    
    """
    predicted_tags = []
    
    for doc in range(tfidf_array.shape[0]): 
       
        #We list the common expressions which are in both document and the tag's vocabulary:   
    
        liste_tags = []

        for freq, word in zip(tfidf_array[doc], tfidf_vocabulary):
            if freq > 0:
                if word in list_of_tags:
                    liste_tags.append(word)
                    
        
    
    predicted_tags.append(liste_tags)
    
    return predicted_tags

In [201]:
# !!! Regarder la fct précédente, le même pb. Indentation ???
pred_tag_tfidf(test_sample, X_train_vocab_tfidf, y_train_vocab_tfidf)

[['c#', 'file', 'fixed', 'fixed-width', 'width']]

In [177]:
#We zip the list with tag's relative frequency:

for freq, word in zip(y_train_dist_tfidf, liste_tags):
    print(freq, word)

1.2913708286761647 add
20.315782399608562 authentication
1.779107648085159 handle
1234.6974528795652 permissions
0.7619725969000765 plugins
10.392397318668808 restful-authentication
100.03215112054504 role
5.747630599165471 using


In [178]:
#Zip the tags contained in the post and tag's frequency
liste = zip(y_train_dist_tfidf, liste_tags)

In [180]:
#Converting to list
liste = list(liste)

In [181]:
#Check
liste

[(1.2913708286761647, 'add'),
 (20.315782399608562, 'authentication'),
 (1.779107648085159, 'handle'),
 (1234.6974528795652, 'permissions'),
 (0.7619725969000765, 'plugins'),
 (10.392397318668808, 'restful-authentication'),
 (100.03215112054504, 'role'),
 (5.747630599165471, 'using')]

In [185]:
#Sort the list by frequency
liste_sort = sorted(liste, key = lambda x: x[0])

In [186]:
#Check
liste_sort

[(0.7619725969000765, 'plugins'),
 (1.2913708286761647, 'add'),
 (1.779107648085159, 'handle'),
 (5.747630599165471, 'using'),
 (10.392397318668808, 'restful-authentication'),
 (20.315782399608562, 'authentication'),
 (100.03215112054504, 'role'),
 (1234.6974528795652, 'permissions')]

In [193]:
#Extract 3 most frequent tags 
tags_final = liste_sort[-3:]

In [194]:
tags_final

[(20.315782399608562, 'authentication'),
 (100.03215112054504, 'role'),
 (1234.6974528795652, 'permissions')]

In [195]:
#Extract the tag's name
tags = [x[1] for x in tags_final]

In [196]:
tags

['authentication', 'role', 'permissions']

# Modélisation non supervisée

## LDA

In [9]:
print ("Number of unique tags: %d" % len(y_train_vocab_tfidf))

Number of unique tags: 4998


Nous allons faire LDA afin de trouver des sujets de posts. Etant donné que nous avons un grand nombre de tags uniques (~5k), nous allons réformuler les tags existants à l'aide de mots clés caractéristiques pour chaque sujet. 

Nous allons essayer de trouver un nombre de sujets optimal, pour que les sujets soient interprétables.

### Document = titre + body + bigrams

In [10]:
# First training of LDA : 20 topics
no_topics = 20

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, 
                                max_iter=5,
                                learning_method='online', 
                                learning_offset=50., 
                                random_state=0).fit(X_train_tfidf)

In [11]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10

display_topics(lda, X_train_vocab_tfidf, no_top_words)

Topic 0
triangle layoutoptions vertical-line backgroundworker grep-string uber uber-jar xamarin header-httpclient hook-script
Topic 1
uicollectionview (1-1) expect( 0-179684827337041 0-17968483 17968483 0-0248644019011408 17968483-0 0248644019011408-0 179684827337041
Topic 2
install-boost b-k largest-element find-kth kth cffi backend-c c-cffi cffi-backend wformat
Topic 3
android android-layout layout recyclerview layout-width parent-android layout-height parent wrap-content +id
Topic 4
git branch commit repository master remote push commits github pull
Topic 5
key-( sprache int(15) int(15)-null engine-innodb doctorsoffice varchar(100)-null default-null 0-00 00-0
Topic 6
jekyll specific-index osgi 04-21 21-13 46-40 13-46 linux-i686 i686-2 build-lib
Topic 7
blah-blah emp salary option-selected customerrors maze linkedhashmap $name aggregate-function server-reporting
Topic 8
get-mime 901) err(-901) err( system-err( cr-lf w-system (-core file-crlf content-uri
Topic 9
aop httpwebrequest dif

Nous pouvons voir que certains sujets sont difficile à interpréter : Nous allons entraîner le modèle à nouveau, cette fois-ci avec 10 sujets. Nous allons aussi réduire le nombre d'expressions clés dans l'affichage, nous allons regarder seulement 5 premiers mots.

In [12]:
# Second LDA training : 10 topics
no_topics = 10

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, 
                                max_iter=5,
                                learning_method='online', 
                                learning_offset=50., 
                                random_state=0).fit(X_train_tfidf)

In [13]:
# Display 5 top words

no_top_words = 5
display_topics(lda, X_train_vocab_tfidf, no_top_words)

Topic 0
file like string code get
Topic 1
width-200px place-answer chatroom underline multicore
Topic 2
equivalent-java's get-mime $pid sudo-ip netns
Topic 3
var2 metaclasses pclass com-mysql matching-line
Topic 4
aggregation-composition aggregation composition difference-aggregation (get-post
Topic 5
father mediumint bigint largest-element father-father
Topic 6
weakhashmap jekyll mm4-s2 mm4 mm2
Topic 7
linkedhashmap if(one &&-two else-if(one session-timeout
Topic 8
difference-math size-byte (bytes debug1 catalina
Topic 9
android layout android-layout fragment recyclerview


In [20]:
# Third LDA training : 5 topics
no_topics = 5

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, 
                                max_iter=10,
                                learning_method='online', 
                                learning_offset=100., 
                                random_state=0).fit(X_train_tfidf)

In [21]:
# Display 5 top words

no_top_words = 10
display_topics(lda, X_train_vocab_tfidf, no_top_words)

Topic 0
aggregation-composition aggregation metaclasses composition great-great difference-aggregation difference-set father 'sdfd' agile-development
Topic 1
weakhashmap faq place-answer eri debug1 c++-faq would-place meta-started chatroom-faq (note-meant
Topic 2
multicore-programming multicore aspectj airport longer-text put( spring-aop 12px org-gradle '28-aug
Topic 3
interpreter-compiler attr-accessor '2010-10 '2010 147 drawnum page-text 153 jekyll $uuid
Topic 4
file like string code get git difference android class function


In [22]:
# 4th LDA training : 3 topics
no_topics = 3

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, 
                                max_iter=10,
                                learning_method='online', 
                                learning_offset=100., 
                                random_state=0).fit(X_train_tfidf)

In [23]:
# Display 5 top words

no_top_words = 10
display_topics(lda, X_train_vocab_tfidf, no_top_words)

Topic 0
file like string code get git difference android class function
Topic 1
weakhashmap faq place-answer interpreter-compiler father 'sdfd' '2010 '2010-10 drawnum eri
Topic 2
aggregation-composition aggregation composition metaclasses difference-aggregation multicore-programming 147 bitcode 'group' 'sub'


Même en modifiant les paramètres, nous n'avons pas trouvé des sujets faciles à interpréter et à reformuler. Nous allons essayer de changer la forme de "document" qui rentre dans le modèle. Nous allons cette fois-ci étudier uniquement le titre de text sans bigrams.

### Document = titre

In [25]:
# First training of LDA : 20 topics
no_topics = 20

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, 
                                max_iter=5,
                                learning_method='online', 
                                learning_offset=50., 
                                random_state=0).fit(X_train_title_tfidf)

In [26]:
no_top_words = 10

display_topics(lda, X_train_title_vocab_tfidf, no_top_words)

Topic 0
single multi double iphone core animation removing advantage border audio
Topic 1
file string get android text javascript check python html command
Topic 2
rail controller model django exception loop react template wpf global
Topic 3
image android build cs debug background loading performance uiview gradle
Topic 4
many laravel stage composer uiimageview relationship ctags poco oop unstaged
Topic 5
j jquery node request http element post practice remove save
Topic 6
ruby package hash pip start learning install available (or gcc
Topic 7
difference what's framework xcode test copy thread entity swift read
Topic 8
git branch list commit array remote repository merge string file
Topic 9
call feature hidden service log useful dot intellij python keyboard
Topic 10
github directory disable tree repo entire define basic pdf proxy
Topic 11
function c c++ object javascript studio variable python mean library
Topic 12
date time query std statement join format result return fixed
Topic 13
p

In [27]:
# Second LDA training : 10 topics
no_topics = 10

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, 
                                max_iter=5,
                                learning_method='online', 
                                learning_offset=50., 
                                random_state=0).fit(X_train_title_tfidf)

In [28]:
no_top_words = 10

display_topics(lda, X_train_title_vocab_tfidf, no_top_words)

Topic 0
jquery element android cs image get html event url set
Topic 1
string python mysql text function return character php bash memory
Topic 2
sql table swift row column server index panda eclipse parameter
Topic 3
database postgresql mac o fragment sort collection date byte map
Topic 4
code programming template source language color context callback exit back
Topic 5
git file branch repository request window commit j remote install
Topic 6
docker exception container stack wpf global uiview regex define place
Topic 7
java object file javascript difference net array python class variable
Topic 8
difference std linq active purpose utf what's unicode jpa storyboard
Topic 9
vim remove case dictionary space string binary terminal scala tree


In [29]:
# Second LDA training : 5 topics
no_topics = 5

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, 
                                max_iter=5,
                                learning_method='online', 
                                learning_offset=50., 
                                random_state=0).fit(X_train_title_tfidf)

In [30]:
no_top_words = 10

display_topics(lda, X_train_title_vocab_tfidf, no_top_words)

Topic 0
string file python javascript java get function array object list
Topic 1
git branch repository file commit remote practice difference local date
Topic 2
difference output language algorithm exception c++ template word thread equivalent
Topic 3
j node request http number column test io panda post
Topic 4
code sql table server mysql database android vim mean query


LDA nous n'a pas permis d'identifier des sujets précis. 

## Clustering

### k-means

# Modélisation supervisée

## KNN + word2vect

In [None]:
# !!! A tester
model = Word2Vec.load("300features_40minwords_10context")