# PROJET 5 : Catégorisez automatiquement les questions 

**PLAN DE PROJET**
1. Titre de projet : PROJET 5 - Catégorisez automatiquement les questions


2. Chargement de bibliothèques


3. Récupérer les données + Séparation de données en test et train
    - Enregistrement de fichiers en .csv :
        - X_train.csv
        - y_train.csv
        - X_test.csv
        - y_test.csv


4. Data cleaning
    - Features :
        - Enlever les balises HTML
        - Enlever la ponctuation
        - Mise en minuscule et tokenization
        - Enlever les stopwords
    - Target :
        - Enlever les balises "<>"


5. Feature engineering 
    - Recodage en bigrams
    - Fusion de title, body + bigrams


6. Analyse exploratoire
    - Analyses univariées
        - Description générale : Longueur de posts, nombre de tags
        - Bag of words : Les expressions les plus fréquentes : feature & target
            - Arrays générées:
                - X_train_bow
                - X_train_vocab_bow
                - X_train_dist_bow
                - y_train_bow
                - y_train_vocab_bow
                - y_train_dist_bow
                
                
        - TF - IDF : Les expressions les plus fréquentes : feature & target
             - Arrays générées:
                  - X_train_ifidf
                  - X_train_vocab_ifidf
                  - X_train_dist_ifidf
                  - y_train_ifidf
                  - y_train_vocab_ifidf
                  - y_train_dist_ifidf
                  

    - Analyse multivarié 
    **QUESTION : Peut-on considérer LDA comme analyse multivariée ?**
    
    
    - Réduction de dimensions
    **QUESTION : Peut-on faire un word2vec ?**
    
    
        

# Chargement de bibliothéques

In [2]:
# Import the libraries
import joblib
from IPython.core.display import display, HTML
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
#nltk.download()  # Download text data sets, including stop words
from nltk.corpus import stopwords # Import the stop word list
import re

# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup 

# Libraries for data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

In [3]:
#Disable warning for .loc
pd.options.mode.chained_assignment = None  # default='warn'

# Récupération de données

In [None]:
# !!! Récupérer les .csv et arrays !!!

# Modélisation de flags basée sur les fréquences

## Fréquences BOW

Nous allons mettre en oeuvre une méthode basée uniquement sur les fréquences des expressions (mot / bigram) utilisées dans le post et nous allons regarder si les expressions les plus fréquentes apparaissent dans le vocabulaire de tags.

Tout d'abord, nous allons analyser s'il existe un ou plusieurs expressions, présentes au moins deux fois dans chaque post, qui matchent avec le vocabulaire de tags. 

Nous allons utiliser la décomposition en Bag of words créé dans le notebook1, chapître 6.1.2 :

In [None]:
#The tags vocabulary :
y_train_vocab_bow[:10]

In [None]:
#The features vocabulary : 
X_train_vocab_bow[:10]

In [None]:
#The BOW array :
X_train_bow[:10]

### Tester les fonctions sur le premier post

Nous allons tester notre idée sur le premier post.

In [None]:
BOW_post1 = X_train_bow[0]
BOW_post1

In [None]:
#We print all expressions which are at least 2 times in the post :
for freq, word in zip(BOW_post1, X_train_vocab_bow):
    if freq >= 2:
        print (freq, word)

In [None]:
#We compare the frequent expression to the tag's vocab:

for freq, word in zip(BOW_post1, X_train_vocab_bow):
    if freq >= 2:
        if word in y_train_vocab_bow:
            print(word)

In [None]:
predicted_tags_vect = []

for freq, word in zip(BOW_post1, X_train_vocab_bow):
    if freq >= 2:
        if word in y_train_vocab_bow:
            predicted_tags_vect.append(word)

predicted_tags_vect

### Création d'une fonction à appliquer sur toutes les données

In [None]:
Maintenant, nous allons créer une fonction qui va sortir les tags pour chaque post :

In [None]:
def pred_tag_freq(BOW, vocabulary, list_of_tags):
    
    """Function which generates a list of tags, based on frequency of expression in a BOW object and 
    its comparison to predefined list of tags.
    
    Input :
    - BOW : a BOW array
    - vocabulary : list of BOW vocabulary
    - list_of_tags : a list of tags
    
    Output :
    - a list of predicted tags  
    
    """
    predicted_tags = []
    
    for vect in range(BOW.shape[0]): 
        
        predicted_tags_vect = [] 
        
        for freq, word in zip(BOW[vect], vocabulary):
            
            if freq >= 2:
                if word in list_of_tags:
                    predicted_tags_vect.append(word)
                    
        predicted_tags.append(predicted_tags_vect)
        
    return predicted_tags

In [None]:
predicted_tags = pred_tag_freq(X_train_bow, X_train_vocab_bow, y_train_vocab_bow)

In [None]:
len(predicted_tags)

In [None]:
predicted_tags[:10]

In [None]:
Sauvegarder les tags prédits:

In [None]:
#Save the predicted tags:
np.save('Data/predicted_tags', predicted_tags)

Nous allons analyser le nombre de tag prédits par la méthode :

In [None]:
nbr_tags = []

for tag in range(len(predicted_tags)):
    length = len(predicted_tags[tag])
    nbr_tags.append(length)
    
nbr_tags = DataFrame(nbr_tags)

In [None]:
nbr_tags[0].value_counts()

Désavantage de la méthode : nous avons des posts sans tag attribué (1875 posts) et certains posts peuvent avoir un grand nombre de tags, même si c'est plutôt rare. Nous allons appliquer la même méthode avec TF-IDF et choisir 3 tags les plus fréquents basé sur le coefficient TF-IDF. 

## Fréquences TF-IDF

L'idée est d'utiliser les fréquences TF-IDF pour avoir la main sur le nombre de tags à prédire. Cette fois-ci, la méthode sera basé sur la procédure suivante :

1. Nous allons comparer toutes les expressions dans le post avec le vocabulaire de tags
2. Nous allons attribuer à chaque expression la distance relative TF-IDF de tag
3. Nous allons sortir 3 tags les plus fréquents

In [147]:
y_train_dist_tfidf[:10]

array([1.29137083e+00, 2.03157824e+01, 1.77910765e+00, 1.23469745e+03,
       7.61972597e-01, 1.03923973e+01, 1.00032151e+02, 5.74763060e+00,
       1.16193175e+02, 2.80201636e+00])

### Tester les fonctions sur le premier post

In [148]:
#Extract the array of first post
TFIDF_post1 = X_train_tfidf[0]
TFIDF_post1

array([0., 0., 0., ..., 0., 0., 0.])

In [149]:
#We compare the expressions in the post to the tag's vocab:
for freq, word in zip(TFIDF_post1, X_train_vocab_tfidf):
    if freq > 0:
        if word in y_train_vocab_tfidf:
            print(word)

add
authentication
handle
permissions
plugins
restful-authentication
role
using


In [175]:
#We list the common expressions which are in both document and the tag's vocabulary:

liste_tags = []

for freq, word in zip(TFIDF_post1, X_train_vocab_tfidf):
    if freq > 0:
        if word in y_train_vocab_tfidf:
            liste_tags.append(word)

In [176]:
liste_tags

['add',
 'authentication',
 'handle',
 'permissions',
 'plugins',
 'restful-authentication',
 'role',
 'using']

In [177]:
#We zip the list with tag's relative frequency:

for freq, word in zip(y_train_dist_tfidf, liste_tags):
    print(freq, word)

1.2913708286761647 add
20.315782399608562 authentication
1.779107648085159 handle
1234.6974528795652 permissions
0.7619725969000765 plugins
10.392397318668808 restful-authentication
100.03215112054504 role
5.747630599165471 using


In [178]:
#Zip the tags contained in the post and tag's frequency
liste = zip(y_train_dist_tfidf, liste_tags)

In [180]:
#Converting to list
liste = list(liste)

In [181]:
#Check
liste

[(1.2913708286761647, 'add'),
 (20.315782399608562, 'authentication'),
 (1.779107648085159, 'handle'),
 (1234.6974528795652, 'permissions'),
 (0.7619725969000765, 'plugins'),
 (10.392397318668808, 'restful-authentication'),
 (100.03215112054504, 'role'),
 (5.747630599165471, 'using')]

In [185]:
#Sort the list by frequency
liste_sort = sorted(liste, key = lambda x: x[0])

In [186]:
#Check
liste_sort

[(0.7619725969000765, 'plugins'),
 (1.2913708286761647, 'add'),
 (1.779107648085159, 'handle'),
 (5.747630599165471, 'using'),
 (10.392397318668808, 'restful-authentication'),
 (20.315782399608562, 'authentication'),
 (100.03215112054504, 'role'),
 (1234.6974528795652, 'permissions')]

In [193]:
#Extract 3 most frequent tags 
tags_final = liste_sort[-3:]

In [194]:
tags_final

[(20.315782399608562, 'authentication'),
 (100.03215112054504, 'role'),
 (1234.6974528795652, 'permissions')]

In [195]:
#Extract the tag's name
tags = [x[1] for x in tags_final]

In [196]:
tags

['authentication', 'role', 'permissions']

### Appliquer la fonctions sur toutes les données

Maintenant, nous allons créer une fonction qui va sortir les tags pour chaque post :

In [198]:
# At first, we will test the function on a sample
test_sample = X_train_tfidf[:100]

In [199]:
test_sample.shape

(100, 50000)

In [200]:
def pred_tag_tfidf(tfidf_array, tfidf_vocabulary, list_of_tags):
    
    """Function generatig a list of tags, based on frequency of expression in an TF-IDF object and 
    its comparison to predefined list of tags. 
    
    Is the document contains more than 3 common expressions with the list of tags, the tags are sorted by the 
    TF-IDF frequency and only 3 most common tags are the predicted tags. If the document contains 
    2 or less common expressions, all the expressions are considered comme predicted tags. 
        
    Input :
    - tfidf_array : a TF-IDF array
    - tfidf_vocabulary : a TF-IDF vocabulary object
    - list_of_tags : a list of tags
    
    Output :
    - a list of predicted tags  
    
    """
    predicted_tags = []
    
    for doc in range(tfidf_array.shape[0]): 
       
        #We list the common expressions which are in both document and the tag's vocabulary:   
    
        liste_tags = []

        for freq, word in zip(tfidf_array[doc], tfidf_vocabulary):
            if freq > 0:
                if word in list_of_tags:
                    liste_tags.append(word)
                    
        
    
    predicted_tags.append(liste_tags)
    
    return predicted_tags

In [201]:
# !!! Regarder la fct précédente, le même pb. Indentation ???
pred_tag_tfidf(test_sample, X_train_vocab_tfidf, y_train_vocab_tfidf)

[['c#', 'file', 'fixed', 'fixed-width', 'width']]

In [177]:
#We zip the list with tag's relative frequency:

for freq, word in zip(y_train_dist_tfidf, liste_tags):
    print(freq, word)

1.2913708286761647 add
20.315782399608562 authentication
1.779107648085159 handle
1234.6974528795652 permissions
0.7619725969000765 plugins
10.392397318668808 restful-authentication
100.03215112054504 role
5.747630599165471 using


In [178]:
#Zip the tags contained in the post and tag's frequency
liste = zip(y_train_dist_tfidf, liste_tags)

In [180]:
#Converting to list
liste = list(liste)

In [181]:
#Check
liste

[(1.2913708286761647, 'add'),
 (20.315782399608562, 'authentication'),
 (1.779107648085159, 'handle'),
 (1234.6974528795652, 'permissions'),
 (0.7619725969000765, 'plugins'),
 (10.392397318668808, 'restful-authentication'),
 (100.03215112054504, 'role'),
 (5.747630599165471, 'using')]

In [185]:
#Sort the list by frequency
liste_sort = sorted(liste, key = lambda x: x[0])

In [186]:
#Check
liste_sort

[(0.7619725969000765, 'plugins'),
 (1.2913708286761647, 'add'),
 (1.779107648085159, 'handle'),
 (5.747630599165471, 'using'),
 (10.392397318668808, 'restful-authentication'),
 (20.315782399608562, 'authentication'),
 (100.03215112054504, 'role'),
 (1234.6974528795652, 'permissions')]

In [193]:
#Extract 3 most frequent tags 
tags_final = liste_sort[-3:]

In [194]:
tags_final

[(20.315782399608562, 'authentication'),
 (100.03215112054504, 'role'),
 (1234.6974528795652, 'permissions')]

In [195]:
#Extract the tag's name
tags = [x[1] for x in tags_final]

In [196]:
tags

['authentication', 'role', 'permissions']

# Modélisation non supervisée

## LDA

In [98]:
from sklearn.decomposition import LatentDirichletAllocation

In [99]:
no_topics = 20

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5,
                                learning_method='online', learning_offset=50., random_state=0).fit(train_data_features)

In [121]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10

display_topics(lda, vocab, no_top_words)

Topic 0
file files server service web using data application command directory
Topic 1
net web page asp asp-net system mvc button assembly 2-0
Topic 2
visual studio visual-studio 2008 ruby lib 8 rails c usr
Topic 3
control end value controls sub vb binding option wpf grid
Topic 4
c date thread 00 10 time 2008 +-+ 11 datetime
Topic 5
class public object string new return method int type value
Topic 6
way like want get using list something function need one
Topic 7
input delphi end begin frame '-' quotes oriented points font-color
Topic 8
print 1) 2) values constant printing (1 3) compare 45
Topic 9
javascript asp form page jquery script id type server ajax
Topic 10
div html image text css style svn width font color
Topic 11
memory search performance index large process size time services much
Topic 12
user python url request http site response # page get
Topic 13
use like code using one know way application project need
Topic 14
sql table database id query data select name sql-server se

## Clustering

### k-means

# Modélisation supervisée

## KNN + word2vect

In [None]:
# !!! A tester
model = Word2Vec.load("300features_40minwords_10context")