# Notebook: Approche Supervisée

## Mise en place

Librairies et paramétrages utilisés tout au long du notebook

In [1]:
import re
import time
import pickle
import numpy as np
import pandas as pd
from IPython.display import display, HTML

import spacy
from nltk.corpus import stopwords
from gensim.corpora import Dictionary
from gensim.matutils import corpus2csc
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.utils import simple_preprocess

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

Charger les fichiers téléchargés

In [3]:
n_files = 4
list_files = [pd.read_csv(f'data/QueryResults-{n+1}.csv', index_col='Id') 
              for n in range(n_files)]

data = pd.concat(list_files, axis=0).drop_duplicates()
data.head()

Unnamed: 0_level_0,Body,Title,Tags
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1771099,<p>One of the features of my project is to all...,Choosing a Path for Python File Access,<python><path>
1771101,<p>Is there any open source software to map po...,Open source motion-capture software,<simulation><tracking><3d-modelling>
1771102,"<p>As the title says, how does one change the ...",Changing Emacs Forward-Word Behaviour,<emacs><emacs23>
1771104,"<p>When loading all unit tests in a package, t...",JUnit java.lang.OutOfMemoryError when running ...,<java><unit-testing><junit><out-of-memory>
1771117,"<p>For a specific example, consider <code>atoi...",Why doesn't C++ reimplement C standard functio...,<c++><c><stl>


### Formater la variable cible

Pour chaque question, séparer les différents tags attribués dans la colonne `Tags` en liste

In [4]:
list_tags = data.Tags.apply(lambda tags: tags[1:-1].split('><'))
list_tags.head()

Id
1771099                                [python, path]
1771101          [simulation, tracking, 3d-modelling]
1771102                              [emacs, emacs23]
1771104    [java, unit-testing, junit, out-of-memory]
1771117                                 [c++, c, stl]
Name: Tags, dtype: object

Compter le nombre d'instances de chaque tag pour l'ensemble des questions

In [5]:
# Count instances
all_tags = {}
for index, row in list_tags.iteritems():
    for tag in row:
        all_tags[tag] = all_tags.get(tag, 0) + 1
print("Nombre de tags:", len(all_tags))

# Sort in descending order
tags_sorted = sorted(all_tags.items(), 
                     key=lambda x: x[1], reverse=True)

Nombre de tags: 23406


Afficher les *n* tags les plus présents pour nous aider à sélectionner les modalités de la variable cible.

In [6]:
n = 30

print("Tags les plus présents:")
for tag, count in tags_sorted[:n]:
    print(f"{tag:15} {count: 5}")

Tags les plus présents:
c#               20869
java             17939
javascript       14458
python           12726
c++              10908
android          9887
php              9680
.net             8817
jquery           7321
html             5915
ios              5233
c                5066
asp.net          4911
css              4881
sql              4811
iphone           4420
mysql            3971
objective-c      3763
ruby-on-rails    3569
sql-server       3275
ruby             3147
asp.net-mvc      2661
linux            2593
wpf              2541
r                2479
windows          2460
django           2457
arrays           2318
regex            2137
performance      2126


Appliquons le filtrage et ne retenons que les *n* tags plus fréquents. Aussi, supprimons les lignes qui n'ont aucun des tags les plus populaires (car ils n'auront pas de valeur cible).

In [7]:
selected_tags = [tag for tag, count in tags_sorted[:n]]

copy_tags = list_tags.copy()
for index, row in list_tags.iteritems():
    copy_tags[index] = [tag for tag in row if tag in selected_tags]
    
mask = (copy_tags.apply(len) > 0)
data['List_tags'] = copy_tags #.apply(lambda l: ', '.join(l))
data['Document'] = data.Title + ' ' + data.Body

corpus = data.loc[mask, ['Document', 'List_tags']]
corpus.sample(15)

Unnamed: 0_level_0,Document,List_tags
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
41286318,Is it safe to still use Java 6 and Eclipse Ind...,[java]
27430640,Why can't you declare a variable inside the ex...,[c++]
1782442,Free Code-to-Flowchart/UML tool for C# code <p...,[c#]
2506476,Finding the string length of a integer in .NET...,"[.net, performance]"
35896324,Source and Behavioral compatibility <p>I was l...,[java]
10071304,What does [ N ... M ] mean in C aggregate init...,"[c, linux]"
473088,"Do you have any good examples of ""architecture...",[java]
2166961,Determining the current foreground application...,[android]
1833554,How does the qr/STRING/ operator in Perl decid...,[regex]
3570889,Get first element of a sparse JavaScript array...,"[javascript, jquery, arrays]"


Affichons les dimensions de la dataframe résultante, et comparons avec celles de la frame initiale 

In [8]:
print("Nombre d'observation avant filtrage: ", data.shape[0])
print("Nombre d'observaions après filtrage: ", corpus.shape[0])

Nombre d'observation avant filtrage:  200000
Nombre d'observaions après filtrage:  142457


In [9]:
corpus.drop('List_tags', axis=1).to_csv('my_data.csv')

En ne sélectionnant que les 30 tags les plus populaires, nous avons donc perdu environ 30% des données.

## Pré-traitement

Reprenant le cheminement de la partie non-supervisée, appliquons succintement les étapes de normalisation sur nos jeux de données d'entrainement et de test.

In [10]:
def clean_text(docs):
    return (docs
            .apply(lambda x: re.sub('<code>(.|\n)*?</code>', '', x))
            .apply(lambda x: re.sub('<[^<]+?>', '', x))
            .apply(lambda x: simple_preprocess(x, min_len=1, deacc=False))
           )

def norm_text(docs, stopwords, nlp, banned_postags, bigrammer):
    def remove_stopwords(doc, stopwords):
        return [word for word in doc if word not in stopwords]
    
    def lemmatization(doc, nlp, banned_postags):
        doc = nlp(" ".join(doc))
        return [token.lemma_ for token in doc if token.pos_ not in banned_postags]
    
    def make_bigrams(doc, bigrammer):
        return bigrammer[doc]
    
    return (docs
            .apply(remove_stopwords, stopwords=stopwords)
            .apply(lemmatization, nlp=nlp, banned_postags=banned_postags)
            .apply(make_bigrams, bigrammer=bigrammer)
           )

In [15]:
corpus_clean = clean_text(corpus.Document)

stop_words = stopwords.words('english')
nlp = spacy.load('en', disable=['parser', 'ner'])
bigrammer = Phraser(Phrases(corpus_clean, threshold=10)) 
banned_postags = ['PUNCT', 'DET', 'PRON', 'CONJ', 'ADV', 'INTJ']

with open('processing_utils.pkl', 'wb') as f:
    pickle.dump([stop_words, bigrammer, banned_postags], f)

corpus_norm = norm_text(corpus_clean, stop_words, nlp, banned_postags, bigrammer)

Maintenant, séparons notre corpus nettoyé en jeu d'entrainement et jeu de test

In [16]:
docs_train, docs_test, targets_train, targets_test = train_test_split(corpus_norm, corpus.List_tags, 
                                                                      train_size=0.8, test_size=0.2, random_state=0)

Créeons notre dictionnaire sur le jeu d'entrainement, et supprimons progressivement le vocabulaire superflu

In [17]:
dictionary = Dictionary(docs_train)
print("Taille du dictionnaire: ",len(dictionary))

Taille du dictionnaire:  91957


In [18]:
dictionary.filter_extremes(no_below=5)
print("Taille du dictionnaire: ", len(dictionary))

Taille du dictionnaire:  19686


Appliquons le *bag-of-words* sur notre jeu d'entrainement et de test à partir du seul dictionnaire du jeu d'entrainement.

In [19]:
n_terms = len(dictionary)

train_corpus = docs_train.apply(dictionary.doc2bow)
train_tdf = corpus2csc(train_corpus, num_terms=n_terms).transpose()

test_corpus = docs_test.apply(dictionary.doc2bow)
test_tdf = corpus2csc(test_corpus, num_terms=n_terms).transpose()

Utilisons un encodage binaire pour représenter notre variable cible, au lieu d'une liste de tags. Étant donné que nous avons sélectionner 20 tags, nous nous attendons à avoir pour chaque observation un vecteur de taille 20 aussi.

In [20]:
binarizer = MultiLabelBinarizer()

y_train = binarizer.fit_transform(targets_train)
y_test = binarizer.transform(targets_test)

with open('binarizer.pkl', 'wb') as f:
    pickle.dump(binarizer, f)
    
y_train.shape, y_test.shape

((113965, 30), (28492, 30))

## Phase de modelisation

Pour cette phase de modélisation avec approche supervisée, nous allons chercher les meilleurs hyperparamètres en GridSearch pour les modèles suivants:
- Multinomial Naive Bayes (en approche OneVsRest)
- Random Forest
- SVM Linéaire (en approche OneVsRest)

Nous nous servirons du taux de classification global (*accuracy*) comme mesure à optimiser. 


Aussi, pour finir le pré-traitement du texte, nous utilisons un transformateur tf-idf. Comme pour le dictionnaire, entrainons celui-ci uniquement sur le jeu d'entrainement, et il sera utilisé automatiquement pour le jeu de test.

### Naive Bayes

In [21]:
hyperparameters = {'nb__estimator__alpha': [1.0, 3e-1, 1e-1, 3e-2, 1e-2]}

estimator_nb = Pipeline([('tfidf', TfidfTransformer()), 
                         ('nb', OneVsRestClassifier(MultinomialNB()))
                        ])
grid_nb = GridSearchCV(estimator_nb, hyperparameters, scoring='accuracy',
                        cv=5, iid=True, verbose=1, n_jobs=-1)

grid_nb.fit(train_tdf, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:   18.2s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)), ('nb', OneVsRestClassifier(estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
          n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'nb__estimator__alpha': [1.0, 0.3, 0.1, 0.03, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [23]:
print("Score meilleur Naive Bayes sur test set: ", round(grid_nb.score(test_tdf, y_test), 3))

Score meilleur Naive Bayes sur test set:  0.282


### Random Forest

In [24]:
hyperparameters = {'rf__n_estimators': [10, 30, 50],
                   'rf__criterion': ['gini', 'entropy'],
                  }

estimator_rf = Pipeline([('tfidf', TfidfTransformer()), 
                         ('rf', RandomForestClassifier())
                        ])
grid_rf = GridSearchCV(estimator_rf, hyperparameters, scoring='accuracy',
                        cv=5, iid=True, verbose=1, n_jobs=-1)

grid_rf.fit(train_tdf, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 25.4min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)), ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impuri...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'rf__n_estimators': [10, 30, 50], 'rf__criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [25]:
print("Score meilleur Random Forest sur test set: ", round(grid_rf.score(test_tdf, y_test), 3))

Score meilleur Random Forest sur test set:  0.319


### SVM Linéaire

In [26]:
hyperparameters = {'svm__estimator__C': [1, 10, 30]}

estimator_svm = Pipeline([('tfidf', TfidfTransformer()), 
                          ('svm', OneVsRestClassifier(LinearSVC(max_iter=10000)))
                         ])
grid_svm = GridSearchCV(estimator_svm, hyperparameters, scoring='accuracy',
                        cv=5, iid=True, verbose=1, n_jobs=-1)

grid_svm.fit(train_tdf, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  4.4min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)), ('svm', OneVsRestClassifier(estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=10000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'svm__estimator__C': [1, 10, 30]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [27]:
print("Score meilleur SVM sur test set: ", round(grid_svm.score(test_tdf, y_test), 3))

Score meilleur SVM sur test set:  0.491


Résumons les scores sur les meilleurs modèles trouvés à partir du test set

| Modèles | Taux de classification |
| --- | --- |
| Bayésien naïf | 0.282 |
| Forêt Aléatoire | 0.321 |
| SVM Linéaire | 0.491 |

Nous choisissons le meilleur modèle général compte tenu des performances sur les tests scores. Aussi, nous exportons le dictionnaire qui nous aidera pour notre API

In [28]:
optimal_model = grid_svm.best_estimator_

with open('model_supervised.pkl', 'wb') as f:
    pickle.dump(optimal_model, f)
    
with open('dico_supervised.pkl', 'wb') as f:
    pickle.dump(dictionary, f)

## Interprétations

Afin d'illustrer concrètement les performances du modèle retenu, prédisons les tags des dix premières lignes de notre test set, et comparons les avec leurs vrais tags.

In [45]:
for n_ex in range(10):
    my_test = corpus2csc(test_corpus.iloc[n_ex:n_ex+1], num_terms=n_terms).transpose()
    series = corpus.loc[docs_test.reset_index().iloc[n_ex, 0]]
    tags_supervised = binarizer.inverse_transform(optimal_model.predict(my_test))[0]
    
    print(f'Test n°{n_ex+1}')
    print('—'*10)
    display(HTML(series.Document))
    print('—'*50)
    print('\tTags cibles:', ', '.join(series.List_tags))
    print('\tTags proposés:', ', '.join(tags_supervised))
    print('—'*50, '\n'*4)

Test n°1
——————————


——————————————————————————————————————————————————
	Tags cibles: .net
	Tags proposés: .net, c#
—————————————————————————————————————————————————— 




Test n°2
——————————


——————————————————————————————————————————————————
	Tags cibles: iphone
	Tags proposés: iphone
—————————————————————————————————————————————————— 




Test n°3
——————————


——————————————————————————————————————————————————
	Tags cibles: c#, asp.net-mvc
	Tags proposés: 
—————————————————————————————————————————————————— 




Test n°4
——————————


——————————————————————————————————————————————————
	Tags cibles: asp.net-mvc
	Tags proposés: asp.net-mvc
—————————————————————————————————————————————————— 




Test n°5
——————————


——————————————————————————————————————————————————
	Tags cibles: .net
	Tags proposés: .net, c#
—————————————————————————————————————————————————— 




Test n°6
——————————


——————————————————————————————————————————————————
	Tags cibles: c++, linux
	Tags proposés: c++
—————————————————————————————————————————————————— 




Test n°7
——————————


——————————————————————————————————————————————————
	Tags cibles: iphone, objective-c
	Tags proposés: 
—————————————————————————————————————————————————— 




Test n°8
——————————


——————————————————————————————————————————————————
	Tags cibles: c#, .net
	Tags proposés: c#
—————————————————————————————————————————————————— 




Test n°9
——————————


——————————————————————————————————————————————————
	Tags cibles: django
	Tags proposés: django
—————————————————————————————————————————————————— 




Test n°10
——————————


——————————————————————————————————————————————————
	Tags cibles: php
	Tags proposés: php
—————————————————————————————————————————————————— 






### Bilan
L'avantage de cette méthode réside, à l'extraction de tags qui évoquent des sujets généraux, contrairement aux détails perçus avec la méthode non-supervisée. Cependant, pour des questions trop courtes, ou au contenu non explicite sur sa catégorie, nous risquons de ne pas obtenir tous les tags adéquats, ou même de n'obtenir aucune réponse.