# [Ateliers: Technologies de l'intelligence Artificielle](https://github.com/wikistat/AI-Frameworks)

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<a href="http://wikistat.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/wikistat.jpg" width=400, style="max-width: 150px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo_imt.jpg" width=400,  style="float:right;  display: inline" alt="IMT"/> </a>
    
</center>

# Traitement Naturel du Langage (NLP) : Catégorisation de Produits Cdiscount

Il s'agit d'une version simplifiée du concours proposé par Cdiscount et paru sur le site [datascience.net](https://www.datascience.net/fr/challenge). Les données d'apprentissage sont accessibles sur demande auprès de Cdiscount mais les solutions de l'échantillon test du concours ne sont pas et ne seront pas rendues publiques. Un échantillon test est donc construit pour l'usage de ce tutoriel.  L'objectif est de prévoir la catégorie d'un produit à partir de son descriptif (*text mining*). Seule la catégorie principale (1er niveau, 47 classes) est prédite au lieu des trois niveaux demandés dans le concours. L'objectif est plutôt de comparer les performances des méthodes et technologies en fonction de la taille de la base d'apprentissage ainsi que d'illustrer sur un exemple complexe le prétraitement de données textuelles. 

Le jeux de données complet (15M produits) permet un test en vrai grandeur du **passage à l'échelle volume** des phases de préparation (*munging*), vectorisation (hashage, TF-IDF) et d'apprentissage en fonction de la technologie utilisée.

La synthèse des résultats obtenus est développée par [Besse et al. 2016](https://hal.archives-ouvertes.fr/hal-01350099) (section 5).

## Partie 1-3 : Modèle d'apprentissage statistiques.

Dans le calepin numéro 2, nous avons créés 2x7 matrices de features correspondant au mêmes échantillons d'apprentissage et de validation des données textuelles de description d'objet de Cdiscount.  Ces matrices ont été crées avec les méthodes suivantes. 

1. `Count_Vectorizer`. `No hashing`.
2. `Count_Vectorizer`. `Hashing = 300`.
3. `TFIDF_vectorizer`. `No hashing`. 
4. `TFIDF_vectorizer`. `Hashing = 300`.
5. `Word2Vec`. `CBOW`
6. `Word2Vec`. `Skip-Gram`
7. `Word2Vec`. `Pre-trained`

Nous allons maintenant étudiés les performances d'algorithmes de *machine learning* (`Regression logistique`, `Forêts aléatoire`, `Perceptron multicouche`) sur ces différents features 

## Librairies

In [1]:
#Importation des librairies utilisées
import time
import numpy as np
import pandas as pd
import scipy as sc

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV

DATA_DIR = "data/features"

## Téléchargement des données

Téléchargement des variables réponses

In [2]:
Y_train = pd.read_csv("data/cdiscount_train_subset.csv").fillna("")["Categorie1"]
Y_valid = pd.read_csv("data/cdiscount_valid.csv").fillna("")["Categorie1"]

Création d'un dictionnaire contenant les chemins ou des différents objets où sont stockés les matrices de features.

In [None]:
features_path_dic = {}

parameters = [["count_no_hashing", None, "count"],
              ["count_300", 300, "count"],
              ["tfidf_no_hashing", None, "tfidf"],
              ["tfidf_300",300, "tfidf"]]
for name, nb_hash, vectorizer in parameters:
        x_train_path = DATA_DIR +"/vec_train_nb_hash_" + str(nb_hash) + "_vectorizer_" + str(vectorizer)+".npz"
        x_valid_path = DATA_DIR +"/vec_valid_nb_hash_" + str(nb_hash) + "_vectorizer_" + str(vectorizer)+".npz"
        dic = {"x_train_path" : x_train_path, "x_valid_path" : x_valid_path, "load" : "npz"}
        features_path_dic.update({name : dic})
 
parametersw2v = [["word2vec_cbow","cbow"],
                 ["word2vec_sg","sg"],
                 ["word2vec_online","online"]]
for name, mtype in parametersw2v:
        x_train_path = DATA_DIR +"/embedded_train_" + mtype+".npy"
        x_valid_path = DATA_DIR +"/embedded_valid_" + mtype+".npy"
        dic = {"x_train_path" : x_train_path, "x_valid_path" : x_valid_path, "load" : "npy"}
        features_path_dic.update({name : dic})

# Regression Logistique

In [None]:
metadata_list_lr = []

param_grid = {"C" : [10,1,0.1]}
#param_grid = {"C" : [1]}

for name, dic in features_path_dic.items():
    
    x_train_path = dic["x_train_path"]
    x_valid_path = dic["x_valid_path"]
    load = dic["load"]
    
    print("Load features : " + name)
    if load == "npz":
        X_train = sc.sparse.load_npz(x_train_path)
        X_valid = sc.sparse.load_npz(x_valid_path)
    else : 
        X_train = np.load(x_train_path)
        X_valid = np.load(x_valid_path)
    
    print("start Learning :" + name)
    ts = time.time()
    gs = GridSearchCV(LogisticRegression(), param_grid=param_grid, verbose=15)
    gs.fit(X_train,Y_train.values)
    te=time.time()
    t_learning = te-ts
    
    print("start prediction :" + name)
    ts = time.time()
    score_train=gs.score(X_train,Y_train)
    score_valid=gs.score(X_valid,Y_valid)
    te=time.time()
    t_predict = te-ts
    
    metadata = {"name":name, "learning_time" : t_learning, "predict_time":t_predict, "score_train": score_train, "score_valid": score_valid}
    metadata_list_lr.append(metadata)
       

Load features : word2vec_sg
start Learning :word2vec_sg
Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] C=10 ............................................................
[CV] ................... C=10, score=0.8477807942420608, total= 7.2min
[CV] C=10 ............................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  7.2min remaining:    0.0s


[CV] ................... C=10, score=0.8506473002841806, total= 7.0min
[CV] C=10 ............................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 14.2min remaining:    0.0s


[CV] .................... C=10, score=0.851478579552635, total= 7.3min
[CV] C=1 .............................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 21.5min remaining:    0.0s


[CV] .................... C=1, score=0.8380579582044321, total= 3.8min
[CV] C=1 .............................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 25.4min remaining:    0.0s


[CV] .................... C=1, score=0.8404799494790022, total= 3.8min
[CV] C=1 .............................................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 29.2min remaining:    0.0s


[CV] .................... C=1, score=0.8414634146341463, total= 4.0min
[CV] C=0.1 ...........................................................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 33.2min remaining:    0.0s


[CV] .................. C=0.1, score=0.8146978975945451, total= 2.6min
[CV] C=0.1 ...........................................................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 35.8min remaining:    0.0s


[CV] .................. C=0.1, score=0.8148089674771076, total= 2.5min
[CV] C=0.1 ...........................................................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 38.3min remaining:    0.0s


[CV] ................... C=0.1, score=0.818431694679641, total= 2.4min


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 40.7min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 40.7min finished


start prediction :word2vec_sg
Load features : word2vec_cbow
start Learning :word2vec_cbow
Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] C=10 ............................................................
[CV] ..................... C=10, score=0.81049940021466, total=20.1min
[CV] C=10 ............................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 20.1min remaining:    0.0s


In [None]:
print("")

for model_name in ["CBOW","skip-gram", "online"]:
    print("Word2Vec :" + model_name)

    X_train = np.load(DATA_DIR +"/embedded_train_nb_hash_" + model_name+".npy")
    X_valid = np.load(DATA_DIR +"/embedded_valid_nb_hash_" + model_name+".npy")
    
    ts = time.time()
    cla = LogisticRegression()
    cla.fit(X_train,Y_train.values)
    te=time.time()
    t_learning = te-ts
    ts = time.time()
    score_train=cla.score(X_train,Y_train)
    score_valid=cla.score(X_valid,Y_valid)
    te=time.time()
    t_predict = te-ts
    metadata = {"typeW2V": model_name ,"nb_hash": None, "vectorizer":"word2vec" ,"learning_time" : t_learning, "predict_time":t_predict, "score_train": score_train, "score_valid": score_valid}
    print(metadata)
    metadata_list_lr.append(metadata)


In [None]:
pd.DataFrame(metadata_list_lr)

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
metadata_list_rf = []

parameters = [[None, "count"],
              [300, "count"],
              [10000, "count"],
              [None, "tfidf"],
              [300, "tfidf"],
              [10000, "tfidf"],]

for nb_hash, vectorizer in parameters:
    print("nb_hash : " + str(nb_hash) + ", vectorizer : " + str(vectorizer))
    X_train = sparse.load_npz(DATA_DIR +"/vec_train_nb_hash_" + str(nb_hash) + "_vectorizer_" + str(vectorizer)+".npz")
    X_valid = sparse.load_npz(DATA_DIR +"/vec_valid_nb_hash_" + str(nb_hash) + "_vectorizer_" + str(vectorizer)+".npz")
    ts = time.time()
    cla = RandomForestClassifier(n_estimators=100)
    cla.fit(X_train,Y_train.values)
    te=time.time()
    t_learning = te-ts
    ts = time.time()
    score_train=cla.score(X_train,Y_train)
    score_valid=cla.score(X_valid,Y_valid)
    te=time.time()
    t_predict = te-ts
    metadata = {"typeW2V": None, "nb_hash": nb_hash, "vectorizer":vectorizer , "learning_time" : t_learning, "predict_time":t_predict, "score_train": score_train, "score_valid": score_valid}
    print(metadata)
    metadata_list_rf.append(metadata)

In [None]:
print("")

for model_name in ["CBOW","skip-gram", "online"]:
    print("Word2Vec :" + model_name)

    X_train = np.load(DATA_DIR +"/embedded_train_nb_hash_" + model_name+".npy")
    X_valid = np.load(DATA_DIR +"/embedded_valid_nb_hash_" + model_name+".npy")
    
    ts = time.time()
    cla = RandomForestClassifier(n_estimators=100)
    cla.fit(X_train,Y_train.values)
    te=time.time()
    t_learning = te-ts
    ts = time.time()
    score_train=cla.score(X_train,Y_train)
    score_valid=cla.score(X_valid,Y_valid)
    te=time.time()
    t_predict = te-ts
    metadata = {"typeW2V": model_name ,"nb_hash": None, "vectorizer":"word2vec" ,"learning_time" : t_learning, "predict_time":t_predict, "score_train": score_train, "score_valid": score_valid}
    print(metadata)
    metadata_list_rf.append(metadata)


In [None]:
pd.DataFrame(metadata_list_lr)