# Choisir le meilleur text traité

Ce code a comme objectif trouver le meilleur des textes traités avec NLP.

D'après les recherches d'Eric, le modèle LightSVC est une modèle très rapide et avec très bons résultats!

Nous allons tester touts les textes avec les differents traitements et choisir celui avec la meilleure performance.

In [50]:
import pandas as pd
# Importer la classe train_test 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize

from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

from sklearn.metrics import f1_score

Transformer les categories target en numeros de 0 à 26:

In [29]:
target = pd.read_csv(r'C:\Users\Edgar\Documents\Rakuten\Y_train_CVw08PX.csv')

y = target.prdtypecode
categories = list(set(y.to_list()))
y_trans = y
i = 0
for category in categories:
    y_trans = y_trans.replace(category,i)
    i+=1

y_trans

0         4
1        25
2        20
3         0
4         6
         ..
84911    17
84912    10
84913    25
84914    11
84915    23
Name: prdtypecode, Length: 84916, dtype: int64

Création de la liste avec tous les X_train possibles:

In [30]:
import os #Miscellaneous operating system interfaces
#https://docs.python.org/3/library/os.html

#get current working directory
current_path = os.getcwd() 

#X_train path
path = current_path + r'/X_train/'

#List with the name of all training images
data_list = os.listdir(path)

Fonction pour calculer le numero de mots uniques:

In [31]:
def number_unique_words(text_series): #count unique words in Series
    words = []
    for sentence in text_series.str.split():
        for word in sentence:
            words.append(word)
            
    return len(list(set(words)))

Essayer chaque possibilité de X_train traité pour voir avec quel type de traitement on arrive a predire mieux: 

In [32]:
#data_list = ['X_train_lemma-FR_stop_words-FR_no_num-FR_remove_accents-FR_no_special-FR_lemma-EN_stop_words-EN_stop_words-DE_lemma-DE_steem-FR_steem-EN_steem-DE.csv']

for data in data_list:
    
    #Importer fichier
    X = pd.read_csv(path + r'\\' + data,index_col =0).squeeze()
    unique_words = number_unique_words(X)
    
    #if unique_words > 200000:
        #print(unique_words)
        #continue
        
    train_size = 0.2
    
    print('************************')
    print('Try with', data)
    
    # Séparer le jeu de données en données d'entraînement et données test 
    X_train, X_test, y_train, y_test = train_test_split(X, y_trans, train_size=0.1 ,test_size=0.02)
    print('Unique words:', unique_words)
    
    try:
        #Transformer donnés en tensors
        tfid = TfidfVectorizer(analyzer='word',
                               tokenizer=word_tokenize,
                               max_df=0.8,
                               min_df=4,
                               #ngram_range=(1,1),
                               use_idf=True,
                               smooth_idf=True,
                               sublinear_tf=False,
                               binary=True,
                              )
        X_train_tf = tfid.fit_transform(X_train).todense()
        X_test_tf = tfid.transform(X_test).todense()
        print('Size vector X_train', X_train_tf.shape)
    
        #Train model and predict
        model = LinearSVC(penalty="l2", dual=True,C=0.5,max_iter = 1000) #values from best hyperpara from *NLP_SC
        model = model.fit(X_train_tf, y_train)
        pred = model.predict(X_test_tf)
        print('Score for ', data, ':', accuracy_score(y_test, pred))
    
    except:
        print(data, 'has too many unique words')
    print('')

************************
Try with X_train_lemma-FR.csv
Unique words: 250363
Size vector X_train (8491, 10352)
Score for  X_train_lemma-FR.csv : 0.7551500882872277

************************
Try with X_train_lemma-FR_lemma-EN.csv
Unique words: 242904
Size vector X_train (8491, 10276)
Score for  X_train_lemma-FR_lemma-EN.csv : 0.7657445556209534

************************
Try with X_train_lemma-FR_lemma-EN_lemma-DE.csv
Unique words: 236462
Size vector X_train (8491, 10179)
Score for  X_train_lemma-FR_lemma-EN_lemma-DE.csv : 0.7781047675103002

************************
Try with X_train_lemma-FR_lemma-EN_lemma-DE_steem-FR.csv
Unique words: 210066
Size vector X_train (8491, 8838)
Score for  X_train_lemma-FR_lemma-EN_lemma-DE_steem-FR.csv : 0.7663331371394938

************************
Try with X_train_lemma-FR_lemma-EN_lemma-DE_steem-FR_steem-EN.csv
Unique words: 205116
Size vector X_train (8491, 8828)
Score for  X_train_lemma-FR_lemma-EN_lemma-DE_steem-FR_steem-EN.csv : 0.7716303708063567

**

In [54]:
#data_list = ['X_train_lemma-FR_stop_words-FR_no_num-FR_remove_accents-FR_no_special-FR_lemma-EN_stop_words-EN_stop_words-DE_lemma-DE_steem-FR_steem-EN_steem-DE.csv']

unique_words_list = ['']
score_list = ['']
data_name_list = ['']

for data in data_list:
    
    #Importer fichier
    X = pd.read_csv(path + r'\\' + data,index_col =0).squeeze()
    
    print('**************************************************************************')
    unique_words = number_unique_words(X)
    print('Unique words:', unique_words)
    print('Try with', data)
    
    #if unique_words > 200000:
        #print(unique_words)
        #continue
    
    #My RAM is not enough to calculate with full data size. Let's split the data in half
    X1 = X.loc[:X.size/2]
    X2 = X.loc[X.size/2 + 1:]
    X_list = [X1,X2]
    y1 = y_trans.loc[:X.size/2]
    y2 = y_trans.loc[X.size/2 + 1:]
    y_list = [y1,y2]
    split = 0
    sum_score = 0
    
    #Train model with each data split
    for X,y in zip(X_list,y_list):
        print('')
        print('Split:',split + 1)
        
        # Séparer le jeu de données en données d'entraînement et données test 
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

        try:#if too many unique words it can lead to error due to not enogh RAM!
            #Transform data in tensors
            tfid = TfidfVectorizer(analyzer='word',
                                   tokenizer=word_tokenize,
                                   max_df=0.8,
                                   min_df=4,
                                   #ngram_range=(1,1),
                                   use_idf=True,
                                   smooth_idf=True,
                                   sublinear_tf=False,
                                   binary=True,
                                  )
            X_train_tf = tfid.fit_transform(X_train).todense()
            X_test_tf = tfid.transform(X_test).todense()
            print('Size vector X_train', X_train_tf.shape)

            #Train model and predict
            model = LinearSVC(penalty="l2", dual=True,C=0.5,max_iter = 1000) #values from best hyperpara from *NLP_SC
            model = model.fit(X_train_tf, y_train)
            pred = model.predict(X_test_tf)
            score = f1_score(y_test, pred, average='weighted')
            print('Score for split', split, ':', round(score,2))
            
            sum_score += score
            split += 1

        except:
            print(data, 'has too many unique words')
    
    print('')
    print('************************')
    mean_score = round(sum_score/split,2)
    print('*    Mean score', mean_score,'  *')
    print('************************')
    print('')
    
    unique_words_list.append(unique_words)
    score_list.append(score)
    data_name_list.append(data)
    

**************************************************************************
Unique words: 250363
Try with X_train_lemma-FR.csv

Split: 1
Size vector X_train (33967, 25050)
Score for split 0 : 0.81

Split: 2
Size vector X_train (33965, 24744)
Score for split 1 : 0.81

************************
*    Mean score 0.81   *
************************

**************************************************************************
Unique words: 242904
Try with X_train_lemma-FR_lemma-EN.csv

Split: 1
Size vector X_train (33967, 24326)
Score for split 0 : 0.82

Split: 2
Size vector X_train (33965, 24237)
Score for split 1 : 0.81

************************
*    Mean score 0.81   *
************************

**************************************************************************
Unique words: 236462
Try with X_train_lemma-FR_lemma-EN_lemma-DE.csv

Split: 1
Size vector X_train (33967, 24276)
Score for split 0 : 0.82

Split: 2
Size vector X_train (33965, 24140)
Score for split 1 : 0.81

********************

# CONCLUSION

Il parait que peu importe le traitement du text, l'algoritme SVC a à peu près le même score à chaque fois.

On peut donc se baser sur le critère de la taille des donnes pour notre choix de texte traité.
On va donc choisir celui avec une taille plus petite afin d'accelerer les calcules!