# Introduction

It seems there was a problem with the previous data processing functions. Some of the columns did not have to be processed by the Label Encoders as they already were strings. Plus, labels and features were saved seperately which is not the most convenient way to proceed.

# Label Encoder

The data has already been partly processed to remove useless features and to transform the format of the date. The new data sets were saved in new_train.csv and new_test.csv. The only remaining step is to transform some columns with the label encoder and to save the results in new csv files.

## Encoding

In [116]:
 #Let's import the modules and packages
import pandas as pd
import numpy as np
from sklearn import preprocessing
import re
import bisect
import pickle

In [111]:
#Let's open the new datasets already preprocessed
train=pd.read_csv('new_train.csv')
test=pd.read_csv('new_test.csv')
train=train.drop('Id',axis=1)
test=test.drop('Id',axis=1)

train.head(3)

Unnamed: 0,org,tld,ccs,mail_type,images,urls,salutations,designation,chars_in_subject,chars_in_body,label,day,month,hour
0,coursera,org,0,multipart/alternative,23,188,0,1,38,136818,0,Thu,Mar,1.95
1,google,com,0,multipart/alternative,1,6,0,0,44,2467,0,Fri,Jan,5.333333
2,iiitd,ac.in,1,multipart/mixed,0,1,1,0,78,2809449,2,Mon,Aug,10.9


We realized that some rows in some dataframes were corrupted due to malfunctions in the day, month or hour columns. Therfore, we will drop these corrupted rows.

In [87]:
train=train[train.day.notnull()]
train=train[train.month.notnull()] 
test=test[test.day.notnull()]
test=test[test.month.notnull()] 

train['hour'].astype(np.float64);
test['hour'].astype(np.float64);

In [105]:
#Now we should proceed to the encoding of the dataset
#You can notice that the following function also returns the list of fitted label encoders because we might need to reverse the transformation for later interpretation.

def encodeWithLabelEncoder(trainDf,testDf):
    """
    Input: pandas dataframe, pandas dataframe
    Output: list of pandas dataframe, pandas dataframe, list of label encoders
    The function takes two datasets (train and test) as arguments and returns the same datasets where strings were transformed into numbers thanks to the labelencoder of sklearn. Only certain columns may be concerned by the LabelEncoder : those which have strings as features.
    It also returns the list of the fitted LabelEncoders. 
    """
    listEncoders=[]
    #We should fit the encoders only on the training data (because both training samples and test samples must undergo the same transformation)
    #Plus, as said before, we only need encoding for string columns
    columns=list(trainDf)
    
    for i in range(len(columns)):
        
        if re.match('^-?\d+(?:\.\d+)?$', str(trainDf.iloc[1,i])) is not None: #Either we can turn the data into a float
            trainDf[columns[i]].astype(np.float64)
            
        else: #Or it has to be encoded
            le=preprocessing.LabelEncoder()
            dataTrain=trainDf[columns[i]]
            dataTest=testDf[columns[i]]
            le.fit(dataTrain)
            
            ## We transform the data in the test dataset, it might have unknown values  
            if type(le.classes_.tolist()[0]) is str:
                unknown='other'
            else:
                unknown=-1

            dataTest = dataTest.map(lambda s: unknown if s not in le.classes_ else s)
            le_classes = le.classes_.tolist()
            bisect.insort_left(le_classes, unknown)
            le.classes_ = le_classes
            listEncoders.append(le)
            
            #Transforming
            transformedDataTrain=le.transform(dataTrain)
            transformedDataTest=le.transform(dataTest)
            #Modifying the test data set
            testDf[columns[i]]=transformedDataTest
            trainDf[columns[i]]=transformedDataTrain
    
    return [trainDf, testDf, listEncoders]
    
    

In [112]:
trainLabels=train['label']

results=encodeWithLabelEncoder(train.drop('label',axis=1).copy(),test.copy())

newTrain=results[0]
newTrain['label']=trainLabels
newTest=results[1]
encoders=results[2];

newTrain.head(3)

Unnamed: 0,org,tld,ccs,mail_type,images,urls,salutations,designation,chars_in_subject,chars_in_body,day,month,hour,label
0,116,129,0,0,23,188,0,1,38,136818,44,21,1.95,0
1,220,40,0,0,1,6,0,0,44,2467,40,18,5.333333,0
2,244,2,1,2,0,1,1,0,78,2809449,41,15,10.9,2


## Saving

Now we should save the data in new csv files as well as the encoders that we'll need for later interpretations.

### csv files

In [113]:
newTrain.to_csv('trainEncoded.csv',index='False')
newTest.to_csv('testEncoded.csv',index='False')

### Encoders

In [114]:
with open('D:/Utilisateurs/Bastien/Documents/Cours/CentraleSupelec/Electifs/Machine Learning/Evaluations/Assignment 2/mail-classification/encoder.txt','wb') as fichier:
    mon_pickler=pickle.Pickler(fichier)
    mon_pickler.dump(encoders)

In [115]:
# with open('D:/Utilisateurs/Bastien/Documents/Cours/CentraleSupelec/Electifs/Machine Learning/Evaluations/Assignment 2/mail-classification/encoder.txt','rb') as fichier:
#     pickler=pickle.Unpickler(fichier)
#     encoder=pickler.load()

# for encoder in encoders:
#     print(encoder.classes_)