# Introduction


## Challenge Large Scale Machine Learning

### Authors: 
#### Pavlo Mozharovskyi (pavlo.mozharovskyi@telecom-paristech.fr), Umut Şimşekli (umut.simsekli@telecom-paristech.fr)


### Fusion of algorithms for face recognition

The increasingly ubiquitous presence of biometric solutions and face recognition in particular in everyday life requires their adaptation for practical scenario. In the presence of several possible solutions, and if global decisions are to be made, each such single solution can be far less efficient than tailoring them to the complexity of an image.

In this challenge, the goal is to build a fusion of algorithms in order to construct the best suited solution for comparison of a pair of images. This fusion will be driven by qualities computed on each image.

Comparing of two images is done in two steps. 1st, a vector of features is computed for each image. 2nd, a simple function produces a vector of scores for a pair of images. The goal is to create a function that will compare a pair of images based on the information mentioned above, and decide whether two images belong to the same person.

You are provided a label set of training data and a test set without labels. You should submit a .csv file with labels for each entry of this test set.

# The properties of the dataset:


### Training data: 


The training set consist of two files, **xtrain_challenge.csv** and **xtest_challenge.csv**.

File **xtrain_challenge.csv** contains one observation per row which contains following entries based on a pair of images, A and B say:
 * columns 1-14 - 14 qualities on image A;
 * columns 15-28 - 14 qualities on image B;
 * columns 29-36 - 8 matching scores between A and B.

File **ytrain_challenge.csv** contains one line with each entry corresponding to one observation in **xtrain_challenge.csv**, maintaining the order, and has '1' if a pair of images belong to the same person and '0' otherwise.

There are in total 3.196.465 training observations.

### Test data:

File **xtest_challenge.csv** has the same structure as file **xtrain_challenge.csv**.

There are in total 1.598.219 test observations.

## The performance criterion¶

The performance criterion is the **prediction accuracy** on the test set, which is a value between 0 and 1, the higher the better.

# Training Data

Training data, input (file **xtrain_challenge.csv**): https://www.dropbox.com/s/myvvtmw61eg5gk7/xtrain_challenge.csv

Training data, output (file **ytrain_challenge.csv**): https://www.dropbox.com/s/cleumxob0dfzre4/ytrain_challenge.csv

# Test Data 

Test data, input (file **xtest_challenge.csv**): https://www.dropbox.com/s/bfrx8b4mqythm4q/xtest_challenge.csv

# Import package

In [2]:
%matplotlib inline
import numpy as np
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.externals import joblib
import math

## Load and investigate the data

## Decouverte du dataset

In [3]:
xtrain = pd.read_csv('xtrain_challenge.csv')
ytrain = pd.read_csv('ytrain_challenge.csv')
dataset=xtrain.merge(ytrain,  left_index=True,right_index=True)

In [3]:
dataset.describe()

Unnamed: 0,fA1,fA2,fA3,fA4,fA5,fA6,fA7,fA8,fA9,fA10,...,fB14,s1,s2,s3,s4,s5,s6,s7,s8,y
count,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,...,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0,3196465.0
mean,3.123656,0.720014,0.08895844,29.38227,0.9112188,0.004165448,0.08456615,0.1323673,0.02970159,-0.0002833317,...,247.7718,3874.35,2434.998,4200.151,4242.067,3488.791,3399.066,3509.614,3745.971,0.2857347
std,1.618458,0.4224322,0.2824698,9.793075,0.2757797,0.04433567,0.2729676,0.415911,0.1279389,0.02096133,...,122.1232,2063.822,227.2729,2954.367,2127.659,1360.081,1363.354,1066.627,1548.29,0.4517637
min,-0.45,0.0,0.0,18.0,0.0,0.0,0.0,-2.55,-3.03,-0.3,...,-308.79,1212.7,1511.4,721.8,1133.8,668.2,720.0,1358.0,710.7,0.0
25%,1.95,0.25,0.0,22.0,1.0,0.0,0.0,-0.02,-0.02,0.0,...,221.42,2420.3,2296.3,2086.2,2735.4,2457.2,2335.0,2742.6,2620.1,0.0
50%,3.39,1.0,0.0,26.0,1.0,0.0,0.0,0.0,0.02,0.0,...,276.64,3062.6,2439.8,3073.3,3402.0,3063.3,2999.8,3151.8,3218.8,0.0
75%,4.38,1.0,0.0,34.0,1.0,0.0,0.0,0.1,0.08,0.01,...,321.66,4442.5,2589.3,5143.8,4898.8,4406.8,4375.5,3926.3,4396.9,1.0
max,7.94,1.0,1.0,74.0,1.0,1.0,1.0,2.5,0.76,0.26,...,481.42,12044.2,3035.9,16666.5,11812.9,7731.0,7580.1,6949.5,8524.9,1.0


In [4]:
print(dataset.columns)
print(dataset.dtypes)

Index(['fA1', 'fA2', 'fA3', 'fA4', 'fA5', 'fA6', 'fA7', 'fA8', 'fA9', 'fA10',
       'fA11', 'fA12', 'fA13', 'fA14', 'fB1', 'fB2', 'fB3', 'fB4', 'fB5',
       'fB6', 'fB7', 'fB8', 'fB9', 'fB10', 'fB11', 'fB12', 'fB13', 'fB14',
       's1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 'y'],
      dtype='object')
fA1     float64
fA2     float64
fA3     float64
fA4       int64
fA5     float64
fA6     float64
fA7     float64
fA8     float64
fA9     float64
fA10    float64
fA11    float64
fA12    float64
fA13    float64
fA14    float64
fB1     float64
fB2     float64
fB3     float64
fB4       int64
fB5     float64
fB6     float64
fB7     float64
fB8     float64
fB9     float64
fB10    float64
fB11    float64
fB12    float64
fB13    float64
fB14    float64
s1      float64
s2      float64
s3      float64
s4      float64
s5      float64
s6      float64
s7      float64
s8      float64
y         int64
dtype: object


In [5]:
dataset.corr()[['s1','s2','s3','s4','s5','s6','s7','s8']]

Unnamed: 0,s1,s2,s3,s4,s5,s6,s7,s8
fA1,0.302163,0.175194,0.175699,0.319515,0.113217,0.072562,0.221904,0.328958
fA2,0.195161,0.094589,0.108425,0.199246,0.08543,0.06589,0.127893,0.207402
fA3,-0.156674,-0.077087,-0.057989,-0.170331,-0.014477,0.014819,-0.092125,-0.190083
fA4,-0.221524,-0.171908,-0.180561,-0.221287,-0.170143,-0.141654,-0.210691,-0.223554
fA5,0.2674,0.005037,0.011938,0.299303,-0.10887,-0.170258,0.090483,0.339884
fA6,-0.031351,-0.070376,-0.037729,-0.029526,-0.030271,-0.032501,-0.038793,-0.02851
fA7,-0.265017,0.006394,-0.00589,-0.297546,0.114937,0.177315,-0.085074,-0.338711
fA8,-0.217639,0.007184,0.01899,-0.247403,0.132473,0.180096,-0.053877,-0.284671
fA9,0.053327,0.074944,0.032096,0.055311,0.02594,0.026823,0.047415,0.054904
fA10,0.005469,0.025288,0.002957,0.006599,-0.001799,-0.003317,0.004837,0.008699


In [None]:
dataset.corr()['y']

fA1     0.172994
fA2     0.111962
fA3    -0.032684
fA4    -0.206904
fA5     0.013986
fA6    -0.000138
fA7    -0.014096
fA8     0.001426
fA9     0.050191
fA10   -0.011532
fA11   -0.127774
fA12    0.129836
fA13   -0.043042
fA14    0.093309
fB1     0.149043
fB2     0.114148
fB3    -0.022911
fB4    -0.190803
fB5     0.012304
fB6     0.047056
fB7    -0.074319
fB8    -0.055535
fB9     0.081903
fB10   -0.010141
fB11   -0.127597
fB12    0.136991
fB13   -0.030770
fB14    0.071156
s1      0.801645
s2      0.325149
s3      0.771318
s4      0.804455
s5      0.761075
s6      0.771061
s7      0.809114
s8      0.795365
y       1.000000
Name: y, dtype: float64

On observe que le matching ( colonne s1 à s8) est fortement corrèlé avec le matching des variables.

In [None]:
dataset['y'].value_counts()

0    2283124
1     913341
Name: y, dtype: int64

On observe que la distribution des variables à prédire n'est pas également répartie.

## split des données

On a vu précédemment que la classe à prédire n'est pas également répartie il faut donc en tenir compte lors du split pour ne pas accroittre cette disparité. Le parametre `stratify` est prévu à cet effet.  
Dans le cas d'utilisation de `réseau de neuronnes` et de `light GBM`, Centrer-réduire les données à un réel impact  sur les perfmances des modèles. Cette étape a été réalisée plus bas.

In [None]:
from sklearn.model_selection import train_test_split
train,test=train_test_split(dataset,test_size=0.1,stratify=dataset['y'])
data=pd.read_csv('dataNormalized.csv',index_col=0)
trainN,testN=train_test_split(data,test_size=0.1,stratify=data['y'])

## Modelling

### Choix des modèles

Suite à une courte recherche bibliographique j'ai décidé de me tourner vers 5 modèles différents:   
 * LDA
 * SVM
 * Réseaux de neuronnes  
 * XGBoost
 * LightGBoost
 
La `SVM` et la `LDA` n'ont pas offert assez bon résultat face aux autres mdèles pour justifier de détailler leur implémentation.

La suite de mon rapport rendra compte de l'implémentation d'un réseau de neuronnes et des modèles de gradient boosting

### Puissance de calcul pour l'optimisation des hyper-parametres

Afin d'optimiser le temps de calcul pour trouver les hyperparametres des trois modèles restants, les calculs ont été distribué sur deux clusters dask composé des machines de l'école. Un notebook pour chaque modèle a aussi été mis en place pour pouvoir lancer les calculs simultanéments.

Voici le lien vers chacun de ces notebooks:  
 * Neural network
 * XGBoost
 * LGBoost
 
 

### Resultats du LGBoost

In [None]:
pd.read_csv('LGBM.csv')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_num_leaves,params,split0_test_score,split1_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,mean_train_score,std_train_score
0,98.779547,0.946638,14.711493,0.277883,0.02,256,"{'learning_rate': 0.02, 'num_leaves': 256}",0.987912,0.987832,0.987872,4e-05,1,0.990968,0.991069,0.991019,5e-05
1,94.529286,0.926894,13.868437,0.247973,0.02,128,"{'learning_rate': 0.02, 'num_leaves': 128}",0.986516,0.986479,0.986498,1.8e-05,2,0.988014,0.988184,0.988099,8.5e-05
2,68.146726,0.511424,11.636487,0.263768,0.02,64,"{'learning_rate': 0.02, 'num_leaves': 64}",0.985063,0.984973,0.985018,4.5e-05,3,0.985731,0.98595,0.98584,0.00011
3,58.402427,4.384872,9.723119,0.349699,0.02,32,"{'learning_rate': 0.02, 'num_leaves': 32}",0.983464,0.98329,0.983377,8.7e-05,4,0.983764,0.98392,0.983842,7.8e-05
4,28.826629,0.630166,7.994866,0.201128,0.4,32,"{'learning_rate': 0.4, 'num_leaves': 32}",0.983425,0.983252,0.983339,8.7e-05,5,0.984965,0.984797,0.984881,8.4e-05
5,26.649774,0.406546,8.610677,0.08326,0.7,32,"{'learning_rate': 0.7, 'num_leaves': 32}",0.981956,0.98192,0.981938,1.8e-05,6,0.983401,0.983681,0.983541,0.00014
6,35.504586,0.599904,10.101779,0.123615,0.4,64,"{'learning_rate': 0.4, 'num_leaves': 64}",0.980784,0.981377,0.98108,0.000296,7,0.982666,0.983824,0.983245,0.000579
7,34.976297,0.386149,10.287855,0.128522,0.7,64,"{'learning_rate': 0.7, 'num_leaves': 64}",0.979862,0.980125,0.979993,0.000131,8,0.981306,0.981954,0.98163,0.000324
8,50.079604,1.605808,12.735095,0.204151,0.4,128,"{'learning_rate': 0.4, 'num_leaves': 128}",0.976314,0.979722,0.978018,0.001704,9,0.979315,0.98214,0.980728,0.001412
9,70.000215,1.332182,16.317688,0.151828,0.4,256,"{'learning_rate': 0.4, 'num_leaves': 256}",0.968426,0.970547,0.969486,0.001061,10,0.971408,0.973724,0.972566,0.001158


In [None]:
modelLGBM = joblib.load('LGBM_final.sav')
yvalid = modelLGBM.predict(testN.drop('y',axis=1))
(yvalid == testN['y']).mean()

### Resultat du XGBoost

In [None]:
pd.read_csv('XGB.csv')

In [None]:
modelXGB = joblib.load('XGB_final.sav')
yvalid = modelXGB.predict(test.drop('y',axis=1))
(yvalid == test['y']).mean()

### Resultat du réseau de neurones

In [None]:
pd.read_csv('NN.csv')

In [None]:
modelNN = joblib.load('NN_final.sav')
yvalid = modelNN.predict(testN.drop('y',axis=1))
(yvalid == testN['y']).mean()

## Prepare a file for submission

Le modèle avec le meilleur score est le modèle `LightGBM`.

In [9]:
# Load test data
xtest = pd.read_csv('testNormalized.csv')
# Classify the provided test data
ytest = modelLGBM.predict(xtest)
print(ytest.shape)
np.savetxt('ytest_final.csv', ytest, fmt = '%1.0d', delimiter=',')

(1598219,)


#### Now it's your turn. Good luck !  :) 