# Challenge Large Scale Machine Learning

### Fusion of algorithms for face recognition

The increasingly ubiquitous presence of biometric solutions and face recognition in particular in everyday life requires their adaptation for practical scenario. In the presence of several possible solutions, and if global decisions are to be made, each such single solution can be far less efficient than tailoring them to the complexity of an image.

In this challenge, the goal is to build a fusion of algorithms in order to construct the best suited solution for comparison of a pair of images. This fusion will be driven by qualities computed on each image.

Comparing of two images is done in two steps. 1st, a vector of features is computed for each image. 2nd, a simple function produces a vector of scores for a pair of images. The goal is to create a function that will compare a pair of images based on the information mentioned above, and decide whether two images belong to the same person.

You are provided a label set of training data and a test set without labels. You should submit a .csv file with labels for each entry of this test set.

# The properties of the dataset:


### Training data: 


The training set consist of two files, **xtrain_challenge.csv** and **ytrain_challenge.csv**.


File **xtrain_challenge.csv** contains one observation per row which contains following entries based on a pair of images:
 * columns 1-13 - 13 qualities on first image;
 * columns 14-26 - 13 qualities on second image;
 * columns 27-37 - 11 matching scores between the two images.

File **ytrain_challenge.csv** contains one line with each entry corresponding to one observation in **xtrain_challenge.csv**, maintaining the order, and has '1' if a pair of images belong to the same person and '0' otherwise.

There are in total 1.068.504 training observations.

### Test data:

File **xtest_challenge.csv** has the same structure as file **xtrain_challenge.csv**.

There are in total 3.318.296 test observations.

## The performance criterion¶

The performance criterion is the **prediction accuracy** on the test set, which is a value between 0 and 1, the higher the better.

# Training Data

Training data, input (file **xtrain_challenge.csv**): https://www.dropbox.com/s/myvvtmw61eg5gk7/xtrain_challenge.csv

Training data, output (file **ytrain_challenge.csv**): https://www.dropbox.com/s/cleumxob0dfzre4/ytrain_challenge.csv

# Test Data 

Test data, input (file **xtest_challenge.csv**): https://www.dropbox.com/s/bfrx8b4mqythm4q/xtest_challenge.csv

# Example submission

In [1]:
%matplotlib inline
import numpy as np
import sys
import os
import matplotlib.pyplot as plt
import math
from sklearn.neighbors import KNeighborsClassifier

## Load and investigate the data

In [2]:
# Load training data
nrows_train = 49
nrows_test = 51
xtrain = np.loadtxt('xtrain_challenge.csv', delimiter=',', skiprows = 1, max_rows = nrows_train + nrows_test)
ytrain = np.loadtxt('ytrain_challenge.csv', delimiter=',', skiprows = 1, max_rows = nrows_train + nrows_test)
ytrain = np.array(ytrain).reshape(nrows_train + nrows_test)
# Check the number of observations and properties
print(xtrain[:3,])
print(ytrain[:10])
print(xtrain.shape)
print(ytrain.shape)

[[ 1.000000e+00  0.000000e+00  0.000000e+00 -3.450000e+00 -1.314000e+01
   1.330000e+00  2.700000e-01  9.900000e-01  2.010000e+00  2.126100e+02
   4.740000e+01  0.000000e+00  0.000000e+00  1.000000e+00  0.000000e+00
   0.000000e+00 -2.990000e+00 -1.661000e+01  1.910000e+00  1.600000e-01
   8.100000e-01  3.800000e-01  1.778500e+02  3.002000e+01  0.000000e+00
   0.000000e+00  6.190440e+03  6.451890e+03  7.382840e+03  6.075200e+03
   1.029042e+04  6.839450e+03  6.782620e+03  7.594450e+03  6.711650e+03
   6.664710e+03  6.688640e+03]
 [ 1.000000e+00  0.000000e+00  0.000000e+00 -2.670000e+00 -4.880000e+00
   7.440000e+00  9.400000e-01  2.030000e+00  7.200000e-01  2.715300e+02
   4.195000e+01  0.000000e+00  1.000000e-01  1.000000e+00  0.000000e+00
   0.000000e+00  4.333000e+01 -2.209000e+01 -6.730000e+00  3.920000e+00
   2.380000e+00  1.920000e+00  1.949800e+02  2.512000e+01  0.000000e+00
   3.700000e-01  1.998000e+03  2.152790e+03  3.329340e+03  2.793190e+03
   2.358420e+03  2.454090e+03  2.

## Train a simple classifier

In [3]:
# Train the classifier on a part of the data set
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(xtrain[:nrows_train], ytrain[:nrows_train], )

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

In [4]:
# Check its on another part of the data set
yvalid = clf.predict(xtrain[nrows_train:(nrows_train + nrows_test)])
(yvalid == ytrain[nrows_train:(nrows_train + nrows_test)]).mean()

0.9803921568627451

## Prepare a file for submission

In [5]:
# Load test data
xtest = np.loadtxt('xtest_challenge.csv', delimiter=',', skiprows = 1)
# Classify the provided test data
ytest = clf.predict(xtest)
print(ytest.shape)
np.savetxt('ytest_challenge_student.csv', ytest, fmt = '%1.0d', delimiter=',')

(3318296,)
