# Final lab session


Fill with your information:
* First name: Eddy 
* Last name: OHAYON
* email: eddy.ohayon5@gmail.com 

# Description

For this lab session, you will have to code in Python using `keras`. 
You will have to learn about the so-called [functional API](http://keras.io/getting-started/functional-api-guide/). When going through this tutorial, pay specific attention to the “Multi-input and multi-output models” and “Shared layers” sections).

## Problem statement

A school receives each year a given number of student applications. These applications are made both of qualitative elements (that we will ignore in this exercise) and some quantitative ones about the grades they obtained for a given list of courses the student followed.

Let us assume that people in charge of the selection process in this school would like to compute, for each student, a kind of global grade that would be a weighted average of its grades.
Unfortunately, these peolpe are not able to reach a consensus on which weights to use.
The only consensus they could reach is when they have to compare a pair of files and decide on which is better.

The goal of this assignment is then to design a model that could learn those weights based on provided pairwise comparisons. 
To do so, you are given a dataset that indicates, for each pair of student candidates, the one that would be prefered (and hence ranked higher) by the jury.

Constraints for this problem are as follows:
* the final score given to a student should be a linear combination of its grades;
* each weight should be positive;
* the weights should sum to 1.

## Data and expected output

Input data is organized in two separate files:
* `grades.csv` provides grades for each students;
* `compare.csv` indicates, for a given pair of students, the one that the jury would have preferred.

The code below downloads these two files that you can then load using `numpy.loadtxt`

You will have to provide two things on your side:
* a completed version of this notebook (in .ipynb format);
* a text file containing the learned weights, as generated by `numpy.savetxt`.

In [0]:
!wget "https://rtavenar.github.io/teaching/deep_edhec/data/project/compare.csv"
!wget "https://rtavenar.github.io/teaching/deep_edhec/data/project/grades.csv"
!head compare.csv
!echo "---"
!head grades.csv

--2019-05-31 14:12:44--  https://rtavenar.github.io/teaching/deep_edhec/data/project/compare.csv
Resolving rtavenar.github.io (rtavenar.github.io)... 185.199.109.153, 185.199.111.153, 185.199.110.153, ...
Connecting to rtavenar.github.io (rtavenar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1466462 (1.4M) [text/csv]
Saving to: ‘compare.csv.10’


2019-05-31 14:12:44 (39.4 MB/s) - ‘compare.csv.10’ saved [1466462/1466462]

--2019-05-31 14:12:47--  https://rtavenar.github.io/teaching/deep_edhec/data/project/grades.csv
Resolving rtavenar.github.io (rtavenar.github.io)... 185.199.109.153, 185.199.111.153, 185.199.110.153, ...
Connecting to rtavenar.github.io (rtavenar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 790457 (772K) [text/csv]
Saving to: ‘grades.csv.10’


2019-05-31 14:12:47 (22.9 MB/s) - ‘grades.csv.10’ saved [790457/790457]

ID_ETU1;ID_ETU2;BEST_ETU
4928;7791;7791
4302

# Data loading and inspecting

First, we load all the libraries we will need and load the datasets as arrays.

In [0]:
# import the libraries
import numpy as np
import keras
from keras.layers import Input, Dense, LSTM, Concatenate
from keras.models import Model
from google.colab import drive
drive.mount('/content/gdrive')

# load the datasets
grades = np.loadtxt('grades.csv',delimiter=';',skiprows=1)
compare = np.loadtxt('compare.csv',delimiter=';',skiprows=1)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


We have a first look at the shape and structure of the data.

In [0]:
# For the grades
print(grades.shape)
print(grades[:10])

# For the comparisons
print(compare.shape)
print(compare[:10])

(10000, 5)
[[ 0.         11.32710894 13.66053368  9.64886861  8.94851893]
 [ 1.         11.82135424 13.74151583 11.10487924  8.89833695]
 [ 2.         10.83920045 11.06217951  9.3612374   9.35079721]
 [ 3.         12.43419379 14.20490498 12.89786443  7.91187559]
 [ 4.         12.34668849 13.96751096 11.23327335  8.01759149]
 [ 5.         11.48188355 14.63886462 10.36178172  6.49370282]
 [ 6.         12.72821111 13.85997618 10.29893556  7.79466919]
 [ 7.         11.30447324 14.44187734  9.80319394  6.69076554]
 [ 8.         12.97748237 14.46427201 11.40295478  8.32207341]
 [ 9.         13.41327909 12.99001591 11.44333505  7.45247558]]
(100000, 3)
[[4928. 7791. 7791.]
 [4302. 7448. 7448.]
 [1005. 8816. 1005.]
 [1970. 2533. 2533.]
 [3203. 1186. 1186.]
 [3057. 3928. 3057.]
 [7636.  647.  647.]
 [2843. 9681. 2843.]
 [1530. 9920. 9920.]
 [4303. 1455. 1455.]]


There are **10,000 students** and each one of them has **4 grades**. 

There are **10,000 comparisons** which have been made.

# Methodology

First, we split the grades into **two arrays: grades1 and grades2.** They will represent the grades of the students that appear on the "compare" file (we keep the orgininal order).

In the same time, we create **a new variable which will the be the target**: its value is **0 if the "student 1" was defined by the professors as better than the "student 2", and 1 otherwise.**

---

Thus, the goal is to use the grades arrays (the ones of student 1 and the ones of student 2) as inputs. We will **compute the score for each one with the same weights**, using **a shared layer**. The score will be compute by a linear combination of grades: **a shared layer with a single neuron, without bias and using a linear activation.**

**The second (and last) layer will use a sigmoid function** that should return 0 if the "student 1" score was higher and 1 otherwise using the y-target.

# Preprocessing data and create constraint

We create the two inputs with the values explained above.

In [0]:
# create the arrays for students 1 and students 2
grades_1 = []
grades_2 = []
best = []
for compare in compare:
  grades_1.append(grades[int(compare[0])][1:])
  grades_2.append(grades[int(compare[1])][1:])

  # adding the comparison for target
  if compare[2] == compare[0]:
    best.append(0)
  else:
    best.append(1)

As a reminder, **the constraints on the weights** are :

*   each weight should be positive;
*   the weights should sum to 1.

Thus, we have to create it by our own as it's not a common one taken in the model.

In [0]:
def constraint(w):
  return abs(w / (math_ops.reduce_sum(w, axis=0, keepdims=True)))

# Build the model

As we are facing a classification problem (predict the best student), we will use the binary_crossentropy loss. The accuracy metric will assess the perfomance. We use an adam optimizer.

We will use a traditionnal validation split of 20% to test the model in a fairly manner.

*Note: restore_best_weights command allows to save the best ones and not the last ones.*

In [0]:
student_1 = Input(shape=(4,), dtype='float32', name='student_1') # input with the grades of students 1
student_2 = Input(shape=(4,), dtype='float32', name='student_2') # input with the grades of students 2


shared_layer = Dense(1,  activation="linear",use_bias=False, kernel_constraint=constraint) # calculating the grades
student_1_score = shared_layer(student_1)
student_2_score = shared_layer(student_2)

preds = Dense(1, activation='sigmoid')(Concatenate(axis=-1)([student_1_score, student_2_score])) # predicting the best student

model = Model(inputs=[student_1, student_2], outputs=preds) # building our model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [0]:
# Testing the model
from tensorflow.python.ops import math_ops
model.fit([grades_1, grades_2], best, epochs=10, validation_split = .2)

Train on 80000 samples, validate on 20000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff7f5245f28>

Our results look great, with a very strong accuracy.

# Weights inspection and validation

We have a look at our weights.

In [0]:
print(shared_layer.get_weights()[0])

[[0.3267201 ]
 [0.10893933]
 [0.30102238]
 [0.26331815]]


They all respect the contraint of being between 0-1.

In [0]:
print(sum(shared_layer.get_weights()[0]))

[1.]


They respect the constraint of the sum equals to 1.

We can see that the first grade has a strong weight, followed closely by the third and fourth one. The second one has a low weigth.

We can deduce that Maths and CS were the most important classes for the professors, followed closely by English. The Economics class is less important. Probably an Engineering School !

In [0]:
# saving the weights as txt file on my drive
np.savetxt("/content/gdrive/My Drive/my_weigths.txt", shared_layer.get_weights()[0])

In [0]:
!ls /content/gdrive/

'My Drive'
