# Homework 04 - Applied ML

*Remarks for the easy reading of the work*:
The data in use are stored in the folder `Data`, the description is available [here](https://github.com/ADAEPFL/Homework/blob/master/04%20-%20Applied%20ML/DATA.md).
All the functions that are mentioned are stored in separate libraries that are specified at each step. 
The *Notebook* organisation is specified in the *Table of contents*.

### Table of contents
1. [Predict the skin color of a soccer player](#task1)
    1. [Exploratory Data Analysis, Feature Selection and Feature engineering](#EDA)
    2. [Baseline model](#baseline)
	3. [Find the model](#tuning)
	4. [*BONUS*](#bonus)
2. [Cluster players with dark and light skin colors](#task2)
    1. [Sub paragraph](#subparagraph1)

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns  
from functools import partial
import matplotlib.pyplot as plt 
from data_preprocessing import *
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
%matplotlib inline

## 1. Predict the skin color of a soccer player <a name="task1"></a>

In this first task we train a *Random forest* classifier to be able to predict the skin color of a soccer player. In order to do so, we proceed pre-processing the data as first step then moving toward the choice of the model (interpret as the choice of parameters controlling the possible issues i.e. the *overfitting*). As required, we then switch to the inspection of the `feature_importances_` attribute and the discussion of the obtained results.

### 1.1 Exploratory Data Analysis, Feature Selection and Feature engineering <a name="EDA"></a>

In [None]:
# Import data 
data = pd.read_csv('CrowdstormingDataJuly1st.csv', sep = ',')

In [None]:
data.head()

In [None]:
data.columns

##### First clean of data
According to the given information in the [data description](https://github.com/ADAEPFL/Homework/blob/master/04%20-%20Applied%20ML/DATA.md), we get rid off all the dyads that correspond to players whose picture is not available.

In [None]:
data_clean = data[(data.photoID.notnull())]

##### Have a glance at the labels

Thus, we check whether happens that one of the two raters do not assign the label. We see that both of them do their job. 

In [None]:
# How many players the rater 1 don't label?
miss_rater_1 = sum(data_clean.rater1.isnull())
# How many the rater 2?
miss_rater_2 = sum(data_clean.rater2.isnull())

print ('Rater 1 does not label', miss_rater_1, 'players')
print ('Rater 2 does not label', miss_rater_2, 'players')

We study the distribution of the labels, even to verify disagreements between the two raters. The procedure consist of:
- Grouping by the `playerShort`
- Get the given labels
- Plot their distribution using a *simple* barplot

In [None]:
# Drop dyads weigth and height
data_clean = data_clean.dropna(axis=0, subset=['height', 'weight'])

In [None]:
player_data = data_clean.groupby('playerShort')

Verify that all the players only belong to one club

In [None]:
player_data.agg({'club' : lambda x: len(set(x))})['club'].unique()

In [None]:
player_data.agg({'position' : lambda x: len(set(x))})['position'].unique()

In [None]:
players = player_data.agg({
        'club' : 'first',
        'leagueCountry' : 'first',
        'birthday' : 'first',
        'height' : 'first',
        'weight' : 'first',
        'position' : 'first',
        'games' : 'sum',
        'victories' : 'sum',
        'ties' : 'sum',
        'defeats' : 'sum',
        'goals' : 'sum',
        'yellowCards': 'sum',
        'yellowReds': 'sum',
        'redCards' : 'sum',
        'rater1' : 'mean',
        'rater2' : 'mean',
        #'refNum' : 'count',
        #'refCountry' : 'count',
        #'meanIAT' : 'mean',
        #'meanExp' : 'mean'
        
    })

In [None]:
label_1 = players['rater1']

In [None]:
label_2 = players['rater2']

In [None]:
def binary_labels(x):
    if x <= 0.5:
        return 0    
    else:
        return 1

In [None]:
def preprocess_labels(label):
    le = preprocessing.LabelEncoder()
    le.fit(label)
    label = le.transform(label) 
    return label

In [None]:
label_1 = label_1.apply(binary_labels)
label_2 = label_2.apply(binary_labels)

In [None]:
label_1 = pd.Series(preprocess_labels(label_1))
label_2 = pd.Series(preprocess_labels(label_2))

In [None]:
players.drop('rater1', axis = 1, inplace = True)

In [None]:
players.drop('rater2', axis= 1, inplace = True)

### Baseline model  <a name="baseline"></a>

#### Preprocess variable to be used as input for the classifier

In [None]:
players['birthday'] = players['birthday'].apply(lambda x: float(x.split('.')[-1]))

In [None]:
def encode_string_variable(df, attribute):
    
    df[attribute] = df[attribute].fillna('Unknown')
    
    le = preprocessing.LabelEncoder()
    
    le.fit(df[attribute])
    
    df[attribute] = le.transform(df[attribute]) 

In [None]:
# Get the string variables
object_features = [i for i in players.columns if players[i].dtypes == 'object']
numerical_features = [i for i in players.columns if (players[i].dtypes == 'int64' or players[i].dtypes == 'float64') and len(players[i].unique()) > 12]

In [None]:
for feature in object_features:
    encode_string_variable(players, feature)

##### Categorise features

In [None]:
numerical_features

In [None]:
def create_bins(df, attribute):
    # Get the whiskers values
    B = plt.boxplot(df[attribute])
    plt.close()
    min_max = [item.get_ydata()[1] for item in B['whiskers']]

    # Compute the Skew-test
    skew_pvalue = skewtest(df[attribute][df[attribute] >= min_max[0]])[1]
    
    if skew_pvalue < 0.05:
        bins = np.histogram(df[attribute], bins = 'doane')[1]
        bins_interval = [(bins[i], bins[i+1]) for i in range(len(bins)-1)]
    else:
        bins = np.histogram(players[attribute], bins = 'auto')[1]
        bins_interval = [(bins[i], bins[i+1]) for i in range(len(bins)-1)]
    
    return bins_interval

In [None]:
def categorisation(bins_intervals,x):
    
    classes = range(len(bins_intervals))
    for i in classes:
        if  bins_intervals[i][0] <= x < bins_intervals[i][1]:
            return classes[i]
    
    return classes[-1]   

In [None]:
for i in range(len(numerical_features)):
    players[numerical_features[i]] = players[numerical_features[i]].apply(partial(categorisation, create_bins(players, numerical_features[i])))

In [None]:
players

In [None]:
#describe(players[numerical_features[1]])

In [None]:
#np.histogram(players[numerical_features[1]], bins = 'doane')

In [None]:
#describe(players[numerical_features[2]])

In [None]:
#np.histogram(players[numerical_features[2]], bins = 'doane')

In [None]:
#describe(players[numerical_features[3]])

In [None]:
#np.histogram(players[numerical_features[3]], bins = 'doane')

In [None]:
#describe(players[numerical_features[4]])

In [None]:
#np.histogram(players[numerical_features[4]], bins = 'doane')

In [None]:
#describe(players[numerical_features[5]])

In [None]:
#np.histogram(players[numerical_features[5]], bins = 'doane')

In [None]:
#describe(players[numerical_features[6]])

In [None]:
#np.histogram(players[numerical_features[6]], bins = 'doane')

In [None]:
#describe(players[numerical_features[7]])

In [None]:
#np.histogram(players[numerical_features[7]], bins = 'auto')

In [None]:
#describe(players[numerical_features[8]])

In [None]:
#np.histogram(players[numerical_features[8]], bins = 'doane')

In [None]:
#describe(players[numerical_features[9]])

In [None]:
#np.histogram(players[numerical_features[9]], bins = 'doane')

In [None]:
#describe(players[numerical_features[10]])

In [None]:
#np.histogram(players[numerical_features[10]], bins = 'doane')

In [None]:
#describe(players[numerical_features[11]])

In [None]:
#np.histogram(players[numerical_features[11]], bins = 'doane')

#### Split train and test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(players, label_1, test_size=0.33, random_state=42)

In [None]:
weight_class = y_train.value_counts()/len(y_train)

In [None]:
weight_class

In [None]:
sample_weights = []
for i in y_train:
    sample_weights += [weight_class[i]]

In [None]:
y_train_2, y_test_2 = label_2[y_train.index], label_2[y_test.index]

In [None]:
forest = RandomForestClassifier(n_estimators=100, random_state=1, class_weight='balanced')

In [None]:
train_forest = forest.fit(X_train, y_train, sample_weight= sample_weights)

In [None]:
a = train_forest.predict(X_test)

In [None]:
train_forest.score(X_test, y_test)

In [None]:
multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)

In [None]:
multi_label =  np.array([ y_train, y_train_2]).T
multi_label_test = np.array([ y_test, y_test_2]).T

In [None]:
multi_target_forest.fit(X_train, multi_label, sample_weight= sample_weights).score(X_test, multi_label_test)

In [None]:
classifier_1 = np.array(classifier_1)

In [None]:
for i in range(5):
    print ('TEST', len(y_test[y_test == i]), 'class', i)
    print ('PREDICTOR', len(classifier_1[classifier_1 == i]), 'class', i)
    print ('*'*20)

In [None]:
len(classifier_1)

In [None]:
len(y_test[y_test == 0])

In [None]:
len(classifier_1[classifier_1 == 0])

In [None]:
multi_target_forest.fit(X_train, multi_label).score(X_test, multi_label_test)

In [None]:
sum(y_test == classifier_1)/len(y_test)

In [None]:
classifier_1

In [None]:
y_test

### Find the model <a name="tuning"></a>

### *BONUS* <a name="bonus"></a>

## Cluster players with dark and light skin colors <a name="task2"></a>