# Machine Learning - Fifa 2018

The following example, explains in simple words, a machine learning tasks.

- Database that is going to be used: Fifa Database 2018
- Algorithm that is going to be tested: Decision Tree


## Part 1: Import databases

In [1]:
# Import pandas
import pandas as pd

# Import three databases
df1 = pd.read_csv("RESULTS.csv")
df2 = pd.read_csv("TEST.csv")
df3 = pd.read_csv("TRAIN.csv")

The above database are:

- df3 --> Train database. It contains all the data features from the FIFA database.
- df2 --> Test database. It contains a test set sample, with empty the column that we want to predict (Feature: Overall).
- df1 --> Results database. It contains the test set sample, with filled the column that we want to predict (Feature: Overall). The filled one is the accurate one, the one that we actually want to predict in the df2.

In [2]:
# Drop the NaN values from the TRAIN database

df3 = df3.dropna()

The reason that we drop the empty cell is because the algorithm that we are going to apply later on, cannot accept NaN values. There are different wayss to deal with missing values. For the particular example it was decided to remove those rows.

## Part 2: Select y and X

- **y** is the feature that we want to predict. In this case is Overall performance of the footballers
- **X** are the features that we want to use, to fit a model. It is important to have all the features in numbers. Python, recognizes ONLY numeric features for training algorithms.

What is a model? A model is something that we can use, in order to make a prediction
What do you mean "fit"? It means that we take into consideration the features and assumes how did they provide the respective "Overall" score for a footballer (per row). This consideration builds the model, and allows to generalize our algorithm.

In [3]:
# Define y
y = df3[["Overall"]]

# Define X
features = ['Crossing',
       'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling',
       'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
       'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
       'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
       'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
       'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling',
       'GKKicking', 'GKPositioning', 'GKReflexes']

X = df3[features]

## Part 3: Train the algorithm (Decision Tree)

For this particular experiment, there was no particular reason for selecting the Decision Tree algorithm. This experiment conducted purely to provide a view of what the algorithm does

In [4]:
# Import the required algorithm
from sklearn.tree import DecisionTreeRegressor 

# Select the algorithm
Trainer = DecisionTreeRegressor(random_state=1)

# Train the model
Trainer.fit(X,y)


DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

#### Sub-Part 3.1: Calculating MAE

Calculating MAE

In [6]:
from sklearn.metrics import mean_absolute_error

val_mae = mean_absolute_error(df3["Overall"], y)
val_mae

0.0

## Part 4: Make a prediction

The following script aims to predict TEST database, Overall empty cells based on the model that we trained earlier (Trainer).

In [8]:
# For this to happen, we have to take into considerration that the TEST database is using different X features.
X1 = df2[features]
X1 = X1.groupby(X1.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
X1 =X1.fillna(value=62)

# And also, we have to apply the model that we fit earlier to the database.
predictions = Trainer.predict(X1)



## Part 5: Calculating MAE

Calculating MAE is one simple way to see the performance of the algorithm. It is the sum of the deduction of the actual value - the predicted one.

In [9]:
from sklearn.metrics import mean_absolute_error

val_mae = mean_absolute_error(df1["Overall"], predictions)
val_mae

2.7602050155592166

The above calculation is not bad but there are many ways to reduce the MAE. Some of the ways are mentioned below:

1. Dealing differently with the **missing values**.
2. Including **additional features** (adding or removing some of them - hint: coefficients might play a vital role here)
3. **Tune** the algorithm. Hmmm that sounds interesting, let us stay on that track and see what else can we do here.

## Part 6: Tune the algorithm (test 1)

In [10]:
# Select the Leaf Nodes
candidate_max_leaf_nodes = [2, 5, 25, 50,80, 100,120,150,180,200, 250, 500]

r = df1["Overall"]

def get_mae(max_leaf_nodes, X, X1,y,r):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X, y)
    preds_val = model.predict(X1)
    mae = mean_absolute_error(r, preds_val)
    return(mae)


for max_leaf_nodes in [2, 5, 25, 50,80, 100,120,150,180,200, 250, 500]:
    my_mae = get_mae(max_leaf_nodes, X, X1, y, df1["Overall"])
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
scores = {leaf_size: get_mae(leaf_size, X, X1, y, df1["Overall"]) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)
print (best_tree_size)

Max leaf nodes: 2  		 Mean Absolute Error:  4
Max leaf nodes: 5  		 Mean Absolute Error:  3
Max leaf nodes: 25  		 Mean Absolute Error:  2
Max leaf nodes: 50  		 Mean Absolute Error:  2
Max leaf nodes: 80  		 Mean Absolute Error:  2
Max leaf nodes: 100  		 Mean Absolute Error:  2
Max leaf nodes: 120  		 Mean Absolute Error:  2
Max leaf nodes: 150  		 Mean Absolute Error:  2
Max leaf nodes: 180  		 Mean Absolute Error:  2
Max leaf nodes: 200  		 Mean Absolute Error:  2
Max leaf nodes: 250  		 Mean Absolute Error:  2
Max leaf nodes: 500  		 Mean Absolute Error:  2
80


## Part 7: Tune the algorithm (test 2)

In [11]:
# Select the Leaf Nodes
candidate_max_leaf_nodes = [70,71,72,73,74,75,76,77,78,79,80]

r = df1["Overall"]

def get_mae(max_leaf_nodes, X, X1,y,r):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X, y)
    preds_val = model.predict(X1)
    mae = mean_absolute_error(r, preds_val)
    return(mae)


for max_leaf_nodes in [70,71,72,73,74,75,76,77,78,79,80]:
    my_mae = get_mae(max_leaf_nodes, X, X1, y, df1["Overall"])
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
scores = {leaf_size: get_mae(leaf_size, X, X1, y, df1["Overall"]) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)
print (best_tree_size)

Max leaf nodes: 70  		 Mean Absolute Error:  2
Max leaf nodes: 71  		 Mean Absolute Error:  2
Max leaf nodes: 72  		 Mean Absolute Error:  2
Max leaf nodes: 73  		 Mean Absolute Error:  2
Max leaf nodes: 74  		 Mean Absolute Error:  2
Max leaf nodes: 75  		 Mean Absolute Error:  2
Max leaf nodes: 76  		 Mean Absolute Error:  2
Max leaf nodes: 77  		 Mean Absolute Error:  2
Max leaf nodes: 78  		 Mean Absolute Error:  2
Max leaf nodes: 79  		 Mean Absolute Error:  2
Max leaf nodes: 80  		 Mean Absolute Error:  2
74


## Part 8: Fit the model with the best number of  trees this time

The most optimal result, is using 74 leaf nodes. This is the number of leaves that we are going to use to optimize our model on the training set.

In [12]:
# Select the algorithm
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# Fit the final model
final_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=74, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

## Part 9: Make a prediction and check the new MAE

As we have created a new model, it is time to see how much was the model benefited after the tunning

In [13]:
predictions = final_model.predict(X1)
val_mae = mean_absolute_error(df1["Overall"], predictions)
val_mae

2.436140467365015

## Part 10: Train a new algorithm - Random Forest

Let us see also if another, more flexible algorithm could perform any better on that

In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

#y = y["Overall"]

forest_model = RandomForestRegressor(random_state=1,n_estimators=48)
forest_model.fit(X, y)
melb_preds = forest_model.predict(X1)
print(mean_absolute_error(r, melb_preds))


  import sys


2.0259968576484226


Below is presented the optimal estimator for the Random Forest, with the best performance of MAE (48)

In [15]:
"""75: 2.036 
74: 2.033 
73: 2.032 
70: 2.031 
60: 2.030 
30: 2.057 
45: 2.028 
47: 2.029 
48: 2.025 
49: 2.034 
50: 2.035 """

'75: 2.036 \n74: 2.033 \n73: 2.032 \n70: 2.031 \n60: 2.030 \n30: 2.057 \n45: 2.028 \n47: 2.029 \n48: 2.025 \n49: 2.034 \n50: 2.035 '