## Model Construction and Training
This notebook denotes the process of which the models are constructed and trained on the data. The following is the table of contents of this notebook. <br>

--------
1. Data Preprocessing
2. Model Setup and Training <br>
  2.1 Linear Regression and LASSO <br>
  2.2 Random Forest <br>
  2.3 Neural Network
3. Model Evaluation
4. Model Deployment
5. Insights

--------

### 1 Data Preprocessing
Load the data before constructing the model. The goal of this section is to convert data to what a perfered type and deal with missing values. 

-------
Here is what I have done in the data preprocessing stage: 
1. Convert the data into usable data <br>
  1.1 Remove the players whose wage data is unavailable. Since the goal is to predict wage, adding necessary guesses at this stage is not preferred. <br> 
  1.2 Remove or replace unnecessary symbols such as "€", "K", etc. <br> 
  1.3 Transform data into integer type. For example, a height of a player could be 5'11''. It is converted to integer ($5\times12+11 = 71$). <br>
2. Replace missing values <br>
  1.1 For some attributes that could not be zero (e.g. Height), the missing values are replaced with the median of the existing set <br>
  1.2 For some attributes that could be zero, like players' scores (e.g. GKDiving, or gate keeper diving), the missing values are replaced by zero. <br>
  
-------


In [1]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20200122180036-0000
KERNEL_ID = b4148ef0-7ea8-4b63-a6fc-56819722d0b5


In [2]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_65d4a81f39cd4ebeba4f9b0b9d168ea8 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='TH6pdPiMrAco2paAMq7JAw5BIZIKjf59H5KdNRoO5F74',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_65d4a81f39cd4ebeba4f9b0b9d168ea8.get_object(Bucket='couseraibmdatascienceproject-donotdelete-pr-fff0ca2ef4bcja',Key='data.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

# If you are reading an Excel file into a pandas DataFrame, replace `read_csv` by `read_excel` in the next statement.
df = pd.read_csv(body)
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


In [3]:
# convert Wage from object to integer
df['Wage'] = df['Wage'].astype(str)
df['Wage'] = df['Wage'].str.replace('€', '')
df['Wage'] = df['Wage'].str.replace('K', '')
df['Wage'] = df['Wage'].astype(int)
df['Wage'].head()
df_1 = df[df.Wage != 0]

In [4]:
# convert Value from object to int type
valuecol = df_1.columns.get_loc('Value')
df_1.loc[:, 'Value'] = df_1.loc[:, 'Value'].astype(str)
df_1['Value'] = df_1['Value'].str.replace('€', '')

Ks = np.where(df_1['Value'].str.contains('K'))[0]
Ks = np.ndarray.tolist(Ks)
df_1.iloc[Ks,valuecol] = df_1.iloc[Ks,valuecol].str.replace('K', '')

                                                                                                                                   
Ms = np.where(df_1['Value'].str.contains('M'))[0]
Ms = np.ndarray.tolist(Ms)
df_1.iloc[Ms,valuecol] = df_1.iloc[Ms,valuecol].str.replace('M', '')
df_1.iloc[Ms,valuecol] = df_1.iloc[Ms,valuecol].astype(float) * 1000

df_1['Value'] = df_1['Value'].astype(int)

In [5]:
# convert Height to int type
df_1copy = df_1.copy()
df_1copy['Height'] = df_1copy['Height'].astype(str)
NAs = np.where(df_1copy['Height'] == 'nan')
NAs = NAs[0]
inches = df_1copy['Height'].str[0]
feet = df_1copy['Height'].str[2:]
inches[inches == 'n'] = '0'
feet[feet == 'n'] = '0'
df_1copy['Height'] = inches.astype(int) * 12 + feet.astype(int)

In [6]:
df_1['Height'] = df_1copy['Height']

In [7]:
hei = df_1.columns.get_loc('Height')
df_1.iloc[np.ndarray.tolist(NAs), hei] = df_1.Height[df_1.Height != 0].median()

In [8]:
# convert weight to int type
df_1['Weight'] = df_1['Weight'].astype(str)
df_1['Weight'] = df_1['Weight'].str.replace('lbs','')
df_1.Weight[df_1.Weight == 'nan'] = '0'
df_1['Weight'] = df_1['Weight'].astype(int)
df_1.Weight[df_1.Weight == 0] = df_1.Weight[df_1.Weight != 0].median()

In [9]:
records = ['Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 
         'LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower', 
         'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision', 
         'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling', 'GKKicking', 
         'GKPositioning', 'GKReflexes']
df_1[records] = df_1[records].fillna(0)

In [10]:
allcol = ['Wage', 'Age', 'Overall', 'Potential', 'Value', 'Height', 'Weight', 'Crossing', 'Finishing', 
              'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 
              'BallControl', 'Acceleration','SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower', 
              'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Positioning', 
              'Vision', 'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 
              'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes' ]
df_1[allcol].head()

Unnamed: 0,Wage,Age,Overall,Potential,Value,Height,Weight,Crossing,Finishing,HeadingAccuracy,...,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
0,565,31,94,94,110500,67.0,159,84.0,95.0,70.0,...,75.0,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0
1,405,33,94,94,77000,74.0,183,84.0,94.0,89.0,...,85.0,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0
2,290,26,92,93,118500,69.0,150,79.0,87.0,62.0,...,81.0,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0
3,260,27,91,93,72000,76.0,168,17.0,13.0,21.0,...,40.0,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0
4,355,27,91,92,102000,71.0,154,93.0,82.0,55.0,...,79.0,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0


In [11]:
pd.set_option('display.max_columns', None)

In [12]:
pd.DataFrame.describe(df_1[allcol])

Unnamed: 0,Wage,Age,Overall,Potential,Value,Height,Weight,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
count,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0,17966.0
mean,9.86185,25.104976,66.225481,71.317322,2443.033508,71.360069,165.96627,49.615941,45.459368,52.155572,58.556551,42.817433,55.263164,47.089669,42.76745,52.583435,58.255093,64.428921,64.547757,63.353668,61.655349,63.791495,55.342258,64.942503,63.037126,65.149171,47.004397,55.729767,46.566125,49.862184,53.306134,48.414783,58.498553,47.137816,47.556607,45.521262,16.546031,16.323945,16.162362,16.316598,16.639096
std,22.117274,4.674724,6.923435,6.146192,5625.317578,2.646264,15.583305,18.509471,19.62799,17.553427,14.970724,17.803028,19.094622,18.52589,17.599535,15.531239,16.914217,15.295182,15.022704,15.108814,9.560669,14.511473,17.4299,12.289756,16.203545,12.981089,19.379105,17.569552,20.804267,19.665096,14.367725,15.869684,11.800752,19.997571,21.758416,21.372569,17.658866,16.876372,16.467775,16.990492,17.912895
min,1.0,16.0,46.0,48.0,0.0,61.0,110.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,21.0,62.0,67.0,325.0,69.0,154.0,38.0,30.0,44.0,53.0,30.0,49.0,34.0,31.0,43.0,54.0,57.0,57.0,55.0,56.0,56.0,45.0,58.0,56.0,58.0,33.0,44.0,26.0,38.0,44.0,39.0,51.0,30.0,26.0,24.0,8.0,8.0,8.0,8.0,8.0
50%,3.0,25.0,66.0,71.0,700.0,71.0,165.0,54.0,49.0,56.0,62.0,44.0,61.0,48.0,41.0,56.0,63.0,67.0,67.0,66.0,62.0,66.0,59.0,66.0,66.0,66.0,51.0,59.0,52.0,55.0,55.0,49.0,59.0,52.0,55.0,52.0,11.0,11.0,11.0,11.0,11.0
75%,9.0,28.0,71.0,75.0,2000.0,73.0,176.0,64.0,62.0,64.0,68.0,57.0,68.0,62.0,56.0,64.0,69.0,75.0,75.0,74.0,68.0,74.0,68.0,73.0,74.0,74.0,62.0,69.0,64.0,64.0,64.0,60.0,67.0,64.0,66.0,64.0,14.0,14.0,14.0,14.0,14.0
max,565.0,45.0,94.0,95.0,118500.0,81.0,243.0,93.0,95.0,94.0,93.0,90.0,97.0,94.0,94.0,93.0,96.0,97.0,96.0,96.0,96.0,96.0,95.0,95.0,96.0,97.0,94.0,95.0,92.0,95.0,94.0,92.0,96.0,94.0,93.0,91.0,90.0,92.0,91.0,90.0,94.0


In [13]:
pd.DataFrame.corr(df_1[allcol])

Unnamed: 0,Wage,Age,Overall,Potential,Value,Height,Weight,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
Wage,1.0,0.143773,0.57615,0.488808,0.85808,0.019966,0.065594,0.234813,0.219319,0.190332,0.296287,0.259252,0.238315,0.260829,0.238217,0.27791,0.277971,0.127312,0.132349,0.157302,0.477211,0.091619,0.259531,0.129796,0.179546,0.14025,0.251102,0.196484,0.159995,0.228335,0.314749,0.224685,0.413919,0.148582,0.129007,0.1138,-0.024234,-0.023797,-0.026884,-0.02402,-0.024625
Age,0.143773,1.0,0.452519,-0.253846,0.077141,0.08198,0.229551,0.130518,0.069622,0.147051,0.132018,0.142423,0.011541,0.143599,0.193484,0.180289,0.085402,-0.154191,-0.146779,-0.017642,0.427635,-0.08689,0.156358,0.172238,0.097799,0.323395,0.155028,0.263356,0.197649,0.083854,0.185797,0.138605,0.37993,0.143118,0.119987,0.103082,0.099971,0.105088,0.103533,0.115344,0.102066
Overall,0.57615,0.452519,1.0,0.660605,0.63157,0.038444,0.154119,0.396615,0.335101,0.342988,0.499904,0.392951,0.374591,0.420967,0.398798,0.482774,0.460365,0.199887,0.213534,0.26674,0.812473,0.108357,0.442167,0.263994,0.365773,0.347049,0.422599,0.396327,0.324114,0.359001,0.49763,0.342878,0.71342,0.289565,0.255716,0.225634,-0.024982,-0.024332,-0.028483,-0.016577,-0.02243
Potential,0.488808,-0.253846,0.660605,1.0,0.57917,-0.009818,-0.00753,0.249085,0.245919,0.204485,0.368916,0.257102,0.317479,0.281499,0.232954,0.322378,0.355717,0.238047,0.239984,0.225557,0.496197,0.14363,0.290572,0.114496,0.205499,0.082126,0.268951,0.173991,0.158222,0.248104,0.348328,0.227321,0.434466,0.166234,0.147085,0.132381,-0.051163,-0.052652,-0.056843,-0.050387,-0.051285
Value,0.85808,0.077141,0.63157,0.57917,1.0,0.002879,0.046616,0.251745,0.258606,0.186815,0.326873,0.290156,0.273076,0.288518,0.267599,0.303158,0.308977,0.172122,0.173931,0.194676,0.519866,0.115995,0.282451,0.124827,0.212278,0.130114,0.281763,0.186601,0.143222,0.260952,0.356646,0.241207,0.443918,0.136821,0.111076,0.090444,-0.027272,-0.027555,-0.029515,-0.026458,-0.027211
Height,0.019966,0.08198,0.038444,-0.009818,0.002879,1.0,0.754278,-0.481792,-0.36669,0.013868,-0.357552,-0.345871,-0.486929,-0.436651,-0.400265,-0.325218,-0.408868,-0.531553,-0.452834,-0.60622,-0.017717,-0.762855,-0.2863,-0.06548,-0.28067,0.519396,-0.377344,-0.042109,-0.049344,-0.430577,-0.360994,-0.335078,-0.129665,-0.072609,-0.05787,-0.066151,0.360583,0.360914,0.358906,0.362008,0.362594
Weight,0.065594,0.229551,0.154119,-0.00753,0.046616,0.754278,1.0,-0.389106,-0.289526,0.037412,-0.283021,-0.260315,-0.408437,-0.34268,-0.301647,-0.255799,-0.331395,-0.465779,-0.39984,-0.521236,0.082039,-0.645334,-0.187543,0.009483,-0.216783,0.596124,-0.275134,0.031942,-0.024375,-0.346132,-0.277647,-0.249198,-0.032196,-0.047815,-0.045426,-0.054921,0.338844,0.338018,0.336873,0.341163,0.340019
Crossing,0.234813,0.130518,0.396615,0.249085,0.251745,-0.481792,-0.389106,1.0,0.661233,0.479991,0.813317,0.695608,0.859764,0.837136,0.765394,0.762133,0.844315,0.675733,0.653687,0.704819,0.410975,0.627755,0.712821,0.167838,0.679565,0.007803,0.746821,0.485786,0.43686,0.787018,0.692419,0.653712,0.587027,0.452757,0.437817,0.418693,-0.648413,-0.645549,-0.644819,-0.645475,-0.648011
Finishing,0.219319,0.069622,0.335101,0.245919,0.258606,-0.36669,-0.289526,0.661233,1.0,0.483089,0.667675,0.884473,0.826993,0.762459,0.701886,0.522642,0.791308,0.613557,0.601499,0.65045,0.351554,0.5338,0.818319,0.126284,0.520971,0.02201,0.879485,0.259058,-0.006059,0.890732,0.702071,0.840845,0.54309,0.039418,-0.018439,-0.057634,-0.577767,-0.576122,-0.572268,-0.573927,-0.576336
HeadingAccuracy,0.190332,0.147051,0.342988,0.204485,0.186815,0.013868,0.037412,0.479991,0.483089,1.0,0.650725,0.515015,0.561127,0.451824,0.41887,0.523571,0.667841,0.349998,0.398406,0.283049,0.357295,0.196558,0.622334,0.402972,0.644132,0.505381,0.51629,0.701324,0.556068,0.543404,0.296463,0.562427,0.525061,0.590872,0.568036,0.540439,-0.732289,-0.731461,-0.72772,-0.725979,-0.730731


#### Assessing Attributes Through Correlation
In the correlation matrix above, one may observe that the following attributes has a high correlation  (greater or equal to 0.90). <br>
Corr(Dribbling, BallControl) = 0.94 <br>
Corr(Acceleration, SprintSpeed) = 0.93 <br>
Corr(StandingTackle, Interception) = 0.94, Corr(SlidingTackle, Interception) = 0.93, Corr(Marking, StandingTackle) = 0.90, Corr(Marking, SlidingTackle) = 0.90, Corr(Marking, Interception) = 0.89, Corr(StandingTackle, SlidingTackle) = 0.97 <br>
And the correlations between "GKDiving", "GKHandling", "GKKicking", "GKPositioning", "GKReflexes" are all above 0.90. <br>
<br>
For attributes that has high correlations, only one of them are remained. So the following attributes are removed. <br>
"BallControl", "SprintSpeed", "SlidingTackle", "Interception", "Marking", "GKHandling", "GKKicking", "GKPositioning", "GKReflexes". 

In [14]:
attributes = ['Age', 'Overall', 'Potential', 'Value', 'Height', 'Weight', 'Crossing', 'Finishing', 
              'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 
              'Acceleration', 'Agility', 'Reactions', 'Balance', 'ShotPower', 'Jumping', 'Stamina', 'Strength', 
              'LongShots', 'Aggression', 'Positioning', 'Vision', 'Penalties', 'Composure', 'StandingTackle', 'GKDiving']

In [15]:
df_1[attributes].shape

(17966, 31)

#### Training/Test Split and Rescaling
After removing correlated attributes, it is necessary to divide the dataset into different set. The point is to train the model on a dataset and test its performance on an unseen dataset. <br> 
P.S. Note that some commands in Python enables us to access cross-validation score (like LASSO from sklearn). Those metrics scores are helpful as well. But for most of the time, the score on the test set is a better estimate of the model's performance on the unseen data. <br>
<br>
There're two ways to re-scale the data set. Standardization and normalization. Standardization rescales the data so that it has a mean of 0 and standard deviation 1; normalization rescale the attribute to range between 0 and 1. In python, one may use `StandardScaler()` to standardize the data and `MinMaxScaler()` to normalize the data. It's difficult to tell which method is better. This project will stick with standardization. <br>

In [16]:
x_tr, x_test, y_tr, y_test = train_test_split(df_1[attributes], df_1['Wage'], test_size = 0.33)

In [17]:
# two ways to scale the data
scaler = StandardScaler()
x_tr_std = scaler.fit_transform(x_tr)
x_test_std = scaler.transform(x_test)
y_tr_std = scaler.fit_transform(np.array(y_tr).reshape(-1, 1))
y_test_std = scaler.transform(np.array(y_test).reshape(-1, 1))

#### Introducing the metric measuring the performance of the model
The performance of model on unseen data is evaluated on the Mean Squared Error (MSE).  It measures how different the predicted value is differnt from the actual value. The formula is given below. Mathematically, it is the mean of the squared error. A detailed discusssion of MSE is in the model evaluation chapter. <br>
$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2$ <br>
P.S. Some might argue that AIC (Akaike Information Criterion) is a better criterion for linear regression model. However, it is very difficult to get the number of estimated parameters in some models (like random forest). We have to use a consistent metric to evalute the model performance. The reason for selecting model algorithms are explained in the next section.  

### 2 Model Construction
The following models are considered: <br>

-----
1. Linear Regression & LASSO<br>
2. Random Forest Regression (or Regression Forest) <br>
3. Neural Network

------

The reason for selecting linear regression is straightforward. If the data is linearly separable, then it is effective to predict wage using linear regression. <br>
However, if the data is not linear separable, random forest regression is more effective in capturing this non-linearity. Additionally, it is one of the machine learning models. <br>
Neural network is one of the deep learning models that is strong at prediction. However, it is not very interpretable. 

### 2.1 Linear Regression & LASSO
The project begin by implementing a basic linear regression model on the training set. 

In [18]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lmmodel_1 = lm.fit(x_tr_std, y_tr_std)

In [19]:
coefficients = lmmodel_1.coef_
intercept = lmmodel_1.intercept_
print('Coefficients:', coefficients)
print('Intercept:', intercept)

Coefficients: [[ 0.12054628 -0.07776646  0.0564994   0.87766307  0.02947106  0.01557953
   0.05027061 -0.03386721  0.03004789 -0.00397389  0.00931532  0.01476425
   0.00853067 -0.04595853 -0.01024697 -0.00931594  0.00299173 -0.01880163
   0.02874845  0.01996634  0.00861733 -0.05090852 -0.02381851  0.0083552
   0.00752086  0.0122537  -0.00570033  0.02684008 -0.00150002  0.03922478
   0.03234762]]
Intercept: [2.98898262e-17]


In [20]:
y_test_pred1 = lmmodel_1.predict(x_test_std)

In [21]:
from sklearn.metrics import mean_squared_error
print('MSE:', mean_squared_error(y_test_std, y_test_pred1))

MSE: 0.2403890390313032


LASSO is a regression analysis that performs variable selection and regularization. Compared with the least square linear models, LASSO penalizes non-zero coefficients. 

In [22]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

In [28]:
lasso = Lasso()
param = {'alpha': [1e-20, 1e-10, 1e-3, 1e-1, 1, 5, 20]}
lasso_regressor = GridSearchCV(lasso, param, scoring = 'neg_mean_squared_error', cv = 5)
lasso_regressor.fit(x_tr_std, y_tr_std)
print('Best alpha:', lasso_regressor.best_params_)

Best alpha: {'alpha': 1e-10}


In [29]:
lasso_coeff = Lasso(alpha = 1e-10)
lasso_coeff = lasso_coeff.fit(x_tr_std, y_tr_std)
y_test_pred_lasso_coeff = lasso_coeff.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_pred_lasso_coeff))
print('Coefficients', lasso_coeff.coef_, 'Intercept:', lasso_coeff.intercept_)

MSE: 0.24038903897495603
Coefficients [ 0.12054628 -0.07776645  0.0564994   0.87766307  0.02947106  0.01557953
  0.05027061 -0.03386721  0.03004789 -0.00397388  0.00931532  0.01476425
  0.00853067 -0.04595853 -0.01024697 -0.00931593  0.00299173 -0.01880163
  0.02874845  0.01996633  0.00861733 -0.05090852 -0.02381851  0.00835519
  0.00752086  0.0122537  -0.00570033  0.02684008 -0.00150002  0.03922478
  0.03234762] Intercept: [2.98898267e-17]


The MSE value of LASSO regularization is slightly smaller, so the mean of square error is smaller in LASSO. We prefer the LASSO regularization. 

### 2.2 Random Forest
Random Forest is one of the machine learning methods. It is based on the idea of decision trees, but it's more effective than decision tree as a result of bagging. In case of random forest, we will not use GridSearchCV, which takes too long to complete (it's exhaustive searching combinations). Hence I will focus on running models on the training set first and evaluate the best hyperparameter using scores on the cross validation set. Then I will use the best hyperparameter to train the "tr" set and obtain the score on the test set. The best score is the smallest MSE. 

In [30]:
from sklearn.ensemble import RandomForestRegressor
rf_1 = RandomForestRegressor(n_estimators = 100, random_state = 0)
rf_1 = rf_1.fit(x_tr_std, y_tr_std)
y_test_rfpred1 = rf_1.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred1))

MSE: 0.2132910756302537


In [31]:
rf_2 = RandomForestRegressor(n_estimators = 100, max_depth = 10, max_features = 'log2', min_samples_split = 3, min_samples_leaf = 2, random_state = 0)
rf_2 = rf_2.fit(x_tr_std, y_tr_std)
y_test_rfpred2 = rf_2.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred2))

MSE: 0.21245407062496136


In [32]:
rf_3 = RandomForestRegressor(n_estimators = 1000, max_depth = 10, max_features = 'log2', random_state = 0)
rf_3 = rf_3.fit(x_tr_std, y_tr_std)
y_test_rfpred3 = rf_3.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred3))

MSE: 0.2102413359872055


In [33]:
rf_4 = RandomForestRegressor(n_estimators = 100, max_depth = 20, max_features = 'log2', random_state = 0)
rf_4 = rf_4.fit(x_tr_std, y_tr_std)
y_test_rfpred4 = rf_4.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred4))

MSE: 0.20991229817069135


In [34]:
rf_5 = RandomForestRegressor(n_estimators = 100, max_depth = 30, max_features = 'log2', random_state = 0)
rf_5 = rf_5.fit(x_tr_std, y_tr_std)
y_test_rfpred5 = rf_5.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred5))

MSE: 0.21097812634303278


In [35]:
rf_6 = RandomForestRegressor(n_estimators = 100, max_depth = 40, max_features = 'log2', random_state = 0)
rf_6 = rf_6.fit(x_tr_std, y_tr_std)
y_test_rfpred6 = rf_6.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred6))

MSE: 0.2122553073700639


In [36]:
rf_7 = RandomForestRegressor(n_estimators = 100, max_depth = 20, max_features = 20, random_state = 0)
rf_7 = rf_7.fit(x_tr_std, y_tr_std)
y_test_rfpred7 = rf_7.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred7))

MSE: 0.20629759008702017


In [37]:
rf_8 = RandomForestRegressor(n_estimators = 100, max_depth = 20, max_features = 30, random_state = 0)
rf_8 = rf_8.fit(x_tr_std, y_tr_std)
y_test_rfpred8 = rf_8.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred8))

MSE: 0.2101805854612528


In [39]:
rf_9 = RandomForestRegressor(n_estimators = 100, max_depth = 20, max_features = 31, random_state = 0)
rf_9 = rf_9.fit(x_tr_std, y_tr_std)
y_test_rfpred9 = rf_9.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred9))

MSE: 0.2129534757725675


In [40]:
rf_10 = RandomForestRegressor(n_estimators = 500, max_depth = 20, max_features = 31, random_state = 0)
rf_10 = rf_10.fit(x_tr_std, y_tr_std)
y_test_rfpred10 = rf_10.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred10))

MSE: 0.20874740222698737


Based on MSE, we use the following hyperparameters. 

-----
n_estimators (number of trees in the forest) = 100 <br>
max_depth (maximum depth of the tree) = 20 <br>
max_features (the number of features to consider when looking for the best split) = 20 <br>

------

The MSE score is approximately 0.206. This score is better than the score of LASSO. So prediction of random forest regression is better than linear regression with LASSO regularization. 

### 2.3 Neural Network
Neural network is an effective deep learning method. Compared with the two previous models (regression, random forest), neural network is more difficult to interpret. The following codes examines this model. 

In [42]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout

In [46]:
nnmodel = Sequential()
nnmodel.add(Dense(30, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel.add(Dense(1, kernel_initializer = 'normal'))
nnmodel.compile(loss = 'mean_squared_error', optimizer = 'adam')

In [47]:
nnmodel.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 32, verbose = 1)

Instructions for updating:
Use tf.cast instead.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc6e3198a58>

In [48]:
y_test_nn1 = nnmodel.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn1.reshape(5929,)))

MSE: 0.21711179263820113


In [49]:
nnmodel11 = Sequential()
nnmodel11.add(Dense(40, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel11.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel11.compile(loss = 'mean_squared_error', optimizer = 'adam')
nnmodel11.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 64, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc6c8350d68>

In [50]:
y_test_nn11 = nnmodel11.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn11.reshape(5929,)))

MSE: 0.22494578974951424


In [51]:
nnmodel12 = Sequential()
nnmodel12.add(Dense(20, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel12.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel12.compile(loss = 'mean_squared_error', optimizer = 'adam')
nnmodel12.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 64, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc6c834ecf8>

In [52]:
y_test_nn12 = nnmodel12.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn12.reshape(5929,)))

MSE: 0.22823854243098654


In [54]:
nnmodel2 = Sequential()
nnmodel2.add(Dense(30, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel2.add(Dropout(0.4))
nnmodel2.add(Dense(20, kernel_initializer = 'normal', activation = 'relu'))
nnmodel2.add(Dropout(0.4))
nnmodel2.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel2.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])

In [55]:
nnmodel2.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 32, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc6a4122320>

In [56]:
y_test_nn2 = nnmodel2.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn2.reshape(5929,)))

MSE: 0.24032697966728822


In [57]:
nnmodel21 = Sequential()
nnmodel21.add(Dense(30, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel21.add(Dropout(0.2))
nnmodel21.add(Dense(20, kernel_initializer = 'normal', activation = 'relu'))
nnmodel21.add(Dropout(0.2))
nnmodel21.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel21.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])
nnmodel21.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 64, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc61419ce48>

In [58]:
y_test_nn21 = nnmodel21.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn21.reshape(5929,)))

MSE: 0.21589163515140736


In [59]:
nnmodel22 = Sequential()
nnmodel22.add(Dense(30, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel22.add(Dropout(0.4))
nnmodel22.add(Dense(20, kernel_initializer = 'normal', activation = 'relu'))
nnmodel22.add(Dropout(0.2))
nnmodel22.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel22.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])
nnmodel22.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 64, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc5d175d908>

In [60]:
y_test_nn22 = nnmodel22.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn22.reshape(5929,)))

MSE: 0.2166728872335843


Note on choosing batch size. <br>
Typically, choosing a larger batch size would lead to less accurate predictions. But choosing a very small batch size would affect the training speed of the neural network. Here I stick with a batch size of 32. Through the two practices above, one may observe that increasing the batch size does not significantly improve the MSE. So I will stick with the batch size 32 in the model construction below. 

In [61]:
nnmodel3 = Sequential()
nnmodel3.add(Dense(30, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel3.add(Dropout(0.4))
nnmodel3.add(Dense(20, kernel_initializer = 'normal', activation = 'relu'))
nnmodel3.add(Dropout(0.4))
nnmodel3.add(Dense(10, kernel_initializer = 'normal', activation = 'relu'))
nnmodel3.add(Dropout(0.4))
nnmodel3.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel3.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])

In [62]:
nnmodel3.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 32, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc61421c978>

In [63]:
y_test_nn3 = nnmodel3.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn3.reshape(5929,)))

MSE: 0.2321377457424366


In [64]:
nnmodel31 = Sequential()
nnmodel31.add(Dense(30, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel31.add(Dropout(0.2))
nnmodel31.add(Dense(20, kernel_initializer = 'normal', activation = 'relu'))
nnmodel31.add(Dropout(0.2))
nnmodel31.add(Dense(10, kernel_initializer = 'normal', activation = 'relu'))
nnmodel31.add(Dropout(0.2))
nnmodel31.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel31.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])
nnmodel31.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 64, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc5d0925278>

In [65]:
y_test_nn31 = nnmodel31.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn31.reshape(5929,)))

MSE: 0.2138056078898866


In [66]:
nnmodel32 = Sequential()
nnmodel32.add(Dense(30, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel32.add(Dropout(0.4))
nnmodel32.add(Dense(20, kernel_initializer = 'normal', activation = 'relu'))
nnmodel32.add(Dropout(0.3))
nnmodel32.add(Dense(10, kernel_initializer = 'normal', activation = 'relu'))
nnmodel32.add(Dropout(0.2))
nnmodel32.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel32.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])
nnmodel32.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 64, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc5d0148ef0>

In [67]:
y_test_nn32 = nnmodel32.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn32.reshape(5929,)))

MSE: 0.21553163663522576


In [68]:
nnmodel4 = Sequential()
nnmodel4.add(Dense(30, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel4.add(Dropout(0.4))
nnmodel4.add(Dense(30, kernel_initializer = 'normal', activation = 'relu'))
nnmodel4.add(Dropout(0.4))
nnmodel4.add(Dense(20, kernel_initializer = 'normal', activation = 'relu'))
nnmodel4.add(Dropout(0.4))
nnmodel4.add(Dense(10, kernel_initializer = 'normal', activation = 'relu'))
nnmodel4.add(Dropout(0.4))
nnmodel4.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel4.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])

In [69]:
nnmodel4.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 32, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc5d0c4ab70>

In [70]:
y_test_nn4 = nnmodel4.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn4.reshape(5929,)))

MSE: 0.24058116647477373


In [71]:
nnmodel41 = Sequential()
nnmodel41.add(Dense(30, input_dim = 31, kernel_initializer = 'normal', activation = 'relu'))
nnmodel41.add(Dropout(0.2))
nnmodel41.add(Dense(30, kernel_initializer = 'normal', activation = 'relu'))
nnmodel41.add(Dropout(0.2))
nnmodel41.add(Dense(20, kernel_initializer = 'normal', activation = 'relu'))
nnmodel41.add(Dropout(0.2))
nnmodel41.add(Dense(10, kernel_initializer = 'normal', activation = 'relu'))
nnmodel41.add(Dropout(0.2))
nnmodel41.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel41.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])
nnmodel41.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 64, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc5b3275c18>

In [72]:
y_test_nn41 = nnmodel41.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn41.reshape(5929,)))

MSE: 0.23246362675952806


As observed above, having more layer is not equivalent to better prediction. The model with the best MSE score is neural network model 31. The MSE of it is approximately 0.214. It has batch size 64 and it has the following layers: 

| Layer | # of Inputs | # of Outputs | Activation Function | Dropout Rate |
|------|------|------|------|------|
| 1 | 31 | 30 | Relu | 0.2 |
| 2 | 30 | 20 | Relu | 0.2 |
| 3 | 20 | 10 | Relu | 0.2 |
| 4 | 10 | 1 | Linear| N/A |



### 3 Model Evaluation
Model evaluation is based on the score, specifically mean squared error (MSE) on the test set. Based on the model trained, how does the model perform on an unseen dataset. The mean squared error is calculated by the mean of the square of errors. It is an effective way to measure how different is the prediction from the actual value. The equation of MSE is given by: <br>
$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2$ <br>
The following table compares the performance of different model algorithms: 

| Model | MSE Score | Hyperparameter/Layers | Interpretable | 
|------|------|------|------| 
| Linear regression with LASSO regularization | $\approx 0.240$ | $\alpha = 10^{-10}$ | Coefficients can be used to interpret feature importance |
| Random Forest | $\approx 0.206$ | # of trees in forest = 500, maximum depth of tree = 20, maximum feature = 20 | Low |
| Neural Network | $\approx 0.213$ | 4 layers in total (Neurons 31>30>20>10>1) | Low |

The model that performs the best prediction on the unseen dataset is the random forest model. Using this model, one can predict a player's wage given the player's profile with the least MSE. Sometimes there might be missing data in the player's profile. The idea of how to deal with missing data is described in the data preprocessing section. Then one can easily predict the player's wage with little error. For the club owners and managers, they can decide on the wages of the player based on this player's attributes. This is a reliable way to negotiate with player without underpaying or overpaying. <br>
However, neural network model does not provide a understanding on how to interpret FIFA players' wages. It would be good to know intuitively which feature(s) of the players affect their wages more than others. This analysis of feature importance will be discussed in the next section.  

### 4 Model Deployment
In the previous section, I have discussed which model to use in predicting the players' wages. Now let's turn to an intuitive side

| Feature | Description | Coefficient | 
|------|------|------|
| Age | Age | $\approx 0.121$ | 
| Value | Current Market Value | $\approx 0.877$ | 
| Volleys | Rating on Volleys | $\approx 0.00932$ |

Based on the coefficient of LASSO, the table above lists the top three features affecting the wage of the FIFA player : age of the player, current market value of the player and the rating of volleys. Although LASSO is the worst at predicting player wage, it provides some insights on interpreting the coefficients. When evaluating the wage of player, one may use these three features together, rather than using one feature, to estimate the wage of a player. For club seeking recruitment, they can use these three features to estimate the wages of players and check whether they would be a good match. 


### 5 Insights
For club owners and managers, if they're looking for a player to recruit, they may search for a player that matches them the best. While players with better performance and ability are always preferred, clubs cannot afford to recruit a player with a wage beyond their budget. So this project is useful for the soccer club owners in two ways: <br>

----
1. During the early recruitment stage, use the analysis on the feature importance in the previous section to briefly estimate the wage of the all players and seek for the most appropriate candidates in terms of both affordability and performance
2. Once appropriate candidate is found, the club may use the random forest model to work out the best wage prediction to gain an advantage during the negotiations. 

-----

With that being said, predicting the wages of FIFA players is very beneficial for the club owners and managers to maximize their benefit. <br>
<br>
There's also an insight for future research. Soccer players have different positions, and different positions may value different attributes. Further study could divide the players' data into categories of positions to achieve better understanding of players' wage. 
