## Model Construction and Training
This notebook denotes the process of which the models are constructed and trained on the data. The following is the table of contents of this notebook. <br>

--------
1. Data Preprocessing
2. Model Setup and Training <br>
  2.1 Linear Regression and LASSO <br>
  2.2 Random Forest <br>
  2.3 Neural Network
3. Model Evaluation
4. Model Deployment
5. Insights

--------

### 1 Data Preprocessing
Load the data before constructing the model. The goal of this section is to convert data to what a perfered type and deal with missing values. 

-------
Here is what I have done in the data preprocessing stage: 
1. Convert the data into usable data <br>
  1.1 Some data involves unnecessary characters such as "€". These symbols are removed or replace. <br> 
  1.2 Convert data into integers. For example, height data could be 5'11''. Then the data is converted to inches ($5\times12+11$) <br>
2. Replace missing values <br>
  1.1 For some attributes that could not be zero (e.g. Height), the missing values are replaced with the median of the existing set <br>
  1.2 For some attributes that could be zero (e.g. gate keeper handling), the missing values are replaced by zero. <br>
  
-------


In [1]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20200121022330-0000
KERNEL_ID = 37f8150e-a0e8-4708-a51c-57951b0feaab


In [2]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_65d4a81f39cd4ebeba4f9b0b9d168ea8 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='TH6pdPiMrAco2paAMq7JAw5BIZIKjf59H5KdNRoO5F74',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_65d4a81f39cd4ebeba4f9b0b9d168ea8.get_object(Bucket='couseraibmdatascienceproject-donotdelete-pr-fff0ca2ef4bcja',Key='data.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

# If you are reading an Excel file into a pandas DataFrame, replace `read_csv` by `read_excel` in the next statement.
df = pd.read_csv(body)
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


In [3]:
# convert Wage from object to integer
df['Wage'] = df['Wage'].astype(str)
df['Wage'] = df['Wage'].str.replace('€', '')
df['Wage'] = df['Wage'].str.replace('K', '')
df['Wage'] = df['Wage'].astype(int)
df['Wage'].head()
df_1 = df[df.Wage != 0]

In [4]:
# convert Value from object to int type
valuecol = df_1.columns.get_loc('Value')
df_1.loc[:, 'Value'] = df_1.loc[:, 'Value'].astype(str)
df_1['Value'] = df_1['Value'].str.replace('€', '')

Ks = np.where(df_1['Value'].str.contains('K'))[0]
Ks = np.ndarray.tolist(Ks)
df_1.iloc[Ks,valuecol] = df_1.iloc[Ks,valuecol].str.replace('K', '')

                                                                                                                                   
Ms = np.where(df_1['Value'].str.contains('M'))[0]
Ms = np.ndarray.tolist(Ms)
df_1.iloc[Ms,valuecol] = df_1.iloc[Ms,valuecol].str.replace('M', '')
df_1.iloc[Ms,valuecol] = df_1.iloc[Ms,valuecol].astype(float) * 1000

df_1['Value'] = df_1['Value'].astype(int)

In [5]:
# convert Height to int type
df_1copy = df_1.copy()
df_1copy['Height'] = df_1copy['Height'].astype(str)
NAs = np.where(df_1copy['Height'] == 'nan')
NAs = NAs[0]
inches = df_1copy['Height'].str[0]
feet = df_1copy['Height'].str[2:]
inches[inches == 'n'] = '0'
feet[feet == 'n'] = '0'
df_1copy['Height'] = inches.astype(int) * 12 + feet.astype(int)

In [6]:
df_1['Height'] = df_1copy['Height']

In [7]:
hei = df_1.columns.get_loc('Height')
df_1.iloc[np.ndarray.tolist(NAs), hei] = df_1.Height[df_1.Height != 0].median()

In [8]:
# convert weight to int type
df_1['Weight'] = df_1['Weight'].astype(str)
df_1['Weight'] = df_1['Weight'].str.replace('lbs','')
df_1.Weight[df_1.Weight == 'nan'] = '0'
df_1['Weight'] = df_1['Weight'].astype(int)
df_1.Weight[df_1.Weight == 0] = df_1.Weight[df_1.Weight != 0].median()

In [9]:
records = ['Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 
         'LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower', 
         'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision', 
         'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling', 'GKKicking', 
         'GKPositioning', 'GKReflexes']
df_1[records] = df_1[records].fillna(0)

In [10]:
allcol = ['Wage', 'Age', 'Overall', 'Potential', 'Value', 'Height', 'Weight', 'Crossing', 'Finishing', 
              'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 
              'BallControl', 'Acceleration','SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower', 
              'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Positioning', 
              'Vision', 'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 
              'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes' ]
df_1[allcol].head()

Unnamed: 0,Wage,Age,Overall,Potential,Value,Height,Weight,Crossing,Finishing,HeadingAccuracy,...,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
0,565,31,94,94,110500,67.0,159,84.0,95.0,70.0,...,75.0,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0
1,405,33,94,94,77000,74.0,183,84.0,94.0,89.0,...,85.0,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0
2,290,26,92,93,118500,69.0,150,79.0,87.0,62.0,...,81.0,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0
3,260,27,91,93,72000,76.0,168,17.0,13.0,21.0,...,40.0,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0
4,355,27,91,92,102000,71.0,154,93.0,82.0,55.0,...,79.0,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0


In [11]:
attributes = ['Age', 'Overall', 'Potential', 'Value', 'Height', 'Weight', 'Crossing', 'Finishing', 
              'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 
              'BallControl', 'Acceleration','SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower', 
              'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Positioning', 
              'Vision', 'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 
              'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes' ]

In [12]:
x_tr, x_test, y_tr, y_test = train_test_split(df_1[attributes], df_1['Wage'], test_size = 0.33)
x_train, x_cv, y_train, y_cv = train_test_split(x_tr, y_tr, test_size = 0.33)

In [13]:
# two ways to scale the data
scaler = StandardScaler()
x_train_std = scaler.fit_transform(x_train)
x_cv_std = scaler.transform(x_cv)
x_test_std = scaler.transform(x_test)
x_tr_std = scaler.fit_transform(x_tr)
y_train_std = scaler.fit_transform(np.array(y_train).reshape(-1, 1))
y_cv_std = scaler.transform(np.array(y_cv).reshape(-1, 1))
y_tr_std = scaler.fit_transform(np.array(y_tr).reshape(-1, 1))
y_test_std = scaler.transform(np.array(y_test).reshape(-1, 1))

normalize = MinMaxScaler()
x_train_nor = normalize.fit_transform(x_train)
x_cv_nor = normalize.transform(x_cv)
x_test_nor = normalize.transform(x_test)
x_tr_nor = normalize.fit_transform(x_tr)
y_tr_nor = normalize.fit_transform(np.array(y_tr).reshape(-1, 1))
y_test_nor = normalize.transform(np.array(y_test).reshape(-1, 1))

There're two ways to re-scale the data set. Standardization and normalization. Standardization rescales the data so that it has a mean of 0 and standard deviation 1; normalization rescale the attribute to range between 0 and 1. It's difficult to tell which method is better. For most of the time, this project will stick with standardization. <br>
$\textbf{Notice}$ that the dataset above is divided to three categories: training set, cross validation set and testing set. The general idea behind these three categories is: training set is used for first training the model, cross-validation set is used for hyperparameter tuning (especially for random forest), testing set is used for comparing the performance across algorithms. The model is first trained on the training set. Then the scores on the cross validation set is used to determine which hyperparameter is the best to use. Finally, the model is re-trained on both the training and cross validation set (x_tr, y_tr), and the performances across different algorithms (regression vs. random forest vs. neural network) will be evaluated based on the scores of the test test. <br>
$\textbf{With}$ that being said, sometimes the "tr" set (training + cross validation set) will be used directly. This is because some commands in Python enables user to directly setup the cross validation set within simple lines of command, and the user may access the score on the cv set directly without having to train the model only on the training set. The spliting of the dataset is random, so different spliting would presumably not have an effect on the evaluation. 

### 2 Model Construction
The following models are considered: <br>
Linear Regression & LASSO<br>
Random Forest <br>
Neural Network

### 2.1 Linear Regression & LASSO
The project begin by implementing a basic linear regression model on the training set. 

In [14]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lmmodel_1 = lm.fit(x_train_std, y_train_std)

In [15]:
coefficients = lmmodel_1.coef_
intercept = lmmodel_1.intercept_
print('Coefficients:', coefficients)
print('Intercept:', intercept)

Coefficients: [[ 1.03317410e-01 -3.64529083e-02  4.62619948e-02  8.44569999e-01
   3.55947543e-02  4.81157485e-03  5.54196715e-02 -4.38959588e-02
   2.71925495e-02 -8.87269575e-03 -2.63101161e-03  2.70219501e-02
   1.87075269e-02 -5.85901493e-02 -1.41724377e-02  1.82654119e-03
  -2.02402430e-02 -2.66560470e-04 -8.45246634e-03 -5.20825338e-03
   3.74720243e-02  3.98214715e-02 -8.98982459e-04 -5.18530061e-02
  -2.03965981e-02  2.77400297e-02  6.91912607e-04  7.49904484e-03
   1.01144133e-02 -1.55447518e-02  2.27302036e-02 -9.80688481e-03
  -3.31925918e-03 -3.87287503e-02  8.66805369e-02  1.86632933e-02
   1.57100438e-02  1.48874832e-02 -2.65165605e-02  5.79314519e-03]]
Intercept: [2.95154411e-17]


In [16]:
y_cv_pred1 = lmmodel_1.predict(x_cv_std)

In [17]:
from sklearn.metrics import mean_squared_error
print('MSE on training:', mean_squared_error(y_cv_std, y_cv_pred1))

MSE on training: 0.2543415532295527


LASSO is a regression analysis that performs variable selection and regularization. Compared with the least square linear models, LASSO penalizes non-zero coefficients. In the commands below, the model is directly trained on the "tr" set because the command GridSearchCV could specify the means of cross-validation (here k-fold cross validation is used). 

In [18]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

In [19]:
lasso = Lasso()
param = {'alpha': [1e-20, 1e-10, 1e-3, 1e-1, 1, 5, 20]}
lasso_regressor = GridSearchCV(lasso, param, scoring = 'neg_mean_squared_error', cv = 5)
lasso_regressor.fit(x_tr_std, y_tr_std)
print('Best hyperparameter:', lasso_regressor.best_params_)
print('MSE:', lasso_regressor.best_score_)

Best hyperparameter: {'alpha': 0.001}
MSE: -0.24770263230526066


LASSO outperforms linear regression model in terms of MSE. So let's test the MSE of prediction of LASSO using the test set 

In [22]:
lasso = Lasso()
param = {'alpha': [1e-20, 1e-10, 1e-3, 1e-1, 1, 5, 20]}
lasso_regressor = GridSearchCV(lasso, param, scoring = 'neg_mean_squared_error', cv = 5)
lasso_regressor.fit(x_tr_std, y_tr_std)
print(lasso_regressor.best_params_)

{'alpha': 0.001}


In [23]:
y_test_pred_lasso = lasso_regressor.predict(x_test_std)
print('MSE', mean_squared_error(y_test_std, y_test_pred_lasso))

MSE 0.2564797691067471


In [24]:
lasso_coeff = Lasso(alpha = 0.001)
lasso_coeff = lasso_coeff.fit(x_tr_std, y_tr_std)
y_test_pred_lasso_coeff = lasso_coeff.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_pred_lasso_coeff))
print('Coefficients', lasso_coeff.coef_, 'Intercept:', lasso_coeff.intercept_)

MSE: 0.2564797691067471
Coefficients [ 8.83515045e-02 -1.43532371e-02  3.20608171e-02  8.44866446e-01
  2.54287424e-02  7.60330741e-03  3.38180650e-02 -1.67879627e-02
  7.15094347e-03 -0.00000000e+00  2.22881861e-03  1.61928301e-02
  1.74663508e-02 -5.53574204e-02 -2.07566492e-02  0.00000000e+00
 -1.61406761e-02  0.00000000e+00 -0.00000000e+00 -5.54758165e-03
  2.25407789e-02  1.87835492e-02  4.05657321e-03 -5.47375219e-02
 -1.84990426e-02  1.39479241e-02  1.02652107e-03  0.00000000e+00
  2.83955293e-03 -2.84788679e-04  3.68042478e-02 -4.83997210e-03
  0.00000000e+00 -0.00000000e+00  6.58383952e-02  0.00000000e+00
  1.41235565e-02  0.00000000e+00 -0.00000000e+00  0.00000000e+00] Intercept: [-3.99698276e-17]


### 2.2 Random Forest
Random Forest is one of the machine learning methods. It is based on the idea of decision trees, but it's more effective than decision tree as a result of bagging. In case of random forest, we will not use GridSearchCV, which takes too long to complete (it's exhaustive searching combinations). Hence I will focus on running models on the training set first and evaluate the best hyperparameter using scores on the cross validation set. Then I will use the best hyperparameter to train the "tr" set and obtain the score on the test set. The best score is the smallest MSE. 

In [33]:
from sklearn.ensemble import RandomForestRegressor
rf_1 = RandomForestRegressor(n_estimators = 100, random_state = 0)
rf_1 = rf_1.fit(x_train_std, y_train_std)
y_cv_rfpred1 = rf_1.predict(x_cv_std)
print('MSE:', mean_squared_error(y_cv_std, y_cv_rfpred1))

MSE: 0.24975180788188464


In [34]:
rf_2 = RandomForestRegressor(n_estimators = 100, max_depth = 10, max_features = 'log2', min_samples_split = 3, min_samples_leaf = 2, random_state = 0)
rf_2 = rf_2.fit(x_train_std, y_train_std)
y_cv_rfpred2 = rf_2.predict(x_cv_std)
print('MSE:', mean_squared_error(y_cv_std, y_cv_rfpred2))

MSE: 0.2889370790000581


In [35]:
rf_3 = RandomForestRegressor(n_estimators = 1000, max_depth = 10, max_features = 'log2', random_state = 0)
rf_3 = rf_3.fit(x_train_std, y_train_std)
y_cv_rfpred3 = rf_3.predict(x_cv_std)
print('MSE:', mean_squared_error(y_cv_std, y_cv_rfpred3))

MSE: 0.28213573438778594


In [36]:
rf_4 = RandomForestRegressor(n_estimators = 100, max_depth = 20, max_features = 'log2', random_state = 0)
rf_4 = rf_4.fit(x_train_std, y_train_std)
y_cv_rfpred4 = rf_4.predict(x_cv_std)
print('MSE:', mean_squared_error(y_cv_std, y_cv_rfpred4))

MSE: 0.2929437116488881


In [37]:
rf_5 = RandomForestRegressor(n_estimators = 100, max_depth = 30, max_features = 'log2', random_state = 0)
rf_5 = rf_5.fit(x_train_std, y_train_std)
y_cv_rfpred5 = rf_5.predict(x_cv_std)
print('MSE:', mean_squared_error(y_cv_std, y_cv_rfpred5))

MSE: 0.2792108830030044


In [38]:
rf_6 = RandomForestRegressor(n_estimators = 100, max_depth = 40, max_features = 'log2', random_state = 0)
rf_6 = rf_6.fit(x_train_std, y_train_std)
y_cv_rfpred6 = rf_6.predict(x_cv_std)
print('MSE:', mean_squared_error(y_cv_std, y_cv_rfpred6))

MSE: 0.2829231260455778


In [39]:
rf_7 = RandomForestRegressor(n_estimators = 100, max_depth = 20, max_features = 20, random_state = 0)
rf_7 = rf_7.fit(x_train_std, y_train_std)
y_cv_rfpred7 = rf_7.predict(x_cv_std)
print('MSE:', mean_squared_error(y_cv_std, y_cv_rfpred7))

MSE: 0.24963987289361042


In [40]:
rf_8 = RandomForestRegressor(n_estimators = 100, max_depth = 20, max_features = 30, random_state = 0)
rf_8 = rf_8.fit(x_train_std, y_train_std)
y_cv_rfpred8 = rf_8.predict(x_cv_std)
print('MSE:', mean_squared_error(y_cv_std, y_cv_rfpred8))

MSE: 0.2476407661132925


In [41]:
rf_9 = RandomForestRegressor(n_estimators = 100, max_depth = 20, max_features = 40, random_state = 0)
rf_9 = rf_9.fit(x_train_std, y_train_std)
y_cv_rfpred9 = rf_9.predict(x_cv_std)
print('MSE:', mean_squared_error(y_cv_std, y_cv_rfpred9))

MSE: 0.24985650824394431


In [42]:
rf_10 = RandomForestRegressor(n_estimators = 500, max_depth = 20, max_features = 40, random_state = 0)
rf_10 = rf_10.fit(x_train_std, y_train_std)
y_cv_rfpred10 = rf_10.predict(x_cv_std)
print('MSE:', mean_squared_error(y_cv_std, y_cv_rfpred10))

MSE: 0.24929560575149673


Based on MSE, we use the following hyperparameters. 

-----
n_estimators= 100 <br>
max_depth = 20 <br>
max_features = 30

------

In [43]:
rf_fin = RandomForestRegressor(n_estimators = 100, max_depth = 20, max_features = 30, random_state = 0)
rf_fin = rf_fin.fit(x_tr_std, y_tr_std)
y_test_rfpred = rf_fin.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_rfpred))

MSE: 0.2033254089306992


The MSE score on the test set is very good. This score is better than the score of LASSO. 

### 2.3 Neural Network
Neural network is an effective deep learning method. Compared with the two previous models (regression, random forest), neural network is more difficult to interpret. The following codes examines this model. 

In [26]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout

In [49]:
nnmodel = Sequential()
nnmodel.add(Dense(50, input_dim = 40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel.add(Dense(1, kernel_initializer = 'normal'))
nnmodel.compile(loss = 'mean_squared_error', optimizer = 'adam')

In [50]:
nnmodel.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 32, verbose = 1)

Instructions for updating:
Use tf.cast instead.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc5e28aa780>

In [51]:
y_test_nn1 = nnmodel.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn1.reshape(5929,)))

MSE: 0.25397525358233153


In [48]:
nnmodel11 = Sequential()
nnmodel11.add(Dense(50, input_dim = 40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel11.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel11.compile(loss = 'mean_squared_error', optimizer = 'adam')
nnmodel11.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 64, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f7f25f987b8>

In [49]:
y_test_nn11 = nnmodel11.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn11.reshape(5929,)))

MSE: 0.2366867855021129


In [35]:
nnmodel2 = Sequential()
nnmodel2.add(Dense(50, input_dim = 40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel2.add(Dropout(0.4))
nnmodel2.add(Dense(40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel2.add(Dropout(0.4))
nnmodel2.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel2.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])

In [36]:
nnmodel2.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 32, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f804c2306d8>

In [37]:
y_test_nn2 = nnmodel2.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn2.reshape(5929,)))

MSE: 0.23937945630262367


In [50]:
nnmodel22 = Sequential()
nnmodel22.add(Dense(50, input_dim = 40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel22.add(Dropout(0.4))
nnmodel22.add(Dense(40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel22.add(Dropout(0.4))
nnmodel22.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel22.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])
nnmodel22.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 64, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f7f25c7fcc0>

In [51]:
y_test_nn22 = nnmodel22.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn22.reshape(5929,)))

MSE: 0.24547522477168535


Note on choosing batch size. <br>
Typically, choosing a larger batch size would lead to less accurate predictions. But choosing a very small batch size would affect the training speed of the neural network. Here I stick with a batch size of 32. Through the two practices above, one may observe that increasing the batch size does not significantly improve the MSE. So I will stick with the batch size 32 in the model construction below. 

In [38]:
nnmodel3 = Sequential()
nnmodel3.add(Dense(50, input_dim = 40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel3.add(Dropout(0.4))
nnmodel3.add(Dense(40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel3.add(Dropout(0.4))
nnmodel3.add(Dense(30, kernel_initializer = 'normal', activation = 'relu'))
nnmodel3.add(Dropout(0.4))
nnmodel3.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel3.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])

In [39]:
nnmodel3.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 32, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f7f27cabb00>

In [40]:
y_test_nn3 = nnmodel3.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn3.reshape(5929,)))

MSE: 0.27516862986859597


In [41]:
nnmodel4 = Sequential()
nnmodel4.add(Dense(50, input_dim = 40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel4.add(Dropout(0.4))
nnmodel4.add(Dense(40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel4.add(Dropout(0.4))
nnmodel4.add(Dense(30, kernel_initializer = 'normal', activation = 'relu'))
nnmodel4.add(Dropout(0.4))
nnmodel4.add(Dense(20, kernel_initializer = 'normal', activation = 'relu'))
nnmodel4.add(Dropout(0.4))
nnmodel4.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel4.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])

In [42]:
nnmodel4.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 32, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f7f275f02e8>

In [43]:
y_test_nn4 = nnmodel4.predict(x_test_std)
print('MSE:', mean_squared_error(y_test_std, y_test_nn4.reshape(5929,)))

MSE: 0.24008253771262641


In [44]:
nnmodel5 = Sequential()
nnmodel5.add(Dense(50, input_dim = 40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel5.add(Dropout(0.4))
nnmodel5.add(Dense(40, kernel_initializer = 'normal', activation = 'relu'))
nnmodel5.add(Dropout(0.4))
nnmodel5.add(Dense(30, kernel_initializer = 'normal', activation = 'relu'))
nnmodel5.add(Dropout(0.4))
nnmodel5.add(Dense(20, kernel_initializer = 'normal', activation = 'relu'))
nnmodel5.add(Dropout(0.4))
nnmodel5.add(Dense(10, kernel_initializer = 'normal', activation = 'relu'))
nnmodel5.add(Dropout(0.4))
nnmodel5.add(Dense(1, kernel_initializer = 'normal', activation = 'linear'))
nnmodel5.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['mean_squared_error'])

In [45]:
nnmodel5.fit(x_tr_std, y_tr_std, epochs = 10, batch_size = 32, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f7f26df26d8>

In [46]:
y_test_nn5 = nnmodel5.predict(x_test_std)
print('MSE', mean_squared_error(y_test_std, y_test_nn5.reshape(5929,)))

MSE 0.30460518887444227


As observed above, having more layer is not equivalent to better prediction. The model with the best MSE score is neural network model 2. It has the following layers

| Layer | # of Inputs | # of Outputs | Activation Function | Dropout Rate |
|------|------|------|------|------|
| 1 | 40 | 50 | Relu | 0.4 |
| 2 | 50 | 40 | Relu | 0.4 | 
| 3 | 40 | 1 | Linear| N/A |

### 3 Model Evaluation
Model evaluation is based on the score, specifically mean squared error (MSE) on the test set. Based on the model trained, how does the model perform on an unseen dataset. The mean squared error is calculated by the mean of the square of errors. It is an effective way to measure how different is the prediction from the actual value. The equation of MSE is given by: <br>
$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2$ <br>
The following table compares the performance of different model algorithms: 

| Model | MSE Score | Hyperparameter/Layers | Interpretable | 
|------|------|------|------| 
| Linear regression with LASSO regularization | $\approx 0.256$ | $\alpha = 10^{-20}$ | Coefficients can be used to interpret feature importance |
| Random Forest | $\approx 0.203$ | # of trees in forest = 500, maximum depth of tree = 20, maximum feature = 41 | Low |
| Neural Network | $\approx 0.239$ | 4 layers in total (Neurons 41>50>40>30>1) | Low |

The model that performs the best prediction on the unseen dataset is the random forest model. Using this model, one can predict a player's wage given the player's profile with the least MSE. Sometimes there might be missing data in the player's profile. The idea of how to deal with missing data is described in the data preprocessing section. Then one can easily predict the player's wage with little error. For the club owners and managers, they can decide on the wages of the player based on this player's attributes. This is a reliable way to negotiate with player without underpaying or overpaying. <br>
However, neural network model does not provide a understanding on how to interpret FIFA players' wages. It would be good to know intuitively which feature(s) of the players affect their wages more than others. This analysis of feature importance will be discussed in the next section.  

### 4 Model Deployment
In the previous section, I have discussed which model to use in predicting the players' wages. Now let's turn to an intuitive side

| Feature | Description | Coefficient | Importance |
|------|------|------|------|
| **Age** | Age | $\approx 0.0884$ | important, positive |
| Overall | Overall rating | $\approx -0.0144$ | little importance, negative | 
| Potential | Potential rating | $\approx 0.0321$ | little importance, positive |
| **Value** | Current Market Value | $\approx 0.844$ | high importance, positive |
| Height | Height | $\approx 0.0254$ | little importance, positive | 
| Crossing | Rating of crossing | $\approx 0.0338$ | little importance, positive |
| Finishing | Rating of finishing | $\approx -0.0168$ | little importance, negative |
| Dribbling | Rating of dribbling | $\approx 0.0162$ | little importance, positive |
| Curve | Rating of curve | $\approx 0.0175$ | little importance, positive |
| FKAccuracy | Rating of free kick accuracy | $\approx -0.0554$ | somewhat important, negative |
| LongPassing| Rating of long passing | $\approx -0.0208$ | little importance, negative |
| Acceleration | Rating of acceleration | $\approx -0.0161$ | little importance, negative |
| Balance | Rating of balance | $\approx 0.0225$ | little importance, positive |
| ShotPower | Rating of shot power | $\approx 0.0188$ | little importance, positive |
| Stamina | Rating of stamina | $\approx -0.0547$ | somewhat important, negative | 
| Strength | Rating of strength | $\approx -0.0185$ | little importance, negative |
| LongShots | Rating of long shots | $\approx 0.0139$ | little importance, positive | 
| Penalties | Rating of penalties | $\approx 0.0368$ | little importance, positive |
| **SlidingTackle** | Rating of sliding tackle | $\approx 0.0658$ | somewhat important, positive |
| GKHandling | Rating of goal keeper handling | $\approx 0.0141$ | little importance, positive |

As observed above, the top three features affecting the wage of the FIFA player are: age of the player, current market value of the player and the rating of sliding tackle (marked in bold text). When evaluating the wage of player, one may use these three features together, rather than using one feature, to estimate the wage of a player. For club seeking recruitment, they can use these three features to estimate the wages of players and check whether they would be a good match. 


### 5 Insights
For club owners and managers, if they're looking for a player to recruit, they may search for a player that matches them the best. While players with better performance and ability are always preferred, clubs cannot afford to recruit a player with a wage beyond their budget. So this project is useful for the soccer club owners in two ways: <br>

----
1. During the early recruitment stage, use the analysis on the feature importance in the previous section to briefly estimate the wage of the all players and seek for the most appropriate candidates in terms of both affordability and performance
2. Once appropriate candidate is found, the club may use the random forest model to work out the best wage prediction to gain an advantage during the negotiations. 

-----

With that being said, predicting the wages of FIFA players is very beneficial for the club owners and managers to maximize their benefit. <br>
<br>
There's also an insight for future research. Soccer players have different positions, and different positions may value different attributes. Further study could divide the players' data into categories of positions to achieve better understanding of players' wage. 
