# Model Training: Centers

The task here will be to train a model to predict that value of NHL players who play the Center position. Choice of features is based on the EDA done in 01_C_EDA.ipynb (Folder 4_ExploratoryDataAnalysis)

## Import our data

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [25]:
# Load in our data
filepath = '../../Data/entitiesResolved/merged_data_clean.csv'
data = pd.read_csv(filepath)

# Select all rows were the 'Position' is 'c' or 'c,l' or 'c,r'
centers = data[data['POSITION'].isin(['c', 'c, l', 'c, r'])]

centers = centers[centers['GP'] >= 21]
centers.shape

(3075, 116)

## Features to try:

Recall that the EDA done revealed that we should try to use the following features:

1. TOI/GP
2. XGF/60 - and possibly combining it with SCF/60, FF/60, HDCF/60, CF/60, SF/60, and MDCF/60 to create a new feature.
3. GF/60
4. TOTAL ASSISTS/60 - But possibly using FIRST ASSISTS/60 and SECOND_ASSISTS/60 instead.
5. GOALS/60
6. Handedness

In [26]:
# Select the features we want to use
numerical_features = ['TOI/GP', 'XGF/60', 'GF/60', 'TOTAL ASSISTS/60', 'GOALS/60']
categorical_features = ['HANDED']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', LinearRegression())])

X = centers
y = centers['Y_SALARY_CAP_PERCENTAGE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 0.000652220784355844


In [27]:
# Create a feature column in Centers for the predicted salary
centers['predicted_salary'] = pipeline.predict(centers)
centers.head(-5)

Unnamed: 0,POSITION,PLAYER,TEAM,TOI,GP,TOI/GP,GOALS/60,TOTAL ASSISTS/60,FIRST ASSISTS/60,SECOND ASSISTS/60,...,AAV,SALARY,BASE SALARY,S.BONUS,P.BONUS,SEASON,Y_SALARY_CAP,Y_SALARY_CAP_PERCENTAGE,DECEASED,predicted_salary
19,"c, l",andrew cogliano,col,1120.283333,82,13.661992,0.96,1.45,0.75,0.70,...,1133333,850000,850000,0,350000,2007-08,50300000,0.016899,0,0.031387
28,c,antoine vermette,-,1423.616667,81,17.575514,1.01,1.22,0.72,0.51,...,1000000,1075000,1075000,0,0,2007-08,50300000,0.019881,0,0.049325
30,c,anze kopitar,lak,1696.800000,82,20.692683,1.13,1.59,0.81,0.78,...,955867,850000,765000,85000,134200,2007-08,50300000,0.016335,0,0.075989
40,c,boyd gordon,-,1054.166667,67,15.733831,0.40,0.51,0.23,0.28,...,650000,650000,650000,0,0,2007-08,50300000,0.012922,0,0.030596
43,c,brad richards,-,1736.433333,74,23.465315,0.69,1.45,0.86,0.59,...,7800000,7800000,7800000,0,0,2007-08,50300000,0.155070,0,0.084902
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12910,"c, r",tyler toffoli,wpg,1158.666667,67,17.293532,1.55,1.04,0.62,0.41,...,4250000,3500000,3500000,0,0,2023-24,83500000,0.050898,0,0.056574
12914,c,tyson jost,buf,388.850000,36,10.801389,0.31,0.31,0.31,0.00,...,2000000,2000000,2000000,0,0,2023-24,83500000,0.023952,0,0.001359
12926,c,vincent trocheck,nyr,1492.000000,70,21.314286,0.97,1.69,1.17,0.52,...,5625000,6500000,3500000,3000000,0,2023-24,83500000,0.067365,0,0.083340
12927,c,vinni lettieri,min,367.533333,38,9.671930,0.49,0.65,0.49,0.16,...,775000,775000,775000,0,0,2023-24,83500000,0.009281,0,0.004550


In [28]:
# Select the top 5 centers by predicted salary
top_centers = centers.sort_values(by='predicted_salary', ascending=False).head(5)
top_centers[['PLAYER', 'Y_SALARY_CAP_PERCENTAGE', 'predicted_salary']]

Unnamed: 0,PLAYER,Y_SALARY_CAP_PERCENTAGE,predicted_salary
12328,connor mcdavid,0.149701,0.113019
9730,connor mcdavid,0.153374,0.106016
11482,connor mcdavid,0.151515,0.105985
10590,connor mcdavid,0.153374,0.104751
12698,nathan mackinnon,0.150898,0.102034


In [29]:
# Select the top 15 centers with the largest discrepency between predicted salary and actual salary
centers['salary_diff'] = centers['Y_SALARY_CAP_PERCENTAGE'] - centers['predicted_salary']
top_centers = centers.sort_values(by='salary_diff', ascending=False).head(15)
top_centers[['PLAYER', 'Y_SALARY_CAP_PERCENTAGE', 'predicted_salary', 'salary_diff']]

Unnamed: 0,PLAYER,Y_SALARY_CAP_PERCENTAGE,predicted_salary,salary_diff
1731,chris drury,0.118687,0.02056,0.098127
3609,scott gomez,0.122619,0.038119,0.0845
5801,jonathan toews,0.147059,0.062566,0.084493
2867,scott gomez,0.114419,0.033201,0.081218
10817,jonathan toews,0.128834,0.052374,0.07646
11703,jonathan toews,0.127273,0.053453,0.073819
1107,chris drury,0.12412,0.050554,0.073566
74,chris drury,0.140159,0.06679,0.073369
6640,jonathan toews,0.143836,0.071405,0.07243
3706,vincent lecavalier,0.128788,0.057975,0.070813


# Preliminary Conclusions
It seems like we need to make some tweaks. I can see two ways to think about this:
1. We're missing something in our feature set that predicts how valuable players like Connor Mcdavid and Nathan Mackinnon really are.
2. Our model thinks that the highest paid players are significantly overpaid considering how well they produce on the ice.