# Model Building and Evaluation

In this notebook, we take the preprocessed dataset and use it to test a machine learning models to predict NBA player salaries. 

In [2]:
# common libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib

# Specific to model creation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA

In [3]:
# Import functions/moduls from this project. 
import sys
sys.path.append("../src")

In [4]:
# Load the cleaned data
player_data_final = pd.read_csv('../data/player_data_final.csv')
player_data_final.head()

Unnamed: 0,Salary,Age,GP,GS,MP,FG,FGA,FG%,3P,3PA,...,Undrafted_Flag,Position_C,Position_PF,Position_PG,Position_PG-SG,Position_SF,Position_SF-PF,Position_SF-SG,Position_SG,Position_SG-PG
0,3.710871,1.916724,0.315571,1.234343,1.555489,2.709056,2.609888,0.259585,4.440903,3.814286,...,-0.724175,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3.643117,1.447935,-0.572167,-0.724496,0.247134,0.307385,0.557035,-0.515895,0.005832,0.182192,...,-0.724175,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.618272,1.916724,1.001551,0.051648,0.969346,1.040098,1.294468,-0.260443,0.233271,0.492248,...,-0.724175,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3.374558,2.854301,0.27522,1.160425,1.639224,3.156825,3.0085,0.323448,1.370469,1.821063,...,-0.724175,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3.341335,1.916724,-0.047594,0.90171,1.649691,2.831175,2.231207,0.870846,1.14303,0.935187,...,-0.724175,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
# Now Split into features and target for test/train sets
X = player_data_final.iloc[:, 1:] 
y = player_data_final.iloc[:, 0] # salary

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [6]:
print('Number of Train Samples: ', X_train.shape[0])
print('Number of Test Samples: ', X_test.shape[0])

Number of Train Samples:  351
Number of Test Samples:  117


We don’t have a huge dataset, but we should have enough to check the final model accuary. Since we have a smaller dataset, to compare models we’ll use 5-fold cross-validation.

### Initial Model Benchmarking

We start by fittina g bungef  odiffere f regression models using defaulerparamets.er From there we'll pick a f ewwhils perforthe bestly and do some basituning and compare againll.
'll
use Rset  Suarted and Rforfrom the cross valida accuracytion. Thet tes set willheldout to for deming accuracy of the final m.ons.

In [8]:
# now let's quick and dirty check a bunch of models, from EDA we already know that the model will probably need regularization
models = {
    'Linear Regression': LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'ElasticNet': ElasticNet(),
    'Random Forest': RandomForestRegressor(),
    'SVR': SVR(),
    'KNN': KNeighborsRegressor(),
    'GBR': GradientBoostingRegressor()
}

# intialize results
results = []

# fit each model and save r squared results
for name, model in models.items():
    r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    rmse_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error')
    
    results.append({'Model': name, 'Mean R squared': np.mean(r2_scores), 'Mean RMSE': np.mean(rmse_scores)})

# Create DataFrame
results_df = pd.DataFrame(results).sort_values(by='Mean R squared', ascending=False)
print(results_df)

               Model  Mean R squared  Mean RMSE
7                GBR        0.757161  -0.465750
4      Random Forest        0.733539  -0.501863
5                SVR        0.682250  -0.536065
6                KNN        0.634184  -0.575547
1              Ridge        0.598178  -0.600532
0  Linear Regression        0.562305  -0.627338
3         ElasticNet        0.171504  -0.867864
2              Lasso       -0.011930  -0.956629


Since GradientBoostingRegressor and RandomForest did the best, we'll try some bsic tuning to see if we get better CV results. Since we only have 350 samples to deal with we should keep the depth and number of estimators comparitively low. We'll also look at Ridge, it was middle of the pack but we know we can directly tune regularization to the derth of features so it will be good to compare.

In [10]:
# GBR model
gbr = GradientBoostingRegressor(random_state=42)

param_grid_gbr = {
    'n_estimators': [50, 100],           
    'learning_rate': [0.01, 0.1, 0.2],        
    'max_depth': [3, 5, 7]                 
}


# Grid search
grid_search_gbr = GridSearchCV(
    estimator=gbr,
    param_grid=param_grid_gbr,
    scoring='r2',
    cv=5,
)

grid_search_gbr.fit(X_train, y_train)

# Results
print("Best parameters:", grid_search_gbr.best_params_)
print("Best R squared score:", grid_search_gbr.best_score_)


Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}
Best R squared score: 0.7571792425491408


GradientBoostingRegressor saw a marginal increase with some tuning, with the best accuracy so far.

In [12]:
# RF model
rf = RandomForestRegressor(random_state=42)

# parameter grid
param_grid_rf = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt']
}

# Grid search
grid_search_rf = GridSearchCV(estimator=rf,
                           param_grid=param_grid_rf,
                           cv=5,
                           scoring='r2')

# Fit the grid search 
grid_search_rf.fit(X_train, y_train)

# Results
print("Best parameters:", grid_search_rf.best_params_)
print("Best R squared score:", grid_search_rf.best_score_)

Best parameters: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Best R squared score: 0.6553599847287885


This R squared here is less than the untuned model, indicating that the base model was probably overfitting to the data.

In [14]:
# Ridge model
ridge = Ridge()

param_grid_ridge = {
    'alpha': [0.1, 1, 10, 100, 150]
}

grid_search_ridge = GridSearchCV(estimator=ridge,
                           param_grid=param_grid_ridge,
                           scoring='r2',
                           cv=5,
                           n_jobs=-1)

grid_search_ridge.fit(X_train, y_train)

print("Best alpha:", grid_search_ridge.best_params_['alpha'])
print("Best R squared score:", grid_search_ridge.best_score_)

Best alpha: 100
Best R squared score: 0.6494581017347951


Ridge performed worse than GradientBoostRegressor but similar to RandomForest, indicating it probably doesn't have enough predictive power to model this dataset

It looks like the Gradient Boost Regressor is the best, we'll take this as the final model in the next steps.