# NBA Data: Modeling

Knowing the amount of number of 2-point shots, 3-point shots, and free-throws, should theoretically predict the winner of a basketball game with 100% accuracy because the team that scores the most points wins and those are all 3 ways to score points. I could throw all of my dependent variables in the model and have something incredibly accurate but that doesn't help teams strategize in this hypothetical scenario because that would essentily say "The team that wins in every possible stat category will win the game" which is obvious.

The value of this model is to provide guidance in team building and training. No team is ever \#1 in every stat category. That's not feasable. Instead I'm going to limit the model to only 3 inputs. This way, a caoch and go to their team and say "This year, we will focus on 3 things". Keep it simple for the team so they don't get scatterbrained trying to do too much. 

First, I will build the model to predict winning using all inputs. Then I will look at the feature importance to extract the 3 most useful features to predict winning. Finally, I will rebuild the model using only those 3 features. I will repeat this process for two to three different models to find the best combination of model and features to predict winning.

Model
Explanation
All features
use just best parameters (3
rebuild with 3
cross validation somewhere on training set
test set

## Imports

In [22]:
import numpy as np
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import cross_val_score, KFold
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.filterwarnings("ignore")

## Load Data

In [23]:
nba_df = pickle.load(open("data_clean/nba_df3.pkl", "rb"))
X_train_scaled = pickle.load(open("data_clean/X_train_scaled.pkl", "rb"))
X_test_scaled = pickle.load(open("data_clean/X_test_scaled.pkl", "rb"))
y_train = pickle.load(open("data_clean/y_train.pkl", "rb"))
y_test = pickle.load(open("data_clean/y_test.pkl", "rb"))

## Circling Back

After running a few tests I realized that including a metric for field goals which is just the summation of 2-point and 3-point field goals is reudnant. So I'm going to remove the fga and fg_pct inputs while keeping the more detailed 2-point and 3-point related inputs.

In [24]:
nba_df = nba_df.drop(columns = ["fga", "fg_pct"])
X_train_scaled = X_train_scaled.drop(columns = ["fga", "fg_pct"])
X_test_scaled = X_test_scaled.drop(columns = ["fga", "fg_pct"])
nba_df

Unnamed: 0,fg2a,fg2_pct,fg3a,fg3_pct,fta,ft_pct,reb,ast,stl,blk,to,pf,win
0,68.0,0.470588,8.0,0.250000,30.0,0.500000,38.0,20.0,9.0,4.0,18.0,34.0,loss
1,63.0,0.492063,7.0,0.142857,34.0,0.735294,41.0,23.0,8.0,4.0,18.0,26.0,win
2,66.0,0.545455,15.0,0.266667,34.0,0.617647,48.0,25.0,18.0,7.0,25.0,35.0,win
3,62.0,0.483871,13.0,0.153846,40.0,0.700000,43.0,20.0,9.0,4.0,24.0,26.0,loss
4,71.0,0.478873,6.0,0.666667,29.0,0.689655,52.0,25.0,10.0,7.0,25.0,33.0,win
...,...,...,...,...,...,...,...,...,...,...,...,...,...
51332,75.0,0.346667,22.0,0.363636,24.0,0.833333,55.0,20.0,5.0,7.0,8.0,22.0,loss
51333,59.0,0.440678,41.0,0.365854,21.0,0.857143,40.0,30.0,9.0,4.0,14.0,19.0,loss
51334,65.0,0.615385,28.0,0.500000,26.0,0.769231,52.0,34.0,7.0,9.0,15.0,20.0,win
51335,60.0,0.500000,34.0,0.352941,20.0,0.800000,48.0,30.0,7.0,7.0,21.0,23.0,loss


## Model 1: Logistic Regression

The first model I will try is Logistic Regression. My output variable is win vs loss so I have to stick with categorical models and with this output variable being dichotomous, it makes sense to try Logistic Regression first.

### Get top 3 Features

First I will build a model with all features, create an ordered list of the features by their importance, then I will extract the top 3. Those 3 features will be used to build the real model

In [25]:
# https://stackoverflow.com/questions/34052115/how-to-find-the-importance-of-the-features-for-a-logistic-regression-model
# https://stackoverflow.com/questions/24255723/sklearn-logistic-regression-important-features
# This is already scaled so I dont need the std. its already 1

log_reg_all = LogisticRegression(solver = 'liblinear', max_iter = 500, C = 1000)
log_reg_all.fit(X_train_scaled, y_train)
log_reg_feature_list = np.abs(np.std(X_train_scaled) * log_reg_all.coef_[0]).sort_values(ascending = False) 

log_reg_top_features = list(log_reg_feature_list[0:3].index)
log_reg_feature_list

reb        2.146340
fg3a       1.863932
fg2a       1.490386
fg2_pct    1.306337
to         1.201043
fg3_pct    1.197691
stl        1.032633
ft_pct     0.442312
ast        0.437753
blk        0.281330
fta        0.252609
pf         0.218038
dtype: float64

The top 3 features are are reb (rebounds), fg3a (3-pointers attempted), and fg2a (2-pointers attempted)

### Rebuild Model using Top 3 Features

Now I will rebuild the model only using the top 3 feaures found in the previous section

In [26]:
X_train_scaled_log_reg_top3 = X_train_scaled[log_reg_top_features]
X_test_scaled_log_reg_top3 = X_test_scaled[log_reg_top_features]

In [27]:
log_reg_top3 = LogisticRegression(solver = 'liblinear', max_iter = 500, C = 1000)
log_reg_top3.fit(X_train_scaled_log_reg_top3, y_train)

LogisticRegression(C=1000, max_iter=500, solver='liblinear')

### Model Evaluation

In [28]:
cv = KFold(n_splits = 10, random_state = 610, shuffle = True)
scores = cross_val_score(log_reg_top3, X_train_scaled_log_reg_top3, y_train, cv = cv)
print("Cross Validation Mean Score: ", round(np.mean(scores), 4), " (Std: ", round(np.std(scores), 4), ")", sep = "")
print("Classification Report for Test Data")
print(classification_report(y_test, log_reg_top3.predict(X_test_scaled_log_reg_top3)))

Cross Validation Mean Score: 0.6329 (Std: 0.0068)
Classification Report for Test Data
              precision    recall  f1-score   support

        loss       0.63      0.64      0.63      7693
         win       0.63      0.63      0.63      7706

    accuracy                           0.63     15399
   macro avg       0.63      0.63      0.63     15399
weighted avg       0.63      0.63      0.63     15399



In [29]:
evaluation_metrics = pd.array(["model", "features", "train_cv_mean", "train_cv_std", "accuracy", "precision_win", "precision_loss", "recall_win", "recall_loss", "f1_win", "f1_loss"])

log_reg_evaluation = pd.array(["Logistic Regression", 
                               log_reg_top_features,
                               round(np.mean(scores), 4), 
                               round(np.std(scores), 4), 
                               round(accuracy_score(log_reg_top3.predict(X_test_scaled_log_reg_top3), y_test), 4), 
                               round(precision_score(log_reg_top3.predict(X_test_scaled_log_reg_top3), y_test, pos_label = 'win'), 4), 
                               round(precision_score(log_reg_top3.predict(X_test_scaled_log_reg_top3), y_test, pos_label = 'loss'), 4), 
                               round(recall_score(log_reg_top3.predict(X_test_scaled_log_reg_top3), y_test, pos_label = 'win'), 4), 
                               round(recall_score(log_reg_top3.predict(X_test_scaled_log_reg_top3), y_test, pos_label = 'loss'), 4), 
                               round(f1_score(log_reg_top3.predict(X_test_scaled_log_reg_top3), y_test, pos_label = 'win'), 4),
                               round(f1_score(log_reg_top3.predict(X_test_scaled_log_reg_top3), y_test, pos_label = 'loss'), 4)])

log_reg_evaluation = pd.DataFrame(log_reg_evaluation, evaluation_metrics).transpose()
log_reg_evaluation

Unnamed: 0,model,features,train_cv_mean,train_cv_std,accuracy,precision_win,precision_loss,recall_win,recall_loss,f1_win,f1_loss
0,Logistic Regression,"[reb, fg3a, fg2a]",0.6329,0.0068,0.6328,0.6287,0.6368,0.6342,0.6313,0.6315,0.6341


## Model 2: Decision Tree

The second model I will try is a Decision Tree. Again, I need to stick to classification models. Decision Trees are beneficial in that they are easier to interpret.

### Get top 3

First I will build a model with all features, create an ordered list of the features by their importance, then I will extract the top 3. Those 3 features will be used to build the real model

In [30]:
# Declare a variable called entr_model and use tree.DecisionTreeClassifier. 
dec_tree = DecisionTreeClassifier(criterion = "entropy", random_state = 610)

# Call fit() on entr_model
dec_tree.fit(X_train_scaled, y_train)


dec_tree_feature_list = pd.Series(dec_tree.feature_importances_, index = dec_tree.feature_names_in_).sort_values(ascending = False) 

dec_tree_top_features = list(dec_tree_feature_list[0:3].index)
dec_tree_feature_list

fg2_pct    0.154704
fg3_pct    0.138294
reb        0.133056
ft_pct     0.077667
fg3a       0.073994
to         0.073337
fta        0.072193
stl        0.066246
fg2a       0.066140
pf         0.056075
ast        0.047649
blk        0.040645
dtype: float64

The top 3 features are are, fg2a (2-pointers attempted), fg3a (3-pointers attempted), and reb (rebounds)

### Rebuild Model using Top 3 Features

Now I will rebuild the model only using the top 3 feaures found in the previous section

In [31]:
X_train_scaled_dec_tree_top3 = X_train_scaled[dec_tree_top_features]
X_test_scaled_dec_tree_top3 = X_test_scaled[dec_tree_top_features]

In [32]:
dec_tree_top3 = DecisionTreeClassifier(criterion = "entropy", random_state = 610)
dec_tree_top3.fit(X_train_scaled_dec_tree_top3, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=610)

### Model Evaluation

In [33]:
cv = KFold(n_splits = 10, random_state = 610, shuffle = True)
scores = cross_val_score(dec_tree_top3, X_train_scaled_dec_tree_top3, y_train, cv = cv)
print("Cross Validation Mean Score: ", round(np.mean(scores), 4), " (Std: ", round(np.std(scores), 4), ")", sep = "")
print("Classification Report for Test Data")
print(classification_report(y_test, dec_tree_top3.predict(X_test_scaled_dec_tree_top3)))

Cross Validation Mean Score: 0.657 (Std: 0.0094)
Classification Report for Test Data
              precision    recall  f1-score   support

        loss       0.66      0.65      0.65      7693
         win       0.65      0.66      0.66      7706

    accuracy                           0.65     15399
   macro avg       0.65      0.65      0.65     15399
weighted avg       0.65      0.65      0.65     15399



In [34]:
evaluation_metrics = pd.array(["model", "features", "train_cv_mean", "train_cv_std", "accuracy", "precision_win", "precision_loss", "recall_win", "recall_loss", "f1_win", "f1_loss"])

dec_tree_evaluation = pd.array(["Decision Tree", 
                                dec_tree_top_features,
                                round(np.mean(scores), 4), 
                                round(np.std(scores), 4), 
                                round(accuracy_score(dec_tree_top3.predict(X_test_scaled_dec_tree_top3), y_test), 4), 
                                round(precision_score(dec_tree_top3.predict(X_test_scaled_dec_tree_top3), y_test, pos_label = 'win'), 4), 
                                round(precision_score(dec_tree_top3.predict(X_test_scaled_dec_tree_top3), y_test, pos_label = 'loss'), 4), 
                                round(recall_score(dec_tree_top3.predict(X_test_scaled_dec_tree_top3), y_test, pos_label = 'win'), 4), 
                                round(recall_score(dec_tree_top3.predict(X_test_scaled_dec_tree_top3), y_test, pos_label = 'loss'), 4), 
                                round(f1_score(dec_tree_top3.predict(X_test_scaled_dec_tree_top3), y_test, pos_label = 'win'), 4),
                                round(f1_score(dec_tree_top3.predict(X_test_scaled_dec_tree_top3), y_test, pos_label = 'loss'), 4)])

dec_tree_evaluation = pd.DataFrame(dec_tree_evaluation, evaluation_metrics).transpose()
dec_tree_evaluation

Unnamed: 0,model,features,train_cv_mean,train_cv_std,accuracy,precision_win,precision_loss,recall_win,recall_loss,f1_win,f1_loss
0,Decision Tree,"[fg2_pct, fg3_pct, reb]",0.657,0.0094,0.654,0.6595,0.6485,0.6527,0.6553,0.6561,0.6519


## Model 3: Random Forest

The last model I will try is the Random Forest. This is more powerful than a single Decision Tree, however it loses out on interpretability. Fortunately in this scenario, the details of how the model works aren’t very relevant and what matters is reducing the model to a few key inputs.

### Get top 3

First I will build a model with all features, create an ordered list of the features by their importance, then I will extract the top 3. Those 3 features will be used to build the real model

In [35]:
rand_for = RandomForestClassifier(n_estimators = 300, random_state = 1, n_jobs = -1)
rand_for.fit(X_train_scaled, y_train)

rand_for_feature_list = pd.Series(rand_for.feature_importances_, index = rand_for.feature_names_in_).sort_values(ascending = False) 

rand_for_top_features = list(rand_for_feature_list[0:3].index)
rand_for_feature_list

fg2_pct    0.152483
fg3_pct    0.145871
reb        0.126755
ast        0.077766
fta        0.071842
fg3a       0.070037
ft_pct     0.069851
to         0.063353
fg2a       0.058817
pf         0.058231
stl        0.057918
blk        0.047075
dtype: float64

The top 3 features are are, fg2a (2-pointers attempted), fg3a (3-pointers attempted), and reb (rebounds)

### Rebuild Model using Top 3 Features

Now I will rebuild the model only using the top 3 feaures found in the previous section

In [36]:
X_train_scaled_rand_for_top3 = X_train_scaled[rand_for_top_features]
X_test_scaled_rand_for_top3 = X_test_scaled[rand_for_top_features]

In [37]:
rand_for_top3 = RandomForestClassifier(n_estimators = 300, random_state = 1, n_jobs = -1)
rand_for_top3.fit(X_train_scaled_rand_for_top3, y_train)

RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=1)

### Model Evaluation

In [38]:
cv = KFold(n_splits = 10, random_state = 610, shuffle = True)
scores = cross_val_score(rand_for_top3, X_train_scaled_rand_for_top3, y_train, cv = cv)
print("Cross Validation Mean Score: ", round(np.mean(scores), 4), " (Std: ", round(np.std(scores), 4), ")", sep = "")
print("Classification Report for Test Data")
print(classification_report(y_test, rand_for_top3.predict(X_test_scaled_rand_for_top3)))

Cross Validation Mean Score: 0.7043 (Std: 0.0082)
Classification Report for Test Data
              precision    recall  f1-score   support

        loss       0.71      0.70      0.70      7693
         win       0.70      0.71      0.71      7706

    accuracy                           0.71     15399
   macro avg       0.71      0.71      0.71     15399
weighted avg       0.71      0.71      0.71     15399



In [39]:
evaluation_metrics = pd.array(["model", "features", "train_cv_mean", "train_cv_std", "accuracy", "precision_win", "precision_loss", "recall_win", "recall_loss", "f1_win", "f1_loss"])

rand_for_evaluation = pd.array(["Random Forest", 
                                rand_for_top_features, 
                                round(np.mean(scores), 4), 
                                round(np.std(scores), 4), 
                                round(accuracy_score(rand_for_top3.predict(X_test_scaled_rand_for_top3), y_test), 4), 
                                round(precision_score(rand_for_top3.predict(X_test_scaled_rand_for_top3), y_test, pos_label = 'win'), 4), 
                                round(precision_score(rand_for_top3.predict(X_test_scaled_rand_for_top3), y_test, pos_label = 'loss'), 4), 
                                round(recall_score(rand_for_top3.predict(X_test_scaled_rand_for_top3), y_test, pos_label = 'win'), 4), 
                                round(recall_score(rand_for_top3.predict(X_test_scaled_rand_for_top3), y_test, pos_label = 'loss'), 4), 
                                round(f1_score(rand_for_top3.predict(X_test_scaled_rand_for_top3), y_test, pos_label = 'win'), 4),
                                round(f1_score(rand_for_top3.predict(X_test_scaled_rand_for_top3), y_test, pos_label = 'loss'), 4)])

rand_for_evaluation = pd.DataFrame(rand_for_evaluation, evaluation_metrics).transpose()
rand_for_evaluation

Unnamed: 0,model,features,train_cv_mean,train_cv_std,accuracy,precision_win,precision_loss,recall_win,recall_loss,f1_win,f1_loss
0,Random Forest,"[fg2_pct, fg3_pct, reb]",0.7043,0.0082,0.7063,0.7123,0.7004,0.7043,0.7085,0.7083,0.7044


## Model Comparison

In [40]:
evaluation_df = pd.concat([log_reg_evaluation, dec_tree_evaluation, rand_for_evaluation]).\
                set_index("model").\
                sort_values("accuracy", ascending = False)

evaluation_df

Unnamed: 0_level_0,features,train_cv_mean,train_cv_std,accuracy,precision_win,precision_loss,recall_win,recall_loss,f1_win,f1_loss
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Random Forest,"[fg2_pct, fg3_pct, reb]",0.7043,0.0082,0.7063,0.7123,0.7004,0.7043,0.7085,0.7083,0.7044
Decision Tree,"[fg2_pct, fg3_pct, reb]",0.657,0.0094,0.654,0.6595,0.6485,0.6527,0.6553,0.6561,0.6519
Logistic Regression,"[reb, fg3a, fg2a]",0.6329,0.0068,0.6328,0.6287,0.6368,0.6342,0.6313,0.6315,0.6341


* Random Forest had the best performance in every test evaluation metric by a sizable margin making the choice an easy one. 
* The simpler Decision Tree also beat out Logistic Regression. 
* Both Random Forest and Decision Tree had the same top 3 features (fg2_pct, fg3_pct, reb) in that order.
* This isn't present in the data but running Cross Validation using Random Forest took noticably longer than the other two models and caused the fans of my PC to start running pretty loud. This isn't an issue given the size and complexity of my model, but this would be a factor when done on a larger scale

Interpretibilty isn't an issue for my hypothetical scenario because at the end of the day, teams just want to know what basketball stats to focus on. They don't care about the details of how the model works. Just that it's accurate. In conclusion, I'm going with Random Forest. Top 3 features are **2-point percent**, **3-point percent**, and **rebounds**. 