Data Source: https://data.world/datatouille/stephen-curry-stats

In [79]:
import pandas as pd
import numpy as np

In [80]:
# Convert tidy csv to numpy array
filename = "cleaned_stephcurry.csv"
df = pd.read_csv(filename)

arr = df.to_numpy()
arr

array([[  0,  36,   7, ...,   2,  14,   0],
       [  1,  39,   5, ...,   3,  12,   0],
       [  2,  28,   3, ...,   1,   7,   1],
       ...,
       [875,  33,  10, ...,   3,  29,   1],
       [876,  36,  11, ...,   2,  35,   1],
       [877,  25,   7, ...,   1,  23,   1]], dtype=int64)

In [81]:
# Create data and label arrays. For data, ignore index and label columns

all_data = arr[:, 1 : -1]
all_labels = arr[:, -1]


In [82]:
from sklearn.model_selection import train_test_split

# Train-test split
train_data, test_data, train_labels, test_labels = train_test_split(all_data, all_labels, test_size=0.2, random_state=47)

## Tree

In [83]:
from sklearn.model_selection import cross_val_score
from sklearn import tree


# Cross validate across parameters and find best tree depth

best_d = 1
best_accuracy = 0.0

for d in range(1,20):
    cv_model = tree.DecisionTreeClassifier(max_depth=d)   
    cv_scores = cross_val_score( cv_model, train_data, train_labels, cv=5 ) # 5-fold cross val
    average_cv_accuracy = cv_scores.mean()  
    print(f"depth: {d:2d}  cv accuracy: {average_cv_accuracy:7.4f}")
    
    if average_cv_accuracy > best_accuracy:
        best_accuracy = average_cv_accuracy
        best_d = d

    
    
# assign best value of d to best_depth
best_depth = best_d   
print()
print(f"best_depth = {best_depth} is our choice for an underfitting/overfitting balance.")  

depth:  1  cv accuracy:  0.6425
depth:  2  cv accuracy:  0.6481
depth:  3  cv accuracy:  0.6653
depth:  4  cv accuracy:  0.6610
depth:  5  cv accuracy:  0.6239
depth:  6  cv accuracy:  0.6310
depth:  7  cv accuracy:  0.6140
depth:  8  cv accuracy:  0.6382
depth:  9  cv accuracy:  0.6425
depth: 10  cv accuracy:  0.6254
depth: 11  cv accuracy:  0.6340
depth: 12  cv accuracy:  0.6325
depth: 13  cv accuracy:  0.6239
depth: 14  cv accuracy:  0.6211
depth: 15  cv accuracy:  0.6240
depth: 16  cv accuracy:  0.6340
depth: 17  cv accuracy:  0.6382
depth: 18  cv accuracy:  0.6254
depth: 19  cv accuracy:  0.6268

best_depth = 3 is our choice for an underfitting/overfitting balance.


### --- Gridsearch to verify our for loop results ---

In [84]:

from sklearn.model_selection import GridSearchCV

gs_tree = tree.DecisionTreeClassifier()
our_gridsearch = GridSearchCV(gs_tree,
                              param_grid={ "max_depth" : range(1,10)},
                              cv=5,
                              verbose=1,
                              n_jobs=-1
                              )

our_gridsearch.fit(train_data, train_labels)

print(f"Best parameters: {our_gridsearch.best_params_}")
print(f"Best score: {our_gridsearch.best_score_}")

Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best parameters: {'max_depth': 3}
Best score: 0.6653191489361703


#### ---- Feature importances and final accuracy! ----

In [85]:
final_tree = tree.DecisionTreeClassifier(max_depth=best_depth)
final_tree.fit(all_data, all_labels)


#### Create dictionary of columns for feature importance processing
data_df = df.iloc[:, 1 : -1] #Matches data array

COLUMNS = data_df.columns

COLUMN_DICT = {}
for i, name in enumerate(COLUMNS):
    COLUMN_DICT[i] = name
    


    
tree_importances = final_tree.feature_importances_.tolist()

print("Decision Tree Results: ")
print()

for i in range(len(tree_importances)):
    print(f"Feature importance of {COLUMN_DICT[i]} : {tree_importances[i]}")
    
print()
print(f"Final accuracy of our model across all data: {final_tree.score(all_data, all_labels)}")

Decision Tree Results: 

Feature importance of Minutes : 0.27628665004046493
Feature importance of Successful Shots : 0.0
Feature importance of Total Shots : 0.0
Feature importance of 3 Points Succesful : 0.0
Feature importance of Total 3 Points : 0.0
Feature importance of Successful FT : 0.0
Feature importance of Total FT : 0.0
Feature importance of REB : 0.0
Feature importance of AST : 0.08541872635911368
Feature importance of BLK : 0.0
Feature importance of STL : 0.0
Feature importance of PF : 0.0
Feature importance of TO : 0.04144598959531665
Feature importance of PTS : 0.5968486340051048

Final accuracy of our model across all data: 0.7004555808656037


## Random Forest

In [86]:
# Cross-validate and find best hyperparameters

from sklearn import ensemble

best_d = 1
best_ntrees = 50   
best_score = 0

for d in range(1,11):
    for ntrees in range(50,300,100):
        rforest_model = ensemble.RandomForestClassifier(max_depth=d, 
                                                        n_estimators=ntrees,
                                                        max_samples=0.5)
        cv_scores = cross_val_score( rforest_model, train_data, train_labels, cv=5 ) 
        average_cv_accuracy = cv_scores.mean()  
        print(f"depth: {d:2d} ntrees: {ntrees:3d} cv accuracy: {average_cv_accuracy:7.4f}")
        if average_cv_accuracy > best_score:
            best_d = d
            best_ntrees = ntrees


best_depth = best_d   
best_num_trees = best_ntrees


print()
print(f"best_depth: {best_depth} and best_num_trees: {best_num_trees} are our choices.")  


depth:  1 ntrees:  50 cv accuracy:  0.6438
depth:  1 ntrees: 150 cv accuracy:  0.6268
depth:  1 ntrees: 250 cv accuracy:  0.6282
depth:  2 ntrees:  50 cv accuracy:  0.6695
depth:  2 ntrees: 150 cv accuracy:  0.6609
depth:  2 ntrees: 250 cv accuracy:  0.6623
depth:  3 ntrees:  50 cv accuracy:  0.6680
depth:  3 ntrees: 150 cv accuracy:  0.6624
depth:  3 ntrees: 250 cv accuracy:  0.6695
depth:  4 ntrees:  50 cv accuracy:  0.6681
depth:  4 ntrees: 150 cv accuracy:  0.6809
depth:  4 ntrees: 250 cv accuracy:  0.6624
depth:  5 ntrees:  50 cv accuracy:  0.6653
depth:  5 ntrees: 150 cv accuracy:  0.6937
depth:  5 ntrees: 250 cv accuracy:  0.6866
depth:  6 ntrees:  50 cv accuracy:  0.6880
depth:  6 ntrees: 150 cv accuracy:  0.6837
depth:  6 ntrees: 250 cv accuracy:  0.6880
depth:  7 ntrees:  50 cv accuracy:  0.6980
depth:  7 ntrees: 150 cv accuracy:  0.6880
depth:  7 ntrees: 250 cv accuracy:  0.6980
depth:  8 ntrees:  50 cv accuracy:  0.6809
depth:  8 ntrees: 150 cv accuracy:  0.7023
depth:  8 n

### --- Gridsearch ---

In [87]:
gs_rf = ensemble.RandomForestClassifier()

params = {"max_depth" : range(1, 11),
          "n_estimators" : range(50, 300, 100),
          "max_samples" : [0.5]}

gridsearch2 = GridSearchCV(gs_rf,
                           param_grid=params,
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

gridsearch2.fit(train_data, train_labels)

print(f"Best parameters: {gridsearch2.best_params_}")
print(f"Best score: {gridsearch2.best_score_}")


Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best parameters: {'max_depth': 9, 'max_samples': 0.5, 'n_estimators': 250}
Best score: 0.7037082066869301


#### ---- Feature importances and final accuracy! ----

In [88]:
final_forest = ensemble.RandomForestClassifier(max_depth=best_depth, n_estimators=best_num_trees, max_samples=0.5)
final_forest.fit(all_data, all_labels)


forest_importances = final_forest.feature_importances_.tolist()

print("Random Forest Results: ")
print()

for i in range(len(tree_importances)):
    print(f"Feature importance of {COLUMN_DICT[i]} : {tree_importances[i]}")
    
print()
print(f"Final accuracy of our model across all data: {final_forest.score(all_data, all_labels)}")


Random Forest Results: 

Feature importance of Minutes : 0.27628665004046493
Feature importance of Successful Shots : 0.0
Feature importance of Total Shots : 0.0
Feature importance of 3 Points Succesful : 0.0
Feature importance of Total 3 Points : 0.0
Feature importance of Successful FT : 0.0
Feature importance of Total FT : 0.0
Feature importance of REB : 0.0
Feature importance of AST : 0.08541872635911368
Feature importance of BLK : 0.0
Feature importance of STL : 0.0
Feature importance of PF : 0.0
Feature importance of TO : 0.04144598959531665
Feature importance of PTS : 0.5968486340051048

Final accuracy of our model across all data: 0.8963553530751709


__Data Background__: This data comes from user _Tristan Malherbe_ on __Data.World__. It's a dataset with every single NBA game played by basketball player Stephen Curry from **October of 2009 to October of 2018.** The dataset contains the following information about every game of basketball (all specific stats about Steph Curry only): 
- the date, opposing team, number of minutes Curry played, number of sucessful shots made, number of total shots attempted, number of sucessful 3 pointers, total 3-pointers attempted, sucessful free throws, total attempted free throws, rebounds, assists, blocks, steals, personal fouls, turnovers, total points scored, type of game (reg season, conference), result (win or loss), and the final team scores.

Out of this, we chose to drop the date, opposing team, type of game, and final team score columns. The rest were used as features, and the result (Win or Loss) is what we were trying to predict. 

__Motivation__: NBA superstars like Steph Curry are often perceived as the primary drivers behind their basketball teams: if the team wins, it's because people think the superstar had a hot game and carried. If the team loses, people believe it's because the player underperformed and let the team down. We wanted to see how strongly we could predict the outcome of a basketball match by only looking at Curry's performance throughout a game. His contract is worth a LOT of money: is that money well-earned? 