# NBA Playoffs

---
# Summary of Findings


## Introduction
---
In this project, I want to predict whether a player will make the playoffs in the 2018 season. This will be a classification problem as there will only be two outcomes to this prediction, either the player makes the playoffs or he does not. I will be using the regular season dataset from 2012-2017 as my primary data, but my target variable will be the players in the 2018 playoff statistics dataset while keeping track of who actually made the playoffs. I  added a column to the dataframe indicating whether the player made the playoffs or not, called 'outcome'.

My objective will be an accuracy score, since there is not a large class imbalance between the two outcomes. Both outcomes are in fact very close to having equal probabilities, due to there being 16 out of 30 teams that make the playoffs each season. This accuracy score will be based on the number of true positives + the number of true negatives divided by the number of positives + number of negatives within every player of 2018. Or in other words, the number of players we predict will make playoffs **over** the total number of players for that season.




### Results: Baseline Model
---

In my baseline model, I first cleaned the data and removed the duplicates of any player who switched teams during the season but kept the last entry to show that they stayed on that final team. Then I used a pipeline with a column transformer and a decision tree classifier. Since there were no ordinal columns, the dataset consisted of only nominal and categorical columns. In the column transformer, I standardized all the nominal columns (26 of them) and transformed the 2 categorical columns (team & position) to one-hot encoded columns.


By using a train-test split on the dataset with a test size of 1/7, I was able to calculate an accuracy score of 0.658. I used a test size of 1/7 because there were 7 seasons worth of data from 2012-2018, but I only wanted to predict on 1 of those 7 years (2018). 

I believe that my model performed fairly well, considering it only used the given features. However, it did receive a a large amount of features to work with, 26+2=28 in total. I want to improve my score through engineering new features and using different classifiers.

### Results: Final Model
---

In my final model, I researched other key basketball statistics and engineered two very powerful features.

The first one is **True Shooting Percentage**, with the formula given below.

**TS**% - True Shooting Percentage; the formula is PTS / (2 * TSA). 

**TSA** - True Shooting Attempts; the formula is FGA + 0.44 * FTA.

True shooting percentage is a measure of shooting efficiency that takes into account field goals, 3-point field goals, and free throws. This feature is important because it shows the true efficiecy of a player on offense. It allows the coach to view that player's current offensive rating, and can also provide insight on how well that player performs in general, which is a big indicator of whether the player will make it into playoffs. 

The second feature I engineered is the **Usage Percentage**.

**Usg%** - Usage Percentage; the formula is 100 * ((FGA + 0.44 * FTA + TOV) * (Tm MP / 5)) / (MP * (Tm FGA + 0.44 * Tm FTA + Tm TOV)). 

Within this formula, there are variable such as 'Tm MP' that indicate that variable is aggregated by team. For example, 'Tm FGA' stands for the field goals attempted by the entire team for that season. I grouped my data by team and year, and transformed a new column that aligned with each player's team and season to indicate his team's minutes played, field goals attempted, free throws attempted, and turnovers. Hence the new columns named ['TM_TOV', 'TM_MP', 'TM_FGA', 'TM_FTA'].

Usage percentage is an estimate of the percentage of team plays used by a player while he was on the floor. This is useful because it shows how often the team uses a certain player in their plays. Likely, the more often a player is used in plays, the more critical they are to the team. This is another huge indicator of whether the player will make the playoffs or not, since his usage percentage is related to how reliable and how useful that player is.

With my new features incorporated into the pipeline, I was able to get a higher accuracy score of 0.681.

Now, I used the grid search cross validation method to search over specified parameter values for the best set of parameters for an estimator. I also incorporated the **random forest classifier** and the **k nearest neighbors** classifier into my grid search. These are the results of using grid search on the 3 classifiers.

**Decision Tree Classifier**: {'classifier__max_depth': 18, 'classifier__min_samples_leaf': 1,
 'classifier__min_samples_split': 2} **Best score: 0.689**

**Random Forest Classifier**: {'classifier__max_depth': 18,
 'classifier__min_samples_leaf': 2,
 'classifier__min_samples_split': 8} **Best score: 0.708**

**K Nearest Neighbors Classifier**: {'classifier__leaf_size': 1, 'classifier__n_neighbors': 7} **Best score: 0.656**

To conclude my final model, I was able to improve the accuracy score from my baseline model that used decision tree classfier and two new engineered features. But after using grid search on two new classifiers, I found that the random forest classifier gave me the highest score of 0.708.

### Results: Fairness Evaluation
---

In my fairness evaluation, I assessed whether my model would be consistent for veterans entering the playoffs. I used the age 32 as a cutoff for being a veteran in the NBA.

After creating my tables of predictions and outcomes, I ran permutation tests on the entire dataset using the accuracy and true positive parity scores. I set my significance value at 0.05.

For the accuracy parity testing, my **test statistic** is the accuracy score and how well it predicts for groups of players that are considered veterans and their entry into the playoffs.

For the accuracy parity testing, my **test statistic** is the recall score and how well it predicts for groups of players that are considered veterans and their entry into the playoffs.

**Null Hypothesis**: Veterans and non veterans can enter the playoffs equally likely.

**Alternative Hypothesis**: Veterans are more likely to enter the playoffs.

For the accuracy parity score testing, I received a p-value of **0.45**.

For the True Positive parity score testing, I received a p-value of **0.37**.

In conclusion, the p-values I got were too high for my significance value and we conclude that we are unable to reject the null hypothesis. Even though the true positive parity score was higher, it was still not significant enough to provide any correlation.

---

# Your Code Starts Here

In [2]:
%matplotlib inline
import os

import pandas as pd
import numpy as np

import matplotlib.pyplot as plot
import seaborn as sns

In [3]:
#reading in the csv
df = pd.read_csv('final.csv')
df = df.drop(columns=['Unnamed: 0'])

In [4]:
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 100)

### Results: Baseline Model
---

In [5]:
#light cleaning
#dropping playoff columns
df.loc[df['Year_poff'].notnull(), 'Year_poff'] = 1
df.rename({'Year_poff': 'outcome'}, axis=1, inplace=True)
df['outcome'] = df['outcome'].fillna(0)
df = df.drop(columns=['Pos_poff', 'Age_poff', 'Tm_poff', 'G_poff', 'GS_poff', 'MP_poff',
       'FG_poff', 'FGA_poff', 'FG%_poff', '3P_poff', '3PA_poff', '3P%_poff',
       '2P_poff', '2PA_poff', '2P%_poff', 'eFG%_poff', 'FT_poff', 'FTA_poff',
       'FT%_poff', 'ORB_poff', 'DRB_poff', 'TRB_poff', 'AST_poff', 'STL_poff',
       'BLK_poff', 'TOV_poff', 'PF_poff', 'PTS_poff'])


In [6]:
#keeping the last duplicate player row for those players who switched teams during the season
df = df.groupby(['Player', 'Year_reg']).tail(1).fillna(0)

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import sklearn.preprocessing as pp
import warnings
warnings.filterwarnings("ignore")

In [17]:
# Numeric columns and associated transformers
num_feat = ['Age_reg', 'G_reg', 'GS_reg',
       'MP_reg', 'FG_reg', 'FGA_reg', 'FG%_reg', '3P_reg', '3PA_reg',
       '3P%_reg', '2P_reg', '2PA_reg', '2P%_reg', 'eFG%_reg', 'FT_reg',
       'FTA_reg', 'FT%_reg', 'ORB_reg', 'DRB_reg', 'TRB_reg', 'AST_reg',
       'STL_reg', 'BLK_reg', 'TOV_reg', 'PF_reg', 'PTS_reg']
num_transformer = Pipeline(steps=[
    ('scaler', pp.StandardScaler())
])

# Categorical columns and associated transformers
cat_feat = ['Pos_reg', 'Tm_reg']
cat_transformer = Pipeline(steps=[
    ('one-hot', pp.OneHotEncoder(handle_unknown='ignore'))
])

# preprocessing pipeline (put them together)
preproc = ColumnTransformer(transformers=[('num', num_transformer, num_feat), ('cat', cat_transformer, cat_feat)])

#generic pipeline with basic features, using a decision tree classifier
pl = Pipeline(steps=[('preprocessor', preproc), ('classifier', DecisionTreeClassifier())])


Conducting a train/test split of 1/7 on the test size. A basic pipeline using generic features such as standard scaler and one hot encoder. I use the decision tree classifier to classify my outcomes.

In [18]:
# features
X = df.drop('outcome', axis=1)
# outcome
y = df.outcome

# test size is 2018, which is 1 of the 7 years from 2012-2018
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/7) 
pl.fit(X_train, y_train)
pl.score(X_test, y_test)

0.6584867075664622

### Results: Final Model
---

In my final model, I engineer two new features: true shooting percentage and usage percentage. 

In [20]:
#engineering new features

#True shooting percentage
TSA = df['FGA_reg'] + 0.44 * df['FTA_reg']
df['TS%'] = df['PTS_reg'] / (2 * TSA)

#Usage Percentage

#getting each player's team's total turnovers, minutes played, field goals attempted
#and free throws attempted, **per year**

#these are just helper columns to calculate the usage rate
df['TM_TOV'] = df.groupby(['Tm_reg', 'Year_reg'])['TOV_reg'].transform(sum)
df['TM_MP'] = df.groupby(['Tm_reg', 'Year_reg'])['MP_reg'].transform(sum)
df['TM_FGA'] = df.groupby(['Tm_reg', 'Year_reg'])['FGA_reg'].transform(sum)
df['TM_FTA'] = df.groupby(['Tm_reg', 'Year_reg'])['FTA_reg'].transform(sum)

#calculating usage rate
df['USG%'] = (((df['FGA_reg'] + 0.44 * df['FTA_reg'] + df['TOV_reg']) * (df['TM_MP'] / 5)) 
              / (df['MP_reg'] * (df['TM_FGA'] + 0.44 * df['TM_FTA'] + df['TM_TOV'])))
                                    
df = df.fillna(0)

Running the pipeline still using the decision tree classifier, but with two new engineered features.

In [31]:
# Numeric columns and associated transformers
newnum_feat = ['Age_reg', 'G_reg', 'GS_reg',
       'MP_reg', 'FG_reg', 'FGA_reg', 'FG%_reg', '3P_reg', '3PA_reg',
       '3P%_reg', '2P_reg', '2PA_reg', '2P%_reg', 'eFG%_reg', 'FT_reg',
       'FTA_reg', 'FT%_reg', 'ORB_reg', 'DRB_reg', 'TRB_reg', 'AST_reg',
       'STL_reg', 'BLK_reg', 'TOV_reg', 'PF_reg', 'PTS_reg', 'TS%', 'USG%']

newnum_transformer = Pipeline(steps=[
    ('scaler', pp.StandardScaler())
])

# Categorical columns and associated transformers
newcat_feat = ['Pos_reg', 'Tm_reg']
newcat_transformer = Pipeline(steps=[
    ('onehot', pp.OneHotEncoder(handle_unknown='ignore'))
])

# preprocessing pipeline (put them together)
newpreproc = ColumnTransformer(transformers=[('num', newnum_transformer, newnum_feat), 
                                          ('cat', newcat_transformer, newcat_feat)])

newpl = Pipeline(steps=[('preprocessor', newpreproc), ('classifier', DecisionTreeClassifier())])

In [29]:
# features
newX = df.drop(columns = ['Rk_reg', 'Player', 'outcome'])
# outcome
newY = df.outcome

# test size is the year 2018, which is 1 of the 7 years from 2012-2018
newX_train, newX_test, newY_train, newY_test = train_test_split(newX, newY, test_size=1/7) 
newpl.fit(newX_train, newY_train)
newpl.score(newX_test, newY_test)

0.6809815950920245

Finding the best mix of parameters using grid search, for **decision tree classifier**.

In [320]:
#searching for the best model using grid search
from sklearn.model_selection import GridSearchCV


parameters = {
    'classifier__max_depth': [2,3,5,13,15,18,None], 
    'classifier__min_samples_split':[2,3,5,8],
    'classifier__min_samples_leaf':[1,2,3,5]
}

grid = GridSearchCV(newpl, parameters, iid=False, cv=5)
grid.fit(newX_train, newY_train)
grid.best_params_

{'classifier__max_depth': 18,
 'classifier__min_samples_leaf': 1,
 'classifier__min_samples_split': 2}

In [381]:
#the best score achieved by the best mix of parameters
grid.best_score_

0.6886037089895432

Finding the best mix of parameters using grid search, for **random forest classifier** and **k nearest neighbors classifier**.

In [33]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [323]:
#random forest classifier
ran_pl = Pipeline(steps=[('preprocessor', newpreproc), 
                        ('classifier', RandomForestClassifier(n_estimators=10, max_depth=None))])

ran_pl.fit(newX_train, newY_train)
ran_grid = GridSearchCV(ran_pl, parameters, iid=False, cv=5)
ran_grid.fit(newX_train, newY_train)
ran_grid.best_params_

{'classifier__max_depth': 18,
 'classifier__min_samples_leaf': 2,
 'classifier__min_samples_split': 8}

In [324]:
#the best score achieved by random forest classifier
ran_grid.best_score_

0.7080518295853983

In [343]:
#k nearest neighbor classifier
parameters = {
    'classifier__n_neighbors': [1,2,3,5,7], 
    'classifier__leaf_size':[1,2,3,5]
}

knn_pl = Pipeline(steps=[('preprocessor', newpreproc), 
                        ('classifier', KNeighborsClassifier(n_neighbors=1))])
knn_pl.fit(newX_train, newY_train)
knn_grid = GridSearchCV(knn_pl, parameters, iid=False, cv=5)
knn_grid.fit(newX_train, newY_train)
knn_grid.best_params_

{'classifier__leaf_size': 1, 'classifier__n_neighbors': 7}

In [382]:
#the best score achieved by k nearest neighbor classifier
knn_grid.best_score_

0.6558757879747932

### Results: Fairness Evaluation
---

Evaluating model for fairness based on whether or not a player's veteran status affects him entering the playoffs.

In [35]:
from sklearn import metrics
#permutation test on whether predicting playoff entry is consistent with age above 32
#testing whether the model is consistent with veterans (players over 32)
predictions = ran_pl.predict(newX_test)
perm = newX_test

#creating table of actual outcomes and predictions
perm['predictions'] = predictions
perm['outcome'] = newY_test
perm['veteran'] = (perm.Age_reg >= 32).replace({True:'veteran', False:'not veteran'})


Running permutation tests for accuracy and True Positive parities.

Setting significance value at p = 0.05.

In [42]:
#Accuracy parity
obs = perm.groupby('veteran').apply(lambda x: metrics.accuracy_score(x.outcome, x.predictions)).diff().iloc[-1]

metrs = []
for _ in range(100):
    s = (
        perm[['veteran', 'predictions', 'outcome']]
        .assign(veteran=perm.veteran.sample(frac=1.0, replace=False).reset_index(drop=True))
        .groupby('veteran')
        .apply(lambda x: metrics.accuracy_score(x.outcome, x.predictions))
        .diff()
        .iloc[-1]
    )
    
    metrs.append(s)
pd.Series(metrs <= obs).mean()


0.45

In [53]:
#True Positive parity
obs = perm.groupby('veteran').apply(lambda x: metrics.recall_score(x.outcome, x.predictions)).diff().iloc[-1]

metrs = []
for _ in range(100):
    s = (
        perm[['veteran', 'predictions', 'outcome']]
        .assign(veteran=perm.veteran.sample(frac=1.0, replace=False).reset_index(drop=True))
        .groupby('veteran')
        .apply(lambda x: metrics.recall_score(x.outcome, x.predictions))
        .diff()
        .iloc[-1]
    )
    
    metrs.append(s)
pd.Series(metrs <= obs).mean()

0.37
