# NBA Data Analysis: Using NBA Advanced Statistics to Predict the Number of Playoff Wins for Each Team!


## Introduction

In recent events, the NBA has taken a noticeable shift towards using data analytics to make informed decisions about roster construction, draft decisions and much more.

Consequently, there has been an explosion of advanced stats being used to better evaluate NBA rosters to increase each respective roster's chances of winning the coveted Larry O'Brien trophy.

This has led me to wanting to answer a specific question; How useful are these advanced stats? 

In this project, I use the [nba_api](https://github.com/swar/nba_api) developed by Swar Patel. It is a free api that allows me to generate my own dataset.

In [6]:
!pip install nba_api



## Libraries

In [4]:
from nba_api.stats.library import parameters
from nba_api.stats.endpoints import leaguedashteamstats
from nba_api.stats.endpoints import leaguegamefinder

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colormaps


from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split,KFold
from sklearn.metrics import r2_score, mean_absolute_error

In [6]:
#Strictly for me. This can be ignored
pd.set_option('display.max_columns',None)

## Building the Dataset

First, we build the dataset of NBA seasons from 2003-2024 by using the nba_api.

In [31]:

# Initialize an empty DataFrame to store all data
all_seasons_df = pd.DataFrame()

# Loop through seasons from 2003-04 to 2023-24
for year in range(2003, 2024):
    season = f"{year}-{str(year+1)[-2:]}"
    params = {'measure_type_detailed_defense': 'Advanced',}
    # Fetch data for the season
    team_stats = leaguedashteamstats.LeagueDashTeamStats(season=season, per_mode_detailed='PerGame',**params)
    season_df = team_stats.get_data_frames()[0]

    # Add a column for the season's year
    season_df['SEASON'] = year+1

    # Combining each season's data frame into one big dataframe. 
    all_seasons_df = pd.concat([all_seasons_df, season_df], ignore_index=True)


ReadTimeout: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)

This dataset will store "playoff win" values of each team during the playoffs since we are only concerned with the outcomes of each team.

In [None]:

# Initialize an empty DataFrame to store win counts
playoff_wins = pd.DataFrame()

# Loop through each season
for year in range(2003, 2024):
    season = f"{year}-{str(year+1)[-2:]}"

    # Fetch game data
    gamefinder = leaguegamefinder.LeagueGameFinder(season_nullable=season, season_type_nullable='Playoffs')
    games_df = gamefinder.get_data_frames()[0]

    # Filter for wins and count them
    wins_df = games_df[games_df['WL'] == 'W'].groupby('TEAM_NAME').size().reset_index(name='PLAYOFF_WINS')
    wins_df['SEASON'] = year+1

    playoff_wins = pd.concat([playoff_wins, wins_df], ignore_index=True)


## Assessing the Dataset

Viewing the dataset and ensuring there is no missing data

In [None]:
all_seasons_df.head(5)

In [None]:
# Checking the shape of the seasons dataset
print('shape:',all_seasons_df.shape)

# Basic info of the seasons dataset
print(all_seasons_df.info())

In [None]:
# Checking the shape of the playoff wins dataset
print('shape:',playoff_wins.shape)

# Basic info of the playoff wins dataset
print(playoff_wins.info())

In [None]:
#Checking to make sure there are no missing values
print(all_seasons_df.isnull().sum())

In [None]:
#Checking to make sure there are no missing values
print(playoff_wins.isnull().sum())

Now, we merge both datasets.

In [None]:
merged_df = pd.merge(all_seasons_df, playoff_wins, how='left', on=['TEAM_NAME', 'SEASON'])
merged_df['PLAYOFF_WINS'] = merged_df['PLAYOFF_WINS'].fillna(0)
merged_df.head()

## Visualizations

To make it easier to generate visualizations, I make feature variables to save time.

In [None]:
# These are my independent variables I will be using to determine the number of playoff wins
features = ['W', 'L', 'W_RANK','W_PCT', 'MIN','OFF_RATING', 'DEF_RATING','NET_RATING','NET_RATING_RANK', 'AST_PCT', 'AST_TO', 'AST_RATIO', 'OREB_PCT', 'DREB_PCT',
'REB_PCT', 'TM_TOV_PCT', 'EFG_PCT', 'TS_PCT','E_PACE', 'PACE', 'POSS', 'PIE','PIE_RANK']

# This is my independent variable
target = 'PLAYOFF_WINS'

# Both features and target will be used in the correlation matrix
cor_mat_features = ['W', 'L','W_RANK', 'W_PCT', 'MIN','OFF_RATING', 'DEF_RATING','NET_RATING','NET_RATING_RANK', 'AST_PCT', 'AST_TO', 'AST_RATIO', 'OREB_PCT', 'DREB_PCT',
'REB_PCT', 'TM_TOV_PCT', 'EFG_PCT', 'TS_PCT', 'E_PACE', 'PACE','PACE_PER40', 'POSS', 'PIE','PIE_RANK','PLAYOFF_WINS']

team_colors ={
    "Atlanta Hawks": (225, 68, 52),
    "Boston Celtics": (0, 122, 51),
    "Brooklyn Nets": (0, 0, 0),
    "Charlotte Hornets": (0, 120, 140),
    "Charlotte Bobcats": (255, 165, 0), 
    "Chicago Bulls": (206, 17, 65),
    "Cleveland Cavaliers": (134, 0, 56), 
    "Dallas Mavericks": (0, 83, 188), 
    "Denver Nuggets": (13, 34, 64),
    "Detroit Pistons": (200, 16, 46), 
    "Golden State Warriors": (255, 199, 44),
    "Houston Rockets": (44,122,161),
    "Indiana Pacers": (255, 198, 39), 
    "LA Clippers": (200, 16, 46), 
    "Los Angeles Clippers": (200, 16, 46),
    "Los Angeles Lakers": (85, 37, 130),
    "Memphis Grizzlies": (93, 118, 169), 
    "Miami Heat": (152, 0, 46), 
    "Milwaukee Bucks": (0, 71, 27),
    "Minnesota Timberwolves": (35, 97, 146),
    "New Jersey Nets": (0, 42, 96),
    "New Orleans Pelicans": (0, 22, 65), 
    "New Orleans Hornets": (29,17,96),
    "New Orleans/Oklahoma City Hornets": (29,17,96),
    "New York Knicks": (0, 107, 182), 
    "Oklahoma City Thunder": (0, 125, 195),
    "Orlando Magic": (196, 206, 211),
    "Philadelphia 76ers": (0, 107, 182),
    "Phoenix Suns": (229, 95, 32),
    "Portland Trail Blazers": (224, 58, 62),
    "Sacramento Kings": (91, 43, 130),
    "San Antonio Spurs": (196, 206, 211),
    "Seattle SuperSonics": (0, 101, 58), 
    "Toronto Raptors": (206, 17, 65),
    "Utah Jazz": (0, 43, 92),
    "Washington Wizards": (227, 24, 55)
}
palette = {
    team: (r/255, g/255,b/255) for team, (r,g,b) in team_colors.items()
}


# Heat Map


Some key insights from this graph are the variables that highly
correlate with playoff wins. Wins, Losses, Win percentage,
Offensive rating, Defensive Rating, , Net Rating, Effective
Field Goal Percentage, True Shooting percentage and PIE as
well as W_RANK, and PIE_RANK all seem to correlate to a moderate amount so, I will be using these in particular to train my model. 

In [None]:
fig, ax = plt.subplots(figsize=(15, 8))  # Size is in inches (width, height)

cor_mat = merged_df[cor_mat_features].corr()
sns.heatmap(cor_mat, annot=True,ax = ax)
plt.show()

## Visualizing the Most Important Features

Based on the correlation matrix: Regular Season Wins, Net Rating, and Player Impact Estimate have the greatest correlation with the number of playoff wins for a team. So, I decided to plot these stats and compare it to how many playoff wins each team had.

As you will see, each stat indeed follows a trend of increasing the number of playoff wins for a team given given an increase of each important stat.

In [None]:
# Create a scatter plot of Win Percentage vs. Playoff Wins
plt.figure(figsize=(10, 6))
sns.scatterplot(data=merged_df, x='W', y='PLAYOFF_WINS', hue='TEAM_NAME',style = 'TEAM_NAME', palette=palette,)
plt.title('Regular Season Wins vs. Playoff Wins for NBA Teams')
plt.xlabel('Regular Season Wins (W)')
plt.ylabel('Playoff Wins (PLAYOFF_WINS)')
plt.legend(title='Team', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

In [None]:
# Create a scatter plot of PIE (Player Impact Estimate) vs. Playoff Wins
plt.figure(figsize=(10, 6))
sns.scatterplot(data=merged_df, x='PIE', y='PLAYOFF_WINS', hue='TEAM_NAME',style = 'TEAM_NAME', palette=palette)
plt.title('Player Impact Estimate (PIE) vs. Playoff Wins for NBA Teams')
plt.xlabel('Player Impact Estimate (PIE)')
plt.ylabel('Playoff Wins (PLAYOFF_WINS)')
plt.legend(title='Team',bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

In [None]:
# Create a scatter plot of Net Rating vs. Playoff Wins
plt.figure(figsize=(10, 6))
sns.scatterplot(data=merged_df, x='NET_RATING', y='PLAYOFF_WINS', hue='TEAM_NAME',style ='TEAM_NAME', palette=palette)
plt.title('Net Rating vs. Playoff Wins for NBA Teams')
plt.xlabel('Net Rating (NET_RATING)')
plt.ylabel('Playoff Wins (PLAYOFF_WINS)')
plt.legend(title='Team', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

In [None]:
finalists = merged_df.query('PLAYOFF_WINS > 15')
plt.figure(figsize=(10, 6))
sns.scatterplot(data = finalists, x = 'SEASON', y = 'E_NET_RATING',hue = 'TEAM_NAME',style ='TEAM_NAME',palette=palette)
plt.xlabel('Season')
plt.ylabel('Net Rating')
plt.xticks(rotation=45)
plt.legend(title='Team', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=finalists, x='SEASON', y='PIE', hue='TEAM_NAME',style ='TEAM_NAME', palette=palette)
plt.title('PIE for Finals teams, NBA Season')
plt.xlabel('NBA SEASON')
plt.ylabel('Player Impact Estimate (PIE')
plt.legend(title='Team',bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

## Building the ML Models

In [None]:
X=merged_df.loc[merged_df['SEASON']<2011]
y=X[target].to_numpy()
X= X[features]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=42)

split = KFold(n_splits=5)

## Random Forest Regressor

First, I will hypertune the parameters for my random forest regressor

In [None]:
#Using K-Fold cross validation to determine the best metrics for the best possible model
for k in range(1,10):
    for i in range(50,301,10):
      for j in range(10,51,10):
        scores=[]
        for train_index, val_index in split.split(X_train):
            X_subtrain = X_train.iloc[train_index]
            X_val = X_train.iloc[val_index]
            y_subtrain = y_train[train_index]
            y_val = y_train[val_index]
            mod = RandomForestRegressor(n_estimators=i,max_depth=j, min_samples_split = 5,max_features = k,random_state = 42,criterion='squared_error')
            mod.fit(X_subtrain,y_subtrain)
            y_predict = mod.predict(X_val)
            scores.append(mod.score(X_val,y_val))
        print(i,j,k,np.mean(scores))
#Best Model seems to be n_estimators = 60, max_depth = 50,min_samples_split=2,max_features = 4

The best Model seems to be a random forest regressor with n_estimators = 300, max_depth = 10,min_samples_split=2,max_features = 2

In [62]:
rf = RandomForestRegressor(n_estimators=300,max_depth = 10, min_samples_split = 2,max_features = 2,random_state=42,criterion='squared_error')
rf.fit(X_train,y_train)
y_predict = rf.predict(X_test)
print(f"R2 score:{r2_score(y_test,y_predict)}\nMAE score:{mean_absolute_error(y_test,y_predict)}")

R2 score:0.596686796285686
MAE score:1.8029040790371138


The r

In [64]:
scaler = StandardScaler()

In [66]:
scaler.fit(X)

In [68]:
X_scaled = scaler.transform(X)

In [70]:
X_scaled_df = pd.DataFrame(data=X_scaled, columns=X.columns)

In [72]:
X_scaled_train,X_scaled_test,y_scaled_train,y_scaled_test = train_test_split(X_scaled_df,y,test_size = 0.2,random_state=42)

In [76]:
#Named models different names to avoid re-running the previous code cells
rf_scaled = RandomForestRegressor(n_estimators=300,max_depth = 10, min_samples_split = 2,max_features=2,criterion='squared_error',random_state=42)
rf_scaled.fit(X_scaled_train,y_scaled_train)
y_predict = rf_scaled.predict(X_scaled_test)
print(f"R2 score:{r2_score(y_scaled_test,y_predict)}\nMAE score:{mean_absolute_error(y_scaled_test,y_predict)}")

R2 score:0.5914962173283609
MAE score:1.8080204320403717


## Linear Regression Model

In [2]:
lin_mod = LinearRegression()

NameError: name 'LinearRegression' is not defined

In [None]:
lin_mod.fit(X_train,y_train)

In [None]:
y_lin_reg = lin_mod.predict(X_test)

In [None]:
lin_mod.score(X_test,y_test)

In [None]:
mean_absolute_error(y_test,y_lin_reg)

In [None]:
lin_mod.fit(X_scaled_train,y_scaled_train)

In [None]:
lin_mod.score(X_scaled_test,y_scaled_test)

## Using Real World Data (Actual NBA Seasons)

In [None]:
# Predicting the number of wins of 2020 season
test_X= merged_df.loc[merged_df['SEASON']==2020]
test_y=test_X[target]
test_X=test_X[features]
scaler.fit(test_X)
test_X_scaled=scaler.transform(test_X)
test_X_df_scaled = pd.DataFrame(data=test_X_scaled,columns=test_X.columns)
test_X_df = pd.DataFrame(data=test_X,columns=test_X.columns) # For curiosity
y_2020 = rf_scaled.predict(test_X_df_scaled)
print(f"R2 score:{r2_score(test_y,y_2020)}\nMAE score:{mean_absolute_error(test_y,y_2020)}")

In [None]:
idx_2020 = test_X.index
df_2020 = merged_df.iloc[idx_2020]
df_2020['PREDICTED_PLAYOFF_WINS']=np.round(y_2020)
df_2020[['TEAM_NAME','SEASON','W','L','PLAYOFF_WINS','PREDICTED_PLAYOFF_WINS']]

In [None]:
df_2020.set_index('TEAM_NAME', inplace=True)

In [None]:
fig,ax = plt.subplots(figsize=(10,8))
df_2020[['PREDICTED_PLAYOFF_WINS','PLAYOFF_WINS']].plot(kind='bar',ax=ax,title='2020 Season',color=['r','c'])

## Predicting 2023 Season sad times :(  

In [None]:
#Predicting number of wins for the 2023 season
test_X= merged_df.loc[merged_df['SEASON']==2023]
test_y=test_X[target]
test_X=test_X[features]
scaler.fit(test_X)
test_X_scaled=scaler.transform(test_X)
test_X_df_scaled = pd.DataFrame(data=test_X_scaled,columns=test_X.columns)
test_X_df = pd.DataFrame(data=test_X,columns=test_X.columns)
y_2023 = rf_scaled.predict(test_X_df_scaled)
print(f"R2 score:{r2_score(test_y,y_2023)}\nMAE score:{mean_absolute_error(test_y,y_2023)}")

In [None]:
idx_2023 = test_X.index
df_2023 = merged_df.iloc[idx_2023]
df_2023['PREDICTED_PLAYOFF_WINS']=np.round(y_2023)
df_2023.set_index('TEAM_NAME', inplace=True)
df_2023[['SEASON','W','L','W_PCT','PLAYOFF_WINS','PREDICTED_PLAYOFF_WINS']]

In [None]:
fig,ax = plt.subplots(figsize=(10,8))
df_2023[['PREDICTED_PLAYOFF_WINS','PLAYOFF_WINS']].plot(kind='bar',ax=ax,title='2023 Season',color=['r','c'])
plt.show()

In [None]:
test_X= merged_df.loc[merged_df['SEASON']==2024]
test_y=test_X[target]
test_X=test_X[features]
scaler.fit(test_X)
test_X_scaled=scaler.transform(test_X)
test_X_df_scaled = pd.DataFrame(data=test_X_scaled,columns=test_X.columns)

In [None]:
y_2024 = rf_scaled.predict(test_X_df)

In [None]:
idx_2024 = test_X.index
df_2024 = merged_df.iloc[idx_2024]
df_2024['PREDICTED_PLAYOFF_WINS']=np.round(y_2024)
df_2024[['TEAM_NAME','SEASON','W','L','PLAYOFF_WINS','PREDICTED_PLAYOFF_WINS']]