# NBA Draft Analysis - Which NBA team selects the best players relative to the average player at their respective draft position?

## index
1. [introduction](#introduction)
2. [data](#data)
3. [results](#results)
4. [conclusions](#conclusions)

## introduction
The NBA Draft is an annual event in the National Basketball Association (NBA) where teams select eligible players to join their rosters. It typically consists of two rounds, with teams choosing players based on a predetermined order, primarily determined by the previous season's standings. The draft serves as a way for teams to acquire new talent, including young prospects from college or international leagues, and provides an opportunity for emerging basketball stars to realize their dreams of playing in the NBA.

For example, in 2003, the famous basketball player LeBron James was the first overall pick by the Cleveland Cavaliers. His selection was the culmination of immense hype and anticipation, as he was considered a generational talent straight out of high school. The NBA Draft is a pivotal moment for the league's future, as it shapes the composition of teams and can have a significant impact on the sport's landscape.

A team’s choice during the draft can make or break its season, making it crucial to have effective tools for selecting the best players from the available pool. With this analysis, we aim to understand which team selects the best players relative to the average player at their respective draft position. To do so, we will get the historical data of the drafts for the NBA seasons from 1996 until 2023. This project focuses on NBA player and team statistics analysis and visualization, offering interactive tools to explore and compare player and team data, and ultimately aiding in the assessment of draft choices, individual player achievements, and team success over different seasons.

## data
The raw data is taken using the [NBA API](https://github.com/swar/nba_api/blob/master/docs/nba_api/stats/examples.md) and it comprehends the statistics of all the official NBA players from year 1996 until 2023. The statistics include the player ID, his name, how many years he played, points scored and many more. The raw data contains 265 variables, but we will use only 22 and filter four of them, adding up to 22 variables. For brevity's sake, in the data there are many abbreviations. A [reference table](abbreviation_reference_table.md) comes in handy.

The raw data:

As mentioned before, there only 22 variables needed.

In [115]:
# Import libraries
import pandas as pd
import numpy as np
import ipywidgets as widgets
import plotly.express as px
from IPython.display import display, HTML, clear_output
from tabulate import tabulate

In [101]:
# load in the data
df = pd.read_csv('./interim/player_career_avg.csv', index_col=0)

In [102]:
# copy the dataframe
df_seasons_filtered = df.copy()

# create list with all seasons between 1996 and 2023
seasons = range(1996, 2023)

# create a slider widget to select the seasons
season_slider = widgets.SelectionRangeSlider(
    options=seasons,
    index=(0, len(seasons)-1),
    description='Select Seasons to Analyze',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True
)

# get the subset of the dataframe that matches the selected seasons
def get_seasons(season_slider):
    # declare global variable
    global df_seasons_filtered
    # get values from slider
    start_season = int(season_slider['new'][0])
    end_season = int(season_slider['new'][1])
    df_seasons_filtered = df[(df['Season'] >= start_season) & (df['Season'] <= end_season)]

# observe the slider widget to get the selected seasons
season_slider.observe(get_seasons, names='value')

# display the slider widget
display(season_slider)

SelectionRangeSlider(continuous_update=False, description='Select Seasons to Analyze', index=(0, 26), options=…

**Select a penalty for unbounded stats**

Drafted players who never played a game in the NBA are perceived as bad draft picks but never accumulated any stats. Since excluding them from the analysis would worsen the relative performance of other players selected at that respective draft position, the empty stats should be replaced by a "bad stat", as never playing a game in the NBA objectively makes that player a bad selection. For total points, assists and rebounds, 0 is the absolute minimum and therefore suitable as a replacement stat. However, all other stats are unbounded and don't have a definite minimum value. Using the minimum value in our data would be greatly affected by outliers which is why we opted for a percentile that can be dynamically chosen. We recommend using a percentile between 1% and 5%.

In [103]:
# set standard penalty percentile
penalty_percentile = 0.02

# create a float slider widget to select the penalty percentile
penalty_percentile_slider = widgets.FloatSlider(
    value=0.02,
    min=0,
    max=1.0,
    step=0.01,
    description='Penalty Percentile:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='.2f',
)

# set the penalty percentile
def set_penalty_percentile(penalty_percentile_slider):
    # declare global variable
    global penalty_percentile
    # get value from slider
    penalty_percentile = penalty_percentile_slider['new']

# observe the slider widget to get the penalty percentile
penalty_percentile_slider.observe(set_penalty_percentile, names='value')

# display the slider widget
display(penalty_percentile_slider)

FloatSlider(value=0.02, continuous_update=False, description='Penalty Percentile:', max=1.0, step=0.01)

**Re-run all cells from here after changing the penalty percentile and/or the season range**

In [142]:
# get min and max seasons from df_seasons_filtered
start_season = df_seasons_filtered['Season'].min()
end_season = df_seasons_filtered['Season'].max()

print("Your selected penalty percentile for unbounded stats is", penalty_percentile)
print("Your selected seasons are", start_season , "to", end_season)

relevant_stats = ['PTS', 'TRB', 'AST', 'WS', 'WS/48', 'BPM', 'VORP', 'PIE', 'OFF_RATING', 'DEF_RATING', 'NET_RATING']
na_fill_values = {'PTS': 0, 'TRB': 0, 'AST': 0, 'WS': df_seasons_filtered['WS'].quantile(penalty_percentile), 'WS/48': df_seasons_filtered['WS/48'].quantile(penalty_percentile), 'BPM': df_seasons_filtered['BPM'].quantile(penalty_percentile), 'VORP': df_seasons_filtered['VORP'].quantile(penalty_percentile), 'PIE': df_seasons_filtered['PIE'].quantile(penalty_percentile), 'OFF_RATING': df_seasons_filtered['OFF_RATING'].quantile(penalty_percentile), 'DEF_RATING': df_seasons_filtered['DEF_RATING'].quantile(1-penalty_percentile), 'NET_RATING': df_seasons_filtered['NET_RATING'].quantile(penalty_percentile)}

# fill the NaNs with the respective entry in the na_fill_values dict for each column
df_career_na_filled = df_seasons_filtered.fillna(value=na_fill_values)

Your selected penalty percentile for unbounded stats is 0.03
Your selected seasons are 2005 to 2022


In [143]:
# TODO: Maybe add toggle button that allows comparing only to players of same position

# group the df_career dataframe by 'Pk' and calculate the average for each relevant stat
df_avg = df_career_na_filled.groupby('Pk')[relevant_stats].mean(numeric_only=True)
df_avg = df_avg.reset_index()

# add 1 to every index
df_avg.index += 1

df_avg

Unnamed: 0,Pk,PTS,TRB,AST,WS,WS/48,BPM,VORP,PIE,OFF_RATING,DEF_RATING,NET_RATING
1,1,7401.111111,2670.166667,1616.277778,34.3,0.111,1.144444,13.461111,0.12134,109.114443,109.075664,0.033601
2,2,6970.722222,2501.777778,1275.277778,30.222222,0.078,-0.861833,9.516667,0.10046,105.213849,109.224775,-3.623824
3,3,8355.388889,2976.888889,1855.555556,43.094444,0.1165,1.277778,16.183333,0.120169,108.540111,108.361783,0.175353
4,4,7035.888889,2590.166667,2083.777778,36.872222,0.091722,-0.155556,13.05,0.098099,108.097615,108.584165,-0.486599
5,5,6074.166667,2606.0,1576.055556,26.266667,0.077889,-0.938889,6.611111,0.099286,106.368897,109.031205,-2.660236
6,6,4357.0,1565.555556,928.333333,21.711111,0.090556,-0.488889,6.722222,0.093982,105.673507,108.231661,-2.552918
7,7,6855.611111,2414.777778,1332.388889,27.766667,0.080056,-0.833333,7.183333,0.095783,107.212172,109.668452,-2.45719
8,8,4463.166667,1758.388889,703.388889,16.894444,0.076056,-1.856278,2.833333,0.084813,105.581364,107.834897,-2.249056
9,9,5584.5,2298.722222,1311.0,27.222222,0.081944,-1.166667,7.216667,0.095053,105.951342,108.637741,-2.675197
10,10,5041.0,1814.888889,1064.611111,21.777778,0.076111,-1.361111,6.261111,0.086465,106.018941,107.771336,-1.74884


In [144]:
# for each relevant stat, create a new column in the df_career_na_filled dataframe with the difference between the player's stat and the average for that stat for their draft position
for stat in relevant_stats:
    df_career_na_filled[stat + '_diff'] = df_career_na_filled[stat] - df_career_na_filled['Pk'].map(df_avg[stat])

**Calculate ranks for all players above a minimum amount of games played**

This section calculates the ranking of the performance above/below the average player at that draft position for each stat. Some of the stats that are normalized per 48 minutes can be affected by players that only played very little but performed well in this limited time (e.g., a player only ever played two minutes at the end of a blowout but scored four points). To avoid such players being ranked very highly, a minimum amount of games can be set here and all players that do not meet this requirement are excluded from the ranking.

In [145]:
# set standard minimum number of games played
min_games = 82

# create an int slides widget to select the minimum number of games played
min_games_slider = widgets.IntSlider(
    value=82,
    min=0,
    max=500,
    step=1,
    description='Minimum Games Played:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

# set the minimum number of games played
def set_min_games(min_games_slider):
    # declare global variable
    global min_games
    # get value from slider
    min_games = min_games_slider['new']

# observe the slider widget to get the minimum number of games played
min_games_slider.observe(set_min_games, names='value')

# display the slider widget
display(min_games_slider)

IntSlider(value=82, continuous_update=False, description='Minimum Games Played:', max=500)

In [146]:
print("You chose", min_games, "as the minimum number of games played.")

# create a new column for each relevant stat with the rank of the player's stat for their draft position, only using player with min_games
for stat in relevant_stats:
    if stat == 'DEF_RATING': # lower DEF_RATING is better
        df_career_na_filled[stat + '_rank'] = df_career_na_filled[df_career_na_filled['G'] >= min_games][stat + '_diff'].rank(ascending=True, method='min')
    else:
        df_career_na_filled[stat + '_rank'] = df_career_na_filled[df_career_na_filled['G'] >= min_games][stat + '_diff'].rank(ascending=False, method='min')

You chose 82 as the minimum number of games played.


**Show Performance of Selected Player**

In [147]:
# create a combobox widget with all 'Player' values
player_widget = widgets.Combobox(
    placeholder='Choose a Player',
    options=df_career_na_filled['Player'].unique().tolist(),
    description='Player:',
    ensure_option=True,
    disabled=False
)

# Define the output area to display additional information
output_player_ranks = widgets.Output()

# Function to update the output area based on the selected player
def on_value_change_player_ranks(change):
    output_player_ranks.clear_output()
    selected_player = change['new']
    with output_player_ranks:
        # get player data
        player_data = df_career_na_filled[df_career_na_filled['Player'] == selected_player]

        # display player information
        display(HTML(f"<h3>Player: {player_data['Player'].values[0]}</h3>"))
        display(HTML(f"<p>Year: {player_data['Season'].values[0]} Pick: {player_data['Pk'].values[0]} - Drafted by: {player_data['Tm'].values[0]}</p>"))
        display(HTML(f"<p>Played {int(player_data['G'].values[0])} games in {int(player_data['Yrs'].values[0])} years</p>"))
        
        # create a table with the player's ranks, total value and diff for each relevant stat
        stats_table = [['Stat', 'Rank', 'Raw Stat', 'Difference to Average for Draft Position']]
        for stat in relevant_stats:
            stats_table.append([stat, int(player_data[stat + '_rank'].values[0]), round(player_data[stat].values[0], 2), round(player_data[stat + '_diff'].values[0], 2)])

        # transpose the table
        stats_table = list(map(list, zip(*stats_table)))
            
        display(HTML(tabulate(stats_table, tablefmt="html")))

# Observe changes in the value of the combobox and call the function
player_widget.observe(on_value_change_player_ranks, names='value')

# Display the widgets
display(player_widget)
display(output_player_ranks)

Combobox(value='', description='Player:', ensure_option=True, options=('Andrew Bogut', 'Marvin Williams', 'Der…

Output()

In [148]:
# Filter players based on the minimum games played
filtered_df = df_career_na_filled[df_career_na_filled['G'] >= min_games]

# create a combobox widget with all 'Player' values
player_widget_diff_scatter = widgets.Combobox(
    placeholder='Choose a Player',
    options=filtered_df['Player'].unique().tolist(),
    description='Player:',
    ensure_option=True,
    disabled=False
)

# create a dropdown widget with all relevant stats
stat_dropdown_diff_scatter = widgets.Dropdown(
    options=relevant_stats,
    description='Select Stat:',
    disabled=False,
)

# Define the output area to display the scatter plot
output_diff_scatter = widgets.Output()

# Function to update the output area based on the selected stat
def on_value_change_diff_scatter(change):
    clear_output()
    output_diff_scatter.clear_output()
    selected_stat = stat_dropdown_diff_scatter.value
    selected_player = player_widget_diff_scatter.value

    display(player_widget_diff_scatter)
    display(stat_dropdown_diff_scatter)
    with output_diff_scatter:
        fig = px.scatter(filtered_df, x='Pk', y=f'{selected_stat}_diff', hover_name='Player',
                         hover_data={'Pk': True, f'{selected_stat}_diff': True, 'Player': False, selected_stat: True, f'{selected_stat}_rank': True})
        if selected_player in filtered_df['Player'].values:
            highlighted_player = filtered_df[filtered_df['Player'] == selected_player]
            fig.add_trace(px.scatter(highlighted_player, x='Pk', y=f'{selected_stat}_diff', hover_name='Player',
                                     hover_data={'Pk': True, f'{selected_stat}_diff': True, 'Player': False, selected_stat: True, f'{selected_stat}_rank': True},
                                     color_discrete_sequence=['red']).data[0])
        fig.update_traces(marker=dict(size=12), showlegend=False)
        fig.update_layout(title=f'{selected_stat} Difference vs Draft Pick for Players with at least {min_games} Games Played',
                          xaxis_title='Draft Pick', yaxis_title=f'{selected_stat} Difference')
        fig.show()
    
# Observe changes in the value of the dropdown and call the function
stat_dropdown_diff_scatter.observe(on_value_change_diff_scatter, names='value')
player_widget_diff_scatter.observe(on_value_change_diff_scatter, names='value')

# Display the dropdown and the output area
display(player_widget_diff_scatter)
display(stat_dropdown_diff_scatter)
display(output_diff_scatter)

Combobox(value='', description='Player:', ensure_option=True, options=('Andrew Bogut', 'Marvin Williams', 'Der…

Dropdown(description='Select Stat:', index=2, options=('PTS', 'TRB', 'AST', 'WS', 'WS/48', 'BPM', 'VORP', 'PIE…

In [149]:
# create a combobox widget with all 'Player' values
player_widget_total_scatter = widgets.Combobox(
    placeholder='Choose a Player',
    options=filtered_df['Player'].unique().tolist(),
    description='Player:',
    ensure_option=True,
    disabled=False
)

# create a dropdown widget with all relevant stats
stat_dropdown_total_scatter = widgets.Dropdown(
    options=relevant_stats,
    description='Select Stat:',
    disabled=False,
)

# Define the output area to display the scatter plot
output_total_scatter = widgets.Output()

# Function to update the output area based on the selected stat
def on_value_change_total_scatter(change):
    clear_output()
    output_total_scatter.clear_output()
    selected_stat = stat_dropdown_total_scatter.value
    selected_player = player_widget_total_scatter.value

    display(player_widget_total_scatter)
    display(stat_dropdown_total_scatter)
    with output_total_scatter:
        fig = px.scatter(filtered_df, x='Pk', y=selected_stat, hover_name='Player',
                         hover_data={'Pk': True, selected_stat: True, 'Player': False, f'{selected_stat}_diff': True, f'{selected_stat}_rank': True})
        
        # add a yellow dot for each average in the df_avg dataframe
        fig.add_trace(px.scatter(df_avg, x='Pk', y=selected_stat, hover_name='Pk',
                                 hover_data={'Pk': True, selected_stat: True}, 
                                 color_discrete_sequence=['yellow']).data[0])
        
        if selected_player in filtered_df['Player'].values:
            highlighted_player = filtered_df[filtered_df['Player'] == selected_player]
            fig.add_trace(px.scatter(highlighted_player, x='Pk', y=selected_stat, hover_name='Player',
                                    hover_data={'Pk': True, selected_stat: True, 'Player': False, f'{selected_stat}_diff': True, f'{selected_stat}_rank': True},
                                    color_discrete_sequence=['red']).data[0])
        fig.update_traces(marker=dict(size=12), showlegend=False)
        fig.update_layout(title=f'{selected_stat} vs Draft Pick for Players with at least {min_games} Games Played',
                          xaxis_title='Draft Pick', yaxis_title=selected_stat)
        fig.show()
    
# Observe changes in the value of the dropdown and call the function
stat_dropdown_total_scatter.observe(on_value_change_total_scatter, names='value')
player_widget_total_scatter.observe(on_value_change_total_scatter, names='value')

# Display the dropdown and the output area
display(player_widget_total_scatter)
display(stat_dropdown_total_scatter)
display(output_total_scatter)

Combobox(value='', description='Player:', ensure_option=True, options=('Andrew Bogut', 'Marvin Williams', 'Der…

Dropdown(description='Select Stat:', index=4, options=('PTS', 'TRB', 'AST', 'WS', 'WS/48', 'BPM', 'VORP', 'PIE…

# Analysis by Team over Seasons

In [152]:
# function to replace outdated abbreviations with the current ones
def clean_team_names(df):
    team_dict = {'CHH': 'CHO', 'CHA': 'CHO', 'NJN': 'BRK', 'NOH': 'NOP', 'NOK': 'NOP', 'SEA': 'OKC', 'VAN': 'MEM', 'WSB': 'WAS'}
    df['Tm'] = df['Tm'].replace(team_dict)
    return df

In [155]:
# TODO: give option to weight by minutes played

# group the df_career_na_filled by team and season and calculate the average
df_team_avg_by_season = clean_team_names(df_career_na_filled).groupby(['Tm', 'Season']).mean(numeric_only=True)
df_team_avg_by_season = df_team_avg_by_season.reset_index()

In [156]:
# create a scatter plot with the average of each team for each season. The seasons should be the x axis and the selected relevant stat differences should be the y axis
def create_team_scatter(selected_stat, selected_team, show_players):
    fig = px.scatter(df_team_avg_by_season, x='Season', y=f'{selected_stat}_diff', hover_name='Tm',
                     hover_data={'Tm': True, selected_stat: True, f'{selected_stat}_diff': True})
    
    # add a red dot for all stats of the selected team
    fig.add_trace(px.scatter(df_team_avg_by_season[df_team_avg_by_season['Tm'] == selected_team], x='Season', y=f'{selected_stat}_diff', hover_name='Tm',
                                hover_data={'Tm': True, selected_stat: True, f'{selected_stat}_diff': True},
                                color_discrete_sequence=['red']).data[0])
    
    if show_players:
        # add a yellow dot for each player of the selected team
        fig.add_trace(px.scatter(df_career_na_filled[df_career_na_filled['Tm'] == selected_team], x='Season', y=f'{selected_stat}_diff', hover_name='Player',
                                    hover_data={'Player': True, selected_stat: True, f'{selected_stat}_diff': True},
                                    color_discrete_sequence=['yellow']).data[0])

    fig.update_traces(marker=dict(size=12), showlegend=False)
    fig.update_layout(title=f'{selected_stat} Difference vs Season for Teams',
                      xaxis_title='Season', yaxis_title=f'{selected_stat} Difference')
    fig.show()

# create a dropdown widget with all relevant stats
stat_dropdown_team_scatter = widgets.Dropdown(
    options=relevant_stats,
    description='Select Stat:',
    disabled=False,
)

# create a dropdown widget with all teams
team_dropdown_team_scatter = widgets.Dropdown(
    options=df_team_avg_by_season['Tm'].unique().tolist(),
    description='Select Team:',
    disabled=False,
)

# create a checkbox widget to toggle between showing players or not
player_checkbox_team_scatter = widgets.Checkbox(
    value=True,
    description='Show Players selected by Team',
    disabled=False,
    indent=False
)

# Define the output area to display the scatter plot
output_team_scatter = widgets.Output()

# Observe changes in the value of the dropdown and call the function
def on_value_change_team_scatter(change):
    clear_output()
    output_team_scatter.clear_output()
    selected_stat = stat_dropdown_team_scatter.value
    selected_team = team_dropdown_team_scatter.value
    show_players = player_checkbox_team_scatter.value

    display(stat_dropdown_team_scatter)
    display(team_dropdown_team_scatter)
    display(player_checkbox_team_scatter)
    with output_team_scatter:
        create_team_scatter(selected_stat, selected_team, show_players)

stat_dropdown_team_scatter.observe(on_value_change_team_scatter, names='value')
team_dropdown_team_scatter.observe(on_value_change_team_scatter, names='value')
player_checkbox_team_scatter.observe(on_value_change_team_scatter, names='value')

# Display the dropdown and the output area
display(stat_dropdown_team_scatter)
display(team_dropdown_team_scatter)
display(player_checkbox_team_scatter)
display(output_team_scatter)

Dropdown(description='Select Stat:', options=('PTS', 'TRB', 'AST', 'WS', 'WS/48', 'BPM', 'VORP', 'PIE', 'OFF_R…

Dropdown(description='Select Team:', options=('ATL', 'BOS', 'BRK', 'CHI', 'CHO', 'CLE', 'DAL', 'DEN', 'DET', '…

Checkbox(value=True, description='Show Players selected by Team', indent=False)

Output()

# Analysis by Team

Be aware that the minimum games restriction does not apply to this section.

In [157]:
# group the df_career_na_filled by team and season and calculate the average
df_team_avg = clean_team_names(df_career_na_filled).groupby(['Tm', 'Season']).mean(numeric_only=True)
df_team_avg = df_team_avg.reset_index()

In [158]:
# create a scatter plot with the average of each team for each season. The seasons should be the x axis and the selected relevant stat differences should be the y axis
def create_team_scatter(selected_stat, selected_team):
    fig = px.scatter(df_team_avg, x='Season', y=f'{selected_stat}_diff', hover_name='Tm',
                     hover_data={'Tm': True, selected_stat: True, f'{selected_stat}_diff': True})
    
    # add a red dot for all stats of the selected team
    fig.add_trace(px.scatter(df_team_avg[df_team_avg['Tm'] == selected_team], x='Season', y=f'{selected_stat}_diff', hover_name='Tm',
                                hover_data={'Tm': True, selected_stat: True, f'{selected_stat}_diff': True},
                                color_discrete_sequence=['red']).data[0])
    fig.update_traces(marker=dict(size=12), showlegend=False)
    fig.update_layout(title=f'{selected_stat} Difference vs Season for Teams',
                      xaxis_title='Season', yaxis_title=f'{selected_stat} Difference')
    fig.show()

# create a dropdown widget with all relevant stats
stat_dropdown_team_scatter = widgets.Dropdown(
    options=relevant_stats,
    description='Select Stat:',
    disabled=False,
)

# create a dropdown widget with all teams
team_dropdown_team_scatter = widgets.Dropdown(
    options=df_team_avg['Tm'].unique().tolist(),
    description='Select Team:',
    disabled=False,
)

# Define the output area to display the scatter plot
output_team_scatter = widgets.Output()

# Observe changes in the value of the dropdown and call the function
def on_value_change_team_scatter(change):
    clear_output()
    output_team_scatter.clear_output()
    selected_stat = stat_dropdown_team_scatter.value
    selected_team = team_dropdown_team_scatter.value

    display(stat_dropdown_team_scatter)
    display(team_dropdown_team_scatter)
    with output_team_scatter:
        create_team_scatter(selected_stat, selected_team)

stat_dropdown_team_scatter.observe(on_value_change_team_scatter, names='value')
team_dropdown_team_scatter.observe(on_value_change_team_scatter, names='value')

# Display the dropdown and the output area
display(stat_dropdown_team_scatter)
display(team_dropdown_team_scatter)
display(output_team_scatter)



Dropdown(description='Select Stat:', options=('PTS', 'TRB', 'AST', 'WS', 'WS/48', 'BPM', 'VORP', 'PIE', 'OFF_R…

Dropdown(description='Select Team:', index=1, options=('ATL', 'BOS', 'BRK', 'CHI', 'CHO', 'CLE', 'DAL', 'DEN',…

# conclusions