# Compiling draft data

It'd be useful to have a dataframe with the format `PLAYER_ID, DRAFT_POSITION` giving where they were drafted (and something like NAN, which we can handle later, if they were never drafted).

There is a Kaggle dataset here (https://www.kaggle.com/datasets/mattop/nba-draft-basketball-player-data-19892021) which has data from Basketball Reference through the 2021 draft.

In this notebook we'll load that data and combine it with data from Basketball Reference (which I just exported as CSV on the website and copy-pasted into CSV files) for 2022, 2023.  Because the 2024 season is ongoing, we won't be able to use it for this project.

In [1]:
import numpy as np
import pandas as pd

In [2]:
kaggle_df = pd.read_csv("kaggle_draft_to_2021.csv")
df_2022 = pd.read_csv("draft_2022.csv")
df_2023 = pd.read_csv("draft_2023.csv")

In [3]:
print(kaggle_df.columns)
print(df_2022.columns)

Index(['id', 'year', 'rank', 'overall_pick', 'team', 'player', 'college',
       'years_active', 'games', 'minutes_played', 'points', 'total_rebounds',
       'assists', 'field_goal_percentage', '3_point_percentage',
       'free_throw_percentage', 'average_minutes_played', 'points_per_game',
       'average_total_rebounds', 'average_assists', 'win_shares',
       'win_shares_per_48_minutes', 'box_plus_minus',
       'value_over_replacement'],
      dtype='object')
Index(['Rk', 'Pk', 'Tm', 'Player', 'College', 'Yrs', 'G', 'MP', 'PTS', 'TRB',
       'AST', 'FG%', '3P%', 'FT%', 'MP.1', 'PTS.1', 'TRB.1', 'AST.1', 'WS',
       'WS/48', 'BPM', 'VORP'],
      dtype='object')


In [4]:
#let's keep just "overall_pick", "player" from the Kaggle data
k_df = kaggle_df[["player", "overall_pick"]].copy()

In [5]:
df_22 = df_2022[["Player", "Pk"]].copy()
df_22.rename(columns={"Player":"player", "Pk":"overall_pick"}, inplace=True)

In [6]:
df_23 = df_2023[["Player", "Pk"]].copy()
df_23.rename(columns={"Player":"player", "Pk":"overall_pick"}, inplace=True)

In [7]:
total_df = pd.concat([k_df, df_22, df_23])

In [8]:
#drop NAN player names (appears in total_df when a team forfeited a draft pick)
total_df.dropna(inplace=True)

In [9]:
#cast overall_pick from float to integer
total_df["overall_pick"] = total_df.overall_pick.astype("Int64")

In [10]:
total_df

Unnamed: 0,player,overall_pick
0,Pervis Ellison,1
1,Danny Ferry,2
2,Sean Elliott,3
3,Glen Rice,4
4,J.R. Reid,5
...,...,...
53,Jalen Slawson,54
54,Isaiah Wong,55
55,Tarik Biberovic,56
56,Trayce Jackson-Davis,57


# Get `nba_api` player IDs

In [11]:
from nba_api.stats.static import players

In [12]:
def find_nba_id(x):     
    matches = players.find_players_by_full_name(x.player)

    if len(matches) == 0:
        return np.nan

    player_id = matches[0]["id"]

    return player_id

In [13]:
total_df["PLAYER_ID"] = total_df.apply(find_nba_id, axis=1).astype("Int64")

In [14]:
#notice we have about 360 NAN player IDs; I think these are players who were drafted and then never
#played in the NBA
total_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2038 entries, 0 to 57
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   player        2038 non-null   object
 1   overall_pick  2038 non-null   Int64 
 2   PLAYER_ID     1672 non-null   Int64 
dtypes: Int64(2), object(1)
memory usage: 67.7+ KB


In [15]:
#drop all NAN rows, rename columns to match with season_counting_stats
final_df = total_df.dropna().copy()
final_df.rename(columns={"player":"NAME", "overall_pick":"OVERALL_PICK"}, inplace=True)

# Export draft position data

In [16]:
final_df.to_csv("draft_position.csv", index=False)