<a href="https://colab.research.google.com/github/BennettHilck12/DraftEdge/blob/main/DraftEdge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CIS 5450 Final Project: DraftEdge

##### Bennett Hilck, Ethan Xia, Mohammed Soufan


# Part 1: Introduction



For our final project, we sought to better understand how NBA teams can draft players more effectively by predicting which NCAA D1 prospects are most likely to outperform expectations at the professional level. To pursue this goal, we compiled and analyzed several datasets containing college performance statistics, NBA Combine measurements and performance statistics, and eventual NBA outcomes.


By aggregrating and merging these data sources, we aim to identify advanced or undervalued metrics that may be overlook in traditional scouting evaluations. This approach alows us to quantitatvely asses why certain highly drafted players become "busts," while others selected later dramatically exceed their draft position.


Given the high variance and high cost associated with drafting, our project highlights non-obvious predictors that could help inform more data-driven decision-making and potentially reshape how teams and scouts approach future NBA drafts.


We hope that you find our project impactful!

# Part 2: Data Loading & Preprocessing


First, we must import all relevant libraries for our project. We will use all supplemental libraries throughout our project.

In [None]:
# Imports + Installs
!pip install category_encoders
import pandas as pd
import seaborn as sns
import dask.dataframe as dd
import matplotlib.pyplot as plt
import folium
import numpy as np
import category_encoders as ce
import dask.array as da
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import tqdm
import copy
from xgboost import XGBRegressor
from folium.plugins import HeatMap
from google.colab import drive
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import spearmanr
from sklearn import datasets, linear_model
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from torch.utils import data as data_utils
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression



## 2.1 Data Mounting + Storage

Because our datasets are quite large, often containing thousands of player seasons, detailed college statistics, and multiple years of NBA Combine measurements—we decided to upload them to Google Drive rather than store them locally. This allows us to efficiently access and manage the data from within our notebook environment without running into storage limitations or upload constraints. Hosting the data on Google Drive also ensures easier collaboration, as all team members can work from the same centralized, version-consistent files.

In [None]:
# Data Loading via Google Docs
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2.2 Loading and Preprocessing NBA Combine Data

First, let's load in our data from the NBA combine.

In [None]:
file_path = '/content/drive/MyDrive/final_project_datasets/draft_combine_stats.csv'
combine_df = pd.read_csv(file_path)
display(combine_df.head(10))

Unnamed: 0,season,player_id,first_name,last_name,player_name,position,height_wo_shoes,height_wo_shoes_ft_in,height_w_shoes,height_w_shoes_ft_in,...,spot_nba_break_right,spot_nba_corner_right,off_drib_fifteen_break_left,off_drib_fifteen_top_key,off_drib_fifteen_break_right,off_drib_college_break_left,off_drib_college_top_key,off_drib_college_break_right,on_move_fifteen,on_move_college
0,2001,12033,Adam,Allenspach,Adam Allenspach,C,83.5,6' 11.5'',,,...,,,,,,,,,,
1,2001,2240,Gilbert,Arenas,Gilbert Arenas,SG,74.25,6' 2.25'',,,...,,,,,,,,,,
2,2001,2220,Brandon,Armstrong,Brandon Armstrong,SG,75.5,6' 3.5'',,,...,,,,,,,,,,
3,2001,2203,Shane,Battier,Shane Battier,SF-PF,80.25,6' 8.25'',,,...,,,,,,,,,,
4,2001,12034,Cookie,Belcher,Cookie Belcher,SG-PG,75.0,6' 3'',,,...,,,,,,,,,,
5,2001,2294,Charlie,Bell,Charlie Bell,PG,74.5,6' 2.5'',,,...,,,,,,,,,,
6,2001,2257,Ruben,Boumtje-Boumtje,Ruben Boumtje-Boumtje,C,83.5,6' 11.5'',,,...,,,,,,,,,,
7,2001,12035,Calvin,Bowman,Calvin Bowman,PF,80.75,6' 8.75'',,,...,,,,,,,,,,
8,2001,2214,Michael,Bradley,Michael Bradley,PF,81.5,6' 9.5'',,,...,,,,,,,,,,
9,2001,2249,Jamison,Brewer,Jamison Brewer,PG,74.5,6' 2.5'',,,...,,,,,,,,,,


In [None]:
combine_df.dtypes

Unnamed: 0,0
season,int64
player_id,int64
first_name,object
last_name,object
player_name,object
position,object
height_wo_shoes,float64
height_wo_shoes_ft_in,object
height_w_shoes,float64
height_w_shoes_ft_in,object


In [None]:
combine_df.describe()

Unnamed: 0,season,player_id,height_wo_shoes,height_w_shoes,weight,wingspan,standing_reach,body_fat_pct,hand_length,hand_width,standing_vertical_leap,max_vertical_leap,lane_agility_time,modified_lane_agility_time,three_quarter_sprint,bench_press
count,1202.0,1202.0,1153.0,1008.0,1152.0,1153.0,1152.0,1003.0,719.0,719.0,1017.0,1017.0,1008.0,411.0,1012.0,808.0
mean,2012.536606,2350133.0,77.570902,78.796577,214.902604,82.478187,103.593663,7.352313,8.720793,9.448887,29.240167,34.637168,11.384444,3.079221,3.282213,10.155941
std,6.56162,56602100.0,3.334605,3.322625,25.718878,3.986623,4.880218,2.748712,0.481028,0.717066,3.054645,3.648106,0.584896,0.229143,0.13185,5.389231
min,2001.0,-1.0,67.75,69.0,154.4,70.0,89.5,2.6,7.5,7.0,20.5,25.0,9.65,2.22,2.91,0.0
25%,2007.0,101145.5,75.25,76.5,195.95,80.0,100.0,5.4,8.5,9.0,27.0,32.0,10.97,2.97,3.19,6.0
50%,2012.0,203147.0,77.75,79.0,212.9,82.75,104.0,6.7,8.75,9.5,29.0,34.5,11.32,3.1,3.27,10.0
75%,2018.0,1629014.0,80.0,81.25,233.0,85.25,107.0,8.6,9.0,10.0,31.5,37.0,11.72,3.23,3.36,14.0
max,2023.0,1962937000.0,89.25,91.0,314.0,98.25,122.5,21.0,10.5,12.0,39.5,45.5,13.44,3.76,3.81,26.0


However, we know that the NBA combine is invite only, and many players who get drafted do not go to the combine. For that, we can look towards the entirety of NCAA D1 Men's Basketball data.

## 2.3 Loading and Preprocessing NCAA Data

And let's also take a look at our NCAA D1 player stats, which we scraped from barttorvik.com, a live NCAA D1 player stats tracker by season.

In [None]:
file_path = '/content/drive/MyDrive/final_project_datasets/battorvikPlayerData.xlsx'
ncaa_df = pd.read_excel(file_path)
display(ncaa_df.head(10))

Unnamed: 0,Rk,Player,Class,Team,Conf,Min%,PRPG!,BPM,ORtg,Usg,...,DR,Ast,TO,Blk,Stl,FTR,2P,3P/100,3P,Year
0,1,Bennett Stirtz,Jr,Drake,MVC,98.8,6.4,10.0,126.4,26.1,...,14.0,34.0,13.1,1.0,3.3,38.6,0.545,7.7,0.396,2025
1,2,Bruce Thornton,Jr,Ohio St.,B10,88.4,6.3,8.7,130.0,22.0,...,10.0,25.1,10.8,0.4,1.8,41.8,0.547,7.0,0.424,2025
2,3,Ryan Kalkbrenner,Sr,Creighton,BE,83.1,6.1,11.1,129.2,22.3,...,18.4,10.0,11.2,7.3,0.9,38.6,0.706,3.0,0.344,2025
3,4,Eric Dixon,Sr,Villanova,BE,84.3,6.1,7.3,116.7,32.9,...,12.6,13.1,11.8,1.1,1.5,35.8,0.483,13.1,0.407,2025
4,5,Cooper Flagg,Fr,Duke,ACC,72.8,6.0,14.9,123.0,30.8,...,21.2,26.8,13.5,4.9,2.8,42.9,0.517,7.2,0.385,2025
5,6,Trey Kaufman-Renn,Jr,Purdue,B10,76.9,5.7,7.9,118.1,31.1,...,15.4,16.8,13.9,1.2,1.4,42.8,0.597,0.4,0.429,2025
6,7,Johni Broome,Sr,Auburn,SEC,71.4,5.7,12.9,118.5,30.6,...,26.0,19.5,9.9,7.5,1.8,39.0,0.559,4.8,0.278,2025
7,8,Braden Smith,Jr,Purdue,B10,92.6,5.7,9.4,116.1,26.6,...,13.5,44.1,18.6,0.7,3.5,20.3,0.469,9.9,0.381,2025
8,9,Kam Jones,Sr,Marquette,BE,83.9,5.7,9.3,118.1,29.2,...,13.6,38.1,11.1,0.9,2.4,16.0,0.586,10.6,0.311,2025
9,10,Tyson Degenhart,Sr,Boise St.,MWC,84.5,5.6,7.6,126.8,23.8,...,15.3,10.3,10.7,0.9,1.1,50.5,0.618,7.6,0.349,2025


In [None]:
ncaa_df.dtypes

Unnamed: 0,0
Rk,int64
Player,object
Class,object
Team,object
Conf,object
Min%,float64
PRPG!,float64
BPM,float64
ORtg,float64
Usg,float64


In [None]:
ncaa_df.describe()

Unnamed: 0,Rk,Min%,PRPG!,BPM,ORtg,Usg,eFG,TS,OR,DR,Ast,TO,Blk,Stl,FTR,2P,3P/100,3P,Year
count,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0
mean,1088.863973,63.886001,1.601908,0.508939,103.134344,20.155992,50.251591,53.593196,5.069479,13.179956,13.647076,18.396751,1.797867,1.892037,35.759151,0.484399,5.994695,0.294807,2016.615569
std,629.257082,13.680132,1.343539,3.643385,10.595077,4.613991,6.271648,5.656281,3.587501,4.846323,7.603259,5.055464,2.141272,0.801671,15.741384,0.077856,4.120459,0.13744,5.194556
min,1.0,40.0,-2.7,-13.3,55.4,6.1,18.3,20.3,0.0,2.1,0.0,2.2,0.0,0.0,0.0,0.0,0.0,0.0,2008.0
25%,544.0,52.4,0.7,-2.0,96.3,16.8,46.1,49.9,2.1,9.5,7.9,14.9,0.4,1.3,24.5,0.433,2.5,0.262,2012.0
50%,1088.0,63.9,1.5,0.4,103.3,19.9,50.0,53.6,4.0,12.5,12.0,17.9,1.0,1.8,33.6,0.483,6.0,0.33,2017.0
75%,1632.0,74.975,2.4,2.9,110.2,23.2,54.3,57.3,7.5,16.3,18.0,21.3,2.4,2.3,44.9,0.535,9.1,0.374,2021.0
max,2308.0,98.8,7.9,18.7,161.4,38.5,88.3,80.4,23.7,39.2,52.6,54.1,18.8,7.6,147.9,1.0,23.1,1.0,2025.0


## 2.4 Loading and Preprocessing NBA Draft Data

Let's also take a look at our NBA Draft data, which we scraped from basketball-reference.com, a live NBA tracker

In [None]:
file_path = '/content/drive/MyDrive/final_project_datasets/nba_draft_2000_2025_clean.csv'
draft_df = pd.read_csv(file_path)
display(draft_df.head(10))

Unnamed: 0,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,FT%,MP.1,PTS.1,TRB.1,AST.1,WS,WS/48,BPM,VORP,Year
0,1.0,1.0,NJN,Kenyon Martin,Cincinnati,15.0,757,23134,9325,5159,...,0.629,30.6,12.3,6.8,1.9,48.0,0.1,0.1,12.1,2000
1,2.0,2.0,VAN,Stromile Swift,LSU,9.0,547,10804,4582,2535,...,0.699,19.8,8.4,4.6,0.5,21.3,0.095,-1.6,1.1,2000
2,3.0,3.0,LAC,Darius Miles,,7.0,446,11730,4507,2190,...,0.59,26.3,10.1,4.9,1.9,9.5,0.039,-1.0,3.0,2000
3,4.0,4.0,CHI,Marcus Fizer,Iowa State,6.0,289,6032,2782,1340,...,0.691,20.9,9.6,4.6,1.2,2.7,0.022,-3.7,-2.6,2000
4,5.0,5.0,ORL,Mike Miller,Florida,17.0,1032,27812,10973,4376,...,0.769,26.9,10.6,4.2,2.6,60.7,0.105,0.8,19.8,2000
5,6.0,6.0,ATL,DerMarr Johnson,Cincinnati,7.0,344,5930,2121,769,...,0.789,17.2,6.2,2.2,0.9,6.4,0.052,-1.6,0.6,2000
6,7.0,7.0,CHI,Chris Mihm,Texas,8.0,436,8758,3262,2302,...,0.704,20.1,7.5,5.3,0.5,13.3,0.073,-3.9,-4.3,2000
7,8.0,8.0,CLE,Jamal Crawford,Michigan,20.0,1327,38994,19419,2948,...,0.862,29.4,14.6,2.2,3.4,60.7,0.075,-0.1,18.4,2000
8,9.0,9.0,HOU,Joel Przybilla,Minnesota,13.0,592,11733,2293,3665,...,0.557,19.8,3.9,6.2,0.4,23.0,0.094,-1.7,0.8,2000
9,10.0,10.0,ORL,Keyon Dooling,Missouri,13.0,728,14134,5067,964,...,0.799,19.4,7.0,1.3,2.2,18.5,0.063,-2.0,-0.2,2000


In [None]:
draft_df.dtypes

Unnamed: 0,0
Rk,float64
Pk,float64
Tm,object
Player,object
College,object
Yrs,object
G,object
MP,object
PTS,object
TRB,object


In [None]:
draft_df.describe()

Unnamed: 0,Rk,Pk,Year
count,1545.0,1545.0,1578.0
mean,30.277023,30.257605,2012.555767
std,17.217246,17.200612,7.479822
min,1.0,1.0,2000.0
25%,15.0,15.0,2006.0
50%,30.0,30.0,2013.0
75%,45.0,45.0,2019.0
max,60.0,60.0,2025.0


Here, we can see that only 3 columns in draft_df are of dtype int. This means that we will have to convert the rest of the columns from dtype object to int

## 2.5 Loading and Preprocessing NBA Player Data

We will also load in our NBA player data, which we scraped from ESPN and also basketball-reference.com. We will merge these two dataframes together later in part 3.

In [None]:
file_path = '/content/drive/MyDrive/final_project_datasets/nba_player_stats_2000_2023_fixed_years_multiTM_clean.csv'
nba_br_df = pd.read_csv(file_path)
display(nba_br_df.head(10))

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,MP,FG,FGA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Awards,Year
0,1.0,Shaquille O'Neal,27.0,LAL,C,79.0,79.0,40.0,12.1,21.1,...,9.4,13.6,3.8,0.5,3.0,2.8,3.2,29.7,"MVP-1,DPOY-2,AS,NBA1,DEF2",2000
1,2.0,Allen Iverson,24.0,PHI,SG,70.0,70.0,40.8,10.4,24.8,...,2.8,3.8,4.7,2.1,0.1,3.3,2.3,28.4,"MVP-7,AS,NBA2",2000
2,3.0,Grant Hill,27.0,DET,SF,74.0,74.0,37.5,9.4,19.2,...,5.3,6.6,5.2,1.4,0.6,3.2,2.6,25.8,"MVP-8,AS,NBA2",2000
3,4.0,Vince Carter,23.0,TOR,SF,82.0,82.0,38.1,9.6,20.7,...,4.0,5.8,3.9,1.3,1.1,2.2,3.2,25.7,"MVP-10,AS,NBA3",2000
4,5.0,Karl Malone,36.0,UTA,PF,82.0,82.0,35.9,9.2,18.0,...,7.4,9.5,3.7,1.0,0.9,2.8,2.8,25.5,"MVP-4,AS,NBA2",2000
5,6.0,Chris Webber,26.0,SAC,PF,75.0,75.0,38.4,10.0,20.6,...,8.0,10.5,4.6,1.6,1.7,2.9,3.5,24.5,"MVP-9,AS,NBA3",2000
6,7.0,Gary Payton,31.0,SEA,PG,82.0,82.0,41.8,9.1,20.3,...,5.2,6.5,8.9,1.9,0.2,2.7,2.2,24.2,"MVP-6,DPOY-5,AS,NBA1,DEF1",2000
7,8.0,Jerry Stackhouse,25.0,DET,SG,82.0,82.0,38.4,7.5,17.6,...,2.4,3.8,4.5,1.3,0.4,3.8,2.3,23.6,AS,2000
8,9.0,Tim Duncan,23.0,SAS,PF,74.0,74.0,38.9,8.5,17.3,...,8.9,12.4,3.2,0.9,2.2,3.3,2.8,23.2,"MVP-5,AS,NBA1,DEF1",2000
9,10.0,Kevin Garnett,23.0,MIN,PF,81.0,81.0,40.0,9.4,18.8,...,9.0,11.8,5.0,1.5,1.6,3.3,2.5,22.9,"MVP-2,DPOY-7,AS,NBA1,DEF1",2000


In [None]:
nba_br_df.dtypes

Unnamed: 0,0
Rk,float64
Player,object
Age,float64
Team,object
Pos,object
G,float64
GS,float64
MP,float64
FG,float64
FGA,float64


In [None]:
nba_br_df.describe()

Unnamed: 0,Rk,Age,G,GS,MP,FG,FGA,FG%,3P,3PA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
count,9836.0,9836.0,9836.0,9836.0,9836.0,9836.0,9836.0,9833.0,9836.0,9836.0,...,9836.0,9836.0,9836.0,9836.0,9836.0,9836.0,9836.0,9836.0,9836.0,9838.0
mean,203.451708,26.649654,56.48902,29.003863,22.458184,3.506415,7.760543,0.448383,0.727847,2.054036,...,1.022773,2.929504,3.950356,2.068625,0.705277,0.450397,1.29449,1.973546,9.407107,2011.359016
std,117.027192,4.274809,22.565843,29.042118,8.887688,2.112659,4.472559,0.074341,0.758573,1.99227,...,0.822219,1.772717,2.443491,1.839382,0.422173,0.480604,0.776622,0.726336,5.845014,6.941689
min,1.0,18.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2000.0
25%,103.0,23.0,42.0,2.0,15.4,1.9,4.3,0.408,0.0,0.2,...,0.4,1.7,2.2,0.8,0.4,0.1,0.7,1.5,5.0,2005.0
50%,204.0,26.0,63.0,17.0,22.1,3.0,6.7,0.442,0.5,1.7,...,0.8,2.5,3.4,1.5,0.6,0.3,1.1,2.0,7.9,2011.0
75%,303.0,30.0,75.0,57.0,29.8,4.7,10.4,0.484,1.2,3.3,...,1.4,3.8,5.1,2.8,0.9,0.6,1.7,2.5,12.5,2017.0
max,444.0,44.0,85.0,83.0,43.7,12.2,27.8,1.0,5.3,13.2,...,5.5,11.5,16.3,11.7,2.9,5.0,5.7,6.0,36.1,2023.0


In [None]:
file_path = '/content/drive/MyDrive/final_project_datasets/nbaPlayerData.xlsx'
nba_espn_df = pd.read_excel(file_path)
display(nba_espn_df.head(10))

Unnamed: 0,RK,Player,Team,3P%,3PA,3PM,AST,BLK,DD2,FG%,...,FTM,GP,MIN,POS,PTS,REB,STL,TD3,TO,Year
0,-,Allen Iverson,PHI,29.1,4.5,1.3,5.5,0.2,4,39.8,...,7.9,60,43.7,SG,31.4,4.5,2.8,1,4.0,2002
1,-,Shaquille O'Neal,LAL,0.0,0.0,0.0,3.0,2.0,40,57.9,...,5.9,67,36.1,C,27.2,10.7,0.6,0,2.6,2002
2,-,Paul Pierce,BOS,40.4,6.3,2.6,3.2,1.0,17,44.2,...,6.3,82,40.3,SF,26.1,6.9,1.9,0,2.9,2002
3,-,Tracy McGrady,ORL,36.4,3.7,1.4,5.3,1.0,24,45.1,...,5.5,76,38.3,SG,25.6,7.9,1.6,1,2.5,2002
4,-,Tim Duncan,SA,10.0,0.1,0.0,3.7,2.5,67,50.8,...,6.8,82,40.6,C,25.5,12.7,0.7,0,3.2,2002
5,-,Kobe Bryant,LAL,25.0,1.7,0.4,5.5,0.4,11,46.9,...,6.1,80,38.3,SF,25.2,5.5,1.5,1,2.8,2002
6,-,Vince Carter,TOR,38.7,5.2,2.0,4.0,0.7,5,42.8,...,4.1,60,39.8,G,24.7,5.2,1.6,0,2.6,2002
7,-,Chris Webber,SAC,26.3,0.4,0.1,4.8,1.4,31,49.5,...,4.7,54,38.4,C,24.5,10.1,1.7,0,2.9,2002
8,-,Dirk Nowitzki,DAL,39.7,4.6,1.8,2.4,1.0,38,47.7,...,5.8,76,38.0,F,23.4,9.9,1.1,0,1.9,2002
9,-,Michael Jordan,WSH,18.9,0.9,0.2,5.2,0.4,8,41.6,...,4.4,60,34.9,G,22.9,5.7,1.4,0,2.7,2002


In [None]:
nba_espn_df.dtypes

Unnamed: 0,0
RK,object
Player,object
Team,object
3P%,float64
3PA,float64
3PM,float64
AST,float64
BLK,float64
DD2,int64
FG%,float64


In [None]:
nba_espn_df.describe()

Unnamed: 0,3P%,3PA,3PM,AST,BLK,DD2,FG%,FGA,FGM,FT%,FTA,FTM,GP,MIN,PTS,REB,STL,TD3,TO,Year
count,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0,11095.0
mean,26.352591,2.040297,0.718558,1.907959,0.417071,4.434971,44.596413,7.07826,3.20438,71.484308,1.968247,1.489797,52.390626,20.571104,8.61443,3.648481,0.646868,0.15151,1.166886,2014.087787
std,16.682662,2.023629,0.772698,1.819998,0.45391,9.385283,8.959944,4.673945,2.216838,18.393246,1.758923,1.412914,24.372468,9.720884,6.121573,2.456222,0.427746,1.233593,0.798064,7.110542
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.1,0.0,0.0,0.0,0.0,2002.0
25%,15.4,0.2,0.0,0.6,0.1,0.0,40.3,3.4,1.5,66.2,0.8,0.5,34.0,12.6,3.9,1.9,0.3,0.0,0.6,2008.0
50%,32.1,1.5,0.5,1.3,0.3,0.0,44.2,6.0,2.7,75.4,1.4,1.0,59.0,20.2,7.1,3.1,0.6,0.0,1.0,2014.0
75%,37.2,3.3,1.2,2.5,0.5,4.0,48.7,9.8,4.5,82.2,2.6,2.0,73.0,28.65,11.9,4.8,0.9,0.0,1.6,2020.0
max,100.0,13.2,5.3,11.7,3.8,77.0,100.0,27.8,12.2,100.0,12.3,10.2,85.0,43.7,36.1,16.0,3.0,42.0,5.7,2025.0


We want to normalize the team format across dataframes. In nba_espn_df, players who played on multiple teams (IND/CHI, for example) are represented with a '/'. We will keep this format.

In [None]:
def replace_teams(team_str):
    if not isinstance(team_str, str):  # handle NaN or other non-string values
        return team_str
    teams = team_str.split('/')
    if len(teams) == 1:
        return team_str  # single team stays as is
    else:
        return f"{len(teams)}TM"

nba_espn_df['Team'] = nba_espn_df['Team'].apply(replace_teams)


In [None]:
nba_espn_df.head(40)

Unnamed: 0,RK,Player,Team,3P%,3PA,3PM,AST,BLK,DD2,FG%,...,FTM,GP,MIN,POS,PTS,REB,STL,TD3,TO,Year
0,-,Allen Iverson,PHI,29.1,4.5,1.3,5.5,0.2,4,39.8,...,7.9,60,43.7,SG,31.4,4.5,2.8,1,4.0,2002
1,-,Shaquille O'Neal,LAL,0.0,0.0,0.0,3.0,2.0,40,57.9,...,5.9,67,36.1,C,27.2,10.7,0.6,0,2.6,2002
2,-,Paul Pierce,BOS,40.4,6.3,2.6,3.2,1.0,17,44.2,...,6.3,82,40.3,SF,26.1,6.9,1.9,0,2.9,2002
3,-,Tracy McGrady,ORL,36.4,3.7,1.4,5.3,1.0,24,45.1,...,5.5,76,38.3,SG,25.6,7.9,1.6,1,2.5,2002
4,-,Tim Duncan,SA,10.0,0.1,0.0,3.7,2.5,67,50.8,...,6.8,82,40.6,C,25.5,12.7,0.7,0,3.2,2002
5,-,Kobe Bryant,LAL,25.0,1.7,0.4,5.5,0.4,11,46.9,...,6.1,80,38.3,SF,25.2,5.5,1.5,1,2.8,2002
6,-,Vince Carter,TOR,38.7,5.2,2.0,4.0,0.7,5,42.8,...,4.1,60,39.8,G,24.7,5.2,1.6,0,2.6,2002
7,-,Chris Webber,SAC,26.3,0.4,0.1,4.8,1.4,31,49.5,...,4.7,54,38.4,C,24.5,10.1,1.7,0,2.9,2002
8,-,Dirk Nowitzki,DAL,39.7,4.6,1.8,2.4,1.0,38,47.7,...,5.8,76,38.0,F,23.4,9.9,1.1,0,1.9,2002
9,-,Michael Jordan,WSH,18.9,0.9,0.2,5.2,0.4,8,41.6,...,4.4,60,34.9,G,22.9,5.7,1.4,0,2.7,2002


## 2.6 Loading and Preprocessing Advanced Statistics

Finally, we will load in our data of advanced statistics, which we also scraped from basketball-reference.com. We hope that these advanced statistics can include specific statistics that will make underdog or outperforming players stand out.

In [None]:
file_path = '/content/drive/MyDrive/final_project_datasets/NBA_Advanced_stats_2008-2025_fixed_years_multiTM_clean.csv'
advanced_df = pd.read_csv(file_path)
display(advanced_df.head(10))

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,MP,PER,TS%,...,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,Awards,Year
0,1,Allen Iverson,32,DEN,SG,82,82,3424,20.9,0.567,...,8.9,2.8,11.6,0.163,3.4,-0.7,2.7,4.0,AS,2008
1,2,Joe Johnson,26,ATL,SG,82,82,3343,17.3,0.535,...,5.3,1.7,7.0,0.1,2.7,-1.5,1.2,2.7,AS,2008
2,3,Andre Iguodala,24,PHI,SF,82,82,3242,19.0,0.543,...,5.3,4.4,9.6,0.143,2.5,1.5,4.0,4.9,,2008
3,4,Richard Jefferson,27,NJN,SF,82,82,3200,17.4,0.571,...,6.1,1.5,7.6,0.114,1.6,-1.3,0.3,1.8,,2008
4,5,Baron Davis,28,GSW,PG,82,82,3196,19.8,0.523,...,5.9,2.8,8.7,0.131,3.2,0.3,3.6,4.5,,2008
5,6,Kobe Bryant,29,LAL,SG,82,82,3192,24.2,0.576,...,9.5,4.3,13.8,0.208,5.2,0.6,5.8,6.3,"MVP-1,DPOY-5,AS,NBA1,DEF1",2008
6,7,Jamal Crawford,27,NYK,SG,80,80,3190,16.0,0.528,...,4.6,0.4,5.0,0.076,2.1,-1.8,0.3,1.9,,2008
7,8,Jason Richardson,27,CHA,SF,82,82,3149,18.4,0.554,...,4.8,2.6,7.4,0.113,3.3,-0.3,3.0,4.0,,2008
8,9,Dwight Howard,22,ORL,C,82,82,3088,22.9,0.619,...,6.4,6.4,12.9,0.2,1.6,1.2,2.8,3.7,"MVP-5,DPOY-7,AS,NBA1,DEF2",2008
9,10,Rashard Lewis,28,ORL,PF,81,81,3076,16.7,0.591,...,6.1,3.7,9.8,0.153,2.7,0.4,3.1,4.0,,2008


In [None]:
advanced_df.dtypes

Unnamed: 0,0
Rk,int64
Player,object
Age,Int64
Team,object
Pos,object
...,...
eFG,float64
hand_width,float64
PTS,float64
hand_length,float64


In [None]:
advanced_df.describe()

Unnamed: 0,Rk,Age,GP,GS,MIN,PER,TS%,3PAr,FTr,ORB%,...,last_name,spot_college_corner_left,body_fat_pct,on_move_college,2P%,Usg,eFG,hand_width,PTS,hand_length
count,6557.0,6557.0,6557.0,6557.0,6557.0,6556.0,6554.0,6553.0,6553.0,6556.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,201.034009,26.52951,58.337959,29.256367,1394.789233,13.854866,0.539528,0.294548,0.271633,5.193975,...,,,,,,,,,,
std,114.557994,4.22162,18.852139,28.210774,757.288966,4.686067,0.06054,0.216604,0.145375,4.056865,...,,,,,,,,,,
min,1.0,19.0,1.0,0.0,0.0,-13.4,0.0,0.0,0.0,0.0,...,,,,,,,,,,
25%,102.0,23.0,46.0,3.0,755.0,10.8,0.506,0.074,0.172,2.1,...,,,,,,,,,,
50%,202.0,26.0,63.0,18.0,1360.0,13.4,0.541,0.306,0.245,3.7,...,,,,,,,,,,
75%,300.0,29.0,74.0,56.0,2003.0,16.5,0.576,0.453,0.339,7.7,...,,,,,,,,,,
max,435.0,43.0,83.0,83.0,3424.0,76.3,1.0,1.0,2.167,63.9,...,,,,,,,,,,


## 2.7 Loading and Preprocessing NBA Draft Data

Finally, we need to load the NBA Draft data, which will assist in our EDA and modeling later.

In [None]:
file_path = '/content/drive/MyDrive/final_project_datasets/nba_draft_2000_2025_clean.csv'
draft_df = pd.read_csv(file_path)
display(draft_df.head(10))

Unnamed: 0,Rk,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,FT%,MP.1,PTS.1,TRB.1,AST.1,WS,WS/48,BPM,VORP,Year
0,1.0,1.0,NJN,Kenyon Martin,Cincinnati,15.0,757,23134,9325,5159,...,0.629,30.6,12.3,6.8,1.9,48.0,0.1,0.1,12.1,2000
1,2.0,2.0,VAN,Stromile Swift,LSU,9.0,547,10804,4582,2535,...,0.699,19.8,8.4,4.6,0.5,21.3,0.095,-1.6,1.1,2000
2,3.0,3.0,LAC,Darius Miles,,7.0,446,11730,4507,2190,...,0.59,26.3,10.1,4.9,1.9,9.5,0.039,-1.0,3.0,2000
3,4.0,4.0,CHI,Marcus Fizer,Iowa State,6.0,289,6032,2782,1340,...,0.691,20.9,9.6,4.6,1.2,2.7,0.022,-3.7,-2.6,2000
4,5.0,5.0,ORL,Mike Miller,Florida,17.0,1032,27812,10973,4376,...,0.769,26.9,10.6,4.2,2.6,60.7,0.105,0.8,19.8,2000
5,6.0,6.0,ATL,DerMarr Johnson,Cincinnati,7.0,344,5930,2121,769,...,0.789,17.2,6.2,2.2,0.9,6.4,0.052,-1.6,0.6,2000
6,7.0,7.0,CHI,Chris Mihm,Texas,8.0,436,8758,3262,2302,...,0.704,20.1,7.5,5.3,0.5,13.3,0.073,-3.9,-4.3,2000
7,8.0,8.0,CLE,Jamal Crawford,Michigan,20.0,1327,38994,19419,2948,...,0.862,29.4,14.6,2.2,3.4,60.7,0.075,-0.1,18.4,2000
8,9.0,9.0,HOU,Joel Przybilla,Minnesota,13.0,592,11733,2293,3665,...,0.557,19.8,3.9,6.2,0.4,23.0,0.094,-1.7,0.8,2000
9,10.0,10.0,ORL,Keyon Dooling,Missouri,13.0,728,14134,5067,964,...,0.799,19.4,7.0,1.3,2.2,18.5,0.063,-2.0,-0.2,2000


In [None]:
draft_df.dtypes

In [None]:
draft_df.describe()

# Part 3: Wrangling

### 3.1 Data Filtering

We can see that although ncaa_df ranges from 2008 - 2025, combine_df only ranges from 2001 - 2023. This indicates that we will have to clean and format our dataframes to only include relevant years for processing.

First, let us only include the years where the combine and NCAA overlap.

In [None]:
dfs = [combine_df, ncaa_df, nba_espn_df, nba_br_df, advanced_df]

#Find intersection year range
intersection_min = max(df["Year"].min() for df in dfs)
intersection_max = min(df["Year"].max() for df in dfs)

#Filter out each data frame
combine_df   = combine_df[(combine_df["Year"] >= intersection_min) & (combine_df["Year"] <= intersection_max)]
ncaa_df      = ncaa_df[(ncaa_df["Year"] >= intersection_min) & (ncaa_df["Year"] <= intersection_max)]
nba_espn_df  = nba_espn_df[(nba_espn_df["Year"] >= intersection_min) & (nba_espn_df["Year"] <= intersection_max)]
nba_br_df    = nba_br_df[(nba_br_df["Year"] >= intersection_min) & (nba_br_df["Year"] <= intersection_max)]
advanced_df  = advanced_df[(advanced_df["Year"] >= intersection_min) & (advanced_df["Year"] <= intersection_max)]

intersection_min, intersection_max




(2008, 2023)

## 3.2 Data Merging

In this section, we will look to merge our data sets for EDA.

### 3.2.1 Normalizing Column Names

Next, we want to normalize our columns and dtypes across dataframes so that we are able to merge correctly.

In [None]:
duplicate_groups = {
    'Player': ['PLAYER', 'player', 'PlayerName'],
    'Pos': ['Pos_br', 'POS_espn', 'Position'],
    'Age': ['Age_br', 'AGE'],
    'Height': ['HEIGHT', 'Height_br'],
    'Weight': ['WEIGHT', 'Weight_br'],
    'Year': ['Year_br', 'Year_espn', 'season', 'Season'],
    'Team': ['TEAM', 'Tm'],

    # Shooting + scoring
    'PTS': ['PTS_br', 'PTS_espn', 'Points'],
    'FGM': ['FGM_espn', 'FGM'],
    'FGA': ['FGA_br', 'FGA_espn'],
    'FG%': ['FG%_br', 'FG%_espn', 'FG_PCT'],
    '3PM': ['3PM', '3P'],
    '3PA': ['3PA_br', '3PA_espn'],
    '3P%': ['3P%_br', '3P%_espn', '3P_PCT'],
    'FTM': ['FTM_espn', 'FTM'],
    'FTA': ['FTA_br', 'FTA_espn'],
    'FT%': ['FT%_br', 'FT%_espn', 'FT_PCT'],

    # Rebounding + playmaking
    'TRB': ['TRB_br', 'REB', 'Rebounds', 'TRB_espn'],
    'AST': ['AST_br', 'AST_espn', 'Assists'],
    'STL': ['STL_br', 'STL_espn', 'Steals'],
    'BLK': ['BLK_br', 'BLK_espn', 'Blocks'],
    'TOV': ['TOV', 'TO', 'Turnovers'],

    # Games + minutes
    'GP': ['G', 'GP', 'Games'],
    'GS': ['GS', 'GamesStarted'],
    'MIN': ['MP', 'MIN', 'Minutes'],

    # Advanced stats
    'PER': ['PER_br', 'PER_espn'],
    'TS%': ['TS%', 'TS_PCT'],
    'USG%': ['USG%', 'UsageRate'],
    'WS': ['WS_br', 'WinShares'],
    'BPM': ['BPM_br', 'BoxPlusMinus'],
    'VORP': ['VORP_br']
}

In [None]:
dfs = [combine_df, ncaa_df, nba_espn_df, nba_br_df, advanced_df]

# Apply renaming
for df in dfs:
    for unified_col, variants in duplicate_groups.items():
        for col in variants:
            if col in df.columns:
                df.rename(columns={col: unified_col}, inplace=True)

In [None]:
# Determine the union of all columns after renaming
all_columns = set().union(*[df.columns for df in dfs])

# Add missing columns as NaN
for df in dfs:
    for col in all_columns:
        if col not in df.columns:
            df[col] = np.nan

# Reorder columns identically
for df in dfs:
    df = df[list(all_columns)]

### 3.2.2 Normalizing Column Datatypes

In [None]:
# Columns that must be strings
string_cols = ['Player', 'Pos', 'Team']

# Columns that must be integers (or nullable ints)
int_cols = ['Age', 'Year', 'GP', 'GS']

# Columns that must be floats (percentages, rates, stats)
float_cols = [
    'PTS', 'FGM', 'FGA', 'FG%', '3PM', '3PA', '3P%', 'FTM', 'FTA', 'FT%',
    'TRB', 'AST', 'STL', 'BLK', 'TOV', 'MIN',
    'PER', 'TS%', 'USG%', 'WS', 'BPM', 'VORP',
    'Height', 'Weight'
]

# Apply dtype normalization across ALL dfs
for df in dfs:

    # --- Standardize string columns ---
    for col in string_cols:
        if col in df.columns:
            df[col] = df[col].astype(str).str.strip()

    # --- Standardize integer columns (nullable Int64 for missing values) ---
    for col in int_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce').astype('Int64')

    # --- Standardize float columns ---
    for col in float_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce').astype(float)


Next, we also know that there are players exist multiple times in each row. However, we want to take a look at their playing over time. We can differentiate these players by combining their name, team, and season for both NBA and NCAA data. This will be our merge key.

In [None]:
for df in [combine_df, ncaa_df, nba_espn_df, nba_br_df, advanced_df]:
    # Remove whitespace, lowercase, normalize name
    df['Player'] = df['Player'].str.strip().str.lower()
    # Ensure 'Team' column is also standardized if it exists
    if 'Team' in df.columns:
        df['Team'] = df['Team'].astype(str).str.strip().str.lower()

    # Construct unique player_season_id (Player_Year) for all dataframes
    df['player_season_id'] = df['Player'] + "_" + df['Year'].astype(str)

    # Also construct player_team_season_id (Player_Team_Year) if 'Team' column exists
    if 'Team' in df.columns:
        df['player_team_season_id'] = df['Player'] + "_" + df['Team'] + "_" + df['Year'].astype(str)

  df['player_season_id'] = df['Player'] + "_" + df['Year'].astype(str)
  df['player_season_id'] = df['Player'] + "_" + df['Year'].astype(str)
  df['player_season_id'] = df['Player'] + "_" + df['Year'].astype(str)
  df['player_season_id'] = df['Player'] + "_" + df['Year'].astype(str)


In [None]:
# Check that this is right
nba_br_df.head()

Unnamed: 0,Rk,Player,Age,Team,Pos,GP,GS,MIN,FG,FGA,...,off_drib_college_top_key,last_name,spot_college_corner_left,body_fat_pct,on_move_college,Usg,eFG,hand_width,hand_length,player_season_id
3380,1.0,lebron james,23,CLE,SF,75,74,40.4,10.6,21.9,...,,,,,,,,,,lebron james_2008
3381,2.0,kobe bryant,29,LAL,SG,82,82,38.9,9.5,20.6,...,,,,,,,,,,kobe bryant_2008
3382,3.0,allen iverson,32,DEN,SG,82,82,41.8,8.7,19.0,...,,,,,,,,,,allen iverson_2008
3383,4.0,carmelo anthony,23,DEN,SF,77,77,36.4,9.5,19.2,...,,,,,,,,,,carmelo anthony_2008
3384,5.0,amar'e stoudemire,25,PHO,C,79,79,33.9,9.0,15.3,...,,,,,,,,,,amar'e stoudemire_2008


## 3.3 Data Merge

Finally, we want to merge our datasets so we can perform meaningful EDA analysis

### 3.3.1 NBA Data Merge

First, we want to merge our data for our two separate NBA dataframes, nba_espn_df and nba_br_df into nba_df

### 3.3.2 NBA Advanced Metrics Merge



merge nba_df with advanced_df into total_nba_df

### 3.3.3 NBA Draft and Combine Merge

merge combine_df and draft_df

### 3.3.4 College to NBA Draft Merge

merge ncaa_df to draft_df

### 3.3.5 NBA Draft to NBA Merge

merge draft_df to total_nba_df

### 3.3.6 College to NBA Merge

merge ncaa_df and total_nba_df

###

# Part 4: Exploratory Data Analysis

After generating a summary of our data and examining the overall data distribution, we can begin to explore insights and relationships between features, which will ultimately help us determine a better drafting procedure. In this section, we explore how NCAA performance statistics relate to the physical and athletic measurements collected at the NBA Combine. By comparing metrics such as wingspan, vertical leap, sprint time, and shooting or efficiency statistics, we aim to uncover relationships that may help explain how a player’s physical tools translate into on-court impact. These comparisons are important because they give us insight into which measurable physical attributes are associated with effective college performance and ultimately, which features may be most predictive when constructing a draft projection model. The goal is not only to visualize trends, but also to identify feature pairs where physical measurements meaningfully correlate with skill, efficiency, or overall impact.

## 4.1 NBA vs. Draft EDA

Take players in nba_total_df and match to draft_df. compare player's NBA advanced stats to their draft number and league-wide average advanced stats. Identify specific players with above-average advanced metrics and see how their draft position compares

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(combine_df['wingspan'], ncaa_df['BPM'])
plt.xlabel("Wingspan (inches)")
plt.ylabel("BPM")
plt.title("Wingspan vs College BPM")
plt.show()


Explanation

Also compare how draft position correlates with NBA outcomes such as awards, longterm advanced stats (PER, BPM, WS). Identify common advanced metrics that successful players share

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(combine_df['wingspan'], ncaa_df['BPM'])
plt.xlabel("Wingspan (inches)")
plt.ylabel("BPM")
plt.title("Wingspan vs College BPM")
plt.show()


Explanation

## 4.2 NCAA vs. Draft EDA

Compare draft pick with college performance by merging draft_df and ncaa_df. Identify which college stats translate best into draft position

In [None]:
plt.figure(figsize=(6,4))
plt.hist(combine_df['max_vertical_leap'].dropna(), bins=20)
plt.xlabel("Max Vertical Leap (inches)")
plt.ylabel("Count")
plt.title("Distribution of Max Vertical Leap")
plt.show()


Explanation

## 4.3 College vs. Draft vs. NBA

Average stats by draft pick and compare these averages to average college stats by pick and average nba stats by pick.

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(ncaa_df['eFG'], combine_df['three_quarter_sprint'])
plt.xlabel("NCAA eFG%")
plt.ylabel("Three-Quarter Sprint (sec)")
plt.title("eFG% vs Sprint Time")
plt.show()


Explanation

# Part 5: Feature Engineering & Preprocessing (All data sets in one)

# Part 6: Modeling

## 6.1 Logistic Regression Modeling

We wanted to predict whether a NBA prospect becomes an above-average NBA player. Using features from college statistics and draft information, we will train a logistic regression model because it provides a simple and interpretable way, before we move onto more advanced models.

## 6.2 Random Forest / XGBoost Pearson Correlation

## 6.3 Heatmap + Feature Correlation Matrix

## 6.4 Predicting the 2026 NBA Draft Class

# Part 7: Conclusion

First, we merge mo's career avg with ethan's career avg and get advanced metrics.

Then that with bennett's career averages.

Second, look at those players and compare to mo's draft dataframe and compare player's advanced stats and average those stats and look at players with above average stats and compare to their draft number.

Then, compare the same player's college numbers to the draft.

Draft to NBA and compare which types of players are getting any awards, top 3. and look at advanced metrics which players have in common

And merge combine data and draft data but before we compare draft nba and draft to college.

For draft dataset, we have ranking of draft and we can average stats by pick number. so we can have average stats by draft pick. with advanced metrics. and also for undrafted guys. and compare that to college stats and nba stats.

and now we are looking for certain stats that stand out between players who go from college to draft to nba and have certain metrics that line up together. more than one stat