<a href="https://colab.research.google.com/github/BennettHilck12/DraftEdge/blob/main/DraftEdge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CIS 5450 Final Project: DraftEdge

##### Bennett Hilck, Ethan Xia, Mohammed Soufan


# Part 1: Introduction



For our final project, we sought to better understand how NBA teams can draft players more effectively by predicting which NCAA D1 prospects are most likely to outperform expectations at the professional level. To pursue this goal, we compiled and analyzed several datasets containing college performance statistics, NBA Combine measurements and performance statistics, and eventual NBA outcomes.


By aggregrating and merging these data sources, we aim to identify advanced or undervalued metrics that may be overlook in traditional scouting evaluations. This approach alows us to quantitatvely asses why certain highly drafted players become "busts," while others selected later dramatically exceed their draft position.


Given the high variance and high cost associated with drafting, our project highlights non-obvious predictors that could help inform more data-driven decision-making and potentially reshape how teams and scouts approach future NBA drafts.


We hope that you find our project impactful!

# Part 2: Data Loading & Preprocessing


First, we must import all relevant libraries for our project. We will use all supplemental libraries throughout our project.

In [15]:
# Imports + Installs
!pip install category_encoders
import pandas as pd
import seaborn as sns
import dask.dataframe as dd
import matplotlib.pyplot as plt
import folium
import numpy as np
import category_encoders as ce
import dask.array as da
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import tqdm
import copy
from xgboost import XGBRegressor
from folium.plugins import HeatMap
from google.colab import drive
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import spearmanr
from sklearn import datasets, linear_model
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from torch.utils import data as data_utils
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression



### 2.1 Loading + Preprocessing NBA Combine Data

Because our datasets are quite large, often containing thousands of player seasons, detailed college statistics, and multiple years of NBA Combine measurements—we decided to upload them to Google Drive rather than store them locally. This allows us to efficiently access and manage the data from within our notebook environment without running into storage limitations or upload constraints. Hosting the data on Google Drive also ensures easier collaboration, as all team members can work from the same centralized, version-consistent files.

In [16]:
# Data Loading via Google Docs
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


First, let's load in our data from the NBA combine.

In [17]:
file_path = '/content/drive/MyDrive/final_project_datasets/draft_combine_stats.csv'
combine_df = pd.read_csv(file_path)
display(combine_df.head())

Unnamed: 0,season,player_id,first_name,last_name,player_name,position,height_wo_shoes,height_wo_shoes_ft_in,height_w_shoes,height_w_shoes_ft_in,...,spot_nba_break_right,spot_nba_corner_right,off_drib_fifteen_break_left,off_drib_fifteen_top_key,off_drib_fifteen_break_right,off_drib_college_break_left,off_drib_college_top_key,off_drib_college_break_right,on_move_fifteen,on_move_college
0,2001,12033,Adam,Allenspach,Adam Allenspach,C,83.5,6' 11.5'',,,...,,,,,,,,,,
1,2001,2240,Gilbert,Arenas,Gilbert Arenas,SG,74.25,6' 2.25'',,,...,,,,,,,,,,
2,2001,2220,Brandon,Armstrong,Brandon Armstrong,SG,75.5,6' 3.5'',,,...,,,,,,,,,,
3,2001,2203,Shane,Battier,Shane Battier,SF-PF,80.25,6' 8.25'',,,...,,,,,,,,,,
4,2001,12034,Cookie,Belcher,Cookie Belcher,SG-PG,75.0,6' 3'',,,...,,,,,,,,,,


In [18]:
combine_df.describe()

Unnamed: 0,season,player_id,height_wo_shoes,height_w_shoes,weight,wingspan,standing_reach,body_fat_pct,hand_length,hand_width,standing_vertical_leap,max_vertical_leap,lane_agility_time,modified_lane_agility_time,three_quarter_sprint,bench_press
count,1202.0,1202.0,1153.0,1008.0,1152.0,1153.0,1152.0,1003.0,719.0,719.0,1017.0,1017.0,1008.0,411.0,1012.0,808.0
mean,2012.536606,2350133.0,77.570902,78.796577,214.902604,82.478187,103.593663,7.352313,8.720793,9.448887,29.240167,34.637168,11.384444,3.079221,3.282213,10.155941
std,6.56162,56602100.0,3.334605,3.322625,25.718878,3.986623,4.880218,2.748712,0.481028,0.717066,3.054645,3.648106,0.584896,0.229143,0.13185,5.389231
min,2001.0,-1.0,67.75,69.0,154.4,70.0,89.5,2.6,7.5,7.0,20.5,25.0,9.65,2.22,2.91,0.0
25%,2007.0,101145.5,75.25,76.5,195.95,80.0,100.0,5.4,8.5,9.0,27.0,32.0,10.97,2.97,3.19,6.0
50%,2012.0,203147.0,77.75,79.0,212.9,82.75,104.0,6.7,8.75,9.5,29.0,34.5,11.32,3.1,3.27,10.0
75%,2018.0,1629014.0,80.0,81.25,233.0,85.25,107.0,8.6,9.0,10.0,31.5,37.0,11.72,3.23,3.36,14.0
max,2023.0,1962937000.0,89.25,91.0,314.0,98.25,122.5,21.0,10.5,12.0,39.5,45.5,13.44,3.76,3.81,26.0


And let's also take a look at our NCAA D1 player stats, which we scraped from barttorvik.com, a live NCAA D1 player stats tracker by season.

### 2.2 Loading and Preprocessing NCAA Data

In [19]:
file_path = '/content/drive/MyDrive/final_project_datasets/battorvikPlayerData.xlsx'
ncaa_df = pd.read_excel(file_path)
display(ncaa_df.head())

Unnamed: 0,Rk,Player,Class,Team,Conf,Min%,PRPG!,BPM,ORtg,Usg,...,DR,Ast,TO,Blk,Stl,FTR,2P,3P/100,3P,Year
0,1,Bennett Stirtz,Jr,Drake,MVC,98.8,6.4,10.0,126.4,26.1,...,14.0,34.0,13.1,1.0,3.3,38.6,0.545,7.7,0.396,2025
1,2,Bruce Thornton,Jr,Ohio St.,B10,88.4,6.3,8.7,130.0,22.0,...,10.0,25.1,10.8,0.4,1.8,41.8,0.547,7.0,0.424,2025
2,3,Ryan Kalkbrenner,Sr,Creighton,BE,83.1,6.1,11.1,129.2,22.3,...,18.4,10.0,11.2,7.3,0.9,38.6,0.706,3.0,0.344,2025
3,4,Eric Dixon,Sr,Villanova,BE,84.3,6.1,7.3,116.7,32.9,...,12.6,13.1,11.8,1.1,1.5,35.8,0.483,13.1,0.407,2025
4,5,Cooper Flagg,Fr,Duke,ACC,72.8,6.0,14.9,123.0,30.8,...,21.2,26.8,13.5,4.9,2.8,42.9,0.517,7.2,0.385,2025


In [20]:
ncaa_df.describe()

Unnamed: 0,Rk,Min%,PRPG!,BPM,ORtg,Usg,eFG,TS,OR,DR,Ast,TO,Blk,Stl,FTR,2P,3P/100,3P,Year
count,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0,39154.0
mean,1088.863973,63.886001,1.601908,0.508939,103.134344,20.155992,50.251591,53.593196,5.069479,13.179956,13.647076,18.396751,1.797867,1.892037,35.759151,0.484399,5.994695,0.294807,2016.615569
std,629.257082,13.680132,1.343539,3.643385,10.595077,4.613991,6.271648,5.656281,3.587501,4.846323,7.603259,5.055464,2.141272,0.801671,15.741384,0.077856,4.120459,0.13744,5.194556
min,1.0,40.0,-2.7,-13.3,55.4,6.1,18.3,20.3,0.0,2.1,0.0,2.2,0.0,0.0,0.0,0.0,0.0,0.0,2008.0
25%,544.0,52.4,0.7,-2.0,96.3,16.8,46.1,49.9,2.1,9.5,7.9,14.9,0.4,1.3,24.5,0.433,2.5,0.262,2012.0
50%,1088.0,63.9,1.5,0.4,103.3,19.9,50.0,53.6,4.0,12.5,12.0,17.9,1.0,1.8,33.6,0.483,6.0,0.33,2017.0
75%,1632.0,74.975,2.4,2.9,110.2,23.2,54.3,57.3,7.5,16.3,18.0,21.3,2.4,2.3,44.9,0.535,9.1,0.374,2021.0
max,2308.0,98.8,7.9,18.7,161.4,38.5,88.3,80.4,23.7,39.2,52.6,54.1,18.8,7.6,147.9,1.0,23.1,1.0,2025.0


# Part 3: Wrangling

### 3.1 Data Filtering

We can see that although ncaa_df ranges from 2008 - 2025, combine_df only ranges from 2001 - 2018. This indicates that we will have to clean and format our dataframes to only include relevant years for processing.

First, let us only include the years where the combine and NCAA overlap.

In [27]:
#First, rename season column to Year
combine_df = combine_df.rename(columns={"season": "Year"})

#Filtering combine_df to min and max years in ncaa_df
combine_df = combine_df[
    (combine_df['Year'] >= ncaa_df['Year'].min()) &
    (combine_df['Year'] <= ncaa_df['Year'].max())
]

#Filtering ncaa_df to min and max years in combine_df
ncaa_df = ncaa_df[
    (ncaa_df['Year'] >= combine_df['Year'].min()) &
    (ncaa_df['Year'] <= combine_df['Year'].max())
]

### 3.2 Dropping Null Rows

In [28]:
ncaa_df = ncaa_df.dropna()
combine_df = combine_df.dropna()

# Part 4: Exploratory Data Analysis

# Part 5: Feature Engineering & Preprocessing

# Part 6: Modeling