# The Expected Value of an NBA Draft

## 1.0 Introduction

The game of basketball is forever changing. Trends come and go, teams rise and fall, players' legacies are constantly being  written and re-written. 

Perhaps more than in any other sport, a single player can be the difference maker for a franchise. While teams can acquire players through a few different methods (drafts, signings, and trades), some of the most compelling stories that the NBA has to offer explore a "Chosen One" narrative: a young prospect turned superstar that brings the team that drafted them into championship contetion. (Ex: Magic Johson, Tim Duncan, LeBron James)

For every success story, however, there are a few "busts" along the way. How often do we see a team place their hopes in some 19-20 year old touted as "The Next Michael Jordan" only for this same team to remain in the lottery for years to come? On the other end, there seem to some "hidden gems" taken later in the draft that come up and surprise people with their production.

Human behaviour is so difficult to predict and there are limitless factors that effect player performance. Teams might not always scout correctly or coach properly, and players might not perform or develop as expected. Win Shares, VORP, and Plus/Minus are far from perfect measures of player value, and there are so many unfounded expectations and overwhelming uncertainties. Such is the nature of the NBA Draft.

**In this article, I aim to quantify value of different draft picks and explore how draft position might be related to player performance**. We likely will not find a terribly strong, generalizable model in our analysis, but this exploration *might* still provide insight on the nature of NBA draft picks.

## 2.0 Methodology

**Sample**: We have pulled 16 draft classes (1989 to 2005) as our sample (n = 962 players).

We begin from 1989 because the modern draft format (two rounds of draft picks) was introduced in 1989, and the NBA has kept the two-round format since then. We end with 2005 because there are players drafted in 2006 and beyond that might still produce "prime" seasons.

**Player Value**: To measure the “value” of each draftee, I am using a statistic referenced on [BoxScoreGeeks.com][1]: Brocato Prime Wins. We are interested in a draftee’s **five** best regular seasons organized by **total Win Shares** (WS). 

We will refer to this measure as "Prime Win Shares", "Prime Wins", or "Prime WS" from this point forward.

**Data Sources**: I pulled Draft Position data from [Basketball-Reference.com][2] & I pulled Regular Season data from [Kaggle.com][3]

**Data Analysis**: The analysis was done using Python. Data was organized using Pandas, numerical operations were implemented using NumPy and scikit-learn, statstical tests were performed using SciPy & Statsmodels, and plots were drawn using Plotly.

**Notes**:
1. What to do with draftees that did not play in the NBA or play 5 seasons?
    * We will include any/all the seasons we can, but seasons of absence we will fill with no production: 0 Win Shares.
    * Likewise, "non-players" will be treated as though they produced 0 Win Shares over their entire career.
    * We want to include these abscences because at the end of the day, we wish to determine the value of a draft pick. Even if the draft pick was used on a player that never played in the NBA, this should still be accounted for in our dataset.

2. Why use total Win Shares instead of time/pace adjusted measures?
    * I decided to use total Win Shares over Win Shares per 48 Minutes (WS/48), because I wanted to value overall impact over efficiency.
    * Players with high WS/48 might be highly efficient, but their impact is limited if they are rarely on the court.
    * A player with relatively lower WS/48 that plays many more minutes can still provide great utility for a team/franchise.

3. Why are we using the five best seasons?
    * According to [The Guardian][4], The average nba career length is around 4.9 years. Five seasons of data should cover the average nba career.
    * Using the five *best* seasons will give us an idea of the player's peak. I wish to ignore injuries and would like to mostly look at player peaks.

**Confounding Effects?**

1. Draft Position and Team Investment
    * Differences in draft position may be coupled with different degrees of team investment.
    * Franchises could be more inclined to build their team around a player on which they have used a higher draft pick. This would could positvely impact draftee's on court performance.
    * By design, higher draft picks are supposed to go to worse teams. Less talented prospects might not have to play as well to be awarded with more minutes, and on average, more minutes produces more Win Shares. 


2. Draft Position and Personal Investment
    * Differenes in draft position come with differences in player salary.
    * Players with more money can afford to invest in themselves to a greater degree.
    * While teams/coaches should be investing in their own players, and the NBA pays its players very well, there is still an inherent difference in pay between different picks of the same draft class.
    * Ex: In 1989, the first overall pick, Pervis Ellison, recieved \\$2.4 million from the Sacremento Kings. The first pick in the 2nd round (28th overall), Sherman Douglas, was only paid \\$325k. 
    

[1]:https://www.boxscoregeeks.com/articles/twenty-players-better-than-kobe "Some Kobe H8er Article"
[2]:https://www.basketball-reference.com/draft/NBA_1989.html "Yes, I did have to export 16 different pages"
[3]:https://www.kaggle.com/drgilermo/nba-players-stats?select=Seasons_Stats.csv "I'm pretty sure this is also just pulled from bbref"
[4]:https://www.theguardian.com/sport/2015/nov/30/the-kobe-bryant-outlier-how-his-career-compares-to-the-nba-average#:~:text=Among%20the%203%2C668%20players%20Wilczynski,career%20length%20was%204.9%20years "The Kobe Bryant outlier"

In [22]:
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'iframe'
import pandas as pd
import numpy as np


#Writing a funtion that returns a scatterplot with a trendline from a linear least squares regression
def ScatterChart(file, indep_var, stat):
    df = pd.read_excel(file) #Read the excel file in as a dataframe
    fig = px.scatter(df, x= indep_var, 
                  y= stat, trendline="ols",
                  title=('Draft Position v. Prime ' + str(stat))) #Create scatterplot w/ a linear trendline from the dataframe
    fig.show() #Display Scatterplot

#Writing a function that will provide our dataframe with a Residuals column
def Residuals(dataframe, stat):
    X = dataframe.iloc[:, dataframe.columns.get_loc('Draft Position')].values.reshape(-1, 1) # Get X values into array format
    Y = dataframe.iloc[:, dataframe.columns.get_loc(stat)].values.reshape(-1, 1) # Get Y values into array format
    regressor = LinearRegression() 
    regressor.fit(X, Y)  # Perform linear regression
  
    Y_pred = regressor.predict(X) # Get our trendline such that we can calcualte residuals (Observed - Modeled data)
    dataframe['Residuals'] = Y-Y_pred # Get residuals into array format
    

In [23]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score


#Writing a function that returns the equation from a linear least squares regression as well as the R-Squared 
def LinearEquation(file, x, stat):
    df = pd.read_excel(file) #Read the excel file in as a dataframe
    X = df.iloc[:, df.columns.get_loc(x)].values.reshape(-1, 1) # Get X values into array format
    Y = df.iloc[:, df.columns.get_loc(stat)].values.reshape(-1, 1) # Get Y values into array format
    regressor = LinearRegression() 
    regressor.fit(X, Y)
    Y_pred = regressor.predict(X)
    
    y = ('Prime ' + str(stat))
    m = regressor.coef_[0][0]
    x = 'Draft Position'
    b = regressor.intercept_[0]
    r2 = r2_score(Y,Y_pred)
    print('{0} = ({1} * {2}) + {3}'.format(y,m,x,b))
    
    print('R^2: {0}'.format(r2))

## 3.0 Analysis

### 3.1 Draft Position vs. Prime Wins: Ordinary Least Squares

In [24]:
ScatterChart("Prime_WS.xlsx",'Draft Position','WS')

In [25]:
LinearEquation("Prime_WS.xlsx",'Draft Position','WS')

Prime WS = (-0.5221335726657379 * Draft Position) + 27.300597336678003
R^2: 0.2855679686535073


### 3.1.1 Linear Analysis: Initial Thoughts

The above shows all of the players plotted by their Draft Position & Prime Win Shares, a ordinarly least squares regression has also been applied to our data, and the trendline is displayed on our scatter plot.

We *do* have a trendline with a negative slope, signaling that teams generally select players that produce more Prime Wins earlier in the draft. The $R^2$ value, however, is pretty weak. Our current linear model is very poorly describing our dataset. 

For this next section, we will check the assumptions of a linear model.

    0. Multicollinearity: Since we only have one independent variable, by design we will not have any issues with multicollinearity.
    1. Linearity: We will be looking in to a Q-Q plot of Predicted Prime Wins v. Observed Prime Wins. If our data is linear, we should expect to see an even distribution about the trendline. If the Q-Q plot does not reflect this, we will reject this assumption
    2. No Autocorrelation: We will perform a Durbin-Watson Test. If our data returns a test statistic outside of the range 1.5 to 2.5, this would indicate that our data is Autocorrelated, and we would reject this assumption.
    3. Multivariate Normality: We will be performing a Kolmogorov-Smirnov test and plotting our residuals on a histogram.We should expect to see our residuals normally distributed about 0. If the p-value returned from our test is below 0.05, we will also reject this assumption.
    4. Homoscedasticity: 

#### Assumption 1: Linearity

In [26]:
#Writing a function that checks for linearity
def Linearity(file,stat):
    df = pd.read_excel(file) #Read the excel file in as a dataframe
    X = df.iloc[:, 2].values.reshape(-1, 1) # Get X values into array format
    Y = df.iloc[:, df.columns.get_loc(stat)].values.reshape(-1, 1) # Get Y values into array format
    regressor = LinearRegression() 
    regressor.fit(X, Y)  # Perform linear regression
    
    df['predicted'] = regressor.predict(X)
    
    linFig = px.scatter(df, x= 'predicted',
                        y= 'WS',
                        title=('Linearity: Predicted v. actual'),
                        width=900,
                        height=1200) #Create scatterplot 
    
    linFig.add_trace(go.Scatter(x= [-5,28],
                                y= [-5,28],
                                mode='lines',
                                showlegend=False)) #Add Linear Trendline
    
    linFig.update_layout(
        xaxis_title="Predicted",
        yaxis_title="Actual")
    
    linFig.show()

In [27]:
Linearity("Prime_WS.xlsx",'WS')

The above graph plots predicted Prime Wins against observed Prime Wins from our dataset. For data with linear relationships, we would expect expect the data points in the above graph to be symetrically distributed about the diagonal.

From this scatter plot, we see that some observed values reach far above our predicted trendline, while other values are densely packed around the 0 across the board.

For this reason, we conclude that our data is not linear, therefore voiding our first assumption.

#### Assumption 2: No Autocorrelation

We are looking to test for autocorrelation within our model. We will be using the Durbin-Watson Test.

Any result within the set [1.5, 2.5] would susgest that our data has no issues with autocorrelation.

i.e. For us to properly apply a linear model to our NBA Draft dataset, we should hope that the test statistic is between 1.5 and 2.5.

In [28]:
from statsmodels.stats.stattools import durbin_watson

# Writing a funciton that tests for Autocorrelation
def Autocorrelation(file,stat):
    df = pd.read_excel(file) #Read the excel file in as a dataframe
    Residuals(df, stat)
    
    durbinWatson = durbin_watson(df['Residuals'])
    print('Durbin-Watson test result:', durbinWatson)

In [29]:
Autocorrelation("Prime_WS.xlsx",'WS')

Durbin-Watson test result: 2.0399169648169355


Since our Durbin-Watson test returned a result less than 1.5, this indicates that our data is positively autocorrelated. This disproves another one of the assumptions that is required to apply our linear model.

#### Assumption 3: Multivariate normality

In [30]:
from scipy.stats import kstest

#Writing a function that tests Multivariate Normality
def MultiNormality(file,stat):
    df = pd.read_excel(file) #Read the excel file in as a dataframe
    Residuals(df, stat)
    
    resFig = fig = px.histogram(df, x="Residuals")
    resFig.update_layout(
        title="Residual Distribution",
        xaxis_title=("Residuals"),
        yaxis_title="Count") #Reformat Residual Graph
    resFig.show() #Display residuals in histogram format
    print(kstest(df['Residuals'], 'norm'))

In [31]:
MultiNormality("Prime_WS.xlsx",'WS')

KstestResult(statistic=0.4801190236741144, pvalue=1.8541071658634794e-204)


Our Kolmogorov-Smirnov test returns an embarrassingly small p-value. This rejects the our assumption and indicates that our residuals are not normally distributed.

Just looking at the residual plot, the data appears strongly right-skewed. 

One reason for this skew could be related to the competitive nature of playing time within the league. Worse performing players will recieve less playing time. With fewer minutes, worse players will have less opportunities to severly underperform. Conversely, great players will recieve *more* playing time, giving them more opportunities to produce more Win Shares.

I.e. the way minutes are distrubuted in the league, we can expect to have more extreme "overachievers", and have relatively less outstanding "underachievers".

#### Assumption 4: Homoscedasticity

Finally, we are testing for any homoscedasticity in our model. We will be using the Goldfeld-Quandt test. This test assumes that our model is homoscedastic, and will provide a p-value for this assumption.

We will reject our assumption that our model is homoscedastic if the test returns a p-value below .05.

In [32]:
from statsmodels.stats.diagnostic import het_goldfeldquandt

def Homosced(file,stat):
    df = pd.read_excel(file) #Read the excel file in as a dataframe
    X = df.iloc[:, df.columns.get_loc('Draft Position')].values.reshape(-1, 1) # Get X values into array format
    Residuals(df, stat)
    
    fig = px.scatter(df, x="Draft Position", y= 'Residuals')
    
    goldfeldResult = het_goldfeldquandt(df['Residuals'], X)
    print('Goldfeld-Quandt test \nF Statistic: {0} \np-value: {1}'.format(goldfeldResult[0], goldfeldResult[1]))
    
    fig.show() #Display Scatterplot

In [33]:
Homosced("Prime_WS.xlsx",'WS')

Goldfeld-Quandt test 
F Statistic: 0.9317480771183729 
p-value: 0.7805100988309035


The Goldfeld-Quandt test returns a p-value significantly below 0.05; therefore, we will reject the assumption in favor of assuming that our model is heteroscedastic. We do not fulfill the final assumption of Homoscedasticity.

Every assumption that we tested for seemed to fail. Applying a linear model to this dataset is not accurate or valid. We should look for a different way to model our data.

## 3.2 Draft Position vs. Prime Wins:  Log Transformation

From a narrative perspective, an inverse logarithmic relationship between Draft Position and Prime Wins would make sense as the league tends to be somewhat "top heavy" between players when investigating most statistics. On [Basketball-Reference.com][2], it even states that when looking at BPM, "there are far more below-average players than above-average players in the league at any time".

If we ignore undrafted players and assume draft pools mimic the player pool of the entire NBA, we can excpect each draft to be comprised of a few strong performers, while most of the draft is made up of less impactful players. Hence, \***there can be a large difference between selecting 1st and selecting 5th**.

On the other side, the difference between **the 55th and 60th picks may be relatively small** for somewhat related reasons. Towards the end of the draft, where many prospects might appear to be future "below average" players, so many draftees become "non-players" in our dataset. Whether these picks are were caught in a contract with another league, or they just weren't impressing NBA teams, many later picks end up with 0 Prime Win Shares.

This might "dampen' the value of later draft picks and might keep the average measure of Prime Wins closer to 0 for the later draft positions.

[2]:https://www.basketball-reference.com/about/bpm2.html

\* Note: Aiming to find one value for a specific draft position is also pretty unreasonable since every draft class is different. High draft picks will hold inconsistent value from year to year depending on the strength of the prospects. The 1st overall pick in particular has varied wildly over the years. Even at the time of their respective drafts, LeBron James was a very different prospect from someone like Anthony Bennet, Andrea Bargnani, or even LeBron's own teammate Anthony Davis.

### 3.2.1 Draft Position vs. log(Prime Wins) Results:

In [39]:
#Writing a function to apply log transformtion on given files

def LogTransform (file,stat):
    df = pd.read_excel(file) #Read the excel file in as a dataframe
    df['log(' + stat + ')'] = np.log(df[stat]+(abs(df[stat].min())+1)) #Log Transform the data
    df.to_excel('Prime_logWS.xlsx', index=False) #Writing our transformed data in to a new Excel file

LogTransform("Prime_WS.xlsx", 'WS')

In [41]:
ScatterChart('Prime_logWS.xlsx','Draft Position','log(WS)')

In [43]:
LinearEquation('Prime_logWS.xlsx','Draft Position','log(WS)')

Prime log(WS) = (-0.037849525834157585 * Draft Position) + 3.208237225643058
R^2: 0.3160884231779767


While we return a marginally stronger $R^2$ of 0.316, the data is still looks very off, and our model appears to be very unreliable.



### 3.2.2 Applying our Model

In this next section, I want to dive in to specific examples in NBA history where teams have traded draft picks.

As mentioned previously, this model (as it stands currently) is not very precise or accurate, but this should be an interesting exercise.

#### Example 1: The 1995 Shawn Respert Trade

June 28th 1995, the Detroit Pistons and the Portland Trail Blazers swapped a few selections that they had made that same day.

This is a good trade to analyze, as these picks are all from the same draft, the draft position had already been known, and none of these draftees had played in the NBA yet so everyone was still an uncertain prospect.

**Detroit Pistons**: 
* Traded Away: 1995 Pick \#8. (i.e. Shawn Respert)
* Recieved: 1995 Pick \#18, 1995 Pick \#19, and 1995 Pick \#58. (i.e. Theo Ratliff, Randolph Childress, and Don Reid)

**Portland Trail Blazers**:
* Traded Away: 1995 Pick \#18, 1995 Pick \#19, and 1995 Pick \#58. (i.e. Theo Ratliff, Randolph Childress, and Don Reid)
* Recieved: 1995 Pick \#8. (i.e. Shawn Respert)

**Expected**: If we were to use our log transformed model to predict Prime Win Shares (Section 3.2)
* Trail Blazers Recieved: \#8. This pick is expected to provide **15.573 Prime Wins**
* Pistons Received: \#18, \#19, and \#58, These picks are expected to provide **19.219 Prime Wins**

Our model suggests that the Blazers slightly lose this trade by around 3.646 Prime Wins (15.573 - 19.219). This is a pretty negligible margin, and this trade doesn't seem that bad/good for side.

**Actual**:
* Trail Blazers Recieved: Shawn Respert: 2.1
* Pistons Received: Theo Ratliff: 28.1, Randolph Childress: -0.2, and Don Reid: 11.8

In reality, the Trail Blazers lose this trade badly. None of the modeled values come very close to their respective projections anyway, so I'm not sure if I would've helped them very much.

Who knew Sixers Legend Theo Ratliff would be such a beast.

## 3.3 What if Every Team Drafts Perfectly? ("Re-draft")

With the first overall pick, every player is available to you. The second pick, every player minus 1. The third pick... so on and so forth. In a weird way, draft busts and steals only exist because of imperfect scouting/development.

In our fantasy scenario, let's see what happens if we assume that each selection in every draft was "perfect": i.e. they took the "best" player availalbe every time (we're assuming that the players' careers play out the same way cuz yolo).

In [52]:
def redraft(file):
    df = pd.read_excel(file) #Read the excel file in as a dataframe
    df = df.sort_values(by=['Draft Class','WS'], ascending=False, ignore_index=True) #Sort by Draft Class & Prime Wins
    counts = df['Draft Class'].value_counts()
    counts = counts.sort_index(ascending = False)
    
    redraft = []
    for draftClass in range(len(counts)):
        for pick in range(counts.iloc[draftClass]):
            redraft.append(pick+1)
    
    df['ReDraft Position'] = redraft
    df.to_excel('Redraft.xlsx', index=False)

In [53]:
redraft("Prime_WS.xlsx")

In [58]:
ScatterChart("Redraft.xlsx", 'ReDraft Position','WS')

In [59]:
LinearEquation("Redraft.xlsx", 'ReDraft Position' ,'WS')

Prime WS = (-0.839443035229705 * Draft Position) + 36.448807257895915
R^2: 0.7381229605116488


### 3.3.1 Re-Draft Results

With the benefit of hindsight, the correlation is much stronger than before. We established before, however, that our fantasy "Re-Draft" that the relationship appears to be more logrithmic. What if we apply the log transform to this data set as well?

### 3.3.2 Re-applying our Fantasy Model

Let's revisit our previous example now: The 1995 Shawn Respert Trade.

In [60]:
def LogTransform (file,stat):
    df = pd.read_excel(file) #Read the excel file in as a dataframe
    df['log(' + stat + ')'] = np.log(df[stat]+(abs(df[stat].min())+1)) #Log Transform the data
    df.to_excel('RePrime_logWS.xlsx', index=False) #Writing our transformed data in to a new Excel file

LogTransform("Redraft.xlsx", 'WS')

In [61]:
ScatterChart("RePrime_logWS.xlsx", 'ReDraft Position','log(WS)')

In [62]:
LinearEquation("RePrime_logWS.xlsx", 'ReDraft Position' ,'log(WS)')

Prime log(WS) = (-0.06437679953826883 * Draft Position) + 3.973033417102024
R^2: 0.9144215825701276


### 3.3.1 "Re-Draft" Position vs. log(Prime Wins) Results:

We return a fairly strong R-squared of 0.9144

## Remove Outliers

Track medians and inner quartiles of each Draft Position.

I expect to see wide ranges for early draft picks. Outside of the lottery, the range from Q1 to Q3 should probably shrink tremendously.

Work in Progress...

In [50]:
def BoxPlot(file,stat):
    df = pd.read_excel(file) #Read the excel file in as a dataframe
    fig = px.box(df, x="Draft Position", y= stat)
    fig.show()

In [51]:
BoxPlot("Prime_WS.xlsx",'WS')

### Example 2: Markelle Fultz Trade

As someone who grew up in Philadelphia and was a fan during The Process. This trade is something that keeps me up at night.

Work in Progress...

## Group by Draft Position (Lottery, Second Round, etc.)

Work in Progress...

## Group by *Drafted Postion* (Point Guard, Shooting Guard, Center, etc.)

Work in Progress...

## Group by Drafted Team (San Antonio, Cleveland Caveliers, etc.)

Work in Progress...