**Predicting the Average Minutes Played of NBA Draft Picks**

Ranking draft prospects is a common occurance among NBA fans. Doing so however is no easy task as the NBA draft is often reffered to as a 'crapshoot' due to it's unpredictable nature. This model serves as an attempt to evaluate draft prospects based on draft position, college statistics and draft combine data. 

The first step of evaluating draft prospects is choosing a metric to evaluate their success. While plenty of publicly avaiable  advanced metrics exist, all are flawed to varying degrees. More importantly however, advanced metrics compare NBA players to eachother and while this is useful for evaluating the best players in the league, it's rather useless for comparing low-end role players who often don't even play enough minutes to be considered by these metrics. For example, out of all second round picks from 2003-2013, 48% played less than 3 years in the NBA with 26% never playing a single minute in the league. Since second round picks make up roughly 47% of the players in my dataset it would be foolish to choose an evaluation metric that would invalidate half of the data I'm working with. This lead to me choosing minutes played to evaluate player succes as at the end of the day good NBA players play more minutes then bad ones. 

**METHOD**

This project can be broken up to three main sections: Collecting Data, Cleaning Data, and Constructing Model.

A more detailed methodology is shown below:

**Collecting Data:**

    -Compiling URLs and table IDs
    -Scrape Basketball Reference for NCAA statistics
    -Use nba_api to gather draft combine data, draft results and minutes played
    -Export data to csv

**Cleaning Data:**

    -Drop irrelevant columns
    -Adress null values
    -Preform appropriate unit conversions
    -Merge college stats, draft combine data, draft results and minutes played to a single csv

**Constructing Model:**

    -Import modules and dataset
    -Isolate predictor variables
    -Seperate training and testing data
    -Train main model
    -Train control model and combine results
    -Run model
    -Next steps


**Importing Modules**

The following modules are required to run this notebook:
    
    -Pandas
    -Ridge from sklearn.linear_model
    -mean_squared_error from sklearn.metrics
    -numpy
    -scipy and scipy.stats

In [1]:
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import numpy as np
import scipy
import scipy.stats as stats


**Importing Dataset**

Imports the dataset used by the model

In [2]:
#Gets the data from the scraper and deletes the index column
main_df = pd.read_csv('merged.csv')
del main_df['Unnamed: 0']
del main_df['BPM']
main_df = main_df.fillna(0)

**Isolating Predictor Variables**

Drops all columns which are not predictor variables from the dataset

In [3]:
#Gets Predidctor variables
def get_predictors():
        columns_to_drop = ['Total Minutes', 
                         'Seasons Played',
                         'Player',
                         'POSITION',
                         'Year',
                         'STANDING_REACH_FT_IN',
                         'DRAFT_NUMBER.1',
                         'Average Minutes',
                         'DRAFT_NUMBER']
        

        predictors = (main_df.drop(columns = columns_to_drop)).columns
        return predictors
        '''
        if year in shuttleRun:
            predictors.drop(columns = year)
        elif year in benchPress:
        '''    

        
    


**Seperating training and testing data**

Chose roughly 20% of the data to be used for testing used the rest for training.

In [4]:
testing_years = [2004,2009,2012,2017]
training_years = []
for year in range(2000,2021):
    if year not in testing_years:
        training_years.append(year)

**Training Main Model**

Used sklearn to create a ridge regression model. Used mean squared error as an error metric to compare with the control model. 

In [5]:
def model_training(training_years, testing_years, predictors):
    
    #Getting Predictors
    column_to_predict = 'Average Minutes'
    
    #Training the model
    train = main_df[main_df['Year'].isin(training_years)]
    test = main_df[main_df['Year'].isin(testing_years)]
    reg = Ridge(alpha = 0.1)
    reg.fit(train[predictors], train[column_to_predict])

    #Getting predicitons in a dataframe
    predictions = reg.predict(test[predictors])
    predictions = pd.DataFrame(predictions, columns = ['Predictions'], index = test.index)
    combination = pd.concat([test[['Player', column_to_predict]], predictions], axis = 1)
    
    #Getting Mean Square Error
    mse = mean_squared_error(combination[column_to_predict], combination['Predictions'])
    
    #Gets a clutter free df 'clean' which neatly displays information
    important_columns = ['Player',  column_to_predict,  'Predictions']
    clean = combination.loc[:,important_columns]
    clean['Predictions'] = round(clean['Predictions'], 0)
    clean[column_to_predict] = round(clean[column_to_predict], 0)
    clean['Predictions'] = clean['Predictions'].astype(int)
    clean[column_to_predict] = clean[column_to_predict].astype(int)
    clean['Difference'] = clean[column_to_predict] - clean['Predictions'] 
    
    return(clean, mse)



**Train Control Model and Combine Results**

A control model was constructed identically to the main model with draft position being the sole predictor instead of a combination of draft postion, college stats and draft combine data that was used in the main model. Draft position was used as a control as the NBA draft is essentially a ranking of draft prospects done by NBA teams and draft position is by far the most signifcant predictor of minutes played. Therefore using draft position as a control allows us to compare the results of our model with the opinion of NBA front offices.

In [6]:
def model(training_years, testing_years):
    
    main_model = model_training(training_years, testing_years, get_predictors())
    main_model_df = main_model[0]
    main_model_mse = main_model[1]
    
    control_predictor = ['DRAFT_NUMBER']
    control_model = model_training(training_years, testing_years, control_predictor)
    control_model_df = control_model[0]
    control_model_mse = control_model[1]
    
    columns_to_rename = {'Predictions': 'Control_Predictions',
                         'Difference': 'Control_Difference'}
    control_model_df = control_model_df.rename(columns = columns_to_rename)
    
    #Merge Control and Main Dataframes
    merged = main_model_df.merge(control_model_df[['Player', 'Control_Predictions', 'Control_Difference']], how='left', on='Player')
    merged = merged.sort_values('Predictions', ascending = False)
    
    #Final Statement
    ratio = main_model_mse/control_model_mse
    percent = (1 - ratio) * 100
    percent = round(percent, 2)
    statement = 'The MSE of the model was ' + str(percent) + '% lower than the control model'
    string = 'mse control: ' + str(round(control_model_mse,0)) + '\n' + 'mse: ' + str(round(main_model_mse,0))
    
    print(statement)
    return merged

**Run Model**

The results of my model is shown below. Based on the mean squared error of both models, my model preformed 5.76% worse at predicting average minutes played then NBA front offices did. 

In [7]:
abc = model(training_years, testing_years)
abc

The MSE of the model was -5.76% lower than the control model


Unnamed: 0,Player,Average Minutes,Predictions,Difference,Control_Predictions,Control_Difference
19,Stephen Curry,2181,1810,371,1383,798
26,Blake Griffin,1984,1749,235,1522,462
23,Tyreke Evans,1824,1593,231,1453,371
44,Terrence Williams,730,1500,-770,1291,-561
1,Luol Deng,2064,1398,666,1383,681
...,...,...,...,...,...,...
94,Jaron Blossomgame,441,300,141,180,261
76,Miles Plumlee,808,281,527,944,-136
74,Andrew Nicholson,816,236,580,1106,-290
53,Festus Ezeli,799,118,681,851,-52


**Next Steps**

As of now the first iteration of this model is complete however that does not mean this project is finished. Some next steps are shown below:

    **-Visualize data**
        -Add plot comparing data
        -As of right now there is no real visualization of the data and just adding a simple plot comparing the two models could go a long way
        
    **-Improve the algorithm**
        -Find the optimal alpha value
            -The alpha value was chosen by manually testing and looking at the results, therefore there is definetly room for improvement in this department
        -Test for overfitting
            -No test was undertaken to see if the model was overfitting/underfitting the data so it's likely improvements can be made here
    
    