# Predicting NBA All-Stars

## Problem Statement

Despite having more access to data and information than ever before, front offices and scouting departments of professional sports teams still continue to struggle to make the correct draft choices. In professional sports, selecting the right player in the draft is critical to a team's future success. It is one of the best ways to build team depth in a cost efficient manner and it is where the fates of the best amateur athletes are decided. Drafting a player is an inexact science, but having a way to predict future success of a player will make a teams draft decision that much easier and will give them a leg up on the competition. 

Our goal is to build a model that will predict the likelihood that a draft eligible College Basketball player will become an All-Star in the NBA. The process of becoming an All-Star has changed over the years, but currently the starting lineups for both teams are selected by a weighted combination of fan, player and media voting (50%, 25% and 25% respectively). The reserves are chosen by a vote between the coaches.   In the NBA, having multiple players that are All-Stars on a team has been proven to lead to championships. The 2019 NBA champion Toronto Raptors had 3 former All-Stars on their roster (Kawhi Leonard, Kyle Lowry and Marc Gasol) and the Golden State Warriors, the best team of the decade, had 5 former All-Stars on their roster during their three championship runs (Steph Curry, Klay Thompson, Draymond Green, Kevin Durant and Andre Iguodala). We will look to predict the probability that a draft eligible College Basketball player by analyzing which features of a player lead to the future NBA success.

## Executive Summary

The goal of this project is to predict the probability that at a draft eligible college basketball will become an NBA All-Star at some point in his career. This was accomplished by utilizing data scraping, data cleaning, EDA, Bayesian statistics, preprocessing and classification modeling. The data used for the analysis was pulled from two websites, [Basketball Reference](https://www.sports-reference.com/cbb/play-index/psl_finder.cgi?request=1&match=combined&year_min=2006&year_max=2019&conf_id=&school_id=&class_is_fr=Y&class_is_so=Y&class_is_jr=Y&class_is_sr=Y&pos_is_g=Y&pos_is_gf=Y&pos_is_fg=Y&pos_is_f=Y&pos_is_fc=Y&pos_is_cf=Y&pos_is_c=Y&games_type=A&qual=&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=pts&order_by_asc=&offset=0) and [Bart Torvik](http://barttorvik.com). The process of pulling/scraping the data from the Basketball Reference website for the NBA draft data, All-Star data and single season data can be seen in the **Data_Collection** folder of the repository. Bart Torvik's data was collected through a JSON file provided by Bart himself. The process of organizing that data can also be seen in the **Data_Collection** folder. 

We were able to collect data on 55,939 individual college basketball seasons. This value includes every player that has played college basketball since 2008. If an individual played in multiple seasons, then they will have multiple observations listed for each season. International and high school athletes were not included in the analysis since the statistics are not an apples to apples comparison to college basketball. College basketball players also make up more than 83% of the players in the NBA ([RPI Ratings](http://rpiratings.com/NBA.php)). In the data cleaning process, we identified which players were drafted and which made an All-Star game. We also reduced the dataset to only include the final season a player played in college. Including all seasons for a player could affect my target variable since the player will be listed multiple times as having made an All-Star game. The last season someone played in college tends to be their best performing season and is what's looked at most heavily by a team's front office. Ultimately, the dataset included 581 individual college basketball seasons with 30 of the seasons having been All-Star players. The data collected for each player includes many individual statistics for that season as well as information on where the player went to school, conference and even physical measurements. 

Before building my model, we needed to understand the dataset in greater detail. In order to do this, we performed exploratory data analysis and used Bayesian statistics to adjust certain features. We wanted to see what percentage of players made an All-Star game given a certain feature as well as what percentage of All-Stars had that feature. We also looked at the continuous features (players stats) by separating the data for players that have made an All-Star game and those who have not to see if there were any significant differences between the two. Before eliminating certain features that had overlap in their calculations, we performed Bayesian statistic techniques to adjust several values. For example, **FTM** and **FTA** are used to calculate **FT %**. We adjusted the **FT %** using prior knowledge and a likelihood function before we dropped the **FTM** and **FTA** features. This provided us with a more accurate representation and helped account for players with far fewer attempts than other players. 

After completing EDA, we began the classification modeling process. We separated the data into a training and test set. The training data included players from 2008-2018 and the test set included college players from the 2019 NBA draft. We began by train/test splitting the training data after selecting features to include in the analysis. Then we began to build several classification models utilizing GridsearchCV to optimize for the best hyperparameters. Below is a list of the classification models used. 

- Logistic Regression  
- KNearest Neighbors 
- DecisionTree 
- RandomForest
- Adaboost

The best model had the highest accuracy and the lowest variance, but also had the most realistic probabilities based off of prior knowledge. This model was then used to predict the All-Star probability on the 2019 test dataset.

## Contents

- [Imports](#Imports)
- [Data Cleaning](#Data-Cleaning)
    - [Cleaning Draft Pick Data](#Cleaning-Draft-Pick-Data)
    - [Cleaning Single Season Data](#Cleaning-Single-Season-Data)
        - [Updating Player Names](#Updating-Player-Names)
        - [Indentifying All-Stars](#Identifying-All-Stars)
        - [Keeping Last Year Played in College](#Keeping-Last-Year-Played-in-College)
        - [Null Values](#Null-Values)
        - [Adjusting Height Column](#Adjusting-Height-Column) 
        - [Updating Conference Column](#Updating-Conference-Column)
        - [Creating Total Points and PPG Column](#Creating-Total-Points-and-PPG-Column)
        - [Creating Position Column](#Creating-Position-Column)
        - [Adjusting Yr Column](#Adjusting-Yr-Column)
- [EDA](#EDA)
    - [Descriptive Statistics & Visualizations](#Descriptive-Statistics-&-Visualizations)
    - [Creating Dummy Columns](#Creating-Dummy-Columns)
    - [More EDA After Modeling](#More-EDA-After-Modeling)
    

    


## Imports

In [1]:
#import libraries
import numpy             as np
import pandas            as pd
import matplotlib.pyplot as plt
import seaborn           as sns
import scipy.stats       as stats

import warnings
warnings.filterwarnings("ignore")

sns.set_style("whitegrid")



#adjusting display to see more data for convenience
pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 500)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Data Cleaning

### Cleaning Draft Pick Data

In [6]:
# reading in NBA draft picks data since 2008
draft_picks = pd.read_csv('.././Data_Files/draftpicks.csv', index_col = 0)
draft_picks.head()

FileNotFoundError: [Errno 2] File b'.././Data_Files/draftpicks.csv' does not exist: b'.././Data_Files/draftpicks.csv'

In [None]:
draft_picks.shape

In [None]:
#removing columns passed college. Unneccesary for my analysis
draft_picks = draft_picks[draft_picks.columns[:4]]

In [None]:
draft_picks.head()

In [None]:
draft_picks.isnull().sum()

In [None]:
#dropping international players from analysis. 
draft_picks = draft_picks[draft_picks['College'].notnull()]

I have decided to remove international players from my analysis. It is not an apples to apples comparison when comparing international player stats compared to college basketball.

In [None]:
draft_picks.shape

In [None]:
#may be some duplicates
draft_picks['Player'].nunique()

In [None]:
draft_picks[draft_picks.duplicated('Player', keep = False)]

Two players drafted happened to have the same name. I wanted to make sure that there weren't any duplicates in the dataset.

### Cleaning Single Season Data

I will be using the data from Bart Torvik's website for my analysis. The data includes all stats for all players that have played college basketball since 2008.

In [None]:
#reading in file
single_season_df = pd.read_csv('../Data_Files/torvik_data.csv', index_col = 0)

In [None]:
single_season_df.head()

In [None]:
#checking to make sure all players drafted are in the scraped dataset from Bart Torvik
player_list = []

for player in draft_picks['Player']:
    if player in set(single_season_df['player_name']):
        player_list.append(player)

In [None]:
len(set(player_list))

There must be some drafted players missing from my list since the length of the list does not equal the length of drafted players.

In [None]:
missing = []
for player in draft_picks['Player']:
    if player not in set(single_season_df['player_name']):
        missing.append(player)


In [None]:
len(missing)

In [None]:
#List of names not showing up in the single season stats dataframe
missing

Above is a list of the players missing from my dataset. All should be included since my data includes all single season stats for college seasons played since 2008. I am going to check if there are reasons why they are missing. My belief is that the names above are spelt differently in the two datasets. 

In [None]:
#checking to see if the players above are really in the dataset. The code below will check by name
single_season_df[single_season_df['player_name'].str.contains('Bagley')]

In [None]:
#code will look for players by college
single_season_df[single_season_df['school'].str.contains('Miami')].head()

In [None]:
#code will check for players by name and college
single_season_df[single_season_df['school'].str.contains('Michigan') & single_season_df['player_name'].str.contains('Hardaway')]


Tim Hardaway is an example of a player that was "missing" in my dataset. As you can see above, the scraped data includes "Jr." in his name. One of the observations also lists his name without the period after Jr. I am going to update this for consistency.

In [None]:
#updating Tim Hardaway Jr. name to be consistent with the other two ways it is spelt
single_season_df.loc[single_season_df['player_name'] == 'Tim Hardaway Jr', 'player_name'] = 'Tim Hardaway Jr.'

In [None]:
#checking correction
single_season_df.loc[single_season_df['player_name'] == 'Tim Hardaway Jr.', 'player_name']

I noticed in my EDA that Trent Plaisted was not listed as drafted in my dataset which is incorrect. That value will need to be imputed.

In [None]:
single_season_df[single_season_df['player_name'].str.contains('Plaisted')]

In [None]:
single_season_df.loc[single_season_df['player_name'] == 'Trent Plaisted', 'pick'] = 46

In [None]:
single_season_df.loc[single_season_df['player_name'] == 'Trent Plaisted', 'pick']

#### Updating Player Names

In [None]:
#name change dictionary
#So both dataframes are consistent
name_changes = {'J.J. Hickson': "JJ Hickson",
                'Donté Greene': 'Donte Greene',
                'Patrick Ewing': 'Patrick Ewing Jr.',
                'Greivis Vásquez': 'Greivis Vasquez',
                "Hamady N'Diaye": 'Hamady Ndiaye',
                'Nikola Vučević': 'Nikola Vucevic',
                'Maurice Harkless': 'Moe Harkless',
                'Perry Jones': 'Perry Jones III',
                'CJ McCollum': 'C.J. McCollum',
                'Tim Hardaway': 'Tim Hardaway Jr.',
                "Johnny O'Bryant": "Johnny O'Bryant III",
                'Glenn Robinson': 'Glenn Robinson III',
                'Devyn Marble': 'Roy Devyn Marble',
                'Larry Nance': 'Larry Nance Jr.',
                'Joe Young': 'Joseph Young',
                'Jakob Pöltl': 'Jakob Poeltl',
                'Taurean Waller-Prince': 'Taurean Prince',
                'Wade Baldwin': 'Wade Baldwin IV',
                'Skal Labissière': 'Skal Labissiere',
                'Stephen Zimmerman': 'Stephen Zimmerman Jr.',
                'Kay Felder': 'Kahlil Felder',
                'Dennis Smith': 'Dennis Smith, Jr.',
                'Bam Adebayo': 'Edrice Adebayo',
                'T.J. Leaf': 'TJ Leaf',
                'Marvin Bagley': 'Marvin Bagley III',
                'Jaren Jackson': 'Jaren Jackson Jr.',
                'Wendell Carter': 'Wendell Carter Jr.',
                'Lonnie Walker': 'Lonnie Walker IV',
                'Jacob Evans': 'Jacob Evans III',
                'Gary Trent': 'Gary Trent Jr.',
                'RJ Barrett': 'R.J. Barrett',
                'Dewan Hernandez': 'Dewan Huell',
                'Michael Porter' : 'Michael Porter Jr.'}

In [None]:
#updating names in drafted players list
for player in draft_picks['Player']:
    if player in name_changes.keys():
        draft_picks.loc[draft_picks['Player'] == player, 'Player'] = name_changes[player]
    else:
        pass
        

In [None]:
still_missing = []
for player in draft_picks['Player']:
    if player not in set(single_season_df['player_name']):
        still_missing.append(player)


In [None]:
still_missing

All the names have been corrected. Ricky Ledo is still missing because he did not play in college and therefore has no stats. He will not be included in my analysis

In [None]:
#updating dataframe to only include drafted players
single_season_df = single_season_df[single_season_df['player_name'].isin(draft_picks['Player'])]

In [None]:
#checking null values in pick column
single_season_df['pick'].isnull().sum()

The pick column should have zero null values since all players in the dataset were drafted. I am going to check to see if there are more situations of players having the same name but went to different schools. The players not drafted will be removed.

In [None]:
single_season_df[single_season_df['pick'].isnull()].head()

As expected there are several players with the same names. The players not drafted will be removed.

In [None]:
#removing rows with null values in pick column
single_season_df = single_season_df[single_season_df['pick'].notnull()]

In [None]:
single_season_df['pick'].isnull().sum()

In [None]:
set(draft_picks['Player']) - set(single_season_df['player_name'])

Two players ae missing. Ricky Ledo which as stated before makes sense since he did not play in college. Mike Taylor was kicked off his college team in 2007 so I don't have data on him since it was prior to 2008.

#### Identifying All-Star Players

In [None]:
#reading in file
all_star = pd.read_csv('../Data_Files/all_star_rosters.csv', index_col = 0)

In [None]:
all_star.head()

In [None]:
#Making sure that the names are spelt in the same way between the list of all-stars and the single season stats.
all_stars = []
for player in all_star['Name']:
    if player in set(single_season_df['player_name']):
        all_stars.append(player)

In [None]:
all_stars

All college players drafted since 2008 that have made an All-Star team are listed above. All other players in the All-Star dataframe were either drafted prior to 2008 or played internationally.

In [None]:
#creating a column in the df if the player made an all-star game. This is the target column.
single_season_df['all_star'] = single_season_df['player_name'].map(lambda x: 1 if x in set(all_star['Name']) else 0)



In [None]:
single_season_df['all_star'].value_counts()

In [None]:
single_season_df[single_season_df['all_star'] == 1]

The single-season dataframe now includes all the data for the players we are looking to use. 

In [None]:
single_season_df.shape

#### Keeping Last Year Played in College

In my dataset, each season that a player played in college is listed as a separate line item. This is going to skew my data and analysis since a player that has made an All-Star team in the NBA and played multiple seasons in college will be listed multiple times as having made the All-Star game. In order to enhance my analysis, I am going to only keep the last year a player played in college. That season is most used by front offices and scouts when evaluating talent for the NBA. This tends to be a players best performing season in college.

In [None]:
#dropping duplicates and keeping last
#pid is a unique identifier for each player
single_season_df.drop_duplicates('pid' , keep = 'last', inplace = True)

In [None]:
single_season_df.shape

In [None]:
single_season_df['pid'].nunique()

Dataframe now only has one season's worth of data for each player.

In [None]:
#updating Damian Lillard's year in college to senior. Value was incorrect in the data.
single_season_df.loc[single_season_df['player_name'] == 'Damian Lillard', 'yr'] = 'Sr'

In [None]:
single_season_df.loc[single_season_df['player_name'] == 'Damian Lillard', 'yr']

In [None]:
#resetting index
single_season_df.reset_index(drop = True, inplace = True)

In [None]:
#checking data types and null values
single_season_df.info()

In [None]:
#filtering for only object columns
single_season_df.select_dtypes('object').head()

Above are the columns listed as objects(strings). I am going to adjust the conference column for power 5 and non-power 5 conferences. Non-power 5 will be their own group. Then I will dummy the adjusted column. The yr column will be turned into a scale and equal to number of years played in college. Height will be turned into an integer in inches. The number and type columns will be dropped since they do not provide any valuable information.

#### Null Values

In [None]:
#list of columns with missing values
single_season_df.isnull().mean().sort_values(ascending = False).head(15)

I am going to delete the columns that have over 10% missing values. These columns are not critical in my analysis and are mainly specific to players at the center position. 

In [None]:
#dropping columns
columns_drop = ['dunksmade/(dunksmade+dunksmiss)',
                'midmade/(midmade+mismiss)',
                'rimmade/(rimmade+rimmiss)',
                'rimmade',
                'midmade',
                'midmade + midmiss',
                'dunksmade',
                'dunksmiss + dunksmade',
                'rimmade + rimmiss',
                'num',
                'rec-rk',
                'type']

single_season_df.drop(columns = columns_drop, inplace = True)

In [None]:
single_season_df.isnull().mean().sort_values(ascending = False).head()

In [None]:
single_season_df.isnull().sum().sum()

No more missing values.

#### Adjusting Height Column

The height column is currently listed as a string. I am going to turn the column into an integer equal to height in inches.

In [None]:
single_season_df['ht'].value_counts()

In [None]:
#turning height columns to inches
single_season_df['ht'] = single_season_df['ht'].str.replace('-', '').map(lambda x: (int(x[0]) * 12) + int(x[1:]))



In [None]:
single_season_df['ht'].head()

In [None]:
single_season_df['ht'].dtype

#### Updating Conference Column

In [None]:
single_season_df['conference'].value_counts()

I am going to update the conference column between major conferences and non-major. The major conferences will be the largest conferences, also known as power-5. Please note that I will be including the PAC-10 as well as the Big East(BE) as major conferences. PAC-10 is now the PAC-12 and the Big East was formerly a major conference up until conference re-allignment in 2013.

In [None]:
#updating conference
major_conf = ['ACC', 'SEC', 'B12', 'BE', 'B10', 'P12', 'P10']

single_season_df['conference'] = single_season_df['conference'].map(lambda x: x if x in major_conf else 'Non_major')



In [None]:
single_season_df['conference'].value_counts()

In [None]:
single_season_df.head()

#### Creating Total Points and PPG Column

I am going to create a points per game column by adding and multiplying the values from the free throws made, two-pointers made and three-pointers made columns. The formula is: **Free Throws Made + (Two-Pointers Made * 2) + (Three-Pointers Made * 3)**

In [None]:
single_season_df.head()

In [None]:
#creating total points column
single_season_df['total_points'] = single_season_df['FTM'] + (single_season_df['twoPM'] * 2) + (single_season_df['TPM'] * 3)



In [None]:
#creating points per game column
single_season_df['PPG'] = np.round(single_season_df['total_points'] / single_season_df['GP'], 2)
                 

In [None]:
single_season_df.head()

#### Creating Position Column

Unfortunately the dataset I collected did not include the position of the players. I am going to attempt to impute the values for each player from another dataset.

In [None]:
#reading in data from basketball-reference.com
position_df = pd.read_csv('../Data_Files/single_season.csv', index_col = 0)
position_df.head()

In [None]:
#renaming player column to be the same as single season dataframe so I can do a merge
position_df.rename({'Player': 'player_name'}, axis = 1, inplace = True)

In [None]:
#dropping duplicates
position_df.drop_duplicates(subset = 'player_name', inplace = True)

In [None]:
#merging the two dataframes to get the position 
single_season_df = pd.merge(single_season_df, position_df, on = 'player_name', how = 'left')

In [None]:
single_season_df.head()

In [None]:
single_season_df.shape

In [None]:
#removing unneccessary columns from the end of the dataset
single_season_df = single_season_df.loc[: , : 'Pos']

#renaming "FTA_x" to "FTA"
single_season_df.rename(columns = {'FTA_x': 'FTA'}, inplace = True)


In [None]:
single_season_df.head()

In [None]:
#dropping season and class columns
single_season_df.drop(columns =  ['Class', 'Season'], inplace = True)

In [None]:
single_season_df.shape

In [None]:
len(set(single_season_df['player_name']) - set(position_df['player_name']))

Some players are missing from the dataset collected from basketball reference.

In [None]:
single_season_df[single_season_df['Pos'].isnull()].shape

In [None]:
#dictionary to update position for missing players
position_dict = {'Alex Oriakhi' : 'F', 'Andre Drummond' : 'C', 'Andre Roberson': 'F','Andrew Harrison': 'G', 'Andy Rautins': 'G', 'Austin Daye' : 'F',
                 'Avery Bradley' : 'G', 'Bernard James' : 'F', 'Bol Bol': 'C','Branden Dawson' : 'G', 'Bruce Brown' : 'G', 'Byron Mullens' : 'C',
                 'Cady Lalanne' : 'C', 'Chandler Parsons' : 'F', 'Cheick Diallo': 'F','Chinanu Onuaku' : 'F', 'Chinemelu Elonu' : 'C', 'Chris McCullough' : 'F',
                 'Chris Singleton' : 'F', 'Cory Joseph': 'G', 'D.J. Wilson' : 'F','DaJuan Summers': 'F', 'Dakari Johnson': 'C', 'Daniel Orton' : 'F',
                 'Darius Garland': 'G', 'Darius Miller' : 'G', "De'Anthony Melton" : 'G','DeAndre Jordan' : 'C', 'DeAndre Liggins': 'G', "DeAndre' Bembry" : 'F',
                 'DeVon Hardin' : 'C', 'Dennis Smith, Jr.' : 'G', 'Derrick Caracter' : 'F','Devin Booker': 'G', 'Devin Ebanks' : 'F', 'Devon Hall': 'G', 'Dewan Huell': 'F',
                 'Dexter Pittman' : 'C', 'Deyonta Davis' : 'F', 'Diamond Stone' : 'C','Dwayne Collins' : 'F', 'Ed Davis': 'F', 'Edmond Sumner' : 'G',
                 'Edrice Adebayo' : 'F', 'Eric Bledsoe' : 'G', 'Fab Melo': 'C','Frank Jackson' : 'G', 'Glen Rice' : 'G', 'Glenn Robinson III': 'F',
                 'Goran Suton' : 'C', 'Gorgui Dieng' : 'C', 'Grant Jerrett': 'F','Hamady Ndiaye' : 'C', 'Hamidou Diallo' : 'G' , 'Harry Giles': 'F',
                 'Ike Anigbogu' : 'F', 'Isaiah Roby' : 'F', 'Ivan Rabb': 'F','J.P. Tokoto' : 'F', 'JJ Hickson' : 'F', 'Jabari Bird' : 'G',
                 'Jacob Evans III' : 'G', 'Jake Layman' : 'F', 'Jaren Jackson Jr.' : 'F','Jarred Vanderbilt' : 'F', 'Jaxson Hayes' : 'F', 'Jerami Grant' : 'F',
                 'Jeremy Evans' : 'F', 'Joe Harris': 'G', 'Joel Embiid' : 'C', 'Joey Dorsey': 'F', "Johnny O'Bryant III" : 'F', 'Jonathan Isaac' : 'F',
                 'Jordan Bell': 'F', 'Joseph Young' : 'G', 'Josh Harrellson' : 'F', 'Josh Huestis': 'F', 'Josh Selby': 'G', 'Jrue Holiday': 'G',
                 'Justin Anderson' : 'G', 'Justin Hamilton' : 'C', 'KZ Okpala' : 'F', 'Kadeem Allen' : 'G', 'Kahlil Felder' : 'G', 'Karl-Anthony Towns' : 'F',
                 'Kelly Oubre' : 'G', 'Kendall Marshall' : 'G', 'Kevin Porter Jr.': 'G', 'Kevon Looney': 'F', 'Kostas Antetokounmpo' : 'F', 'Kyle Weaver' : 'G',
                 'Kyrie Irving' : 'G', 'Lance Stephenson': 'G', 'Larry Nance Jr.' : 'F', 'Lavoy Allen': 'F', 'Lonnie Walker IV' : 'G', 'Luc Mbah a Moute' : 'F',
                 'Malcolm Lee' : 'G', 'MarShon Brooks' : 'G', 'Marquis Teague': 'G', 'Matisse Thybulle' : 'G', 'Meyers Leonard': 'C', 'Michael Porter Jr.': 'F',
                 'Miles Plumlee': 'F', 'Mitch McGary': 'F', 'Moe Harkless' : 'G', 'Mohamed Bamba': 'C', 'Myles Turner': 'F', 'Nassir Little': 'F',
                 'Nerlens Noel': 'F', 'Nicolas Claxton': 'F', 'Noah Vonleh': 'F', 'OG Anunoby': 'F', 'Omari Spellman': 'F', 'Patrick Ewing Jr.': 'F',
                 'Patty Mills': 'G', 'Perry Jones III' : 'F', 'Peyton Siva' : 'G', 'Quincy Miller' : 'F', 'Rashad Vaughn': 'G',
                 'Ray Spalding': 'F', 'Robert Williams': 'F', 'Robin Lopez': 'C', 'Rondae Hollis-Jefferson' : 'F', 'Ryan Kelly': 'F', 'Ryan Reid' : 'F',
                 'Sasha Kaun': 'C', 'Skal Labissiere': 'F', 'Solomon Alabi': 'C', 'Stephen Zimmerman Jr.' : 'F', 'Steven Adams' : 'C', 'Talen Horton-Tucker': 'G',
                 'Taylor Griffin': 'F', 'Terance Mann': 'G', 'Thomas Bryant': 'F', 'Thomas Welsh' : 'C', 'Tim Hardaway Jr.': 'G', 'Tiny Gallon': 'F',
                 'Tony Bradley': 'F', 'Trey Lyles': 'F', 'Troy Brown': 'F', 'Tyler Honeycutt': 'F', 'Vernon Macklin': 'F', 'Wade Baldwin IV':'G',
                 'Walter Sharpe': 'F', 'Wesley Johnson': 'F', 'Willie Cauley-Stein': 'F', 'Zach Collins': 'F', 'Zach LaVine': 'G',
                 'Zhaire Smith': 'G'}




In [None]:
#updating missing values in position columns
for player in single_season_df['player_name']:
    if player in position_dict.keys():
        single_season_df.loc[single_season_df['player_name'] == player, 'Pos'] = position_dict[player]
    else:
        pass

In [None]:
single_season_df['Pos'].isnull().sum()

All players now have positions listed.

#### Adjusting Yr Column

Currently the yr column lists a player as Freshman, Sophomore, Junior or Senior depending on what grade they are in. I am going to adjust this column so that each grade corresponds to a value which will represent the number of years the player played in college.

In [None]:
single_season_df['yr'].value_counts()

In [None]:
#creating dictionary for map function
yr_map = {'Fr' : 1,
          'So' : 2,
          'Jr' : 3,
          'Sr' : 4}

single_season_df['yr'] = single_season_df['yr'].map(yr_map)

In [None]:
single_season_df['yr'].value_counts()

## EDA

### Descriptive Statistics & Visualizations

In [None]:
single_season_df.describe()

Nothing abnormal stands out from descriptive statistics above. I am going to look at the columns graphically to look for trends and any abnormalities or outliers.

In [None]:
# All-Star % by School
plt.figure(figsize= (20, 40))
sns.barplot(y = 'school', x= 'all_star', data = single_season_df, orient = 'h', ci = None, edgecolor = 'black')
plt.title('All Star % by School', size = 30)
plt.xlabel('All Star %', size = 25)
plt.ylabel('School', size = 25)
plt.xticks(size = 16)
plt.yticks(size = 16);

In [None]:
#Plotting where current All-Stars when to College
plt.figure(figsize = (12,8))
single_season_df[single_season_df['all_star'] == 1]['school'].value_counts(normalize = True).plot(kind = 'barh',
                                                                                                 edgecolor = 'black')
plt.title('% of All-Stars by School', size = 22)
plt.xlabel('School', size = 18)
plt.ylabel("All-Star %", size = 18)
plt.xticks(rotation = 0, size = 16)
plt.yticks(size = 16);

Due to so many schools not having any all stars, I am going to remove that column from the dataset. I will be able to get more information from the conferences column. It is interesting to note though that Kentucky has had the most players become all-stars that have been drafted over the last 10 years. 

In [None]:
single_season_df.conference.value_counts()

The ACC conference has had the most players drafted since 2008 followed by the sum of all the non-major conferences. PAC10 has by the fewest but this is partially due to the conference expanding to the PAC12 in 2011.

In [None]:
# Graphs showing All-Star % by # of Years Played in College
plt.figure(figsize= (22, 8))

plt.subplot(1,2,1)
sns.barplot(x = 'conference', y= 'all_star', data = single_season_df, ci = None)
plt.xlabel('Conference', size = 18)
plt.ylabel('All-Star %', size = 18)
plt.title('All-Star % by Conference', size = 22)
plt.xticks(size = 16)
plt.yticks(size = 16)

plt.subplot(1,2,2)
single_season_df[single_season_df['all_star'] == 1]['conference'].value_counts(normalize = True).plot(kind = 'bar')
plt.title('% of All-Stars by Conference', size = 22)
plt.xlabel('Conference', size = 18)
plt.ylabel("All-Star %", size = 18)
plt.xticks(rotation = 0, size = 16)
plt.yticks(size = 16);

The percentage of players that have become All-Stars from each conference is pretty surprising. We did not expect the non_major conferences to have a higher percentage of All-Stars than the ACC which has historically been the best conference in college basketball. Despite having by far the fewest draft picks, the PAC10 had the largest All-Star % by approximately 20%. When a PAC10 player was drafted, they were more often than not a quality player. The PAC12 has not had any All-Stars since the conference was formed from the PAC10 in 2011. It is also interesting to note that close to 50% of the All-Stars drafted since 2008 played in the PAC10 or a Non-major conference. Playing in the best conference in college does not necessarily translate to being successful at the professional level. 

In [None]:
#All-Star % by conference
single_season_df.groupby('conference')['all_star'].mean()

In [None]:
# What conference did current All-Stars play in college
single_season_df[single_season_df['all_star'] == 1]['conference'].value_counts(normalize = True)

In [None]:
# Graphs showing All-Star % by # of Years Played in College
plt.figure(figsize= (16, 8))

plt.subplot(1,2,1)
sns.barplot(x = 'yr', y= 'all_star', data = single_season_df, ci = None)
plt.xlabel('Years Played', size = 18)
plt.ylabel('All Star %', size = 18)
plt.title('All Star % by # of Years Played', size = 22)
plt.xticks(size = 16)
plt.yticks(size = 16)

plt.subplot(1,2,2)
single_season_df[single_season_df['all_star'] == 1]['yr'].value_counts(normalize = True).plot(kind = 'bar')
plt.title('Years Played in College for All-Stars', size = 22)
plt.xlabel('Years Played', size = 18)
plt.ylabel("All-Star %", size = 18)
plt.xticks(rotation = 0, size = 16)
plt.yticks(size = 16);

The graph on the left shows a breakdown of the percentage of college players drafted that became All-Stars for each college year. Both Freshmen and Sophomores have a higher overall rate than the average for all players drafted (approx. 5%). This shows that the number of years of experience playing in college does not necessarily mean that a player will become successful in the NBA. Players that leave after 1 or 2 years in school are usually more prepared and do not need additional development at the college level. They also tend to be higher rated recruits coming out of high school which has been a big determinant of future draft position which is correlated with becoming an All-Star.

The graph on the right shows what percentage of all-stars left college as a Freshman, Sophomore, Junior or Senior. These results are in line with the previous graph. The majority of All-Stars in the last 10 years left college after only playing 1 or 2 years.

In [None]:
# All-Star % by years played
single_season_df.groupby('yr')['all_star'].mean()

In [None]:
# How many years did current All-Stars play on average in college
single_season_df[single_season_df['all_star'] == 1]['yr'].value_counts(normalize = True)

In [None]:
#Plotting All-Star % by Year-Drafted
plt.figure(figsize= (20, 8))

plt.subplot(1,2,1)


sns.barplot(x = 'year', y= 'all_star', data = single_season_df, ci = None)
plt.xlabel('Year Drafted', size = 18)
plt.ylabel('All-Star %', size = 18)
plt.title('All Star % by Year Drafted', size = 22)
plt.xticks(size = 16)
plt.yticks(size = 16)

plt.subplot(1,2,2)
single_season_df[single_season_df['all_star'] == 1]['year'].value_counts(normalize = True).plot(kind = 'bar')
plt.title('Year Drafted for All-Star Players', size = 22)
plt.xlabel('Year Drafted', size = 18)
plt.ylabel("All-Star %", size = 18)
plt.xticks(rotation = 0, size = 16)
plt.yticks(size = 16);

No player that was drafted in the last 3 years has made an All-Star game. This shows how difficult it is to make the team and that in the NBA it is hard to come in and be successful right away. It takes a lot of growth and development.

In [None]:
# % of players to make All-Star game by year drafted
single_season_df.groupby('year')['all_star'].mean()

In [None]:
# When current All-Stars were drafted
single_season_df[single_season_df['all_star'] == 1]['year'].value_counts(normalize = True)

In [None]:
#plotting All-Star % by Position
plt.figure(figsize= (16, 8))

plt.subplot(1,2,1)

sns.barplot(x = 'Pos', y= 'all_star', data = single_season_df, ci = None)
plt.xlabel('Position', size = 18)
plt.ylabel('All-Star %', size = 18)
plt.title('All Star % by Position', size = 22)
plt.xticks(size = 16)
plt.yticks(size = 16)

plt.subplot(1,2,2)
single_season_df[single_season_df['all_star'] == 1]['Pos'].value_counts(normalize = True).plot(kind = 'bar')
plt.title('Position of All-Star Players', size = 22)
plt.xlabel('Position', size = 18)
plt.ylabel("All-Star %", size = 18)
plt.xticks(rotation = 0, size = 16)
plt.yticks(size = 16);

- G: Guard
- F: Forward
- C: Center

Centers had the highest rate of becoming an All-Star of the players drafted since 2008 at approximately 11%. Forwards were least likely at a rate of approximately 4%. It is important to note that there have been far fewer centers enter the draft than either guards or forwards. Despite this, the positive hit rate for centers is higher than the other positions. Of the players that have made an All-Star game, 83% have been guards and forwards. This makes sense since historically there have been two guards, two forwards and one center in a starting lineup. However, over the last several years this standard lineup has changed as the game has been more focused on shooting and playing more athletic smaller players.

In [None]:
single_season_df['Pos'].value_counts()

In [None]:
single_season_df.groupby('Pos')['all_star'].mean()

In [None]:
single_season_df[single_season_df['all_star'] == 1]['Pos'].value_counts(normalize = True)

In [None]:
single_season_df.groupby('all_star').mean().T

Not many of the stats have large discrepancies when categorized by All-Star vs non-All-Stars. All-Stars take and make more free throws. This could be because better college players tend to be more aggressive which is correlated with getting fouled more often. All-Stars also have a higher popag and are drafted significantly earlier. This makes sense since the players drafted earliest are supposed to be the best. It was also suprising to me that the points per game for All-Stars were only slightly higher (approx. 1 point) than non-All-Stars. 


In [None]:
single_season_df.groupby('all_star')['ht'].describe().T

In [None]:
single_season_df[(single_season_df['all_star'] == 1) & (single_season_df['ht'] == 86)]

In [None]:
single_season_df[(single_season_df['all_star'] == 1) & (single_season_df['ht'] == 69)]

The shortest player to make an All-Star team is only 5 foot 9 inches (Isaiah Thomas) and the tallest person is 7 foot 2 inches (Roy Hibbert).  The average All-Star is 6 foot 6 inches.

In [None]:
single_season_df.groupby('all_star')['GP'].describe().T

In [None]:
single_season_df[(single_season_df['all_star'] == 1) & (single_season_df['GP'] == 11)]

In [None]:
single_season_df[(single_season_df['all_star'] == 1) & (single_season_df['GP'] == 41)]

In [None]:
single_season_df[single_season_df['all_star'] ==1].hist(column = 'GP', figsize = (10,6))
plt.title('College Games Played for All-Stars', size = 20)
plt.xlabel('Games Played', size = 14)
plt.ylabel('Frequency', size = 14);
                                  
                                  

The majority of all-stars played between 30 and 40 games in their final college season. Only Kyrie Irving played less than 25 games when he played 11 games in his Freshman year at Duke. Kyrie Irving went on to be the first pick in the draft and has been a perennial All-Star. He is clearly not the norm. The majority of All-Stars played over 35 games in their final season which means they played deep into the NCAA tournament which signifies that the players were on successful teams. It also shows that NBA All-Stars were healthy in college.

In [None]:
#columns to drop for histograms
drop = ['player_name','school', 'conference', 'year', 'pid', 'pick', 'all_star', 'Pos']


In [None]:
#Created a function below to create histograms for many features.
#Code assistance from Andrew Bergman
def make_histograms(df,list_of_columns):
    fig   = plt.figure(figsize = (20, 40))      # Set the size for each plot
    count = 0                                   # The count sets the location for each subplot
    for column in list_of_columns:
        count += 1                              # By adding 1 to the count I can create a new location for the graph
        ax = fig.add_subplot(16, 3, count)
        df.hist(column = column, ax = ax, figsize = (10,12))
        plt.title(column, size = 18); 
    plt.tight_layout();

In [None]:
#histograms for numerical columns
make_histograms(single_season_df, single_season_df.drop(columns = drop).columns)

In [None]:
single_season_df.groupby('Pos')['TPM'].mean()

Many of the stats look to be normally distributed. Several features are skewed to the left including games played, minutes percentage and free throw percentage. They are skewed to the left due to several players being outliers and having values far lower than the average. Games played will be low for players that have been injured. Minutes played tend to be low when a player isn't a starter which is rare for a player that gets drafted. The features related to three point shooting are skewed to the right. Not all players shoot a large amount of threes. Guards shoot by far the most three pointers in a season. Approximately 60 threes a season for guards compared to 24 and 6.5 for forwards and centers respectively.

In [None]:
#histograms for numerical columns for only the players that have made an all-star game
make_histograms(single_season_df[single_season_df['all_star'] == 1],single_season_df.drop(columns = drop).columns)



In [None]:
single_season_df[(single_season_df['all_star'] == 1) & (single_season_df['GP'] < 25)]

In [None]:
single_season_df[(single_season_df['all_star'] == 1) & (single_season_df['FTA'] > 250)]

It is easier to outliers when focusing on just the players that have been All-Stars. For example with the games played column, it is easy to see that there was only one player that played less than 25 games in a season and played less than 40% of a teams minutes(both Kyrie Irving). It becomes easier to filter the data when looking at the distributions. For example, we can quickly look at the players with the most free throws attempted (see below) and see that it is a combination of guards and forwards.

In [None]:
#columns to drop for correlation map
corr_drop = ['player_name','school', 'conference', 'year', 'pid', 'pick', 'Pos']


In [None]:
#Looking at correlation heatmap for numerical columns versus the all_star column
plt.figure(figsize = (2,20))

sns.heatmap(single_season_df.drop(columns = corr_drop).corr()[['all_star']].sort_values(by = 'all_star', ascending = False), 
            annot = True, 
            cmap = 'RdBu',
            vmin = -1,
           annot_kws={"size":12})
plt.xticks(size = 14)
plt.yticks(size = 14);



Very few numerical columns are highly correlated with becoming an all-star. In fact, the majority of columns show relatively no correlation at all. This may make it difficult for models to predict the target variable correctly.

### Creating Dummy Columns

I am going to turn the conference and position columns into dummy variables before modeling the data in order to represent the categorical features.

In [None]:
#creating dummy columns
dummy_columns = ['conference', 'Pos']

single_season_df = pd.get_dummies(data = single_season_df, columns = dummy_columns, drop_first= True)

### More EDA After Modeling

I am going to reduce some of the features that were intially used in the model. Several of the features have some overlap in the way they are calculated so should be able to be removed. 

- FTM, FTA and FT_per 
- twoPM, twoPA and twoP_per
- TPM, TPA and TP_per

Before I remove the columns that are not percentages, I am going to create a new percentage column using Bayesian statistics. There is no mininum number of shot attempts in the original calculations so someone who has only taken a few shots and made a higher percentage will be weighted the same as someone who took many field goal attempts with a lower percentage. I want to make sure my percentage data isn't skewed due to this case. I am going to use prior knowledge along with a likelihood function to calculate the posterior distribution for each field goal type. I will then calculate the the maximum a posteriori (MAP) estimate of for each field goal type.

**Free Throws**

Our prior belief for free throws is that out of 100 shots a player would make 70 free throws. This data was calculated by averaging all free throw attempts in college over the last 10 years. The calculation can be seen in the **Data Scrape - Team Stats for Bayes Analysis** notebook.

In [None]:
#Graphically showing the prior
alpha_prior = 71
beta_prior = 31

distn = stats.beta(alpha_prior, beta_prior)

x_axis = np.linspace(0, 1, 100)

plt.plot(x_axis, distn.pdf(x_axis));

The graph above shows the beta distribution for free throw percentage given the prior knowledge that 70% is the historical average. You can see that the peak of the distribution graph occurs around 70%.

In [None]:
single_season_df.head()

In [None]:
#create misses column
single_season_df['FT_misses'] = single_season_df['FTA'] - single_season_df['FTM']

In [None]:
single_season_df['new_FTM'] = single_season_df['FTM'] + alpha_prior
single_season_df['new_FT_misses'] = single_season_df['FT_misses'] + beta_prior

In [None]:
#Creating MAP for Free Throw Average
single_season_df['new_FT_avg'] = (single_season_df['new_FTM'] - 1) / (single_season_df['new_FTM'] + single_season_df['new_FT_misses'] - 2)


In [None]:
plt.figure(figsize = (12, 8))
sns.regplot(single_season_df['FT_per'], single_season_df['new_FT_avg'], ci = False)
plt.xlabel('Original Free Throw %', size = 18)
plt.ylabel('Adjusted Free Throw %', size = 18)
plt.title('Free Throw % (Bayes)', size = 22)
plt.xticks(size = 16)
plt.yticks(size = 16)
plt.xlim(.3, 1)
plt.ylim(.3, 1);


Utilizing the Beta distribution to conduct Bayesian inference on free throw percentage produced the results above. The Beta distribution will adjust the percentage more players with extremely high or low rates. Values around 70% did not change as they had a similar value to the prior.

---

**3-point Shots**

Our prior belief for three point shots is that out of 100 shots a player would make 35 free throws. This data was calculated by averaging all three point attempts in college over the last 10 years. The calculation can be seen in the **Data Scrape - Team Stats for Bayes Analysis** notebook.


In [None]:
#Graphically showing the prior
alpha_prior = 36
beta_prior = 66

distn = stats.beta(alpha_prior, beta_prior)

x_axis = np.linspace(0, 1, 100)

plt.plot(x_axis, distn.pdf(x_axis));

The graph above shows the beta distribution for 3-point percentage given the prior knowledge that 35% is the historical average. You can see that the peak of the distribution graph occurs around 35%.

In [None]:
#create misses column
single_season_df['3P_misses'] = single_season_df['TPA'] - single_season_df['TPM']

In [None]:
single_season_df['new_TPM'] = single_season_df['TPM'] + alpha_prior
single_season_df['new_3P_misses'] = single_season_df['3P_misses'] + beta_prior

In [None]:
#Creating MAP for 3 Point Average
single_season_df['new_3P_avg'] = (single_season_df['new_TPM'] - 1) / (single_season_df['new_TPM'] + single_season_df['new_3P_misses'] - 2)


In [None]:
plt.figure(figsize = (12, 8))
sns.regplot(single_season_df['TP_per'], single_season_df['new_3P_avg'], ci = False)
plt.xlabel('Original 3-Point %', size = 18)
plt.ylabel('Adjusted 3-Point %', size = 18)
plt.title('3-Point % (Bayes)', size = 22)
plt.xticks(size = 16)
plt.yticks(size = 16);


Adjusting 3-point percentages by utilizing the Beta distribution produced the results above. The Beta distribution will adjust the percentage more for players with extremely high or low 3-point percentages. For example, players that had a 100% original shooting percentage were adjusted down and ones that had a 0% were adjusted up. Values around 35% did not change as they had a similar value to the prior.

---

**2-Point Shots**

Our prior belief for two point shots is that out of 100 shots a player would make 49 two point shots. This data was calculated by averaging all two point attempts in college over the last 10 years. The calculation can be seen in the **Data Scrape - Team Stats for Bayes Analysis** notebook.

twoPM, twoPA and twoP_per

In [None]:
#Graphically showing the prior
alpha_prior = 50
beta_prior = 52

distn = stats.beta(alpha_prior, beta_prior)

x_axis = np.linspace(0, 1, 100)

plt.plot(x_axis, distn.pdf(x_axis));

The graph above shows the beta distribution for 2-point percentage given the prior knowledge that 49% is the historical average. You can see that the peak of the distribution graph occurs around 49%.

In [None]:
#create misses column
single_season_df['2P_misses'] = single_season_df['twoPA'] - single_season_df['twoPM']

In [None]:
single_season_df['new_2PM'] = single_season_df['twoPM'] + alpha_prior
single_season_df['new_2P_misses'] = single_season_df['2P_misses'] + beta_prior

In [None]:
#Creating MAP for 3 Point Average
single_season_df['new_2P_avg'] = (single_season_df['new_2PM'] - 1) / (single_season_df['new_2PM'] + single_season_df['new_2P_misses'] - 2)


In [None]:
plt.figure(figsize = (12, 8))
sns.regplot(single_season_df['twoP_per'], single_season_df['new_2P_avg'], ci = False)
plt.title('2-Point % (Bayes)', size = 22)
plt.xlabel('Original 2-Point %', size = 18)
plt.ylabel('Adjusted 2-Point %', size = 18)
plt.xticks(size = 16)
plt.yticks(size = 16)
plt.xlim(.3, .8)
plt.ylim(.3, .8);


Adjusting 2-point percentages by utilizing the Beta distribution produced the results above. The Beta distribution will adjust the percentage more for players with extremely high or low 2-point percentages. For example, players that had an original shooting percentage above 70% were adjusted down and ones that had below 40% were adjusted up. Values around 50% did not change as they had a similar value to the prior.


In [None]:
single_season_df.head()

In [None]:
single_season_df.groupby('all_star').mean().T

#### Saving Files for Modeling

In [None]:
#turning dataframe to CSV to use in separate notebook for modeling.
single_season_df.to_csv('../Data_Files/model_single_season.csv')

In [None]:
#creating file with Bayes columns
single_season_df.to_csv('../Data_Files/model_single_season_bayes.csv')