# Intro to Data Science - Hands On Approach

## *Predict Fantasy Point Totals for Upcoming NBA Games*
- Step 1:  Prepare Data and Engineer Features 
- Step 2:  Split Data into Train and Test Data
- Step 3:  Train Models with Cross-Validation and Regularization
- Step 4:  Compare and Contrast Individual Model Performance
- Step 5:  Choose Best Model and Evaluate Performance
- Step 6:  Lessons Learned and Next Steps

#### Prepare Data
http://www.kdnuggets.com/2016/10/data-preparation-tips-tricks-tools.html

In [40]:
import pandas as pd
pd.options.display.max_columns = 999
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# read in current current season average stats for all players
df = pd.read_csv('../data/current_allplayer_stats_2017-02-16.csv')

In [4]:
# drop duplicate columns and columns without any information
columns_to_drop = ['Rk.x', 'Rk.y', 'Age.y',	'G.y', 'MP.y', 'X.', 'X..1']
df.drop(columns_to_drop, axis=1, inplace=True)  # axis 0 for rows, 1 for columns

In [5]:
# filter out the benchwarmers
df = df[df['MP.x'] > 10]

In [6]:
df.head()

Unnamed: 0,Player,Pos,Tm,Rk.x,Age.x,G.x,GS,MP.x,FG,FGA,FG.,X3P,X3PA,X3P.,X2P,X2PA,X2P.,eFG.,FT,FTA,FT.,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PS.G,PER,TS.,X3PAr,FTr,ORB.,DRB.,TRB.,AST.,STL.,BLK.,TOV.,USG.,OWS,DWS,WS,WS.48,OBPM,DBPM,BPM,VORP
1,Aaron Brooks,PG,IND,57,32,46,0,14.3,2.0,4.9,0.407,0.7,2.2,0.337,1.3,2.7,0.464,0.482,0.5,0.6,0.793,0.3,0.8,1.1,2.2,0.5,0.2,1.1,1.4,5.2,10.3,0.505,0.447,0.128,2.4,6.3,4.4,22.5,1.7,1.0,17.3,19.5,-0.2,0.4,0.2,0.017,-1.7,-2.4,-4.0,-0.3
2,Aaron Gordon,SF,ORL,150,21,56,48,27.6,4.3,10.0,0.428,1.0,3.4,0.292,3.3,6.6,0.499,0.478,1.6,2.4,0.65,1.3,3.3,4.6,1.9,0.7,0.4,1.1,2.1,11.2,12.5,0.503,0.342,0.244,5.1,13.4,9.1,11.2,1.3,1.3,9.1,19.6,0.5,1.2,1.6,0.051,-1.0,-0.4,-1.4,0.2
5,Al Horford,C,BOS,194,30,44,44,33.0,5.8,12.7,0.459,1.5,4.3,0.354,4.3,8.4,0.512,0.519,1.6,2.0,0.807,1.4,5.3,6.7,5.0,0.8,1.5,1.8,2.2,14.8,17.8,0.545,0.339,0.158,4.6,18.1,11.4,23.8,1.1,3.9,11.4,20.7,2.2,1.6,3.9,0.128,1.2,1.9,3.1,1.9
6,Al Jefferson,C,IND,211,32,56,1,14.6,3.7,7.4,0.501,0.0,0.0,0.0,3.7,7.4,0.502,0.501,1.1,1.4,0.772,1.2,3.1,4.3,0.9,0.3,0.3,0.5,1.9,8.5,19.2,0.53,0.002,0.191,9.3,23.2,16.4,11.3,1.0,1.6,5.9,25.9,1.2,0.9,2.1,0.123,-1.2,-1.5,-2.6,-0.1
7,Al-Farouq Aminu,SF,POR,10,26,37,23,29.1,2.9,7.9,0.367,1.1,3.7,0.307,1.8,4.2,0.42,0.439,1.4,1.8,0.746,1.2,6.2,7.4,2.1,1.0,0.6,1.5,1.8,8.3,10.5,0.476,0.466,0.228,4.3,24.0,14.0,10.1,1.7,1.8,14.8,15.5,-0.4,1.0,0.7,0.03,-2.7,1.2,-1.6,0.1


#### Feature Engineering
- Calculate Mean Player Performance over Last 5 Games
- Calculate Mean Points Per Attempt

https://en.wikipedia.org/wiki/Feature_engineering

In [None]:
# drop the following columns, which won't average properly and aren't helpful
columns_to_drop = ['FG.', 'X3P.', 'FT.']

# intiailize dictionary for containing mean player performance over the last 5 games
d_recent = {}
# initialize dictionary for containing the player performance for the next game
d_outcome = {}

# loop through each player in the NBA and calculate mean recent player performance
for player in df.Player:
    f = '../data/game_log_{}_2017-02-16.csv'.format(player.replace(" ", ""))
    
    # read last n rows of data from game log
    gl = pd.read_csv(f).tail(6)
    
    # subset gl to get columns with relevant data
    gl = gl.iloc[:, 10:]
    gl.drop(columns_to_drop, axis=1, inplace=True)
    
    # split gl into recent and outcome portions
    gl_recent = gl[0:5]
    gl_outcome = gl[5:6]
    
    # get mean of each column in gl_recent
    gl_mean = gl_recent.mean()
    
    # calculate points per attempt and append to gl_mean
    gl_mean['ppa'] = ((gl_mean['FG'] * 2 + gl_mean['X3P'] * 3 + gl_mean['FT'] * 1) / \
                      np.sum(gl_mean[['FGA', 'X3PA', 'FTA']]))
    
    d_recent[player] = gl_mean
    d_outcome[player] = gl_outcome.squeeze()  # returns series instead of df

#### Combine Input and Engineered Feature Data

In [88]:
# convert d_recent and d_outcome to dataframes
df_recent = pd.DataFrame.from_dict(d_recent, orient='index').reset_index()
df_recent.columns = ['recent_{}'.format(c) for c in df_recent.columns]
df_outcome = pd.DataFrame.from_dict(d_outcome, orient='index').reset_index()

In [98]:
data = df.merge(df_recent, left_on='Player', right_on='recent_index')
data.drop('recent_index', 1, inplace=True)

#### DRAFTKINGS NBA FANTASY SCORING RULES
- Point = +1 PT
- Made 3pt. shot = +0.5 PTs
- Rebound = +1.25 PTs
- Assist = +1.5 PTs
- Steal = +2 PTs
- Block = +2 PTs
- Turnover = -0.5 PTs
- Double-Double = +1.5PTs (MAX 1 PER PLAYER: Points, Rebounds, Assists, Blocks, Steals)
- Triple-Double = +3PTs (MAX 1 PER PLAYER: Points, Rebounds, Assists, Blocks, Steals)

In [127]:
# define functions to calculate fantasy points for each player
def fantasy_points(data):
    """
    jfdljfljflsd
    """
    # calculate basic fantasy points
    basic_pts = (data['PTS'] * 5 + 
                 data['X3P'] * 0.5 +
                 data['TRB'] * 1.25 +
                 data['AST'] * 1.5 +
                 data['STL'] * 2 +
                 data['BLK'] * 2 +
                 data['TOV'] * -0.5)
    
    # calculate bonus fantasy points
    # get important columns for bonus points
    important_columns = ['PTS', 'TRB', 'AST', 'BLK', 'STL']
    data2 = data[important_columns]
    bonus_pts = data2.apply(lambda row: row.values)
    
    return(bonus_pts)
    

In [128]:
(df_outcome > 10).any()

index    True
FG       True
FGA      True
X3P      True
X3PA     True
FT       True
FTA      True
ORB      True
DRB      True
TRB      True
AST      True
STL      True
BLK      True
TOV      True
PF       True
PTS      True
GmSc     True
X...     True
dtype: bool

In [129]:
fantasy_points(df_outcome)

Unnamed: 0,PTS,TRB,AST,BLK,STL
0,5,1,2,0,0
1,8,6,3,1,2
2,16,9,3,1,0
3,2,3,3,1,1
4,14,15,2,0,1
5,3,3,1,0,0
6,5,2,1,0,2
7,8,2,0,0,1
8,10,7,0,1,0
9,12,10,0,2,3


#### Train Models with Cross-Validation and Regularization
http://scikit-learn.org/stable/modules/cross_validation.html