<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Predicting Shots Made Per Game by Kobe Bryant

_Authors: Kiefer Katovich (SF)_

---

In this lab you'll be using regularized regression penalties — ridge, lasso, and elastic net — to try and predict how many shots Kobe Bryant made per game during his career.

The Kobe Shots data set contains hundreds of columns representing different characteristics of each basketball game. Fitting an ordinary linear regression using every predictor would dramatically overfit the model, considering the limited number of observations (games) we have available. Plus, many of the predictors have significant multicollinearity. 


**Warning:** Some of these calculations are computationally expensive and may take a while to execute. It may be worthwhile to only use a portion of the data to perform these calculations, especially if you've experienced kernel issues in the past.

---

### 1) Load packages and data.

In [1]:
import numpy as np
import pandas as pd
import patsy

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
kobe = pd.read_csv('./datasets/kobe_superwide_games.csv')

---

### 2) Examine the data.

- How many columns are there?
- Examine what the observations (rows) and columns represent.
- Why might regularization be particularly useful for modeling this data?

In [3]:
# A:
kobe.columns

Index(['SHOTS_MADE', 'AWAY_GAME', 'SEASON_OPPONENT:atl:1996-97',
       'SEASON_OPPONENT:atl:1997-98', 'SEASON_OPPONENT:atl:1999-00',
       'SEASON_OPPONENT:atl:2000-01', 'SEASON_OPPONENT:atl:2001-02',
       'SEASON_OPPONENT:atl:2002-03', 'SEASON_OPPONENT:atl:2003-04',
       'SEASON_OPPONENT:atl:2004-05',
       ...
       'ACTION_TYPE:tip_layup_shot', 'ACTION_TYPE:tip_shot',
       'ACTION_TYPE:turnaround_bank_shot',
       'ACTION_TYPE:turnaround_fadeaway_bank_jump_shot',
       'ACTION_TYPE:turnaround_fadeaway_shot',
       'ACTION_TYPE:turnaround_finger_roll_shot',
       'ACTION_TYPE:turnaround_hook_shot', 'ACTION_TYPE:turnaround_jump_shot',
       'SEASON_GAME_NUMBER', 'CAREER_GAME_NUMBER'],
      dtype='object', length=645)

In [4]:
kobe.shape #1558 rows, 645 columns

(1558, 645)

In [5]:
kobe

Unnamed: 0,SHOTS_MADE,AWAY_GAME,SEASON_OPPONENT:atl:1996-97,SEASON_OPPONENT:atl:1997-98,SEASON_OPPONENT:atl:1999-00,SEASON_OPPONENT:atl:2000-01,SEASON_OPPONENT:atl:2001-02,SEASON_OPPONENT:atl:2002-03,SEASON_OPPONENT:atl:2003-04,SEASON_OPPONENT:atl:2004-05,...,ACTION_TYPE:tip_layup_shot,ACTION_TYPE:tip_shot,ACTION_TYPE:turnaround_bank_shot,ACTION_TYPE:turnaround_fadeaway_bank_jump_shot,ACTION_TYPE:turnaround_fadeaway_shot,ACTION_TYPE:turnaround_finger_roll_shot,ACTION_TYPE:turnaround_hook_shot,ACTION_TYPE:turnaround_jump_shot,SEASON_GAME_NUMBER,CAREER_GAME_NUMBER
0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,1,1
1,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,2,2
2,2.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,3,3
3,2.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,4,4
4,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,5,5
5,1.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,6,6
6,2.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,7,7
7,1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,8,8
8,4.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,9,9
9,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,10,10


In [6]:
kobe.isnull().sum().head()

SHOTS_MADE                     0
AWAY_GAME                      0
SEASON_OPPONENT:atl:1996-97    0
SEASON_OPPONENT:atl:1997-98    0
SEASON_OPPONENT:atl:1999-00    0
dtype: int64

Why might regularization be particularly useful for modeling this data?  
* A) There is 1558 rows and 645 columns of seemingly pretty clean data here.  That is over 9 million pieces of data.  Plenty of these columns are either trivial or heavily colinear to other columns, and can be cut down in effect to eliminate collinearity, so our models are not overfit on our training data and, thus generalizable to testing data and new data.

---

### 3) Create predictor and target variables. Standardize the predictors.

Why is normalization necessary for regularized regressions?

Use the `sklearn.preprocessing` class `StandardScaler` to standardize the predictors.

In [7]:
# A: For our regularization functions to work, we must assume variables are evenly distributed,
#and variance within data will not cause certain variables to dominate in the regularized regression.
y = kobe['SHOTS_MADE']
X = kobe
kobe.drop(columns ='SHOTS_MADE', inplace = True)
lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(X_train) 
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

  return self.partial_fit(X, y)
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':


---

### 4. Build a linear regression predicting `SHOTS_MADE` from the rest of the columns.

Cross-validate the $R^2$ of an ordinary linear regression model with 10 cross-validation folds.

How does it perform?

In [8]:
# A: It shows that, even with multiple run throughs, the training data does not generalize on test data at all. 


In [9]:
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [10]:
lr.score(X_train, y_train)

0.8163552834314866

In [11]:
lr.score(X_test, y_test)

-2.1621564545103972e+26

In [12]:
from sklearn.model_selection import KFold, cross_val_score
cv_scores = cross_val_score(lr, X_train, y_train, cv=10)

cv_scores


array([-2.71838054e+28, -1.97166896e+28, -3.52473835e+27, -1.41037446e+28,
       -9.72655301e+27, -1.81425148e+28, -1.02878104e+28, -1.30490946e+28,
       -2.00032366e+28, -2.38085484e+28])

In [13]:
cv_scores.mean()

-1.595467358318239e+28

In [14]:
print('cv score', cross_val_score(lr, X_train, y_train, cv=10).mean())

lr.fit(X_train, y_train)
print('train score', lr.score(X_train, y_train))

print('test score', lr.score(X_test,y_test))

cv score -1.595467358318239e+28
train score 0.8163552834314866
test score -2.1621564545103972e+26


---

### 5) Find an optimal value for the ridge regression alpha using `RidgeCV`.

Go to the documentation and [read how RidgeCV works](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).

> *Hint: Once the RidgeCV is fit, the attribute `.alpha_` contains the best alpha parameter it found through cross-validation.*

Recall that ridge performs best when searching alphas through logarithmic space (`np.logspace`). This may take awhile to fit.


In [46]:
ridge = RidgeCV(cv=5)


In [47]:
# A:
r_alphas = np.logspace(0, 10, 100)

ridge_model = ridge

ridge.fit(X_train, y_train)


RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=5, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)

In [48]:
ridge_model.alpha_

10.0

---

### 6) Cross-validate the ridge regression $R^2$ with the optimal alpha.

Is it better than the linear regression? If so, why might this be?

In [33]:
# A: This process selects the alpha at which misleading or irrelevant columns effects are minimized. 

In [42]:
cross_val_score(ridge, X_train, y_train, cv=5).mean()

0.610284993587664

In [35]:
# A:

---

### 8) Cross-validate the lasso $R^2$ with the optimal alpha.

Is it better than the linear regression? Is it better than ridge? What do the differences in results imply about the issues with the data set?

In [38]:
# A: It seems slightly better than the ridge - lots of the columns contain data that does not help us establish
# a relationship to y.
lasso = LassoCV(cv=5)

In [49]:
lasso_alphas = np.linspace(0.11, 1, 99)

lasso = LassoCV(alphas=lasso_alphas, cv=5)

lasso.fit(X_train, y_train)

LassoCV(alphas=array([0.11   , 0.11908, ..., 0.99092, 1.     ]), copy_X=True,
    cv=5, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100,
    n_jobs=None, normalize=False, positive=False, precompute='auto',
    random_state=None, selection='cyclic', tol=0.0001, verbose=False)

In [51]:
lasso.alpha_

0.11

In [76]:
cross_val_score(lasso, X_train, y_train, cv=5).mean()

0.6301839888909586

---

### 9) Look at the coefficients for variables in the lasso.

1. Show the coefficient for variables, ordered from largest to smallest coefficient by absolute value.
2. What percent of the variables in the original data set are "zeroed-out" by the lasso?
3. What are the most important predictors for how many shots Kobe made in a game?

> **Note:** If you only fit the lasso within `cross_val_score`, you'll have to refit it outside of that function to pull out the coefficients.

In [21]:
# A:

In [63]:
lasso.coef_.sort(axis=-1) #??

---

### 10) Find an optimal value for elastic net regression alpha using `ElasticNetCV`.

Go to the documentation and [read how ElasticNetCV works](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html).

Note that here you'll be optimizing both the alpha parameter and the l1_ratio:
- `alpha`: Strength of regularization.
- `l1_ratio`: Amount of ridge vs. lasso (0 = all ridge, 1 = all lasso).
    
Do not include 0 in the search for `l1_ratio` — it won't allow it and will break.

You can use `n_alphas` for the alpha parameters instead of setting your own values, which we highly recommend.

Also, be careful setting too many l1_ratios over cross-validation folds in your search. It can take a long time if you choose too many combinations and, for the most part, there are diminishing returns in this data.

In [22]:
# A:


In [67]:
net_alphas = np.linspace(0.1, 1, 99)

ElasticNet_model = ElasticNetCV(alphas=net_alphas, l1_ratio=.5, cv=5)


ElasticNet_model.fit(X_train, y_train)

ElasticNetCV(alphas=array([0.1    , 0.10918, ..., 0.99082, 1.     ]),
       copy_X=True, cv=5, eps=0.001, fit_intercept=True, l1_ratio=0.5,
       max_iter=1000, n_alphas=100, n_jobs=None, normalize=False,
       positive=False, precompute='auto', random_state=None,
       selection='cyclic', tol=0.0001, verbose=0)

In [69]:
ElasticNet_model.alpha_

0.1642857142857143

---

### 11) Cross-validate the elastic net $R^2$ with the optimal alpha and l1_ratio.

How does it compare to the ridge and lasso regularized regressions?

In [78]:
# A: It results in a value in between the two, which I assume would be the case for all or most ratio values
# between 0 and 1 combining ridge and lasso. 
cross_val_score(ElasticNet_model, X_train, y_train, cv=5).mean()


0.6289257240575321

---

### 12. [Bonus] Compare the residuals for ridge and lasso visually.


In [24]:
# A: Maybe a jointplot?