<a href="https://colab.research.google.com/github/JoshuaHaga/DS-Unit-2-Kaggle-Challenge/blob/master/joshua_haga_DS_Sprint_Challenge_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science, Unit 2_
 
# Sprint Challenge: Predict Steph Curry's shots 🏀

For your Sprint Challenge, you'll use a dataset with all Steph Curry's NBA field goal attempts. (Regular season and playoff games, from October 28, 2009, through June 5, 2019.) 

You'll predict whether each shot was made, using information about the shot and the game. This is hard to predict! Try to get above 60% accuracy. The dataset was collected with the [nba_api](https://github.com/swar/nba_api) Python library.

In [52]:
%%capture
import sys

if 'google.colab' in sys.modules:
    # Install packages in Colab
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

In [53]:
# Read data
import pandas as pd
url = 'https://drive.google.com/uc?export=download&id=1fL7KPyxgGYfQDsuJoBWHIWwCAf-HTFpX'
df = pd.read_csv(url)

# Check data shape
assert df.shape == (13958, 20)

To demonstrate mastery on your Sprint Challenge, do all the required, numbered instructions in this notebook.

To earn a score of "3", also do all the stretch goals.

You are permitted and encouraged to do as much data exploration as you want.

**1. Begin with baselines for classification.** Your target to predict is `shot_made_flag`. What is your baseline accuracy, if you guessed the majority class for every prediction?

**2. Hold out your test set.** Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

**3. Engineer new feature.** Engineer at least **1** new feature, from this list, or your own idea.
- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
- **Opponent**: Who is the other team playing the Golden State Warriors?
- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
- **Made previous shot**: Was Steph Curry's previous shot successful?

**4. Decide how to validate** your model. Choose one of the following options. Any of these options are good. You are not graded on which you choose.
- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
- **Train/validate/test split: random 80/20%** train/validate split.
- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

**5.** Use a scikit-learn **pipeline** to **encode categoricals** and fit a **Decision Tree** or **Random Forest** model.

**6.** Get your model's **validation accuracy.** (Multiple times if you try multiple iterations.) 

**7.** Get your model's **test accuracy.** (One time, at the end.)


**8.** Given a **confusion matrix** for a hypothetical binary classification model, **calculate accuracy, precision, and recall.**

### Stretch Goals
- Engineer 4+ new features total, either from the list above, or your own ideas.
- Make 2+ visualizations to explore relationships between features and target.
- Optimize 3+ hyperparameters by trying 10+ "candidates" (possible combinations of hyperparameters). You can use `RandomizedSearchCV` or do it manually.
- Get and plot your model's feature importances.



## 1. Begin with baselines for classification. 

>Your target to predict is `shot_made_flag`. What would your baseline accuracy be, if you guessed the majority class for every prediction?

In [54]:
df.head()

Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot
0,20900015,4,Stephen Curry,1,11,25,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0
1,20900015,17,Stephen Curry,1,9,31,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0
2,20900015,53,Stephen Curry,1,6,2,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,14,-60,129,0,2009-10-28,GSW,HOU,Regular Season,-4.0
3,20900015,141,Stephen Curry,2,9,49,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),16-24 ft.,19,-172,82,0,2009-10-28,GSW,HOU,Regular Season,-4.0
4,20900015,249,Stephen Curry,2,2,19,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-68,148,0,2009-10-28,GSW,HOU,Regular Season,0.0


In [55]:
target = 'shot_made_flag'

In [56]:
baseline = df[target].value_counts(normalize=True)

In [57]:
baseline[1] #< - - - This is our baseline

0.4729187562688064

In [58]:
### We have a 47% baseline accuracy, meaning if we guessed that he got it in the net
### every time, we would be right a little over 47% of the time. 

## 2. Hold out your test set.

>Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

In [59]:
test = df[(df['game_date'] >= '2018-10') & (df['game_date'] <= '2019-07')]

In [60]:
test.shape

(1709, 20)

## 3. Engineer new feature.

>Engineer at least **1** new feature, from this list, or your own idea.
>
>- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
>- **Opponent**: Who is the other team playing the Golden State Warriors?
>- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
>- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
>- **Made previous shot**: Was Steph Curry's previous shot successful?

    

In [61]:
df.head()

Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot
0,20900015,4,Stephen Curry,1,11,25,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0
1,20900015,17,Stephen Curry,1,9,31,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0
2,20900015,53,Stephen Curry,1,6,2,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,14,-60,129,0,2009-10-28,GSW,HOU,Regular Season,-4.0
3,20900015,141,Stephen Curry,2,9,49,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),16-24 ft.,19,-172,82,0,2009-10-28,GSW,HOU,Regular Season,-4.0
4,20900015,249,Stephen Curry,2,2,19,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-68,148,0,2009-10-28,GSW,HOU,Regular Season,0.0


In [62]:
def home_adv(df):
  export = df
  for cell in export['htm']:
    export['home_adv'] = df['htm'] == 'GSW'
  return export

In [63]:
df = home_adv(df)

In [64]:
df[df['home_adv'] == False]

Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot,home_adv
12,20900030,9,Stephen Curry,1,10,39,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,24,152,195,0,2009-10-30,PHX,GSW,Regular Season,2.0,False
13,20900030,18,Stephen Curry,1,9,27,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,22,-81,205,1,2009-10-30,PHX,GSW,Regular Season,-0.0,False
14,20900030,258,Stephen Curry,2,1,28,Jump Shot,2PT Field Goal,Mid-Range,Right Side Center(RC),16-24 ft.,20,125,162,1,2009-10-30,PHX,GSW,Regular Season,-9.0,False
15,20900030,264,Stephen Curry,2,1,0,Jump Shot,3PT Field Goal,Above the Break 3,Center(C),24+ ft.,26,-76,250,1,2009-10-30,PHX,GSW,Regular Season,-10.0,False
16,20900030,313,Stephen Curry,3,8,35,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),8-16 ft.,15,-158,-6,0,2009-10-30,PHX,GSW,Regular Season,-10.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13922,41800402,399,Stephen Curry,3,9,17,Driving Layup Shot,2PT Field Goal,Restricted Area,Center(C),Less Than 8 ft.,1,-1,17,0,2019-06-02,TOR,GSW,Playoffs,4.0,False
13923,41800402,489,Stephen Curry,3,3,29,Pullup Jump shot,3PT Field Goal,Above the Break 3,Left Side Center(LC),24+ ft.,25,-221,132,0,2019-06-02,TOR,GSW,Playoffs,11.0,False
13924,41800402,501,Stephen Curry,3,2,56,Driving Floating Bank Jump Shot,2PT Field Goal,Mid-Range,Right Side(R),8-16 ft.,11,93,68,0,2019-06-02,TOR,GSW,Playoffs,9.0,False
13925,41800402,527,Stephen Curry,3,0,40,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,28,109,258,1,2019-06-02,TOR,GSW,Playoffs,7.0,False


## **4. Decide how to validate** your model. 

>Choose one of the following options. Any of these options are good. You are not graded on which you choose.
>
>- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
>- **Train/validate/test split: random 80/20%** train/validate split.
>- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

In [65]:
### Split using 09/10 through 16/17 for train, 17/18 for val and 18/19 for test 
train = df[(df['game_date'] >= '2009-10') & (df['game_date'] <= '2017-07')]
val  = df[(df['game_date'] >= '2017-10') & (df['game_date'] <= '2018-07')]
test = df[(df['game_date'] >= '2018-10') & (df['game_date'] <= '2019-07')]

In [66]:
print(train.shape)
print(val.shape)
print(test.shape)

(11081, 21)
(1168, 21)
(1709, 21)


## 5. Use a scikit-learn pipeline to encode categoricals and fit a Decision Tree or Random Forest model.

In [70]:
from scipy.stats import randint, uniform
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV


  import pandas.util.testing as tm


In [89]:
### Set the target to the status group column
target = 'shot_made_flag'

### Create a dataframe with the other features
df_train = train.drop(columns=[target,'player_name'])

In [116]:
### Get a list of the numeric features
numeric_features = df_train.select_dtypes(include='number').columns.tolist()

### Get a series with the cardinality of the nonnumeric features
cardinality = df_train.select_dtypes(exclude='number').nunique()

### Get a list of all categorical features with cardinality <= 
categorical_features = cardinality[cardinality <= 20].index.tolist()

### Combine the lists 
features = numeric_features + categorical_features

X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]
X_val = val[features]
y_val = val[target]

In [None]:
X_train

In [None]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    RandomForestClassifier(random_state=56)
)

param_distributions = {      
    'randomforestclassifier__n_estimators': randint(50, 500), 
    'randomforestclassifier__max_depth': [5, 10, 15, 20, None], 
    
}


search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=25, 
    cv=2, 
    scoring='accuracy', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=2
)

search.fit(X_train, y_train);

In [99]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation Accuracy', -search.best_score_)

Best hyperparameters {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 107}
Cross-validation Accuracy -0.5569894784986484


In [126]:
### This shows parameters tested and the outcomes for each iteration run
pd.DataFrame(search.cv_results_).sort_values(by='rank_test_score').T

Unnamed: 0,16,22,3,11,4,9,14,5,6,1,21,8,20,10,12,19,7,24,23,17,0,13,15,18,2
mean_fit_time,1.86566,2.25997,1.07133,0.800716,0.384891,0.872998,4.42172,2.04271,5.37547,2.11378,5.09159,1.89116,3.45839,2.72211,2.1427,2.40641,6.624,4.98437,2.6118,5.02159,6.04058,3.81592,1.12724,3.51611,2.7356
std_fit_time,0.00532889,0.00178087,0.00613308,0.0021522,0.000128031,0.012194,0.00750661,0.0124761,0.164587,0.0731187,0.0178747,0.0335208,0.086913,0.0540954,0.0387574,0.0316125,0.21495,0.100278,0.00790238,0.0692366,0.103255,0.0700758,0.0270191,0.0263519,0.0240211
mean_score_time,0.208056,0.252902,0.126626,0.104293,0.0565867,0.107379,0.396378,0.194318,0.438691,0.182856,0.428925,0.170293,0.277424,0.226058,0.190417,0.208027,0.520404,0.373326,0.23314,0.407741,0.498049,0.314803,0.105514,0.309486,0.251082
std_score_time,0.0001477,0.00571251,0.000138879,0.00543916,0.000460863,0.000889421,0.0116577,0.0049299,0.0435863,0.0117835,0.0114348,0.00283217,0.0169436,0.00696039,0.0101023,0.0031482,0.0247679,0.0533562,0.0123109,0.0287839,0.0301688,0.0114181,0.00226581,0.00791848,0.0148786
param_randomforestclassifier__max_depth,5,5,5,5,5,5,10,10,,,15,20,,20,20,20,,15,15,20,20,20,,15,15
param_randomforestclassifier__n_estimators,292,355,161,116,50,130,456,208,353,136,401,128,225,185,145,160,436,406,203,350,417,265,70,271,213
params,"{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 10, 'ran...","{'randomforestclassifier__max_depth': 10, 'ran...","{'randomforestclassifier__max_depth': None, 'r...","{'randomforestclassifier__max_depth': None, 'r...","{'randomforestclassifier__max_depth': 15, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': None, 'r...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': None, 'r...","{'randomforestclassifier__max_depth': 15, 'ran...","{'randomforestclassifier__max_depth': 15, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': None, 'r...","{'randomforestclassifier__max_depth': 15, 'ran...","{'randomforestclassifier__max_depth': 15, 'ran..."
split0_test_score,0.555676,0.555856,0.555495,0.555495,0.556037,0.555134,0.55351,0.553149,0.550442,0.551525,0.552066,0.55333,0.552247,0.550803,0.552608,0.551886,0.550623,0.551345,0.552066,0.550623,0.550623,0.550803,0.551345,0.550803,0.552247
split1_test_score,0.559567,0.558845,0.557581,0.556679,0.556137,0.556498,0.541155,0.536101,0.53574,0.534477,0.533755,0.53213,0.533032,0.534296,0.53213,0.531949,0.532852,0.531769,0.530505,0.531769,0.531408,0.530144,0.527617,0.526895,0.525451
mean_test_score,0.557621,0.557351,0.556538,0.556087,0.556087,0.555816,0.547333,0.544625,0.543091,0.543001,0.54291,0.54273,0.54264,0.54255,0.542369,0.541918,0.541737,0.541557,0.541286,0.541196,0.541015,0.540474,0.539481,0.538849,0.538849


## 6.Get your model's validation accuracy

> (Multiple times if you try multiple iterations.)

In [111]:
val_accuracy = search.score(X_val,y_val)

0.5779109589041096

In [114]:
print('Validation Accuracy is:',val_accuracy)

Validation Accuracy is: 0.5779109589041096


## 7. Get your model's test accuracy

> (One time, at the end.)

In [117]:
test_accuracy = search.score(X_test,y_test)

In [118]:
print('Test Accuracy is:',test_accuracy)

Test Accuracy is: 0.5763604447045055


## 8. Given a confusion matrix, calculate accuracy, precision, and recall.

Imagine this is the confusion matrix for a binary classification model. Use the confusion matrix to calculate the model's accuracy, precision, and recall.

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

### Calculate accuracy 

### Calculate precision

### Calculate recall