<a href="https://colab.research.google.com/github/AdrianduPlessis/DS-Unit-2-Kaggle-Challenge/blob/master/DS7_Sprint_Challenge_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science, Unit 2_
 
# Sprint Challenge: Predict Steph Curry's shots 🏀

For your Sprint Challenge, you'll use a dataset with all Steph Curry's NBA field goal attempts. (Regular season and playoff games, from October 28, 2009, through June 5, 2019.) 

You'll predict whether each shot was made, using information about the shot and the game. This is hard to predict! Try to get above 60% accuracy. The dataset was collected with the [nba_api](https://github.com/swar/nba_api) Python library.

In [0]:
#Global imports
import pandas as pd
import numpy as np
import seaborn as sns
import pandas_profiling

from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint, uniform

import category_encoders as ce
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint, uniform

In [147]:
import sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install packages in Colab
    !pip install category_encoders==2.0.0
    !pip install pandas-profiling==2.3.0
    !pip install plotly==4.1.1



In [0]:
# Read data
url = 'https://drive.google.com/uc?export=download&id=1fL7KPyxgGYfQDsuJoBWHIWwCAf-HTFpX'
df = pd.read_csv(url)

# Check data shape
assert df.shape == (13958, 20)

To demonstrate mastery on your Sprint Challenge, do all the required, numbered instructions in this notebook.

To earn a score of "3", also do all the stretch goals.

You are permitted and encouraged to do as much data exploration as you want.

**1. Begin with baselines for classification.** Your target to predict is `shot_made_flag`. What is your baseline accuracy, if you guessed the majority class for every prediction?

**2. Hold out your test set.** Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

**3. Engineer new feature.** Engineer at least **1** new feature, from this list, or your own idea.
- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
- **Opponent**: Who is the other team playing the Golden State Warriors?
- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
- **Made previous shot**: Was Steph Curry's previous shot successful?

**4. Decide how to validate** your model. Choose one of the following options. Any of these options are good. You are not graded on which you choose.
- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
- **Train/validate/test split: random 80/20%** train/validate split.
- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

**5.** Use a scikit-learn **pipeline** to **encode categoricals** and fit a **Decision Tree** or **Random Forest** model.

**6.** Get your model's **validation accuracy.** (Multiple times if you try multiple iterations.) 

**7.** Get your model's **test accuracy.** (One time, at the end.)


**8.** Given a **confusion matrix** for a hypothetical binary classification model, **calculate accuracy, precision, and recall.**

### Stretch Goals
- Engineer 4+ new features total, either from the list above, or your own ideas.
- Make 2+ visualizations to explore relationships between features and target.
- Optimize 3+ hyperparameters by trying 10+ "candidates" (possible combinations of hyperparameters). You can use `RandomizedSearchCV` or do it manually.
- Get and plot your model's feature importances.



##Getting overview of dataset

In [149]:
#Looking at first 5 entries
df.head(5)

Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot
0,20900015,4,Stephen Curry,1,11,25,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0
1,20900015,17,Stephen Curry,1,9,31,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0
2,20900015,53,Stephen Curry,1,6,2,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,14,-60,129,0,2009-10-28,GSW,HOU,Regular Season,-4.0
3,20900015,141,Stephen Curry,2,9,49,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),16-24 ft.,19,-172,82,0,2009-10-28,GSW,HOU,Regular Season,-4.0
4,20900015,249,Stephen Curry,2,2,19,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-68,148,0,2009-10-28,GSW,HOU,Regular Season,0.0


In [150]:
pandas_profiling.ProfileReport(df)



## 1. Begin with baselines for classification. 

>Your target to predict is `shot_made_flag`. What would your baseline accuracy be, if you guessed the majority class for every prediction?

In [171]:
#Majority_class prediction accuracy
majority_class = df['shot_made_flag'].mode()[0]
base_pred = [majority_class] * len(df)

#Accuracy of majority class baseline = frequency of the majority class
baseline = accuracy_score(df['shot_made_flag'], base_pred)
print(baseline)

0.5270812437311936


## 2. Hold out your test set.

>Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

This is not a statistically optimal way for testing our model, unless there is reason to believe that the player's performance is constant over time.

In [0]:
# Train/test split
# Start of 2018 NBA season is October 2018 16th (google)
df['game_date'] = pd.to_datetime(df['game_date'], infer_datetime_format=True)
cutoff = pd.to_datetime('2018-08-16')
train = df[df.game_date < cutoff]
test  = df[df.game_date >= cutoff]

#Sanity check
assert len(train) + len(test) == 13958

## 3. Engineer new feature.

>Engineer at least **1** new feature, from this list, or your own idea.
>
>- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
>- **Opponent**: Who is the other team playing the Golden State Warriors?
>- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
>- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
>- **Made previous shot**: Was Steph Curry's previous shot successful?

    

In [0]:
def wrangle(X):
  #Engineer Features
  #Homecourt advantage
  X['homecourt'] = np.where(X['htm']=='GSW', 1, 0)

  #Seconds remaining in period
  X['seconds_remaining_in_period'] = X['minutes_remaining'] * 60 + X['seconds_remaining']

  #Seconds remaining in the game (12min * 60sec/min = 720seconds in period)
  X['seconds_remaining_in_game'] = X['seconds_remaining_in_period'] + 720*(X['period']-1)


  # Extract components from date_recorded, then drop the original column
  X['year'] = X['game_date'].dt.year
  X['month'] = X['game_date'].dt.month
  X['day'] = X['game_date'].dt.day
  X = X.drop(columns='game_date')
  
  #Drop constant features
  X = X.drop(columns='player_name')

  return X

train = wrangle(train)
test = wrangle(test)

In [0]:
#Split into features and target
target = 'shot_made_flag'
features = train.columns.drop(target)
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

## **4. Decide how to validate** your model. 

>Choose one of the following options. Any of these options are good. You are not graded on which you choose.
>
>- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
>- **Train/validate/test split: random 80/20%** train/validate split.
>- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

In [0]:
#Will Validate using Cross-validation

## 5. Use a scikit-learn pipeline to encode categoricals and fit a Decision Tree or Random Forest model.

In [0]:
#TODO: Try different encoders

#Create Pipeline
#Setting max_depth to 10 to avoid overfitting
pipeline = make_pipeline(
    ce.TargetEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, max_depth=10)
)

#Fit on train
pipeline.fit(X_train, y_train);

## 6.Get your model's validation accuracy

> (Multiple times if you try multiple iterations.)

In [157]:
k = 3
scores = cross_val_score(pipeline, X_train, y_train, cv=k, 
                         scoring='neg_mean_absolute_error')
print(f'MAE for {k} folds:', -scores)

MAE for 3 folds: [0.40744368 0.35782513 0.35791279]


In [158]:
#Getting accuracy score
train_pred = pipeline.predict(X_train)
print("Accuracy with base RandomForestClassifier Model: ", accuracy_score(y_train, train_pred))

Accuracy with base RandomForestClassifier Model:  0.7427545105722916


Accuracy score is not valid since I'm fitting and validating on train. Should instead rely on MAE from CV above.

In [159]:
-scores.mean()

0.37439386636502453

Since we are doing a binary precition MAE translates nicley to accuracy by 1-MAE (theory)

In [160]:
accuracy = 1-(-scores.mean())
print("The base RandomForestClassifier beats the Majority Classifier baseline by ", accuracy - baseline)

0.6256061336349754

In [161]:
param_distributions = {
    'targetencoder__min_samples_leaf': randint(1, 1000), 
    'targetencoder__smoothing': uniform(1, 1000), 
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestclassifier__n_estimators': randint(50, 500), 
    'randomforestclassifier__max_depth': [5, 10, 15, 20, None], 
    'randomforestclassifier__max_features': uniform(0, 1), 
}

search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=10, 
    cv=3, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train);

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   27.2s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   54.9s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  4.8min finished


In [162]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation MAE', -search.best_score_)

Best hyperparameters {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__max_features': 0.22947787165632183, 'randomforestclassifier__n_estimators': 67, 'simpleimputer__strategy': 'mean', 'targetencoder__min_samples_leaf': 35, 'targetencoder__smoothing': 358.5984617425412}
Cross-validation MAE 0.36109070128173726


In [0]:
#Train accuracy (MAE)
best_performing_hyperperameters = pd.DataFrame(search.cv_results_).sort_values(by='rank_test_score')

In [0]:
pipeline = search.best_estimator_

In [172]:
train_pred = pipeline.predict(X_train)
CV_tuned_accuracy = accuracy_score(y_train, train_pred)
print("Accuracy with RandomizedSearchCV tuned hyper params: ", CV_tuned_accuracy, ", beating the base RandomForestClassifier by ", CV_tuned_accuracy - accuracy)

Accuracy with RandomizedSearchCV tuned hyper params:  0.6781778104335048 , beating the base RandomForestClassifier by  0.031118986904092982


## 7. Get your model's test accuracy

> (One time, at the end.)

In [166]:
test_pred = pipeline.predict(X_test)
print("Final test accuracy score: ", accuracy_score(y_test, test_pred))

Final test accuracy score:  0.6325336454066706


## 8. Given a confusion matrix, calculate accuracy, precision, and recall.

Imagine this is the confusion matrix for a binary classification model. Use the confusion matrix to calculate the model's accuracy, precision, and recall.

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

### Calculate accuracy 

In [167]:
Total_Predictions = 85 + 58 + 8 + 36
Correct_Predictions = 85 + 36
accuracy = Correct_Predictions / Total_Predictions
accuracy

0.6470588235294118

### Calculate precision

In [168]:
# Precision = Correct value predictions of a class / Total predictions for class
True_Negatives = 85
True_And_False_Negatives = 85 + 8
precision = True_Negatives / True_And_False_Negatives
precision

0.9139784946236559

### Calculate recall

In [169]:
# Recall = Correct for the class / Actual for the class
Actual_Negatives = 85 + 58
recall = True_Negatives / Actual_Negatives
recall

0.5944055944055944