_Lambda School Data Science, Unit 2_
 
# Sprint Challenge: Predict Steph Curry's shots 🏀

For your Sprint Challenge, you'll use a dataset with all Steph Curry's NBA field goal attempts. (Regular season and playoff games, from October 28, 2009, through June 5, 2019.) 

You'll predict whether each shot was made, using information about the shot and the game. This is hard to predict! Try to get above 60% accuracy. The dataset was collected with the [nba_api](https://github.com/swar/nba_api) Python library.

In [2]:
%%capture
import sys

if 'google.colab' in sys.modules:
    # Install packages in Colab
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*
    

In [3]:
# Read data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
url = 'https://drive.google.com/uc?export=download&id=1fL7KPyxgGYfQDsuJoBWHIWwCAf-HTFpX'
df = pd.read_csv(url)

# Check data shape
assert df.shape == (13958, 20)

To demonstrate mastery on your Sprint Challenge, do all the required, numbered instructions in this notebook.

To earn a score of "3", also do all the stretch goals.

You are permitted and encouraged to do as much data exploration as you want.

**1. Begin with baselines for classification.** Your target to predict is `shot_made_flag`. What is your baseline accuracy, if you guessed the majority class for every prediction?

**2. Hold out your test set.** Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

**3. Engineer new feature.** Engineer at least **1** new feature, from this list, or your own idea.
- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
- **Opponent**: Who is the other team playing the Golden State Warriors?
- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
- **Made previous shot**: Was Steph Curry's previous shot successful?

**4. Decide how to validate** your model. Choose one of the following options. Any of these options are good. You are not graded on which you choose.
- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
- **Train/validate/test split: random 80/20%** train/validate split.
- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

**5.** Use a scikit-learn **pipeline** to **encode categoricals** and fit a **Decision Tree** or **Random Forest** model.

**6.** Get your model's **validation accuracy.** (Multiple times if you try multiple iterations.) 

**7.** Get your model's **test accuracy.** (One time, at the end.)


**8.** Given a **confusion matrix** for a hypothetical binary classification model, **calculate accuracy, precision, and recall.**

### Stretch Goals
- Engineer 4+ new features total, either from the list above, or your own ideas.
- Make 2+ visualizations to explore relationships between features and target.
- Optimize 3+ hyperparameters by trying 10+ "candidates" (possible combinations of hyperparameters). You can use `RandomizedSearchCV` or do it manually.
- Get and plot your model's feature importances.



## 1. Begin with baselines for classification. 

>Your target to predict is `shot_made_flag`. What would your baseline accuracy be, if you guessed the majority class for every prediction?

In [4]:
df.columns

Index(['game_id', 'game_event_id', 'player_name', 'period',
       'minutes_remaining', 'seconds_remaining', 'action_type', 'shot_type',
       'shot_zone_basic', 'shot_zone_area', 'shot_zone_range', 'shot_distance',
       'loc_x', 'loc_y', 'shot_made_flag', 'game_date', 'htm', 'vtm',
       'season_type', 'scoremargin_before_shot'],
      dtype='object')

In [5]:
#features 'shot_zone_basic', 'total_seconds_remaining', 'loc_x', 'loc_y', 'season_type', 'shot_distance', 'scoremargin_before_shot', 'shot_type'
df['shot_zone_basic'].value_counts()

Above the Break 3        5695
Mid-Range                3194
Restricted Area          2692
In The Paint (Non-RA)    1250
Left Corner 3             603
Right Corner 3            428
Backcourt                  96
Name: shot_zone_basic, dtype: int64

In [6]:
df['season_type'].value_counts(normalize=True)

Regular Season    0.848689
Playoffs          0.151311
Name: season_type, dtype: float64

In [7]:
df['shot_type'].value_counts(normalize=True)

2PT Field Goal    0.511176
3PT Field Goal    0.488824
Name: shot_type, dtype: float64

In [8]:
df['shot_zone_area'].value_counts(normalize=True)

Center(C)                0.428930
Right Side Center(RC)    0.207695
Left Side Center(LC)     0.187061
Left Side(L)             0.091274
Right Side(R)            0.077805
Back Court(BC)           0.007236
Name: shot_zone_area, dtype: float64

In [8]:
df['shot_made_flag'].value_counts(normalize=True)

0    0.527081
1    0.472919
Name: shot_made_flag, dtype: float64

## 2. Hold out your test set.

>Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

In [10]:
df['game_date'] = pd.to_datetime(df['game_date'])
start = pd.to_datetime('2018-10-01')
end = pd.to_datetime('2019-06-30')
test = df[(df['game_date'] >= start) & (df['game_date'] <= end)]
df = df[(df['game_date'] <= start)]
len(test)

1709

In [11]:
len(df)

12249

## 3. Engineer new feature.

>Engineer at least **1** new feature, from this list, or your own idea.
>
>- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
>- **Opponent**: Who is the other team playing the Golden State Warriors?
>- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
>- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
>- **Made previous shot**: Was Steph Curry's previous shot successful?

    

In [12]:
df.head()

Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot
0,20900015,4,Stephen Curry,1,11,25,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0
1,20900015,17,Stephen Curry,1,9,31,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0
2,20900015,53,Stephen Curry,1,6,2,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,14,-60,129,0,2009-10-28,GSW,HOU,Regular Season,-4.0
3,20900015,141,Stephen Curry,2,9,49,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),16-24 ft.,19,-172,82,0,2009-10-28,GSW,HOU,Regular Season,-4.0
4,20900015,249,Stephen Curry,2,2,19,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-68,148,0,2009-10-28,GSW,HOU,Regular Season,0.0


In [14]:
def new_feature(row):
    return row['seconds_remaining'] + (row['minutes_remaining'] * 60)

In [15]:
df['total_seconds_remaining'] = df.apply(new_feature, axis=1)
df['total_seconds_remaining']

0        685
1        571
2        362
3        589
4        139
5         34
6        626
7        391
8        145
9        107
10        29
11         0
12       639
13       567
14        88
15        60
16       515
17       177
18       614
19       546
20       443
21       570
22       366
23       109
24        55
25       671
26       294
27       645
28       284
29       255
        ... 
12219    462
12220    178
12221    158
12222    698
12223    564
12224    527
12225    488
12226    363
12227      0
12228    680
12229    436
12230    310
12231    246
12232    219
12233    192
12234      5
12235    653
12236    618
12237    574
12238    415
12239    310
12240    256
12241    671
12242    585
12243    522
12244    379
12245    348
12246    313
12247    267
12248    229
Name: total_seconds_remaining, Length: 12249, dtype: int64

## **4. Decide how to validate** your model. 

>Choose one of the following options. Any of these options are good. You are not graded on which you choose.
>
>- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
>- **Train/validate/test split: random 80/20%** train/validate split.
>- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

In [16]:
train, val = train_test_split(df, test_size=.20, random_state=42)


## 5. Use a scikit-learn pipeline to encode categoricals and fit a Decision Tree or Random Forest model.

In [17]:
target = 'shot_made_flag'
features = ['shot_zone_basic', 'loc_x', 'loc_y', 'season_type', 'shot_distance', 'scoremargin_before_shot', 'shot_type']
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test = test[target]

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(strategy='mean'), 
    DecisionTreeClassifier(min_samples_leaf=20, random_state=42)
)

pipeline.fit(X_train, y_train)


Pipeline(memory=None,
     steps=[('onehotencoder', OneHotEncoder(cols=['shot_zone_basic', 'season_type', 'shot_type'],
       drop_invariant=False, handle_missing='value',
       handle_unknown='value', return_df=True, use_cat_names=True,
       verbose=0)), ('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing...        min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best'))])

## 6.Get your model's validation accuracy

> (Multiple times if you try multiple iterations.)

In [18]:
print('Validation Accuracy', pipeline.score(X_val, y_val))

Validation Accuracy 0.5314285714285715


In [19]:
#trying it out with random forest as well
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(n_jobs=-1, random_state=0))
pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))



Validation Accuracy 0.5424489795918367


## 7. Get your model's test accuracy

> (One time, at the end.)

In [20]:
print('Test Accuracy', pipeline.score(X_test, y_test))

Test Accuracy 0.5529549444119368


## 8. Given a confusion matrix, calculate accuracy, precision, and recall.

Imagine this is the confusion matrix for a binary classification model. Use the confusion matrix to calculate the model's accuracy, precision, and recall.

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

### Calculate accuracy 

In [21]:
acc = (85 + 36)/ 187
acc

0.6470588235294118

### Calculate precision

In [22]:
precision = 36/(36+58)
precision

0.3829787234042553

### Calculate recall

In [36]:
recall = 36/(36+8)
recall

0.8181818181818182