## A Step-by-Step Walkthrough of the Next Pitch Prediction Model. 


#### Importing the necessary libraries

In [75]:
import next_pitch
from next_pitch import library as lib
from next_pitch import pitch_functions
from next_pitch import data_collection
import os

In [76]:
os.environ['KMP_DUPLICATE_LIB_OK']='True'

## Load Data

The data used in this model is too large to store as a ```csv``` file on Github, so for the purpose of the walkthrough the data must be collected fresh using ```statsapi```. The function ```get_clean_data``` uses several other functions found in the ```data_collection.py``` file to pull in data, clean it and return a dataframe of all pitches thrown in every Major League Baseball game during the specified period.

For demonstration, data from only the month of May 2018 will be pulled in. The full dataset takes over 25 minutes to coerce. 

In [77]:
pitch_data = data_collection.get_clean_data(start_date='05/01/2018', end_date='05/31/2018')

In [78]:
len(pitch_data)

## Create Binary Labels for Pitch Prediction: Fastball = 1 and Offspeed = 0.

Using the ```binarize_target``` as the final cleaning measure for our data. This function turns the target variable 'pitch_type' into a a binary outcome. A pitcher's main goal to confuse the hitters timing, and by helping a hitter recognize fastball vs non-fastball that will go along way as to helping them become a better hitter.

This step wasn't added to the original data cleaning because I expect to classify 3 types of pitches in the next version of this product.

In [79]:
final_df = data_collection.binarize_target(pitch_data)

Exports dataframe to CSV, this path isn't available on Github because it is too large. 

In [80]:
#final_df.to_csv(r'raw_data/all_2018_pitches.csv', sep=',', encoding='utf-8')

In [81]:
final_df.head()

# Visualizing the Data

## Model Creation

Define the classifier that will be used to run the model. A gradient boosted trees model was selected because it performed the highest during the intial EDA period. Intial EDA can be found in ```next_pitch/eda``` & ```next_pitch/model_creation```

In [82]:
classifier = lib.GradientBoostingClassifier(n_estimators=200, max_depth=10)

Create a testing parameter so model example. This uses a line from a unseen data source that will test the outcome of the model for example purposes. For the purposes of this test, the line is taken from game data from the 2019 season. 

# Collection of testing data

Dropping columns from the data frame that cannot be entered by the user via the web app. The model was trained on past data with results in order to help it build a knowledge base on how different pitches effect hitters. When the user inputs their data into the model, a dictionary with hardcoded averages from original dataset are added to fill in the gaps. This isn't a perfect solution, and will be addressed in later versions.

In [5]:
final_df = final_df.drop(['Unnamed: 0', 'about.atBatIndex', 'details.call.description', 'details.description', 
                                    'matchup.pitcher.id'], axis=1)

Only data from 2019 is used in the test set so to avoid data leakage during the train test split.

In [6]:
test_data = data_collection.get_clean_data(start_date='05/06/2019', end_date='07/06/2019')

In [7]:
#test_data.to_csv(r'raw_data/2019_test_pitches.csv', sep=',', encoding='utf-8')

In [9]:
test_data = test_data.drop(['Unnamed: 0', 'about.atBatIndex', 'details.call.description', 'details.description', 
                                    'matchup.pitcher.id'], axis=1)

In [10]:
final_df.head()

Unnamed: 0,pitcher,WAR_x,WHIP,ERA,SO,hitter,SLG,OPS,WAR_y,about.halfInning,about.inning,matchup.batSide.code,matchup.pitchHand.code,matchup.splits.menOnBase,pitchData.nastyFactor,pitchData.zone,pitchNumber,pitch_type,prior_pitch_type,count
0,Kendrys Morales,0.0,1.0,0.0,0,Matt Chapman,0.508,0.864,8.2,top,9,R,R,Men_On,32.94,1.0,2.0,0.0,0.0,1.0-1.0
1,Kendrys Morales,0.0,1.0,0.0,0,Matt Chapman,0.508,0.864,8.2,top,9,R,R,Men_On,31.44,4.0,3.0,0.0,0.0,1.0-2.0
2,Kendrys Morales,0.0,1.0,0.0,0,Matt Chapman,0.508,0.864,8.2,top,9,R,R,Men_On,2.66,14.0,4.0,0.0,0.0,2.0-2.0
3,Kendrys Morales,0.0,1.0,0.0,0,Matt Chapman,0.508,0.864,8.2,top,9,R,R,Men_On,3.82,14.0,5.0,0.0,0.0,3.0-2.0
4,Kendrys Morales,0.0,1.0,0.0,0,Matt Chapman,0.508,0.864,8.2,top,9,R,R,Men_On,35.64,11.0,6.0,0.0,0.0,4.0-2.0


In [11]:
test_data.head()

Unnamed: 0,pitcher,WAR_x,WHIP,ERA,SO,hitter,SLG,OPS,WAR_y,about.halfInning,about.inning,matchup.batSide.code,matchup.pitchHand.code,matchup.splits.menOnBase,pitchData.nastyFactor,pitchData.zone,pitchNumber,pitch_type,prior_pitch_type,count
0,Pablo Sandoval,0.0,0.0,0.0,0,Jose Peraza,0.416,0.742,2.3,bottom,8,R,R,Men_On,35.326612,12.0,2.0,0.0,0.0,1.0-1.0
1,Pablo Sandoval,0.0,0.0,0.0,0,Jose Peraza,0.416,0.742,2.3,bottom,8,R,R,Men_On,35.326612,14.0,3.0,0.0,0.0,1.0-2.0
2,Pablo Sandoval,0.0,0.0,0.0,0,Jose Peraza,0.416,0.742,2.3,bottom,8,R,R,Men_On,35.326612,11.0,4.0,0.0,0.0,2.0-2.0
3,Aaron Brooks,0.1,1.13,0.0,1,Joey Wendle,0.435,0.789,4.3,top,7,L,R,Men_On,35.326612,11.0,2.0,1.0,1.0,0.0-2.0
4,Aaron Brooks,0.1,1.13,0.0,1,Joey Wendle,0.435,0.789,4.3,top,7,L,R,Men_On,35.326612,13.0,3.0,0.0,1.0,0.0-2.0


Defining the test and target variables. These are entered into the function as ```X_test``` & ```y_test```. By using current data, it ensures that the model has had absolelty zero interaction with the new data and will output the most accurate results.

In [169]:
test_target = test_data['pitch_type']

In [170]:
test_predictors = test_data.drop(['pitch_type'], axis=1).copy()

Creating a demo row for our model to predict upon. This is simulating the user input experience for deployment.

In [171]:
testing = test_predictors[-442:-441]
testing

Unnamed: 0,pitcher,WAR_x,WHIP,ERA,SO,hitter,SLG,OPS,WAR_y,about.halfInning,about.inning,matchup.batSide.code,matchup.pitchHand.code,matchup.splits.menOnBase,pitchData.nastyFactor,pitchData.zone,pitchNumber,prior_pitch_type,count
147932,John Means,-0.2,1.8,13.5,4,Jeff Mathis,0.272,0.544,0.2,bottom,5,R,L,Men_On,35.326612,6.0,3.0,0.0,2.0-1.0


The original model which was pickled below is now being imported for use on this testing row. 

In [19]:
import pickle

In [None]:
with open('next_pitch/web_app/final.pkl', 'rb') as f:
    model = pickle.load(f)

In [329]:
model.predict(testing)

array([1.])

Outputs an array with either a 0 or 1 value. 1 means that it predicts Fastball will come next, and 0 means that Off Speed will come next.

In [17]:
# final = pitch_functions.final_model(X_test=test_predictors, y_test=test_target, dataframe=final_df_test, classifier=classifier)

Accuracy:0.694
F1-Score: 0.693
AUC: 0.682
None


Dump original model(commented out above) into a pickle that can be used for web app and demonstrations.

In [20]:
with open('final_test.pkl', 'wb') as f:
    pickle.dump(final, f)

In [325]:
def format_user_input(user_dict):
    live_df = lib.pd.DataFrame([user_dict])
    created_test = data_collection.merge_player_stats(live_df)
    created_test = created_test[test_list]
    return created_test
    
    