## Pitch Prediciton Model Building

I'll use a logistic regression model to perform the actual pitch prediction.

Thoughts:
- Stratified test/train split
- OvR multi-class method
- Need to encode categorical variables
- Feature scaling??
- CV and GridSearch

In [1]:
#Import packages

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
#Load data

kluber = pd.read_csv('kluber.csv')
kluber.head()

Unnamed: 0,pitcher_name,batter_name,stand,p_throws,inning_side,count,x,y,pitch_type,type_confidence,on_1b,on_2b,on_3b,b,s,TTO,prev_pitch
0,Corey Kluber,Carlos Gomez,R,R,bottom,0-0,127.39,141.47,SI,2.0,0.0,0.0,0.0,0,0,1.0,First Pitch
1,Corey Kluber,Carlos Gomez,R,R,bottom,1-0,81.85,191.16,SI,2.0,0.0,0.0,0.0,1,0,1.0,SI
2,Corey Kluber,Carlos Gomez,R,R,bottom,1-1,107.3,158.5,FC,2.0,0.0,0.0,0.0,1,1,1.0,SI
3,Corey Kluber,Carlos Gomez,R,R,bottom,1-2,87.38,116.71,FF,2.0,0.0,0.0,0.0,1,2,1.0,FC
4,Corey Kluber,Carlos Gomez,R,R,bottom,2-2,72.14,163.7,SI,2.0,0.0,0.0,0.0,2,2,1.0,FF


## Preparing the Data

There are a few data preparation steps required before training a model. 

1. Decide what columns to use for initial model, remove the remainder
2. Separate the dependent variable (pitch_type)
3. Encode any categorical variables
4. Split the data into testing and training sets
5. Standardize data

For my initial model, the predictor variables I'm going to use are:
- Batter handedness: 'stand'
- Men on base: 'on_1b', 'on_2b', 'on_3b'
- Balls: 'b'
- Strikes: 's'
- Times-through-order: TTO
- Previous pitch: 'prev_pitch'
- Interactions between these variables


In [7]:
#Remove unneccessary columns

cols_to_drop = ['pitcher_name','batter_name','p_throws','inning_side','count','x','y','type_confidence']
x = kluber.drop(cols_to_drop, axis = 1)
x.head()

Unnamed: 0,stand,pitch_type,on_1b,on_2b,on_3b,b,s,TTO,prev_pitch
0,R,SI,0.0,0.0,0.0,0,0,1.0,First Pitch
1,R,SI,0.0,0.0,0.0,1,0,1.0,SI
2,R,FC,0.0,0.0,0.0,1,1,1.0,SI
3,R,FF,0.0,0.0,0.0,1,2,1.0,FC
4,R,SI,0.0,0.0,0.0,2,2,1.0,FF


In [8]:
#Split off pitch_type into separate dataframe

y = x['pitch_type']
x.drop(['pitch_type'], axis = 1, inplace = True)

In [9]:
x.head()

Unnamed: 0,stand,on_1b,on_2b,on_3b,b,s,TTO,prev_pitch
0,R,0.0,0.0,0.0,0,0,1.0,First Pitch
1,R,0.0,0.0,0.0,1,0,1.0,SI
2,R,0.0,0.0,0.0,1,1,1.0,SI
3,R,0.0,0.0,0.0,1,2,1.0,FC
4,R,0.0,0.0,0.0,2,2,1.0,FF


### Encode Categorical Variables

All of my variables are categorical, even those with numeric values. I'll use pandas get_dummies function to perform one-hot encoding

In [10]:
x = pd.get_dummies(x, columns = x.columns.tolist())

In [11]:
x.head()

Unnamed: 0,stand_L,stand_R,on_1b_0.0,on_1b_1.0,on_2b_0.0,on_2b_1.0,on_3b_0.0,on_3b_1.0,b_0,b_1,...,s_2,TTO_1.0,TTO_2.0,TTO_3.0,prev_pitch_CH,prev_pitch_CU,prev_pitch_FC,prev_pitch_FF,prev_pitch_First Pitch,prev_pitch_SI
0,0,1,1,0,1,0,1,0,1,0,...,0,1,0,0,0,0,0,0,1,0
1,0,1,1,0,1,0,1,0,0,1,...,0,1,0,0,0,0,0,0,0,1
2,0,1,1,0,1,0,1,0,0,1,...,0,1,0,0,0,0,0,0,0,1
3,0,1,1,0,1,0,1,0,0,1,...,1,1,0,0,0,0,1,0,0,0
4,0,1,1,0,1,0,1,0,0,0,...,1,1,0,0,0,0,0,1,0,0


### Split training and test set

In [12]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=1, stratify = y)

### Building the Model

First attempt: L2 regularized logistic regression with cross-validation using one-vs-rest, no interactions

In [111]:
from sklearn.linear_model import LogisticRegressionCV

ovr = LogisticRegressionCV(cv = 10, max_iter = 10000)
ovr_fit = ovr.fit(x_train, y_train)
ovr_fit.score(x_test, y_test)

0.35087719298245612

In [112]:
#Compute confusion matrix

pd.crosstab(y_test, ovr_fit.predict(x_test), rownames=['True'], colnames=['Predicted'])

Predicted,CU,FC,SI
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CH,26,0,32
CU,181,1,81
FC,65,0,170
FF,63,1,73
SI,114,3,159


Let's see if multinomial does any better

In [113]:
mn = LogisticRegressionCV(cv = 10, max_iter = 10000, multi_class = 'multinomial')
mn_fit = mn.fit(x_train, y_train)
mn_fit.score(x_test, y_test)

0.35087719298245612

In [114]:
#Compute confusion matrix

pd.crosstab(y_test, mn_fit.predict(x_test), rownames=['True'], colnames=['Predicted'])

Predicted,CU,FC,FF,SI
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CH,26,2,0,30
CU,171,22,0,70
FC,59,36,0,140
FF,61,12,0,64
SI,108,34,1,133


### Consider interaction terms!

In [105]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(interaction_only=True, include_bias = False)
x_train_poly = poly.fit_transform(x_train)
x_test_poly = poly.fit_transform(x_test)

In [115]:
#Re-try Logistic Regression with one-vs-rest

ovr_poly = LogisticRegressionCV(cv = 5, max_iter = 10000)
ovr_poly_fit = ovr_poly.fit(x_train_poly, y_train)
ovr_poly_fit.score(x_test_poly, y_test)

0.35087719298245612

In [116]:
#Confusion Matrix

pd.crosstab(y_test, ovr_poly_fit.predict(x_test_poly), rownames=['True'], colnames=['Predicted'])

Predicted,CU,FC,SI
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CH,28,0,30
CU,173,2,88
FC,77,2,156
FF,64,0,73
SI,110,1,165


In [117]:
#Multinomial

mn_poly = LogisticRegressionCV(cv = 5, max_iter = 10000, multi_class = 'multinomial')
mn_poly_fit = mn_poly.fit(x_train_poly, y_train)
mn_poly_fit.score(x_test_poly, y_test)

0.36222910216718268

In [118]:
pd.crosstab(y_test, mn_poly_fit.predict(x_test_poly), rownames=['True'], colnames=['Predicted'])

Predicted,CU,FC,SI
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CH,27,1,30
CU,174,12,77
FC,69,23,143
FF,64,9,64
SI,109,13,154
