# Baseball Pitch Type Prediction

    Living in Houston, it would be sacreligious for me to not be a huge fan of baseball and the Hosuton Astros. For years, I've cheered, cried, and laughed as the Astros went from a pretty bad team to arguably the best team in the league (no matter what the Yankees say!). We acquired some of the best batters in the league as well as one of the greatest pitching lineups in MLB history. One of these incredibe pitchers is Houston's favorite closer, Ryan Pressly. He has come to be known as the ninth inning pitcher who never lets up a run, which is why I want to attempt to make a predictive algorithm that can determine what pitch Pressly would throw based on the status of the game at the time (ie. the nubmers of balls and strikes, the previous pitch, etc.)

In [1]:
#importing some necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import pybaseball as pb
import numpy as np

In [2]:
#getting the pitching information for all of Pressly's games for the 2022 season
raw_data = pb.statcast_pitcher('2022-04-07', '2022-10-05', player_id = 519151)
raw_data.info()

Gathering Player Data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 711 entries, 0 to 710
Data columns (total 92 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   pitch_type                       711 non-null    object 
 1   game_date                        711 non-null    object 
 2   release_speed                    711 non-null    float64
 3   release_pos_x                    711 non-null    float64
 4   release_pos_z                    711 non-null    float64
 5   player_name                      711 non-null    object 
 6   batter                           711 non-null    int64  
 7   pitcher                          711 non-null    int64  
 8   events                           181 non-null    object 
 9   description                      711 non-null    object 
 10  spin_dir                         0 non-null      float64
 11  spin_rate_deprecated             0 non-null      float64
 12  

After looking at the different types of data available when getting the pitching information, I racked my brain to figure out what factors would be seemingly important to a pitcher in the short time frame they have between pitches. Obviously, the count (how many balls and strikes there are) is something considered by the pitcher. Additionally, the handedness of the batter (right or left handed) would change the type of pitch a pitcher tends to use. Maybe the pitch number for each at-bat will affect the pitcher's decision? One thing that I wasn't sure about was the last pitch and second to last pitch type; perhaps that might impact the pitcher's choice? Well, to test these theories, let's first do a bit of feature engineering and combine all the features into a singular, organized dataframe.

In [3]:
#using the pitch type data to make the last pitch and second to last pitch features
last_pitch = raw_data.get(["pitch_name"]).shift(-1)
last_pitch = last_pitch.mask(last_pitch.eq('None')).dropna()
second_last = raw_data.get(["pitch_name"]).shift(-2)
second_last = second_last.mask(second_last.eq('None')).dropna()

In [4]:
#creating the organized dataframe
df = raw_data.get(["game_date", "pitch_name", "pitch_number", "strikes", "balls", "stand"])
df.insert(3, "last_pitch", last_pitch)
df.insert(4, "second_last", second_last)

df

Unnamed: 0,game_date,pitch_name,pitch_number,last_pitch,second_last,strikes,balls,stand
0,2022-10-05,Changeup,7,4-Seam Fastball,Curveball,2,3,R
1,2022-10-05,4-Seam Fastball,6,Curveball,Slider,2,2,R
2,2022-10-05,Curveball,5,Slider,Slider,2,1,R
3,2022-10-05,Slider,4,Slider,Slider,2,1,R
4,2022-10-05,Slider,3,Slider,Curveball,1,1,R
...,...,...,...,...,...,...,...,...
706,2022-04-07,Curveball,1,4-Seam Fastball,Slider,0,0,L
707,2022-04-07,4-Seam Fastball,3,Slider,Slider,0,2,R
708,2022-04-07,Slider,2,Slider,4-Seam Fastball,0,1,R
709,2022-04-07,Slider,1,4-Seam Fastball,,0,0,R


Alright, now that I have my organized dataframe of my features and labels, I need to encode all of it in order for it to, you know, actually work with the algorithms! First however, I removed all of the data that is the first or second pitch of the at-bat. My reasoning behind this is that at bats act as a sort of refresh for pitchers - a clean slate, if you will. Additionally, by shifting the data up by 1 and 2 for the last and second to last picth data respectively, for the first and second pitch of a game, the data would say that the last pitch and second last pitch would be from the last game, which just doesn't make sense! Finally, I'm going to combine two of the label types, Curveballs and Sliders, for two reasons. One, when grouping pitch types, people usually combine them as one category - Breaking Balls. The second reason is more important in terms of practicality; the pitches that are most important for batters to recognize (in order for them to hit it well) are fastballs and changeups, which means that differentiating between fastballs, changeups, and breaking balls is the most practical setup for this algorithm.

In [5]:
#combining the curveballs and sliders

df = df.mask(df['pitch_number'].eq(1)).dropna()
df = df.mask(df['pitch_number'].eq(2)).dropna()

df['pitch_name'] = df['pitch_name'].replace('4-Seam Fastball', 'fast')
df['pitch_name'] = df['pitch_name'].replace(['Curveball', 'Slider'], 'breaking')


df['last_pitch'] = df['last_pitch'].replace('4-Seam Fastball', 'fast')
df['last_pitch'] = df['last_pitch'].replace(['Curveball', 'Slider'], 'breaking')

df['second_last'] = df['second_last'].replace('4-Seam Fastball', 'fast')
df['second_last'] = df['second_last'].replace(['Curveball', 'Slider'], 'breaking')

df

Unnamed: 0,game_date,pitch_name,pitch_number,last_pitch,second_last,strikes,balls,stand
0,2022-10-05,Changeup,7.0,fast,breaking,2.0,3.0,R
1,2022-10-05,fast,6.0,breaking,breaking,2.0,2.0,R
2,2022-10-05,breaking,5.0,breaking,breaking,2.0,1.0,R
3,2022-10-05,breaking,4.0,breaking,breaking,2.0,1.0,R
4,2022-10-05,breaking,3.0,breaking,breaking,1.0,1.0,R
...,...,...,...,...,...,...,...,...
699,2022-04-10,breaking,4.0,fast,fast,2.0,1.0,R
700,2022-04-10,fast,3.0,fast,fast,1.0,1.0,R
703,2022-04-07,breaking,4.0,fast,fast,2.0,1.0,L
704,2022-04-07,fast,3.0,fast,breaking,1.0,1.0,L


In [6]:
#encoding yayyy woo hoo

df["pitch_name"] = df["pitch_name"].astype('category')
df["pitch_name"] = df["pitch_name"].cat.codes

df["last_pitch"] = df["last_pitch"].astype('category')
df["last_pitch"] = df["last_pitch"].cat.codes

df["second_last"] = df["second_last"].astype('category')
df["second_last"] = df["second_last"].cat.codes

df['strikes'] = df['strikes'].astype(int)
df['balls'] = df['balls'].astype(int)
df['pitch_number'] = df['pitch_number'].astype(int)

df['stand'] = df['stand'].astype('category')
df['stand'] = df['stand'].cat.codes

df = df.drop('game_date', 1)

df

  df = df.drop('game_date', 1)


Unnamed: 0,pitch_name,pitch_number,last_pitch,second_last,strikes,balls,stand
0,0,7,2,1,2,3,1
1,2,6,1,1,2,2,1
2,1,5,1,1,2,1,1
3,1,4,1,1,2,1,1
4,1,3,1,1,1,1,1
...,...,...,...,...,...,...,...
699,1,4,2,2,2,1,1
700,2,3,2,2,1,1,1
703,1,4,2,2,2,1,0
704,2,3,2,1,1,1,0


Alright, our data is set! Now to apply the data to the models. There's a number of classification algorithms that we can use, and there are two main types that I want to use in this project - an SVM (support vector machine) as well as a Random Forest Classifier. So let's create both of those models using scikit-learn and print the respective accuracy scores of the two models for their predictions of the test data.

In [7]:
#split data into train and test sets

X = np.array(df.drop('pitch_name', 1))
y = np.array(df['pitch_name'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)

  X = np.array(df.drop('pitch_name', 1))


In [8]:
#first model - random forest model

clf = RandomForestClassifier(n_estimators = 200)
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

0.6956521739130435

In [9]:
#enxt model - svm

rbf = svm.SVC(kernel='rbf', gamma=10, C=1).fit(X_train, y_train)
poly = svm.SVC(kernel='poly', degree=2, C=0.13).fit(X_train, y_train)

poly_pred = poly.predict(X_test)
rbf_pred = rbf.predict(X_test)

In [10]:
poly_accuracy = accuracy_score(y_test, poly_pred)
poly_f1 = f1_score(y_test, poly_pred, average='weighted')
print('Accuracy (Polynomial Kernel): ', "%.2f" % (poly_accuracy*100))
print('F1 (Polynomial Kernel): ', "%.2f" % (poly_f1*100))

rbf_accuracy = accuracy_score(y_test, rbf_pred)
rbf_f1 = f1_score(y_test, rbf_pred, average='weighted')
print('Accuracy (RBF Kernel): ', "%.2f" % (rbf_accuracy*100))
print('F1 (RBF Kernel): ', "%.2f" % (rbf_f1*100))

Accuracy (Polynomial Kernel):  71.74
F1 (Polynomial Kernel):  64.93
Accuracy (RBF Kernel):  69.57
F1 (RBF Kernel):  65.45
