This notebook will be exploring the pybaseball package. Pybaseball is a package that can pull real-time and historical statcast data from MLB Baseball Savant (The leading source for advanced analytics in baseball). 

The end goal of this notebook will be to get a initial working dataset we can use for modeling and cross-validation

Since each pitcher is so different, and each contains thousands of rows of data, I'll shoot for building a model for just one pitcher. Then I'll look to make it more dynamic and take different pitchers as inputs.

In [1]:
from pybaseball import  playerid_lookup
from pybaseball import  statcast_pitcher
from pybaseball import  statcast
import pandas as pd
import numpy as np


In [2]:
#This function finds id info for any player in baseball history. I'll choose one of the more popular and successful pitchers in recent years Blake Snell for my first example
a = playerid_lookup('snell', 'blake')

# There was a Blake snell that played in the early 1900's, so just filtering odwn. We'll extract the id to use for stat searching
a[a['mlb_played_first'] > 2000.0]

Gathering player lookup table. This may take a moment.


Unnamed: 0,name_last,name_first,key_mlbam,key_retro,key_bbref,key_fangraphs,mlb_played_first,mlb_played_last
0,snell,blake,605483,snelb001,snellbl01,13543,2016.0,2024.0


In [61]:
len(a)

1

In [3]:
#This data is on a pitch level. This is df contains all 3000+ pitches thrown by Snell in the designated timeframe.

df = statcast_pitcher('2023-04-01', '2023-09-30', 605483)
df.head()

Gathering Player Data


Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
0,CU,2023-09-25,82.7,1.91,6.47,"Snell, Blake",672275,605483,strikeout,swinging_strike_blocked,...,1,1,0,0,1,Standard,Standard,330.0,-0.077,-0.311
1,FF,2023-09-25,96.8,1.82,6.51,"Snell, Blake",672275,605483,,swinging_strike,...,1,1,0,0,1,Standard,Standard,162.0,0.0,-0.098
2,FF,2023-09-25,96.8,1.92,6.54,"Snell, Blake",672275,605483,,swinging_strike,...,1,1,0,0,1,Standard,Standard,124.0,0.0,-0.063
3,CH,2023-09-25,86.7,2.39,6.33,"Snell, Blake",571745,605483,single,hit_into_play,...,1,1,0,0,1,Standard,Standard,138.0,0.035,0.284
4,CU,2023-09-25,80.9,2.01,6.48,"Snell, Blake",571745,605483,,ball,...,1,1,0,0,1,Standard,Standard,333.0,0.0,0.025


In [9]:
df11 = df.head(1)
df11

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
0,CU,2023-09-25,82.7,1.91,6.47,"Snell, Blake",672275,605483,strikeout,swinging_strike_blocked,...,1,1,0,0,1,Standard,Standard,330.0,-0.077,-0.311


In [16]:
df.columns

Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description', 'spin_dir', 'spin_rate_deprecated',
       'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',
       'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',
       'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',
       'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',
       'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',
       'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',
       'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',
       'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',
       'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
       'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
       'estima

This table contains some factors that I assume will be key in predicting pitch type. 

In [18]:
#assessing missing values
missing = df.isna().sum().reset_index()
missing = missing.loc[missing[0] > 0]
missing

Unnamed: 0,index,0
8,events,2351
10,spin_dir,3075
11,spin_rate_deprecated,3075
12,break_angle_deprecated,3075
13,break_length_deprecated,3075
22,hit_location,2470
23,bb_type,2681
31,on_3b,2841
32,on_2b,2499
33,on_1b,2139


Most factors with null values are stats related to the batted ball in play. Obviously these won't be relevant to predicting the pre
ceeding pitch so most can be safely removed.

One that stands out are the 'on base' columns. These contain a player id of whoever is on base. These could possibly affect pitch type so I'll convert to binary dummies and include them in the initial feature set


In [19]:
#initializg a list of factors to possibly be trimmed 
missing_list = missing['index'].to_list()


#These conatin player id's for whoever is on base. I'll replace with binary variables

on_base_cols = ['on_1b', 'on_2b', 'on_3b']

for i in on_base_cols:
    df[i] = df[i].fillna(0)
    df[i] = np.where(df[i]>1, 1, df[i])
    

df[on_base_cols]


Unnamed: 0,on_1b,on_2b,on_3b
0,1.0,1.0,0.0
1,1.0,1.0,0.0
2,1.0,1.0,0.0
3,1.0,0.0,0.0
4,1.0,0.0,0.0
...,...,...,...
3070,0.0,1.0,0.0
3071,0.0,0.0,0.0
3072,0.0,0.0,0.0
3073,0.0,0.0,0.0


In [29]:
#remove on base columns from removal list
missing_list2 = [x for x in missing_list if x not in on_base_cols]

#new dataframe with null columns removed
df2 = df.drop(missing_list2, axis=1)
df2.columns


Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'description',
       'zone', 'des', 'game_type', 'stand', 'p_throws', 'home_team',
       'away_team', 'type', 'balls', 'strikes', 'game_year', 'pfx_x', 'pfx_z',
       'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b', 'outs_when_up',
       'inning', 'inning_topbot', 'fielder_2', 'vx0', 'vy0', 'vz0', 'ax', 'ay',
       'az', 'sz_top', 'sz_bot', 'effective_speed', 'release_extension',
       'game_pk', 'pitcher.1', 'fielder_2.1', 'fielder_3', 'fielder_4',
       'fielder_5', 'fielder_6', 'fielder_7', 'fielder_8', 'fielder_9',
       'release_pos_y', 'at_bat_number', 'pitch_number', 'pitch_name',
       'home_score', 'away_score', 'bat_score', 'fld_score', 'post_away_score',
       'post_home_score', 'post_bat_score', 'post_fld_score',
       'if_fielding_alignment', 'of_fielding_alignment', 'delta_home_win_exp',
       'delta_run_exp'],
      dtype='object')

Now I'll look to remove any more columns that would be irrelevant to predicting pitch type. As a baseball fan, I'm quite certain in removing these. If anything is uncertain, it will be left in initially.

In [56]:
missing_list3 = ['release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'description',
       'zone', 'des', 'game_type', 'home_team',
       'away_team', 'type', 'game_year', 'pfx_x', 'pfx_z',
       'plate_x', 'plate_z', 'fielder_2', 'vx0', 'vy0', 'vz0', 'ax', 'ay',
       'az', 'sz_top', 'sz_bot', 'effective_speed', 'release_extension',
       'game_pk', 'pitcher.1', 'fielder_2.1', 'fielder_3', 'fielder_4',
       'fielder_5', 'fielder_6', 'fielder_7', 'fielder_8', 'fielder_9',
       'release_pos_y'
       ,'post_away_score',
       'post_home_score', 'post_bat_score', 'post_fld_score',
       ]


df3 = df2.drop(missing_list3, axis=1)
df3

Unnamed: 0,pitch_type,game_date,stand,p_throws,balls,strikes,on_3b,on_2b,on_1b,outs_when_up,...,pitch_number,pitch_name,home_score,away_score,bat_score,fld_score,if_fielding_alignment,of_fielding_alignment,delta_home_win_exp,delta_run_exp
0,CU,2023-09-25,R,L,0,2,0.0,1.0,1.0,2,...,3,Curveball,0,1,0,1,Standard,Standard,-0.077,-0.311
1,FF,2023-09-25,R,L,0,1,0.0,1.0,1.0,2,...,2,4-Seam Fastball,0,1,0,1,Standard,Standard,0.000,-0.098
2,FF,2023-09-25,R,L,0,0,0.0,1.0,1.0,2,...,1,4-Seam Fastball,0,1,0,1,Standard,Standard,0.000,-0.063
3,CH,2023-09-25,R,L,1,0,0.0,0.0,1.0,2,...,2,Changeup,0,1,0,1,Standard,Standard,0.035,0.284
4,CU,2023-09-25,R,L,0,0,0.0,0.0,1.0,2,...,1,Curveball,0,1,0,1,Standard,Standard,0.000,0.025
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3070,FF,2023-04-06,L,L,0,0,0.0,1.0,0.0,0,...,1,4-Seam Fastball,0,0,0,0,Standard,Standard,-0.010,-0.292
3071,FF,2023-04-06,R,L,2,1,0.0,0.0,0.0,0,...,4,4-Seam Fastball,0,0,0,0,Standard,Standard,0.060,0.591
3072,SL,2023-04-06,R,L,1,1,0.0,0.0,0.0,0,...,3,Slider,0,0,0,0,Standard,Standard,0.000,0.053
3073,FF,2023-04-06,R,L,0,1,0.0,0.0,0.0,0,...,2,4-Seam Fastball,0,0,0,0,Standard,Standard,0.000,0.028


In [57]:
df3['batter_is_right'] = np.where(df2['stand'] == 'R', 1, 0)
df3['pitcher_is_right'] = np.where(df2['p_throws'] == 'R', 1, 0)
df3['inning_top'] = np.where(df2['inning_topbot'] == 'Top', 1, 0)
df3 = df3.drop(['stand', 'p_throws', 'pitch_name', 'game_date', 'inning_topbot'], axis=1)

df3.head()

Unnamed: 0,pitch_type,balls,strikes,on_3b,on_2b,on_1b,outs_when_up,inning,at_bat_number,pitch_number,...,away_score,bat_score,fld_score,if_fielding_alignment,of_fielding_alignment,delta_home_win_exp,delta_run_exp,batter_is_right,pitcher_is_right,inning_top
0,CU,0,2,0.0,1.0,1.0,2,6,47,3,...,1,0,1,Standard,Standard,-0.077,-0.311,1,0,0
1,FF,0,1,0.0,1.0,1.0,2,6,47,2,...,1,0,1,Standard,Standard,0.0,-0.098,1,0,0
2,FF,0,0,0.0,1.0,1.0,2,6,47,1,...,1,0,1,Standard,Standard,0.0,-0.063,1,0,0
3,CH,1,0,0.0,0.0,1.0,2,6,46,2,...,1,0,1,Standard,Standard,0.035,0.284,1,0,0
4,CU,0,0,0.0,0.0,1.0,2,6,46,1,...,1,0,1,Standard,Standard,0.0,0.025,1,0,0


In [58]:


# print(df3['of_fielding_alignment'].value_counts())
# print(df3['if_fielding_alignment'].value_counts())
#df3.columns
df4 = pd.get_dummies(df3, columns=['if_fielding_alignment', 'of_fielding_alignment'], drop_first=True)
df4.head()

Unnamed: 0,pitch_type,balls,strikes,on_3b,on_2b,on_1b,outs_when_up,inning,at_bat_number,pitch_number,...,bat_score,fld_score,delta_home_win_exp,delta_run_exp,batter_is_right,pitcher_is_right,inning_top,if_fielding_alignment_Standard,if_fielding_alignment_Strategic,of_fielding_alignment_Strategic
0,CU,0,2,0.0,1.0,1.0,2,6,47,3,...,0,1,-0.077,-0.311,1,0,0,True,False,False
1,FF,0,1,0.0,1.0,1.0,2,6,47,2,...,0,1,0.0,-0.098,1,0,0,True,False,False
2,FF,0,0,0.0,1.0,1.0,2,6,47,1,...,0,1,0.0,-0.063,1,0,0,True,False,False
3,CH,1,0,0.0,0.0,1.0,2,6,46,2,...,0,1,0.035,0.284,1,0,0,True,False,False
4,CU,0,0,0.0,0.0,1.0,2,6,46,1,...,0,1,0.0,0.025,1,0,0,True,False,False
