In [1]:
import pybaseball

# Your own support vector machine

Today's practice will deal with support vector machines. Using radial basis functions(RBF) as kernel functions we will analyze everyones' favorite sport: baseball! Due to its popularity you are obviously already familiar with the fact, that the [strike zone](https://de.wikipedia.org/wiki/Strike_Zone), which the pitcher has to hit in order to be awarded a "strike" against the batter and will otherwise be punished with a "ball", is defined as a rectangle reaching from the player's knees to his chest above the home base. Because this is common knowledge we don't need to mention, that this definition varies from player to player as a result and also the umpire's calls will impact the shape of the real shape of the strike zone. We will build a support vector machine for our favorite players to determine the decision boundary for the judgment, if a pitch will be a strike or a ball.

The first player we'll have a look at is the 2017 rookie of the year, Aaron Judge, who's 2.01m tall and therefore one of the tallest players in the MLB. Let's see what information is available.

In [24]:
judge_id = pybaseball.playerid_lookup("Judge","Aaron")
judge_stats = pybaseball.statcast_batter('2010-01-01', '2022-01-01', 592450)
print(judge_stats.columns)

Gathering Player Data


  


Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description', 'spin_dir', 'spin_rate_deprecated',
       'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',
       'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',
       'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',
       'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',
       'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',
       'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',
       'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',
       'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',
       'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
       'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
       'estima

### Data exploration

Well this might be a little too much data for now, even for us baseball enthusiasts. If you really want to know what all of these stats mean, I'll refer you to [this site](https://baseballsavant.mlb.com/csv-docs).

For now I'll go ahead and pick the relevant columns for now.

In [25]:
judge_df = judge_stats[["plate_x", "plate_z", "type"]]
print(judge_df[100:106])

    plate_x  plate_z type
30     0.82     3.21    X
31     0.89     0.96    B
32     0.57     1.66    S
33     0.34     2.74    S
34     0.64     3.88    S
35     1.11     2.45    B


There we go, this is the horizontal(plate_x) and vertical(plate_z) position of the ball when it crosses home plate from the catcher's perspective, but wait... this was too easy. Something feels off. What's that "mixed types"-error message in cell number 2 about? Why would I print the columns 100 to 105 and not simply df.head()? Also if I print columns 100 to 105, why does the index in the output say 30 to 35?

Inspect the data yourself and figure out the problem

In [26]:
print(judge_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11953 entries, 0 to 11882
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   plate_x  11409 non-null  float64
 1   plate_z  11409 non-null  float64
 2   type     11953 non-null  object 
dtypes: float64(2), object(1)
memory usage: 373.5+ KB
None


Well, there is a bunch of NaNs in our data. We should decided what we want to do with these values. If there is some other relevant data in these rows, we might want to replace them with some other value. But in our case we have no reason to keep these row, so we can just drop them.

Clean the data, so that there are no NaNs left.

In [27]:
judge_df_cleaned = judge_df.dropna()

### Data preparation