## Data Preprocessing

Data was gathered from the Statcast database

In [1]:
import pandas as pd
import os

os.chdir('..')

The dataset is then filtered to only include regular-season games and the 13 pitch types used in the study.

In [2]:
data = pd.read_csv(os.path.join('data', 'raw', 'wade_miley_15-21.csv'))

# Define acceptable pitch types
valid_pitch_dict = {
    'FF': 'Four-Seam Fastball', 'FT': 'Two-Seam Fastball',
    'CH': 'Change-up', 'CU': 'Curveball',
    'FC': 'Cutter', 'EP': 'Eephus',
    'FO': 'Forkball', 'KC': 'Knuckle Curve',
    'KN': 'Knuckleball', 'SC': 'Screwball',
    'SI': 'Sinker', 'SL': 'Slider',
    'FS': 'Splitter',
}

filtered_pitches_pitch_type = data[
    data['pitch_type'].str.contains(
        '|'.join(list(valid_pitch_dict.keys())), na=False)
]

# Consider only regular season
filtered_pitches_game_type = filtered_pitches_pitch_type[filtered_pitches_pitch_type['game_type'] == 'R']

Now, we will store the repertoire of the pitcher for future reference

In [3]:
repertoire_abb = filtered_pitches_game_type['pitch_type'].unique()
repertoire_full = filtered_pitches_game_type['pitch_name'].unique()

By calculating a pitcher’s most common pitch and the frequency with which he throws it, I create a baseline for judging model performance. In Miley’s case, the model needs to outperform the 27% accuracy you would get from guessing four-seam fastball every time.

In [4]:
top_pitch = filtered_pitches_game_type['pitch_type'].value_counts().head(1)
top_pitch_name = valid_pitch_dict[top_pitch.index[0]]
top_pitch_freq = int((top_pitch.values[0] / len(filtered_pitches_game_type)) * 100)
print(f'Wade Miley throws {top_pitch_name} {top_pitch_freq}% of the time')

Wade Miley throws Four-Seam Fastball 27% of the time


## Feature Engineering
This paper uses the following 8 features:

Before engineering these features, I created a new ID column to group pitches by unique at-bat. This will be used to divide the data into chronological sequences for the LSTM model.

In [5]:
data.loc[:,'plate_app_id'] =  data['game_pk'].astype(str) + data['batter'].astype(str) + data['at_bat_number'].astype(str)

With pitch count, inning, ball count, and outs already in the data, I engineered the rest of the features with custom functions and simple pandas operations.

In [6]:
data = data.sort_values(['game_date',
                         'game_pk', # Handle double headers
                         'plate_app_id',
                         'pitch_number'], ascending=True)

In [7]:
def score_diff(row):
    if row['inning_topbot'] == 'Top':
        return row['home_score'] - row['away_score']
    else:
        return row['away_score'] - row['home_score']

In [8]:
# Create previous pitch column
data.loc[:,'previous_pitch'] = data['pitch_type'].shift(1)
data.loc[data['pitch_number'] == 1, 'previous_pitch'] = None

# Create previous zone column
data.loc[:,'previous_zone'] = data['zone'].shift(1)
data.loc[data['pitch_number'] == 1, 'previous_zone'] = None

# Encode runners on base
on_base_cols = ['on_3b', 'on_2b', 'on_1b']
for col in on_base_cols:
  data.loc[:,col] = data[col].fillna(0).astype(int)
  data.loc[data[col] != 0, col] = 1

# Create score difference
data.loc[:,'score_diff'] = data.apply(score_diff, axis=1)

selected_features = [
    'plate_app_id',
    'previous_pitch',
    'previous_zone',
    'pitch_number',
    'inning',
    'on_3b', 'on_2b', 'on_1b',
    'score_diff',
    'balls',
    'outs_when_up',
    'pitch_type'
]
selected_cols = data[selected_features]

Categorical features were then one-hot encoded. Since the paper did not mention scaling or standardizing the continuous columns, I did not do so. This may not make much of a difference as pitch count is the only continuous feature without an upper limit written into the rules of baseball and at-bats are usually about 5 pitches long.

In [9]:
# Apply one-hot encoding to categorical columns
selected_cols = pd.get_dummies(selected_cols, columns=['previous_zone', 'previous_pitch', 'inning', 'pitch_type'], dtype=int)
selected_cols.head()

Unnamed: 0,plate_app_id,pitch_number,on_3b,on_2b,on_1b,score_diff,balls,outs_when_up,previous_zone_1.0,previous_zone_2.0,...,inning_8,inning_9,pitch_type_CH,pitch_type_CS,pitch_type_CU,pitch_type_FC,pitch_type_FF,pitch_type_IN,pitch_type_SI,pitch_type_SL
10784,41369612134729,1,0.0,1.0,0.0,1,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
10783,41369612134729,2,0.0,1.0,0.0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
10782,41369612134729,3,0.0,1.0,0.0,1,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
10781,41369612134729,4,0.0,1.0,0.0,1,1,1,0,0,...,0,0,0,0,0,0,1,0,0,0
10748,41369612134750,1,1.0,0.0,1.0,3,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


In [10]:
selected_cols.to_csv(os.path.join("data", 'clean', "wade_miley.csv"))