#### Bridget Sands
#### Harvard University
#### Applied Mathematics Senior Thesis
#### April 1st, 2024

# "Data_Cleaning_PA.ipynb"

### Note: This is the 2nd(A) notebook used for cleaning, following the first cleaning notebook that adjusts for the men on base. It is denoted (A) because it properly prepares data for the PA model. (B) prepares data for the SB model.

#### Notebook Purpose and Summary:
This notebook was used to do the final cleaning and preparation of the data that had already been cleaned from the first data cleaning notebook, `Clean_OB.ipynb`. This notebook takes in a season of cleaned data, does additional featuring engineering, and exports the season ready to be imported into the PA model.

#### Input:
1. `csv` season of data for specific league/year, already cleaned by the `Clean_OB.ipynb`.
2. `problem_pks.csv` of game_pks for games where ABS is used.

#### Export:
1. `csv` season of data for inputted specific league/year, ready imported into the `PA_model.Rmd` file.
2. `csv` of unique batter ids and names from this season of data.
3. `csv` of unique pitcher ids and names from this season of data.

#### Glossary:
- PA: Plate appearance

#### Additional Notes:
- Following data cleaning and investigation, it is clear that columns that begin with `details.` generally describe the specific row entry, while columns that begin with `results.` provide information about the overall PA the entry belongs to.

In [1]:
# Import helpful libraries
import numpy as np
import pandas as pd
import math

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

## Remember to CHANGE FILE:
#### Relative to season needed to clean

In [3]:
# Read in file as a pandas dataframe
df = pd.read_csv('da14_wOB_F.csv', low_memory=False)

In [4]:
# Sort values by game and entry time
df = df.sort_values(by=['game_pk', 'startTime'])

# Inspect head of data
df.head(50)

Unnamed: 0.1,Unnamed: 0,game_pk,game_date,startTime,type,playId,pitchNumber,details.description,details.event,details.code,details.isInPlay,details.isStrike,details.isBall,count.balls.start,count.strikes.start,count.outs.start,result.eventType,result.description,result.rbi,result.awayScore,result.homeScore,about.atBatIndex,about.halfInning,about.inning,about.isComplete,about.isScoringPlay,matchup.batter.id,matchup.batter.fullName,matchup.batSide.code,matchup.pitcher.id,matchup.pitcher.fullName,matchup.pitchHand.code,matchup.splits.menOnBase,details.isOut,about.isTopInning,PA_id,Men_OB
14,14,383692,2014-04-03,2014-04-03T23:39:43.000Z,pitch,03836926-0016-0013-000c-f08cd117d70a,1.0,Called Strike,,C,False,True,False,0,1,0,field_out,"Ketel Marte grounds out, second baseman Darnel...",0.0,0.0,0.0,0.0,top,1.0,True,False,606466.0,Ketel Marte,L,516910.0,Carlos Frias,R,Empty,False,True,383692-0.0-1.0-1,Empty
13,13,383692,2014-04-03,2014-04-03T23:39:53.000Z,pitch,03836926-0016-0023-000c-f08cd117d70a,2.0,"In play, out(s)",,X,True,False,False,0,1,0,field_out,"Ketel Marte grounds out, second baseman Darnel...",0.0,0.0,0.0,0.0,top,1.0,True,False,606466.0,Ketel Marte,L,516910.0,Carlos Frias,R,Empty,True,True,383692-0.0-1.0-1,Empty
16,16,383692,2014-04-03,2014-04-03T23:40:19.000Z,pitch,03836926-0026-0013-000c-f08cd117d70a,1.0,Called Strike,,C,False,True,False,0,1,1,field_out,"Jack Marder grounds out, third baseman Daniel ...",0.0,0.0,0.0,1.0,top,1.0,True,False,573011.0,Jack Marder,R,516910.0,Carlos Frias,R,Empty,False,True,383692-1.0-1.0-1,Empty
15,15,383692,2014-04-03,2014-04-03T23:40:42.000Z,pitch,03836926-0026-0023-000c-f08cd117d70a,2.0,"In play, out(s)",,X,True,False,False,0,1,1,field_out,"Jack Marder grounds out, third baseman Daniel ...",0.0,0.0,0.0,1.0,top,1.0,True,False,573011.0,Jack Marder,R,516910.0,Carlos Frias,R,Empty,True,True,383692-1.0-1.0-1,Empty
20,20,383692,2014-04-03,2014-04-03T23:41:02.000Z,pitch,03836926-0036-0013-000c-f08cd117d70a,1.0,Ball,,B,False,False,True,1,0,2,strikeout,Leon Landry strikes out swinging.,0.0,0.0,0.0,2.0,top,1.0,True,False,518914.0,Leon Landry,L,516910.0,Carlos Frias,R,Empty,False,True,383692-2.0-1.0-1,Empty
19,19,383692,2014-04-03,2014-04-03T23:41:10.000Z,pitch,03836926-0036-0023-000c-f08cd117d70a,2.0,Foul,,F,False,True,False,1,1,2,strikeout,Leon Landry strikes out swinging.,0.0,0.0,0.0,2.0,top,1.0,True,False,518914.0,Leon Landry,L,516910.0,Carlos Frias,R,Empty,False,True,383692-2.0-1.0-1,Empty
18,18,383692,2014-04-03,2014-04-03T23:41:27.000Z,pitch,03836926-0036-0033-000c-f08cd117d70a,3.0,Swinging Strike,,S,False,True,False,1,2,2,strikeout,Leon Landry strikes out swinging.,0.0,0.0,0.0,2.0,top,1.0,True,False,518914.0,Leon Landry,L,516910.0,Carlos Frias,R,Empty,False,True,383692-2.0-1.0-1,Empty
17,17,383692,2014-04-03,2014-04-03T23:41:47.000Z,pitch,03836926-0036-0043-000c-f08cd117d70a,4.0,Swinging Strike,,S,False,True,False,1,3,2,strikeout,Leon Landry strikes out swinging.,0.0,0.0,0.0,2.0,top,1.0,True,False,518914.0,Leon Landry,L,516910.0,Carlos Frias,R,Empty,True,True,383692-2.0-1.0-1,Empty
2,2,383692,2014-04-03,2014-04-03T23:44:19.000Z,pitch,03836926-0046-0013-000c-f08cd117d70a,1.0,Ball,,B,False,False,True,1,0,0,field_out,"Darnell Sweeney grounds out, second baseman Ja...",0.0,0.0,0.0,3.0,bottom,1.0,True,False,572182.0,Darnell Sweeney,L,607651.0,Trevor Miller,R,Empty,False,False,383692-3.0-1.0-0,Empty
1,1,383692,2014-04-03,2014-04-03T23:44:36.000Z,pitch,03836926-0046-0023-000c-f08cd117d70a,2.0,Swinging Strike,,S,False,True,False,1,1,0,field_out,"Darnell Sweeney grounds out, second baseman Ja...",0.0,0.0,0.0,3.0,bottom,1.0,True,False,572182.0,Darnell Sweeney,L,607651.0,Trevor Miller,R,Empty,False,False,383692-3.0-1.0-0,Empty


In [5]:
# Convert game_date to a datetime character
df['game_date'] = pd.to_datetime(df['game_date'])

# Print out date of final game --> ensure it makes sense
sorted(df['game_date'].values)[-1]

numpy.datetime64('2014-09-01T00:00:00.000000000')

In [6]:
# Filter to just use pitch and no_pitch types, with only complete pitches/plays
df = df[(df['type'].isin(['pitch','no_pitch']))&(df['about.isComplete']==True)]
print(len(df))

569797


In [7]:
# Investigate nas:
df.isna().sum()

Unnamed: 0                       0
game_pk                          0
game_date                        0
startTime                     1407
type                             0
playId                         232
pitchNumber                    232
details.description            232
details.event               569797
details.code                     0
details.isInPlay               232
details.isStrike               232
details.isBall                 232
count.balls.start                0
count.strikes.start              0
count.outs.start                 0
result.eventType                 0
result.description               0
result.rbi                       0
result.awayScore                 0
result.homeScore                 0
about.atBatIndex                 0
about.halfInning                 0
about.inning                     0
about.isComplete                 0
about.isScoringPlay              0
matchup.batter.id                0
matchup.batter.fullName          0
matchup.batSide.code

In [8]:
# Get rid of the entries that have NAs in the `details.isStrike` column
df = df[df['details.isStrike'].notna()]
df = df[df['playId'].notna()]

In [9]:
# Check value counts of Men_OB values
df['Men_OB'].value_counts()

Men_OB
Empty     311966
RISP      128444
Men_On    114485
Loaded     14670
Name: count, dtype: int64

In [10]:
# Create unique identification of each PA
df['PA_id'] = df['game_pk'].astype('str') + '-' + df['about.atBatIndex'].astype('str') + '-' + df['matchup.batter.id'].astype('str')

In [11]:
# Inspect new column
df['PA_id'].head()

14    383692-0.0-606466.0
13    383692-0.0-606466.0
16    383692-1.0-573011.0
15    383692-1.0-573011.0
20    383692-2.0-518914.0
Name: PA_id, dtype: object

In [12]:
# Consider potential PA results
df['result.eventType'].unique()

array(['field_out', 'strikeout', 'double', 'field_error', 'single',
       'home_run', 'walk', 'sac_fly', 'caught_stealing_2b', 'force_out',
       'triple', 'sac_bunt', 'grounded_into_double_play', 'hit_by_pitch',
       'fielders_choice_out', 'double_play', 'intent_walk',
       'fielders_choice', 'other_out', 'catcher_interf', 'pickoff_1b',
       'strikeout_double_play', 'caught_stealing_home',
       'pickoff_caught_stealing_2b', 'caught_stealing_3b', 'pickoff_2b',
       'pickoff_caught_stealing_3b', 'sac_fly_double_play',
       'pickoff_caught_stealing_home', 'sac_bunt_double_play',
       'triple_play', 'stolen_base_2b', 'wild_pitch', 'pickoff_3b',
       'pickoff_error_1b', 'pickoff_error_2b'], dtype=object)

In [13]:
# Filter out PAs resulting in variations of stolen base attempts (successful or not)
# Filter out PAs resulting in variations of pickoffs
# Filter out PAs resulting in balks, catch interference
steals = ['caught_stealing_2b', 'pickoff_caught_stealing_2b', 'other_out', 'pickoff_1b', 'intent_walk', 
          'caught_stealing_3b', 'catcher_interf', 'caught_stealing_home', 'pickoff_2b', 'pickoff_caught_stealing_3b',
          'pickoff_caught_stealing_home', 'pickoff_error_1b', 'pickoff_error_2b', 'stolen_base_2b', 'balk', 'pickoff_3b']

df = df[~df['result.eventType'].isin(steals)]

# Filter out entries of intentional balls, hit by pitch
df = df[~(df['details.description'].isin(['Intent Ball', 'Hit By Pitch']))]

# Create feature that identifies if pitch is last pitch of PA
max_vals = df.groupby('PA_id')['pitchNumber'].transform('max') == df['pitchNumber']
df['Last_pitch'] = max_vals.astype(int)

# Filter out pitches that are the last pitch of field errors, wild pitches, hit by pitch
df = df[~((df['Last_pitch']==True)&(df['result.eventType'].isin(['field_error', 'hit_by_pitch', 'wild_pitch'])))]

In [14]:
# Balls feature engineering
df['Balls'] = np.where(df['details.isBall'], df['count.balls.start']-1, df['count.balls.start'])

# Rename Outs 
df['Outs'] = df['count.outs.start']

# Batter_home feature engineering
df['Batter_home'] = np.where(df['about.isTopInning'], 0, 1)

# Rename Inning
df['Inning'] = df['about.inning']

In [15]:
# Strikes feature engineering
df['Strikes'] = np.where((df['details.isStrike'])&(df['count.strikes.start']==2)&(df['Last_pitch']==False)&(df['details.description'].isin(['Foul', 'Foul Bunt', 'Foul Tip'])), -1,
                                                                                                            np.where(df['details.isStrike'], df['count.strikes.start']-1, df['count.strikes.start']))
                                                                                                            
# Confirm first part of Strikes feature engineering worked as intended
df['Strikes'].value_counts()                                                                                                           

Strikes
 0    240047
 1    137906
 2    117542
-1     66603
 3         4
Name: count, dtype: int64

In [16]:
# Confirm first part of Strikes feature engineering worked as intended
df[df['Strikes']==-1].head()

Unnamed: 0.1,Unnamed: 0,game_pk,game_date,startTime,type,playId,pitchNumber,details.description,details.event,details.code,details.isInPlay,details.isStrike,details.isBall,count.balls.start,count.strikes.start,count.outs.start,result.eventType,result.description,result.rbi,result.awayScore,result.homeScore,about.atBatIndex,about.halfInning,about.inning,about.isComplete,about.isScoringPlay,matchup.batter.id,matchup.batter.fullName,matchup.batSide.code,matchup.pitcher.id,matchup.pitcher.fullName,matchup.pitchHand.code,matchup.splits.menOnBase,details.isOut,about.isTopInning,PA_id,Men_OB,Last_pitch,Balls,Outs,Batter_home,Inning,Strikes
4,4,383692,2014-04-03,2014-04-03T23:45:25.000Z,pitch,03836926-0056-0023-000c-f08cd117d70a,2.0,Foul,,F,False,True,False,0,2,1,double,Ozzie Martinez doubles (1) on a fly ball to ce...,0.0,0.0,0.0,4.0,bottom,1.0,True,False,501954.0,Osvaldo Martinez,R,607651.0,Trevor Miller,R,RISP,False,False,383692-4.0-501954.0,Empty,0,0,1,1,1.0,-1
9,9,383692,2014-04-03,2014-04-03T23:47:33.000Z,pitch,03836926-0066-0033-000c-f08cd117d70a,3.0,Foul,,F,False,True,False,1,2,1,field_out,"Daniel Mayora grounds out, third baseman Ramon...",0.0,0.0,0.0,5.0,bottom,1.0,True,False,468452.0,Daniel Mayora,R,607651.0,Trevor Miller,R,RISP,False,False,383692-5.0-468452.0,RISP,0,1,1,1,1.0,-1
7,7,383692,2014-04-03,2014-04-03T23:48:42.000Z,pitch,03836926-0066-0053-000c-f08cd117d70a,5.0,Foul,,F,False,True,False,2,2,1,field_out,"Daniel Mayora grounds out, third baseman Ramon...",0.0,0.0,0.0,5.0,bottom,1.0,True,False,468452.0,Daniel Mayora,R,607651.0,Trevor Miller,R,RISP,False,False,383692-5.0-468452.0,RISP,0,2,1,1,1.0,-1
34,34,383692,2014-04-03,2014-04-03T23:54:17.000Z,pitch,03836926-0096-0053-000c-f08cd117d70a,5.0,Foul,,F,False,True,False,2,2,1,field_out,Kevin Rivers flies out to left fielder Scott S...,0.0,0.0,0.0,8.0,top,2.0,True,False,577011.0,Kevin Rivers,L,516910.0,Carlos Frias,R,Empty,False,True,383692-8.0-577011.0,Empty,0,2,1,0,2.0,-1
23,23,383692,2014-04-03,2014-04-03T23:58:00.000Z,pitch,03836926-0116-0033-000c-f08cd117d70a,3.0,Foul,,F,False,True,False,1,2,0,field_out,O'Koyea Dickson flies out to right fielder Kev...,0.0,0.0,0.0,10.0,bottom,2.0,True,False,607297.0,O'Koyea Dickson,R,607651.0,Trevor Miller,R,Empty,False,False,383692-10.0-607297.0,Empty,0,1,0,1,2.0,-1


In [17]:
# Run second part of Strikes feature engineering
for index, row in df[df['Strikes']==-1].iterrows():

    # Access current row
    r = row['count.strikes.start']
    curr_loc = df.index.get_loc(index)

    # Access pitch before
    nxt = df.index[curr_loc+1]
    b4 = df.loc[nxt, 'count.strikes.start']

    # Check if strike count of current row is same as pitch before
    if r == b4:
        df.loc[index, 'Strikes2'] = row['count.strikes.start']
    else:
        df.loc[index, 'Strikes2'] = row['count.strikes.start']-1

In [18]:
# Execute final part of Strikes Engineering
print(df['Strikes2'].value_counts())
df['Strikes'] = np.where(df['Strikes']==-1, df['Strikes2'], df['Strikes'])

# Confirm Strikes engineering worked
df['Strikes'].value_counts()

Strikes2
2.0    54188
1.0    12415
Name: count, dtype: int64


Strikes
0.0    240047
2.0    171730
1.0    150321
3.0         4
Name: count, dtype: int64

In [19]:
# Confirm Strikes engineering worked
df[df['details.description'].isin(['Foul', 'Foul Bunt', 'Foul Tip'])].head()

Unnamed: 0.1,Unnamed: 0,game_pk,game_date,startTime,type,playId,pitchNumber,details.description,details.event,details.code,details.isInPlay,details.isStrike,details.isBall,count.balls.start,count.strikes.start,count.outs.start,result.eventType,result.description,result.rbi,result.awayScore,result.homeScore,about.atBatIndex,about.halfInning,about.inning,about.isComplete,about.isScoringPlay,matchup.batter.id,matchup.batter.fullName,matchup.batSide.code,matchup.pitcher.id,matchup.pitcher.fullName,matchup.pitchHand.code,matchup.splits.menOnBase,details.isOut,about.isTopInning,PA_id,Men_OB,Last_pitch,Balls,Outs,Batter_home,Inning,Strikes,Strikes2
19,19,383692,2014-04-03,2014-04-03T23:41:10.000Z,pitch,03836926-0036-0023-000c-f08cd117d70a,2.0,Foul,,F,False,True,False,1,1,2,strikeout,Leon Landry strikes out swinging.,0.0,0.0,0.0,2.0,top,1.0,True,False,518914.0,Leon Landry,L,516910.0,Carlos Frias,R,Empty,False,True,383692-2.0-518914.0,Empty,0,1,2,0,1.0,0.0,
4,4,383692,2014-04-03,2014-04-03T23:45:25.000Z,pitch,03836926-0056-0023-000c-f08cd117d70a,2.0,Foul,,F,False,True,False,0,2,1,double,Ozzie Martinez doubles (1) on a fly ball to ce...,0.0,0.0,0.0,4.0,bottom,1.0,True,False,501954.0,Osvaldo Martinez,R,607651.0,Trevor Miller,R,RISP,False,False,383692-4.0-501954.0,Empty,0,0,1,1,1.0,2.0,2.0
10,10,383692,2014-04-03,2014-04-03T23:47:03.000Z,pitch,03836926-0066-0023-000c-f08cd117d70a,2.0,Foul,,F,False,True,False,1,1,1,field_out,"Daniel Mayora grounds out, third baseman Ramon...",0.0,0.0,0.0,5.0,bottom,1.0,True,False,468452.0,Daniel Mayora,R,607651.0,Trevor Miller,R,RISP,False,False,383692-5.0-468452.0,RISP,0,1,1,1,1.0,0.0,
9,9,383692,2014-04-03,2014-04-03T23:47:33.000Z,pitch,03836926-0066-0033-000c-f08cd117d70a,3.0,Foul,,F,False,True,False,1,2,1,field_out,"Daniel Mayora grounds out, third baseman Ramon...",0.0,0.0,0.0,5.0,bottom,1.0,True,False,468452.0,Daniel Mayora,R,607651.0,Trevor Miller,R,RISP,False,False,383692-5.0-468452.0,RISP,0,1,1,1,1.0,2.0,2.0
7,7,383692,2014-04-03,2014-04-03T23:48:42.000Z,pitch,03836926-0066-0053-000c-f08cd117d70a,5.0,Foul,,F,False,True,False,2,2,1,field_out,"Daniel Mayora grounds out, third baseman Ramon...",0.0,0.0,0.0,5.0,bottom,1.0,True,False,468452.0,Daniel Mayora,R,607651.0,Trevor Miller,R,RISP,False,False,383692-5.0-468452.0,RISP,0,2,1,1,1.0,2.0,2.0


In [20]:
# Filter and confirm innings considered are only first nine
df = df[df['about.inning'] <= 9]
df['about.inning'].unique()

array([1., 2., 3., 5., 6., 7., 8., 9., 4.])

In [21]:
# Pitch_outcome engineering:

# Define results ending in inPlay outs
outs = ['field_out', 'grounded_into_double_play', 'sac_bunt', 'force_out', 'double_play', 'fielders_choice_out', 
        'fielders_choice', 'sac_fly_double_play', 'sac_bunt_double_play', 'triple_play', 'sac_fly']

df['Pitch_outcome'] = np.where(((df['details.isBall'] == True) & (df['Last_pitch'] == False)) | ((df['result.eventType'] == 'walk') & (df['Last_pitch'] == True)), 'Ball', 
                        np.where((df['details.code'].isin(['F', 'L', 'R', 'O', 'T'])) & (df['Last_pitch'] == False), 'Foul',
                            np.where(((df['details.isStrike'] == True) & (df['Last_pitch'] == False))|((df['result.eventType'].isin(['strikeout','strikeout_double_play'])) & (df['Last_pitch'] == True)), 'Strike',
                                np.where((df['result.eventType'].isin(outs)) & (df['Last_pitch'] == True), 'InPlay_Out',
                                    np.where((df['result.eventType'] == 'single') & (df['Last_pitch'] == True), 'Single',
                                        np.where((df['result.eventType'] == 'double') & (df['Last_pitch'] == True), 'Double',
                                            np.where((df['result.eventType'] == 'triple') & (df['Last_pitch'] == True), 'Triple',
                                                np.where((df['result.eventType'] == 'home_run') & (df['Last_pitch'] == True), 'HR',
                                                    'N/A'))))))))
                                          

In [22]:
# Evaluated Pitch_outcome value counts
df['Pitch_outcome'].value_counts()

Pitch_outcome
Ball          205133
Strike        153463
Foul           96319
InPlay_Out     72412
Single         24109
Double          6849
HR              2861
Triple           934
N/A               22
Name: count, dtype: int64

In [23]:
# Evaluate Strike Counts for fouls to confirm Strikes again
df[df['Pitch_outcome']=='Foul']['Strikes'].value_counts()

Strikes
2.0    54189
0.0    29713
1.0    12417
Name: count, dtype: int64

## Remember to adjust GRACE PERIOD:
#### Note:
If data is from Double-A or Triple-A in 2022, they started the season with a "Grace Period" relative to the Pitch Timer. Filter to only include games after this period.

In [24]:
# # 2022 GRACE PERIOD AAA AND AA
# df['game_date'] = pd.to_datetime(df['game_date'])
# df = df[df['game_date']>='2022-04-15']
# sorted(df['game_date'].values)[0]

In [25]:
# Filter out Pitches that do not have a certain result
df = df[df['Pitch_outcome']!='N/A']

In [26]:
# Matchup_handed feature engineering
display(df['matchup.batSide.code'].value_counts())
display(df['matchup.pitchHand.code'].value_counts())

df['Matchup_handed'] = df['matchup.batSide.code']+df['matchup.pitchHand.code']

df['Matchup_handed'].value_counts()

matchup.batSide.code
R    351499
L    210581
Name: count, dtype: int64

matchup.pitchHand.code
R    405725
L    156355
Name: count, dtype: int64

Matchup_handed
RR    237817
LR    167908
RL    113682
LL     42673
Name: count, dtype: int64

In [27]:
# Rename columns for export
rename = {'matchup.batter.id':'Batter_id', 'matchup.batter.fullName':'Batter_name', 
          'matchup.pitcher.id':'Pitcher_id', 'matchup.pitcher.fullName':'Pitcher_name'}

df = df.rename(columns=rename)

In [28]:
# Distinguish columns to be in final dataframe to export
f_cols = ['game_pk', 'Strikes', 'Balls', 'Outs', 'Men_OB', 'Batter_home', 'Inning', 'Pitch_outcome', 'Matchup_handed', 
          'Batter_id', 'Pitcher_id']

In [29]:
# Create final data frame to export
df_f = df[f_cols].copy()

## Remember to CHANGE YEAR AND LEAGUE:
#### Relative to season cleaning

In [30]:
# Set Year and League values
df_f['Year'] = 2014
df_f['League'] = 'AA'

## Remember to CHANGE TREATMENTS:
#### Relative to season cleaning

In [31]:
# Set rule implemenation status

# Pitch timer
# Control, v1, v2
df_f['Pitch_timer'] = 'Control'

# Bigger Bases
# 0, 1
df_f['Bigger_bases'] = 0

# Defensive shift limits 
# Control, v1, v2
df_f['Defensive_shift_limits'] = 'Control'

### ABS Adjustment:

In [32]:
# Read in problematic game_pks 
pks = pd.read_csv('problem_pks.csv', low_memory=False)

# Assign indicator for problematic
df_f['ABS'] = np.where((df_f['Year']==2023)&(df_f['League']=='AAA'), 1, np.where(df_f['game_pk'].isin(pks['value'].values), 1, 0))

# Evaluate values
df_f['ABS'].value_counts()

ABS
0    562080
Name: count, dtype: int64

In [33]:
# Confirm final columns of df
df_f.columns

Index(['game_pk', 'Strikes', 'Balls', 'Outs', 'Men_OB', 'Batter_home',
       'Inning', 'Pitch_outcome', 'Matchup_handed', 'Batter_id', 'Pitcher_id',
       'Year', 'League', 'Pitch_timer', 'Bigger_bases',
       'Defensive_shift_limits', 'ABS'],
      dtype='object')

In [34]:
# Create pitcher and batter keys for season
pitcher_key = df[['Pitcher_name', 'Pitcher_id']].copy().drop_duplicates(ignore_index=True)
batter_key = df[['Batter_name', 'Batter_id']].copy().drop_duplicates(ignore_index=True)

In [35]:
# Print out length of final df
len(df_f)

562080

## Write and export code to csv:
### Remember to CHANGE FILE LABELS.

In [36]:
df_f.to_csv('da14_PA.csv')
pitcher_key.to_csv('pitcher_key_PA_da14.csv')
batter_key.to_csv('batter_key_PA_da14.csv')