#### Bridget Sands
#### Harvard University
#### Applied Mathematics Senior Thesis
#### April 1st, 2024

# "Clean_OB.ipynb"

### Note: This is the 1st notebook used for cleaning, following the original sourcing of the data from baseballR.

#### Notebook Purpose and Summary:
This notebook was used to clean data originally sourced from the baseballR R package's `mlb_pbp()` function, which returned pitch by pitch data relative to the season and league specified in user entered parameters. This notebook takes in a season of data and creates an additional column that properly reflects the state of the bases at the beginning of a plate appearance, rather than the end of the plate appearance, as it was returned by the function.

#### Input:
`csv` season of data for specific league/year, sourced and exported by `Data_Acquisition.Rmd`.

#### Export:
`csv` that contains season of data for inputted specific league/year, ready to be imported and further cleaned by either the `Data_Cleaning_PA.ipynb` or `Data_Cleaning_SB.ipynb` files.

#### Glossary:
- PA: Plate appearance

In [1]:
# import necessary libraries
import numpy as np
import pandas as pd
import math

In [2]:
# Specify columns to read in for dataframe - 
# Columns used distinguished after previous cleaning
# In order to not read in unnecessary columnes and save computation time and resources
cols = ['game_pk', 'startTime', 'game_date', 'type', 'playId', 'pitchNumber', 'details.description', 'details.event', 'details.code', 'details.isInPlay', 'details.isStrike', 'details.isBall',
        'count.balls.start', 'count.strikes.start', 'count.outs.start', 'result.eventType', 'result.description', 'result.rbi', 'result.awayScore', 
        'result.homeScore', 'about.atBatIndex', 'about.halfInning', 'about.inning', 'about.isComplete', 'about.isScoringPlay', 'matchup.batter.id', 
        'matchup.batter.fullName', 'matchup.batSide.code', 'matchup.pitcher.id', 'matchup.pitcher.fullName', 'matchup.pitchHand.code', 
        'matchup.splits.menOnBase', 'details.isOut', 'about.isTopInning'
]

## Remember to CHANGE FILE:
#### Relative to season needed to clean

In [3]:
# Read in data as pandas dataframe
df = pd.read_csv('da14_r.csv', low_memory=False, usecols=cols)

In [4]:
# Filter dataframe to only consider pitches from complete PAs and first nine innings of games:
df = df[(df['about.isComplete']==True)&(df['about.inning']<=9)]
print(len(df))

# Drop duplicate rows in the dataframe:
df = df.drop_duplicates()
print(len(df))

647495
618765


In [5]:
# Sort values first by games in season
# Sort each game of season by inning 
# Sort each inning by half inning (top or bottom)
# Sort each half inning by atBatIndex --> to order PAs properly
df = df.sort_values(by=['game_pk', 'about.inning', 'about.halfInning', 'about.atBatIndex'], ignore_index=True)

In [6]:
# Create unique identification for each plate appearance
df['PA_id'] = df['game_pk'].astype('str') + '-' + df['about.atBatIndex'].astype('str') + '-' + df['about.inning'].astype('str') + '-' + df['about.isTopInning'].astype('int').astype('str')

### Isolate Issues:
Previous data exploration indicated that SOME half innings with PAs resulting in a hit by pitch situation had cascading issues throughout the half inning, incorrectly reflecting the base status. In order to account for this, because the number of half innings were miniscule relative to the overall, identify the problematic half innings in order to skip them later.

In [7]:
# Isolate half innings with problems
c = 0 
uni = df['PA_id'].unique()
problems = []

for index, row in df[(df['details.description']=='Hit By Pitch')&(df['result.eventType']!='hit_by_pitch')].iterrows():
    game = row['game_pk']
    half = row['about.halfInning']
    inn = row['about.inning']
    check = df.loc[(df['game_pk']==game)&(df['about.halfInning']==half)&(df['about.inning']==inn)][['result.eventType']].copy()
    check2 = check['result.eventType'].values
    if 'hit_by_pitch' not in check2:
        problems.append((game, half, inn))
        c+=1
c

297

### Create new column:
Because the method used uses the `about.atBatIndex` of the last PA to correct the current, instead of reassigning the values as the iteration goes, it is imperative to preserve the old, incorrect values so they can continously be used for the adjustment of the new. Therefore, the revised values live in a new column.

In [8]:
# Create a new column with empty values
df['Men_OB'] = ''

# Isolate values to identify each unique half inning
unique_subset = df[['game_pk', 'about.halfInning', 'about.inning']].drop_duplicates(subset=['game_pk', 'about.halfInning', 'about.inning'])
len(unique_subset)

36190

In [9]:
# Initialize count of PAs
c = 0

# Iterate through values of each unique half inning
for game, half, inn in unique_subset.values:

    # If half innning is problematic, skip it
    if (game, half, inn) in problems:
        c += 1
        continue

    # Print for every 1,000 half innings handled for tracking purposes
    if c % 1000 == 1:
        print(c)

    # Create temporary dataframe for the half inning
    temp = df.loc[(df['game_pk']==game)&(df['about.halfInning']==half)&(df['about.inning']==inn), ['matchup.splits.menOnBase', 'about.atBatIndex']].copy()
    
    # Find the number of entries corresponding to the first PA
    minAB = len(temp[temp['about.atBatIndex'] == min(temp['about.atBatIndex'])])

    # Create list to represent the state of bases for each pitch of half inning 
    # Initialize with "Empty" --> repeated for previously found length of first PA
    temp2 = ['Empty'] * minAB

    # Get unique PAs of half inning
    ABs = temp['about.atBatIndex'].unique()

    # Iterate through PAs minus one 
    # Note that the first PA has already been taken care of, so this is for rest
    # Need to shift state of bases of each PA to be the results of the PA before
    for i in range(len(ABs)-1):

        # Isolate resulting base state of previous PA
        curr = temp.loc[temp['about.atBatIndex']==ABs[i], 'matchup.splits.menOnBase'].values[0]

        # Find number of pitches in PA of focus
        lenAB = len(temp[temp['about.atBatIndex'] == ABs[i+1]])

        # Add resulting state of previous PA to list times number of pitches in the PA of focus
        temp2 = temp2 + [curr]*lenAB
    
    c += 1
    # Assign properly shifted values to half inning 
    df.loc[(df['game_pk']==game)&(df['about.halfInning']==half)&(df['about.inning']==inn), 'Men_OB'] = temp2

1
1001
2001
3001
4001
5001
6001
7001
8001
9001
10001
11001
12001
13001
14001
15001
16001
17001
18001
19001
20001
21001
22001
23001
24001
25001
26001
27001
28001
29001
30001
31001
32001
33001
34001
35001
36001


In [10]:
# Confirmation that method executed with proper logic -->
# Steals should not happen on empty bases
steals = ['Stolen Base 2B', 'Pickoff Error 1B', 'Caught Stealing 2B', 'Pickoff Caught Stealing 2B', 'Pickoff 1B']
df[df['details.event'].isin(steals)]['Men_OB'].value_counts()

Men_OB
Men_On    2925
RISP       373
            45
Name: count, dtype: int64

In [11]:
# Further confirmation that method executed with proper logic -->
# Pickoffs should not happen on empty bases
df[(df['details.description']=='Pickoff Attempt 1B')]['Men_OB'].value_counts()

Men_OB
Men_On    14659
RISP       1373
            161
Loaded        8
Name: count, dtype: int64

In [12]:
# Manually adjust for issues
df.loc[(df['details.event'].isin(steals))&(df['Men_OB']=='Empty'), 'Men_OB'] = 'Men_On'
df.loc[(df['details.description']=='Pickoff Attempt 1B')&(df['Men_OB']=='Empty'), 'Men_OB'] = 'Men_On'

In [13]:
# Further confirmation that method executed with proper logic -->
# Check length of dataframe with unassigned Men_OB values
print(len(df))
print(len(df[df['Men_OB']=='']))
print(len(df[df['Men_OB']!='']))

618765
5671
613094


In [14]:
# Further confirmation that method executed with proper logic -->
# Check Men_OB value counts
df['Men_OB'].value_counts()

Men_OB
Empty     322244
RISP      138466
Men_On    136017
Loaded     16367
            5671
Name: count, dtype: int64

## Write and export code to csv:
### Remember to CHANGE FILE LABEL.

In [15]:
df = df[df['Men_OB']!='']
df.to_csv('da14_wOB_F.csv')