## 2023 MLB Umpire Data

Baseball Savant does not currently give umpire data on individual pitch data. Consequently, we will be scraping umpire data for each game of the 2023 MLB regular season from box scores on Baseball Reference ([www.baseball-reference.com](https://www.baseball-reference.com)].

## Importing Necessary Packages

We begin by importing the necessary packages. Some users may need to install `bs4` first.

In [2]:
import bs4
from bs4 import BeautifulSoup
import requests
from time import sleep
import pandas as pd

## Setting Up the Scraping

Baseball Reference and Baseball Savant sometimes use different abbreviations for the same team; in fact, different parts of Baseball Reference use different abbreviations. We create a dictionary to go from Baseball Reference abbreviations used in box score URLs to Baseball Savant abbreviations.

In [8]:
team_dict = {'ANA':'LAA', 'ARI':'AZ', 'ATL':'ATL', 'BAL':'BAL', 'BOS':'BOS', 'CHN':'CHC', 'CHA':'CWS',
             'CIN':'CIN', 'CLE':'CLE', 'COL':'COL', 'DET':'DET', 'HOU':'HOU', 'KCA':'KC', 'LAN':'LAD',
             'MIA':'MIA', 'MIL':'MIL', 'MIN':'MIN', 'NYN':'NYM', 'NYA':'NYY', 'OAK':'OAK', 'PHI':'PHI', 'PIT':'PIT',
             'SDN':'SD', 'SFN':'SF', 'SEA':'SEA', 'SLN':'STL', 'TBA':'TB', 'TEX':'TEX', 'TOR':'TOR', 'WAS':'WSH'}

These functions scrape the umpire information from Baseball Reference. The first function takes in a URL and is helpful for testing. The second function takes in a URL request and is used in the function `get_team_info` below.

In [9]:
def get_umpire_info_from_url(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    string_list = str(soup.find_all('div', id='all_3758413896')[0]).split()
    end_search = False
    index = 0
    while end_search == False:
        if string_list[index] == 'HP':
            umpire = string_list[index+2]+' '+string_list[index+3]
            return umpire[:len(umpire)-1]
        else:
            index += 1

def get_umpire_info_from_request(r):
    soup = BeautifulSoup(r.content, 'html.parser')
    find_all_list = soup.find_all('div', id='all_3758413896')
    if len(find_all_list) > 0:
        string_list = str(soup.find_all('div', id='all_3758413896')[0]).split()
        end_search = False
        index = 0
        while end_search == False:
            if string_list[index] == 'HP':
                umpire = string_list[index+2]+' '+string_list[index+3]
                return umpire[:len(umpire)-1]
            else:
                index += 1

The following function takes in a team abbreviation (such as `BOS` for the Boston Red Sox or `KCR` for the Kansas City Royals) and iterates through possible box score URLs. Note that there will be a status code of `200` exactly when that team had a home game on the given day. When the status code is `200`, we record the home team abbreviation, the date of the game, and the home plate umpire. The function ends by creating a data frame and saving it as a CSV.

In [10]:
def get_team_info(team):
    home_teams = []
    game_dates = []
    umpires = []
    base_url = "https://www.baseball-reference.com/boxes/"
    for day in range(30,32):
        temp_url = base_url+team+"/"+team+"202303"+str(day)+'0.shtml'
        req = requests.get(temp_url)
        if req.status_code == 200:
            home_teams.append(team_dict[team])
            game_dates.append('2023-03-' + str(day) )
            umpires.append(get_umpire_info_from_request(req))
        sleep(3)
    for month in range(4,10):
        for day in range(1,10):
            temp_url = base_url+team+"/"+team+"20230"+str(month)+'0'+str(day)+'0.shtml'
            req = requests.get(temp_url)
            if req.status_code == 200:
                home_teams.append(team_dict[team])
                game_dates.append('2023-0' + str(month) + '-0' + str(day) )
                umpires.append(get_umpire_info_from_request(req))
            sleep(3)
        for day in range(10,32):
            temp_url = base_url+team+"/"+team+"20230"+str(month)+str(day)+'0.shtml'
            req = requests.get(temp_url)
            if req.status_code == 200:
                home_teams.append(team_dict[team])
                game_dates.append('2023-0' + str(month) + '-' + str(day) )
                umpires.append(get_umpire_info_from_request(req))
            sleep(3)
    temp_url = base_url+team+"/"+team+"202310010.shtml"
    req = requests.get(temp_url)
    if req.status_code == 200:
        home_teams.append(team_dict[team])
        game_dates.append('2023-10-01')
        umpires.append(get_umpire_info_from_request(req))
    sleep(3)
    umpire_df = pd.DataFrame(data={'home_team':home_teams, 'game_date':game_dates, 'umpire':umpires})
    umpire_df.to_csv(team+'_umpire.csv')

## Scraping the Data

The following code runs `get_team_info` for all 30 MLB teams. Since we have already scraped this information, we have commented these cells out.

In [None]:
#get_team_info('ANA')
#print('Done')

In [None]:
#get_team_info('ARI')
#print('Done')

In [None]:
#get_team_info('ATL')
#print('Done')

In [None]:
#get_team_info('BAL')
#print('Done')

In [None]:
#get_team_info('BOS')
#print('Done')

In [None]:
#get_team_info('CHN')
#print('Done')

In [None]:
#get_team_info('CHA')
#print('Done')

In [None]:
#get_team_info('CIN')
#print('Done')

In [None]:
#get_team_info('CLE')
#print('Done')

In [None]:
#get_team_info('COL')
#print('Done')

In [None]:
#get_team_info('DET')
#print('Done')

In [None]:
#get_team_info('HOU')
#print('Done')

In [None]:
#get_team_info('KCA')
#print('Done')

In [None]:
#get_team_info('LAN')
#print('Done')

In [None]:
#get_team_info('MIA')
#print('Done')

In [None]:
#get_team_info('MIL')
#print('Done')

In [None]:
#get_team_info('MIN')
#print('Done')

In [None]:
#get_team_info('NYN')
#print('Done')

In [None]:
#get_team_info('NYA')
#print('Done')

In [None]:
#get_team_info('OAK')
#print('Done')

In [None]:
#get_team_info('PHI')
#print('Done')

In [None]:
#get_team_info('PIT')
#print('Done')

In [None]:
#get_team_info('SDN')
#print('Done')

In [None]:
#get_team_info('SFN')
#print('Done')

In [None]:
#get_team_info('SEA')
#print('Done')

In [None]:
#get_team_info('SLN')
#print('Done')

In [None]:
#get_team_info('TBA')
#print('Done')

In [None]:
#get_team_info('TEX')
#print('Done')

In [None]:
#get_team_info('TOR')
#print('Done')

In [None]:
#get_team_info('WAS')
#print('Done')

## Creating and Saving the Data Frame

We now read in our scraped data as data frame, merge these data frames together, and export the merged data frame as `umpires.csv`.

In [16]:
ANA = pd.read_csv("ANA_umpire.csv")
ARI = pd.read_csv("ARI_umpire.csv")
ATL = pd.read_csv("ATL_umpire.csv")
BAL = pd.read_csv("BAL_umpire.csv")
BOS = pd.read_csv("BOS_umpire.csv")
CHA = pd.read_csv("CHA_umpire.csv")
CHN = pd.read_csv("CHN_umpire.csv")
CIN = pd.read_csv("CIN_umpire.csv")
CLE = pd.read_csv("CLE_umpire.csv")
COL = pd.read_csv("COL_umpire.csv")
DET = pd.read_csv("DET_umpire.csv")
HOU = pd.read_csv("HOU_umpire.csv")
KCA = pd.read_csv("KCA_umpire.csv")
LAN = pd.read_csv("LAN_umpire.csv")
MIA = pd.read_csv("MIA_umpire.csv")
MIL = pd.read_csv("MIL_umpire.csv")
MIN = pd.read_csv("MIN_umpire.csv")
NYA = pd.read_csv("NYA_umpire.csv")
NYN = pd.read_csv("NYN_umpire.csv")
OAK = pd.read_csv("OAK_umpire.csv")
PHI = pd.read_csv("PHI_umpire.csv")
PIT = pd.read_csv("PIT_umpire.csv")
SDN = pd.read_csv("SDN_umpire.csv")
SEA = pd.read_csv("SEA_umpire.csv")
SFN = pd.read_csv("SFN_umpire.csv")
SLN = pd.read_csv("SLN_umpire.csv")
TBA = pd.read_csv("TBA_umpire.csv")
TEX = pd.read_csv("TEX_umpire.csv")
TOR = pd.read_csv("TOR_umpire.csv")
WAS = pd.read_csv("WAS_umpire.csv")

In [17]:
data_frames = [ANA, ARI, ATL, BAL, BOS, CHN, CHA, CIN, CLE, COL, DET, HOU, KCA, LAN, MIA]
data_frames.extend([MIL, MIN, NYN, NYA, OAK, PHI, PIT, SDN, SFN, SEA, SLN, TBA, TEX, TOR, WAS])

umpire_df = pd.concat(data_frames, ignore_index=True)

umpire_df = umpire_df[['home_team', 'game_date', 'umpire']]

umpire_df.to_csv('umpires.csv', index=False)

## Doubleheaders

First, we note that this data has already been scraped and processed, so we comment out the code below.

Everything above was for normal games. However, teams occasionally play doubleheaders (two games on the same day) and Baseball Reference indexes these box score pages slightly differently. In a different notebook, we determine all of the games for which we have pitch data but do not have umpires from the data we scraped above.

In [3]:
missing_umpires = pd.read_csv('missing_umpires.csv')

In [4]:
pd.set_option('display.max_rows', None)

missing_umpires

Unnamed: 0,game_date,home_team,game_pk
0,2023-06-03,BOS,717885
1,2023-06-03,BOS,717918
2,2023-06-18,BOS,717715
3,2023-06-18,BOS,717730
4,2023-09-12,BOS,716612
5,2023-09-12,BOS,716628
6,2023-09-14,BOS,716593
7,2023-09-14,BOS,716597
8,2023-09-01,CIN,716770
9,2023-09-01,CIN,718700


Upon manual inspection, we see that the only game which is not part of a double-header was on `2023-08-20` where `WSH` was the home team. We will handle that one exception by hand.

Also, note that we must revert from Baseball Savant abbreviations to Baseball Reference abbreviations, which we do now.

In [5]:
for index in range(18,24):
    missing_umpires.at[index, 'home_team'] = 'CHA'

for index in range(34,36):
    missing_umpires.at[index, 'home_team'] = 'KCA'

for index in range(36,40):
    missing_umpires.at[index, 'home_team'] = 'ANA'

for index in range(40,42):
    missing_umpires.at[index, 'home_team'] = 'LAN'
    
for index in range(42,52):
    missing_umpires.at[index, 'home_team'] = 'NYN'
    
for index in range(52,54):
    missing_umpires.at[index, 'home_team'] = 'NYA'

for index in range(62,64):
    missing_umpires.at[index, 'home_team'] = 'SDN'
    
for index in range(64,66):
    missing_umpires.at[index, 'home_team'] = 'SLN'

for index in range(66,71):
    missing_umpires.at[index, 'home_team'] = 'WAS'

Next, we create lists of dates and home teams that we will iterate through while getting data for doubleheader games.

In [6]:
missing_umpires_no_game_pk = missing_umpires[['game_date', 'home_team']]

missing_umpires_no_game_pk = missing_umpires_no_game_pk.drop_duplicates()

missing_game_dates = missing_umpires_no_game_pk.game_date.to_list()
missing_home_team = missing_umpires_no_game_pk.home_team.to_list()

We now scrape for doubleheader data and create a list of home plate umpires that we will join to the `missing_umpires` data frame.

In [11]:
doubleheader_umpires = []
base_url = "https://www.baseball-reference.com/boxes/"

for index in range(len(missing_game_dates)):
    team = missing_home_team[index]
    date = missing_game_dates[index].replace('-', '')
    if (team != 'WAS') or (date != '20230820'):
        temp_url_1 = base_url+team+"/"+team+date+'1.shtml'
        temp_url_2 = base_url+team+"/"+team+date+'2.shtml'
        req1 = requests.get(temp_url_1)
        if req1.status_code == 200:
            doubleheader_umpires.append(get_umpire_info_from_request(req1))
        sleep(3)
        req2 = requests.get(temp_url_2)
        if req2.status_code == 200:
            doubleheader_umpires.append(get_umpire_info_from_request(req2))
        sleep(3)
    else:
        doubleheader_umpires.append('Sean Barber')

Finally, we add in our freshly obtained umpire data, switch back to Baseball Savant abbreviations, and export the updated `missing_umpires` data frame as `missing_umpires_return.csv`.

In [12]:
missing_umpires = missing_umpires.assign(umpire=doubleheader_umpires)

In [13]:
for index in range(18,24):
    missing_umpires.at[index, 'home_team'] = 'CWS'

for index in range(34,36):
    missing_umpires.at[index, 'home_team'] = 'KC'

for index in range(36,40):
    missing_umpires.at[index, 'home_team'] = 'LAA'

for index in range(40,42):
    missing_umpires.at[index, 'home_team'] = 'LAD'
    
for index in range(42,52):
    missing_umpires.at[index, 'home_team'] = 'NYM'
    
for index in range(52,54):
    missing_umpires.at[index, 'home_team'] = 'NYN'

for index in range(62,64):
    missing_umpires.at[index, 'home_team'] = 'SD'
    
for index in range(64,66):
    missing_umpires.at[index, 'home_team'] = 'STL'

for index in range(66,71):
    missing_umpires.at[index, 'home_team'] = 'WSH'

In [15]:
missing_umpires.to_csv('missing_umpires_return.csv', index=False)