## Creating the game_logs for the basic statistics for every player based on retrosheet data.

I want to create a game_log sheet for every player in my data set. I realized this will help accelerate my backtesting. 

<br><br> The reason I am creating this file is that once I run the script once. I will just need to call the specific games to get the required information instead of generating it every time and not saving the results

<br> https://www.baseball-reference.com/players/gl.fcgi?id=bautijo02&t=b&year=2014  <br><br> this is the ideal format except the CSV file created will have all the players. So far it works from 2005-2018. File might be hundreds of MB but that is OK.


In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st

import warnings
warnings.filterwarnings('ignore')

d = pd.read_csv('../2005-2016_games.csv')
d.rename(columns={'unknown':'double_header_flag'}, inplace=True)
d['ab_flag'] = d.ab_flag.map({'F':0,'T':1})
d.sh_flag = d.sh_flag.map({'F':0,'T':1})
d.sf_flag = d.sf_flag.map({'F':0,'T':1})

In [2]:
df = d.copy()

In [3]:
#find me all the Plate Apperances as described by MLB rules
AB_only = df[df.ab_flag == 1]

event_flags = [14,15,16,17] #14 is BB 15 is IBB 16 is HBP 17 is interference
BB_IBB_HBP_INT = df[df.event_type.isin(event_flags)]

sac_hits = df[df.sh_flag ==1]

sac_fly  = df[df.sf_flag ==1]

#combine all the dataFrames into 1.
new_df = AB_only.append(BB_IBB_HBP_INT)
new_df = new_df.append(sac_hits)
new_df = new_df.append(sac_fly)

df = new_df.copy()

In [4]:
hit_flags = [20,21,22,23]     #20 = single, 21= double, 22 = triple, 23 = homerun
df.loc[df.event_type.isin(hit_flags), 'hit_flag'] = 1
df.loc[df.event_type.isin(hit_flags) == False, 'hit_flag'] = 0

#event_type 14 is BB, 15 is IBB, 16 is HBP, 17 is interference
df.loc[df.event_type==14, 'BB'] = 1
df.loc[df.event_type!=14, 'BB'] = 0

df.loc[df.event_type==15, 'IBB'] = 1
df.loc[df.event_type!=15, 'IBB'] = 0

df.loc[df.event_type==16, 'HBP'] = 1
df.loc[df.event_type!=16, 'HBP'] = 0

df.loc[df.event_type==17, 'ITF'] = 1
df.loc[df.event_type!=17, 'ITF'] = 0

df.loc[df.event_type==20, '1B'] = 1
df.loc[df.event_type!=20, '1B'] = 0

df.loc[df.event_type==21, '2B'] = 1
df.loc[df.event_type!=21, '2B'] = 0

df.loc[df.event_type==22, '3B'] = 1
df.loc[df.event_type!=22, '3B'] = 0

df.loc[df.event_type==23, 'HR'] = 1
df.loc[df.event_type!=23, 'HR'] = 0

In [5]:
data_grouped = df.groupby(['game_id','year','month','day','res_batter'])

#probably need to change game_id to double_header_flag that way but for now keep since you can just copy paste game_id on google and find the right game

In [6]:
# now that everything is seperated by game_id, 'year', 'month', 'day' and 'batter name' aggregate the stats
batting_summary = data_grouped.agg({'hit_flag':[np.sum, np.size],'ab_flag':[np.sum],'BB':np.sum,'IBB':np.sum,'HBP':np.sum,'ITF':np.sum,'1B':np.sum,'2B':np.sum,'3B':np.sum,'HR':np.sum,'sh_flag':np.sum,'sf_flag':np.sum})

In [7]:
#Just some cleaning
batting_hits_pa = batting_summary.hit_flag
batting_hits_pa.rename(columns={'sum':'H','size':'PA'}, inplace=True)


batting_ab = batting_summary.ab_flag
batting_ab.rename(columns={'sum':'AB'}, inplace=True)

batting_BB  = batting_summary.BB
batting_BB.rename(columns={'sum':'BB'}, inplace=True)

batting_IBB = batting_summary.IBB
batting_IBB.rename(columns={'sum':'IBB'}, inplace=True)

batting_HBP = batting_summary.HBP
batting_HBP.rename(columns={'sum':'HBP'}, inplace=True)

batting_ITF = batting_summary.ITF
batting_ITF.rename(columns={'sum':'ITF'}, inplace=True)

batting_1B = batting_summary['1B']
batting_1B.rename(columns={'sum':'1B'}, inplace=True)

batting_2B = batting_summary['2B']
batting_2B.rename(columns={'sum':'2B'}, inplace=True)

batting_3B = batting_summary['3B']
batting_3B.rename(columns={'sum':'3B'}, inplace=True)

batting_HR = batting_summary.HR
batting_HR.rename(columns={'sum':'HR'}, inplace=True)

batting_SH = batting_summary.sh_flag
batting_SH.rename(columns={'sum':'SH'}, inplace=True)

batting_SF = batting_summary.sf_flag
batting_SF.rename(columns={'sum':'SF'}, inplace=True)


#now combine all the smaller dataframes
batting_summary = pd.concat([batting_hits_pa,batting_ab,batting_BB,batting_IBB,batting_HBP,batting_ITF,batting_1B,batting_2B,batting_3B,batting_HR,batting_SH,batting_SF], axis =1)

batting_summary.sort_values(by=['res_batter','year','month','day','game_id'],inplace=True)

#calculate BA and OBP: BA = H / AB, OBP = (H + BB + IBB + HBP)/ (AB + BB + IBB + HBP + SF )
batting_summary['BA'] = batting_summary.groupby(['res_batter','year'])['H'].transform(pd.Series.cumsum) / batting_summary.groupby(['res_batter','year'])['AB'].transform(pd.Series.cumsum)
batting_summary['OBP'] = (batting_summary.groupby(['res_batter','year'])['H'].transform(pd.Series.cumsum) + batting_summary.groupby(['res_batter','year'])['BB'].transform(pd.Series.cumsum) + batting_summary.groupby(['res_batter','year'])['IBB'].transform(pd.Series.cumsum) + batting_summary.groupby(['res_batter','year'])['HBP'].transform(pd.Series.cumsum) ) / (batting_summary.groupby(['res_batter','year'])['AB'].transform(pd.Series.cumsum) + batting_summary.groupby(['res_batter','year'])['BB'].transform(pd.Series.cumsum) + batting_summary.groupby(['res_batter','year'])['IBB'].transform(pd.Series.cumsum) + batting_summary.groupby(['res_batter','year'])['HBP'].transform(pd.Series.cumsum) + batting_summary.groupby(['res_batter','year'])['SF'].transform(pd.Series.cumsum))

#calculate player's cumulative Hits to each day in the season, as well as his H/PA
batting_summary['H_cum_season'] = batting_summary.groupby(['res_batter','year'])['H'].transform(pd.Series.cumsum)
batting_summary['H/PA_seasonal'] =  batting_summary.groupby(['res_batter','year'])['H'].transform(pd.Series.cumsum) / batting_summary.groupby(['res_batter','year'])['PA'].transform(pd.Series.cumsum)
batting_summary['PA_cum_season'] = batting_summary.groupby(['res_batter','year']).PA.transform(pd.Series.cumsum)
batting_summary['AB_cum_season'] = batting_summary.groupby(['res_batter','year']).AB.transform(pd.Series.cumsum)


#for each game the player plays in a season... write down his game count. also calculate H/seasonal_game_played
batting_summary['seasonal_game_played']= batting_summary.groupby(['res_batter','year']).cumcount()+1
batting_summary['H/seasonal_game_played'] = batting_summary.groupby(['res_batter','year'])['H'].transform(pd.Series.cumsum) / batting_summary.seasonal_game_played

# write down each player's career game count : career_game_played
batting_summary['career_game_played'] = batting_summary.groupby(['res_batter']).cumcount()+1

#career statistics
batting_summary['AB_cum_car'] = batting_summary.groupby(['res_batter']).AB.transform(pd.Series.cumsum)
batting_summary['H_cum_car'] = batting_summary.groupby(['res_batter']).H.transform(pd.Series.cumsum)
batting_summary['BA_car'] = batting_summary['H_cum_car'] / batting_summary['AB_cum_car']
batting_summary['PA_car'] = batting_summary.groupby(['res_batter']).PA.transform(pd.Series.cumsum)
batting_summary['H/PA_car'] = batting_summary['H_cum_car'] / batting_summary['PA_car']
batting_summary['H/career_game_played'] = batting_summary['H_cum_car'] / batting_summary['career_game_played']

batting_summary.reset_index(inplace=True)

In [8]:
%%time
batting_summary['H_rolling_5d']    = batting_summary.groupby(['res_batter','year'])['H'].apply(lambda g: g.rolling(5).sum())
batting_summary['AB_rolling_5d']   = batting_summary.groupby(['res_batter','year'])['AB'].apply(lambda g: g.rolling(5).sum())
batting_summary['BA_rolling_5d']   = batting_summary.H_rolling_5d / batting_summary.AB_rolling_5d
batting_summary['PA_rolling_5d']   = batting_summary.groupby(['res_batter','year'])['PA'].apply(lambda g: g.rolling(5).sum())
batting_summary['H/PA_rolling_5d'] = batting_summary.H_rolling_5d / batting_summary.PA_rolling_5d

batting_summary['H_rolling_10d'] = batting_summary.groupby(['res_batter','year'])['H'].apply(lambda g: g.rolling(10).sum())
batting_summary['AB_rolling_10d'] = batting_summary.groupby(['res_batter','year'])['AB'].apply(lambda g: g.rolling(10).sum())
batting_summary['BA_rolling_10d'] = batting_summary.H_rolling_10d / batting_summary.AB_rolling_10d
batting_summary['PA_rolling_10d'] = batting_summary.groupby(['res_batter','year']).PA.apply(lambda g: g.rolling(10).sum())
batting_summary['H/PA_rolling_10d'] = batting_summary.H_rolling_10d / batting_summary.PA_rolling_10d

batting_summary['H_rolling_30d'] = batting_summary.groupby(['res_batter','year']).H.apply(lambda g: g.rolling(30).sum())
batting_summary['AB_rolling_30d'] = batting_summary.groupby(['res_batter','year']).AB.apply(lambda g: g.rolling(30).sum())
batting_summary['BA_rolling_30d'] = batting_summary['H_rolling_30d'] / batting_summary['AB_rolling_30d']
batting_summary['PA_rolling_30d'] = batting_summary.groupby(['res_batter','year']).PA.apply(lambda g: g.rolling(30).sum())
batting_summary['H/PA_rolling_30d'] = batting_summary.H_rolling_30d/batting_summary.PA_rolling_30d

Wall time: 53.9 s


In [9]:
%%time
# Standard Error for a proportion = sqrt( p * q / n )
# What is a batter's 'True' BA? it is mean +- z_score * Standard Error

#season
p_season = batting_summary.BA
n_season = batting_summary.AB_cum_season
batting_summary['BA_SE_season'] = np.sqrt(p_season * (1-p_season) / n_season)

#career
p_car = (batting_summary.BA_car)
n_car = batting_summary['AB_cum_car']

batting_summary['BA_SE_car'] = np.sqrt((p_car * (1-p_car) / n_car))#.apply(np.sqrt)

percentile = 0.975     #this represents a 95% Confidence Intervall
crit_value_z = st.norm.ppf(percentile)
#True statistic bounds: general formula is mean +- z_score * Standard error, use t_score when n<=30

#seasonal
batting_summary.loc[batting_summary['seasonal_game_played']<=30,'BA_lower_CI'] = batting_summary['BA'] - st.t.ppf(percentile,df=batting_summary['seasonal_game_played'] - 1) * batting_summary['BA_SE_season']
batting_summary.loc[batting_summary['seasonal_game_played']<=30,'BA_upper_CI'] = batting_summary['BA'] + st.t.ppf(percentile,df=batting_summary['seasonal_game_played'] - 1) * batting_summary['BA_SE_season']

batting_summary.loc[batting_summary['seasonal_game_played']>30 ,'BA_lower_CI'] = batting_summary['BA'] - crit_value_z * batting_summary['BA_SE_season']
batting_summary.loc[batting_summary['seasonal_game_played']>30 ,'BA_upper_CI'] = batting_summary['BA'] + crit_value_z * batting_summary['BA_SE_season']

#career
batting_summary.loc[batting_summary['career_game_played']<=30,'BA_lower_CI_car'] = batting_summary['BA_car'] - st.t.ppf(percentile,df=batting_summary['career_game_played'] - 1) * batting_summary['BA_SE_car']
batting_summary.loc[batting_summary['career_game_played']<=30,'BA_upper_CI_car'] = batting_summary['BA_car'] + st.t.ppf(percentile,df=batting_summary['career_game_played'] - 1) * batting_summary['BA_SE_car']

batting_summary.loc[batting_summary['career_game_played']>30,'BA_lower_CI_car'] = batting_summary['BA_car'] - crit_value_z * batting_summary['BA_SE_car']
batting_summary.loc[batting_summary['career_game_played']>30,'BA_upper_CI_car'] = batting_summary['BA_car'] + crit_value_z * batting_summary['BA_SE_car'] 

print('Complete')

Complete
Wall time: 7.38 s


In [10]:
batting_summary.query('res_batter=="bautj002" and year == 2010').sample(5)

Unnamed: 0,game_id,year,month,day,res_batter,H,PA,AB,BB,IBB,...,AB_rolling_30d,BA_rolling_30d,PA_rolling_30d,H/PA_rolling_30d,BA_SE_season,BA_SE_car,BA_lower_CI,BA_upper_CI,BA_lower_CI_car,BA_upper_CI_car
31332,TBA201004250,2010,4,25,bautj002,1.0,3.0,3,0.0,0.0,...,,,,,0.050807,0.01025,0.125142,0.338626,0.21968,0.259859
31392,CLE201007010,2010,7,1,bautj002,1.0,4.0,4,0.0,0.0,...,100.0,0.2,122.0,0.163934,0.025516,0.009683,0.178772,0.278793,0.219534,0.257492
31474,MIN201010030,2010,10,3,bautj002,0.0,4.0,4,0.0,0.0,...,107.0,0.224299,127.0,0.188976,0.018391,0.0091,0.22406,0.296151,0.227355,0.263025
31358,ARI201005220,2010,5,22,bautj002,0.0,3.0,3,0.0,0.0,...,106.0,0.245283,121.0,0.214876,0.033643,0.009993,0.171561,0.303439,0.220283,0.259454
31449,TOR201009060,2010,9,6,bautj002,0.0,4.0,3,1.0,0.0,...,100.0,0.28,130.0,0.215385,0.020187,0.009296,0.224586,0.303716,0.22723,0.263671


In [11]:
# We got what we want, we can now save the dataframe
batting_summary.to_csv('../2005-2016_game_logs.csv',index=None)

In [None]:
df2 = pd.read_csv('../2005-2016_game_logs.csv')

In [None]:
df2.sample(4).head()

## Running Tests

We can run tests on the code and verify the results by comparing them to the information on a reputable website: baseball-reference.com

<br> To do this we will pick a year, here I've chosen 2006 but this can easily be changed

<br> Next we will randomly pick a batter and visit his website and manually check the numbers

In [None]:
yearly_batting_summary = batting_summary.groupby('year')

In [None]:
df2006 = yearly_batting_summary.get_group(2006)

### Case 1: Checking Hits, AB, and PA

In [None]:
import random
#batter = random.sample(set(df2006.res_batter),1)
batter =['suzui001']
print(batter)

df2006[df2006.res_batter == batter[0]].H.sum(), df2006[df2006.res_batter == batter[0]].AB.sum(), df2006[df2006.res_batter == batter[0]].PA.sum()

The chosen batter is 'suzui001' which is Ichiro Suzuki (https://www.retrosheet.org/boxesetc/S/Psuzui001.htm) <br><br> We can check his 2006 stats by following this link: https://www.baseball-reference.com/players/gl.fcgi?id=suzukic01&t=b&year=2006
<br><br>By scrolling to the bottom of the web-page, we can see that he had exactly 752 Plate appearances, 695 AB and 224 Hits which corresponds to the data we obtained

### Case 2: Checking BB, IBB and HBP,

In [None]:
df2006[df2006.res_batter == batter[0]].BB.sum(),df2006[df2006.res_batter == batter[0]].IBB.sum(),df2006[df2006.res_batter == batter[0]].HBP.sum()

Checking his Walks, Intentional Walks and Hit by Pitch on baseball-reference we can see that he has 51 Walks, with 1 intentional walks and 6 Hit by Pitches. <br><br>I've double checked the game date below and I can confirm that baseball-reference counts the Intentional walks in the Walks Columns which is why they have it at 51 (50 normal walks + 1 intentional walks) instead of 50

In [None]:
df2006.query('res_batter =="suzui001" and month==7 and day ==31')[['game_id','year','month','day','res_batter','BB','IBB']]

In [None]:
df.query('year==2006 and res_batter =="suzui001" and month==7 and day ==31')

## Case 3: Checking his BA and OBP

In [None]:
df2006[df2006.res_batter==batter[0]].sample(3)[['game_id','year','month','day','BA','OBP']]

https://www.baseball-reference.com/players/gl.fcgi?id=suzukic01&t=b&year=2006
<br><br>On July 17, 2006 Ichiro's BA was .340 and OBP was .395        : pass
<br>On  April 7, 2006: Ichiro's BA was .286 and OBP was .375 : pass
<br>On April 23, 2006: Ichiro's BA was .253 and OBP was .333      : pass