# Baseball Analytics
Some questions I want to answer:
1. Does handedness really matter when it comes to pitching/batting? Do lefties really do better against right-handed batters?
2. Can you predict a batters OBP?
3. Does the number of outs affect a batters performance? 

In [30]:
import pandas as pd
import numpy as np
import re
import os

## What to do with raw data?
I think the best way to go about this is to create a games, players and play by play dataset. The *games* dataset will have
1. a unique ID
2. Team info, away, home, etc.
3. Weather info
4. All fields with `info` tag in the txt file.

The *players* dataset will have
1. player ID's
2. team ID's (coupled with years for trades?)
3. position
4. handedness
5. any other demos I can find

The *play-by-play* data will have all information pertaining to individual plays in games!
1. GameID
2. PlayID
3. PlayerID's (all players who were involved)
4. pitch sequence
5. other events?

Check out retrosheet's [detailed descriptions](https://www.retrosheet.org/eventfile.htm) of the play-by-play files. 


In [32]:
# Here we're just going to read in our data line by line. For now we are just dealing with 2010 data
folderpath = r'Data/Retrosheet/2010-2019/2010'
filepaths = [os.path.join(folderpath, name) for name in os.listdir(folderpath) if "ROS" not in name]
all_files = []

for path in filepaths:
    with open(path, 'r') as f:
        print(path)
        file = f.readlines()
        all_files.append(file)

Data/Retrosheet/2010-2019/2010/2010CHN.EVN
Data/Retrosheet/2010-2019/2010/2010SEA.EVA
Data/Retrosheet/2010-2019/2010/2010NYN.EVN
Data/Retrosheet/2010-2019/2010/2010PHI.EVN
Data/Retrosheet/2010-2019/2010/2010SLN.EVN
Data/Retrosheet/2010-2019/2010/2010HOU.EVN
Data/Retrosheet/2010-2019/2010/2010MIL.EVN
Data/Retrosheet/2010-2019/2010/2010MIN.EVA
Data/Retrosheet/2010-2019/2010/2010LAN.EVN
Data/Retrosheet/2010-2019/2010/2010NYA.EVA
Data/Retrosheet/2010-2019/2010/2010DET.EVA
Data/Retrosheet/2010-2019/2010/2010WAS.EVN
Data/Retrosheet/2010-2019/2010/2010OAK.EVA
Data/Retrosheet/2010-2019/2010/2010CIN.EVN
Data/Retrosheet/2010-2019/2010/2010BOS.EVA
Data/Retrosheet/2010-2019/2010/2010TOR.EVA
Data/Retrosheet/2010-2019/2010/2010PIT.EVN
Data/Retrosheet/2010-2019/2010/2010TBA.EVA
Data/Retrosheet/2010-2019/2010/2010KCA.EVA
Data/Retrosheet/2010-2019/2010/2010CLE.EVA
Data/Retrosheet/2010-2019/2010/2010CHA.EVA
Data/Retrosheet/2010-2019/2010/2010TEX.EVA
Data/Retrosheet/2010-2019/2010/2010ATL.EVN
Data/Retros

In [33]:
# here I'm just grabbing all of the info classes so I don't have to type them out myself
# NOTE: There could be more classes than this; we can improve this later if we want
infoClassList = ['gameID']

for line in data:
    flag = re.search(r'([a-z]*),', line).group(1)
    if (flag == 'info'):
        infoObj = re.search(r'info,([A-Za-z\d]*),(.*)\n', line)
        infoClass = infoObj.group(1)
        if infoClass not in infoClassList:
            infoClassList.append(infoClass)
            
infoClassList.append('starters')

In [34]:
# now let's parse this data and get some dataframes out of it.
games = pd.DataFrame(columns = infoClassList)

# we need a dictionary to append to the games dataframe
appDict = {}
for el in infoClassList:
    appDict[el]=None

# this only has a few items so we can hand code these features. I haven't decided how I want to split up the pitches and play strings yet
# I may need to add a "switch" feature for when a batter switches hands
plays = pd.DataFrame(columns = ['gameID', 'inning', 'h/a', 'playerID', 'pitcherID', 'eventCount', 'pitches', 'playString']) 

In [35]:
gameID=''
homePitcher = ''
awayPitcher = ''
switch = False # this will be used to denote when batters switch hit
startersList = []
# let's just parse through each line and get what we need.
for line in data:
    flag = re.search(r'([a-z]*),', line).group(1)
    if flag == 'id': 
        if appDict['gameID'] is not None:
            # this prevents us from writing Null on the first pass through
            appDict['starters'] = startersList
            games = games.append(appDict, ignore_index=True)
            # need to restart this list now that we have a new game
            startersList = [] 
        gameID = re.search(r'id,([A-Z\d]*)\n', line).group(1)
        appDict['gameID'] = gameID
        
    if flag == 'info': 
        # this is information about the game: time, weather, etc.
        infoObj = re.search(r'info,([\w]*),(.*)\n', line)
        infoClass = infoObj.group(1)
        infoVal = infoObj.group(2)
        appDict[infoClass] = infoVal
        
    if flag == 'start':
        # This is information about the starters of the game. 
        startObj = re.search(r'start,([\w-]*),"(.*)",([\d]*),([\d]*),([\d]*)\n', line)
        playerID = startObj.group(1)
        homeAway = int(startObj.group(3))
        position = int(startObj.group(5))
        #print('ID', playerID,'homeAway', homeAway,'pos', position)
        if position == 1:
            if homeAway == 0:
                homePitcher = playerID
            else:
                awayPitcher = playerID
        # not using these for now
        # playerName = startObj.group(2)
        # battingPosition = startObj.grostartersList = group(4)
        startersList.append([playerID, int(homeAway), int(position)])
        
    if flag == 'play':
        playObj = re.search(r'play,([\d]*),([\d]*),([\w-]*),(.*),(.*),(.*)\n', line)
        inning = playObj.group(1)
        homeAway = int(playObj.group(2))
        playerID = playObj.group(3)
        count = playObj.group(4)
        pitches = playObj.group(5)
        playString = playObj.group(6)
        # print(homePitcher,awayPitcher)
        if homeAway == 0:
            tempPitch = homePitcher
        else:
            tempPitch = awayPitcher
        temp = {
            'gameID':gameID,
            'inning':inning,
            'h/a':homeAway,
            'playerID':playerID,
            'pitcherID':tempPitch,
            'eventCount':count,
            'pitches':pitches,
            'playString':playString
        }
        plays = plays.append(temp, ignore_index=True)
        
    if flag == 'sub':
        # This string is the same as the start string. The only info I care
        # about right now is if the pitcher changes. I may want more later!
        subObj = re.search(r'sub,([\w-]*),"(.*)",([\d]*),([\d]*),([\d]*)\n', line)
        playerID = subObj.group(1)
        homeAway = int(subObj.group(3))
        position = int(subObj.group(5))
        if position == 1:
            #print(playerID)
            if homeAway == 0:
                homePitcher = playerID
            else:
                awayPitcher = playerID
        # not using these for now
        # playerName = subObj.group(2)
        # battingPosition = subObj.grostartersList = group(4)
    if flag == 'badj':
        switch = True
        # we need this to inform each subsequent play row. 
        # The hard part is turning it off when there is a new batter!
        
        

In [36]:
plays.head()

Unnamed: 0,gameID,inning,h/a,playerID,pitcherID,eventCount,pitches,playString
0,ANA201004050,1,0,spand001,bakes002,22,CSBFFBFC,K
1,ANA201004050,1,0,hudso001,bakes002,1,FX,43/G-
2,ANA201004050,1,0,mauej001,bakes002,11,CBX,43/G
3,ANA201004050,1,1,aybae001,weavj003,32,BBCCFBFFFB,W
4,ANA201004050,1,1,abreb001,weavj003,1,CX,8/F


In [37]:
games.head()

Unnamed: 0,gameID,visteam,hometeam,site,date,number,starttime,daynight,usedh,umphome,...,windspeed,fieldcond,precip,sky,timeofgame,attendance,wp,lp,save,starters
0,ANA201004050,MIN,ANA,ANA01,2010/04/05,0,7:08PM,night,True,mcclt901,...,11,unknown,unknown,cloudy,180,43504,weavj003,bakes002,fuenb001,"[[spand001, 0, 8], [hudso001, 0, 4], [mauej001..."
1,ANA201004060,MIN,ANA,ANA01,2010/04/06,0,7:08PM,night,True,everm901,...,5,unknown,unknown,sunny,161,43510,blacn001,saunj001,raucj001,"[[spand001, 0, 8], [hudso001, 0, 4], [mauej001..."
2,ANA201004070,MIN,ANA,ANA01,2010/04/07,0,7:08PM,night,True,fleta901,...,4,unknown,unknown,sunny,158,41533,pavac001,sante001,raucj001,"[[spand001, 0, 8], [hudso001, 0, 4], [mauej001..."
3,ANA201004080,MIN,ANA,ANA01,2010/04/08,0,7:07PM,night,True,johna901,...,5,unknown,unknown,sunny,182,39709,slowk001,pinej001,,"[[spand001, 0, 8], [hudso001, 0, 4], [mauej001..."
4,ANA201004090,OAK,ANA,ANA01,2010/04/09,0,7:08PM,night,True,barrs901,...,7,unknown,unknown,sunny,179,40034,gonzg003,palmm001,,"[[davir003, 0, 8], [bartd001, 0, 3], [sweer001..."
