# Data Frame Creation: 

#### Complete data collection, cleaning and exportation to CSV. This data will be used throughout the notebook in various experiments

In [51]:
import pandas as pd # Dataframes
from pandas.io.json import json_normalize # JSON wrangler
import statsapi # Python wrapper MLB data API

In [52]:
pd.set_option('display.max_columns', None)

## Gathering Game Ids

Calling the StatsApi ```https://pypi.org/project/MLB-StatsAPI/``` to collect infomation about all MLB games played between 03/28/18 and 10/03/18. This brings in meta data from all games played during the time period requested. In order to gather the game data needed to make predictions, the ```gameid``` from each game is needed. Since there isn't a premade list, it must be created below.

In [53]:
schedule = statsapi.schedule(start_date="03/28/2018", end_date="04/03/2018")

In [54]:
full = json_normalize(schedule)
gamepks= full['game_id']

In [57]:
gamepks_2018 = list(gamepks.unique())
len(gamepks_2018)

78

In [58]:
test_pk = gamepks_2018

## Data Collection and Data Frame Creation

This loop iterates through the game_pks from the above list, grabs the game data assoicated with each game pk and breaks it out into a custom human usuable pandas dataframe. The column names are defined manually so that only necessary columns are added to the data frame.
In the second loop, it iterates through play events in the current plays dataframe, this normalizes the ```.json``` nested in this data frame in the play events column. A dictionary is then defined, and several columns are added to the dictionary. This columns are used to add the prior pitch to the data frame. The prior pitch is important when determining patterns in pitcher tendencies. If/Else statements are then used to add each column and row to the dictionary which is then appended to a list, which is used to create a final dataframe.

In [59]:
list_for_final_df = []
for game in test_pk:
    curr_game = statsapi.get('game_playByPlay',{'gamePk':game})
    curr_plays = curr_game.get('allPlays')
    curr_plays_df = pd.DataFrame(curr_plays)
    curr_plays_norm = json_normalize(curr_plays)
    
    all_plays_cols = ['about.atBatIndex', 'about.halfInning', 'about.inning', 'count.balls', 'count.strikes', 'matchup.batSide.code', 
                     'matchup.batter.fullName', 'matchup.batter.id', 'matchup.pitchHand.code', 'matchup.splits.menOnBase', 'matchup.pitcher.fullName',
                     'matchup.pitcher.id', 'result.eventType']
    
    play_events_cols = ['count.balls', 'count.strikes', 'details.ballColor', 'details.call.code', 'details.call.description', 'details.type.description'
                        ,'details.call.code', 'details.description', 'details.code', 'details.type.code', 'index', 'pitchData.nastyFactor',
                       'pitchData.zone', 'pitchNumber', 'type']
    i = 1
    for index, row in curr_plays_norm.iterrows():
            play_events = json_normalize(row['playEvents'])
            
            for play_events_idx, play_events_row in play_events.iterrows():
                
                game_dict = {}
                game_dict['gamepk'] = game
                game_dict['pitch_id']  = str(game) + '_' + str(row['about.atBatIndex']) + '_' + str(i)
                game_dict['prior_pitch'] = str(game) + '_' + (str(row['about.atBatIndex']) + '_' + str(i - 1))
                
                
                for col_all_plays in all_plays_cols:
                    if col_all_plays in curr_plays_norm.columns:
                        game_dict[col_all_plays] = row[col_all_plays]
                    else:
                        game_dict[col_all_plays] = np.nan
                for col_play_events in play_events_cols:
                    if col_play_events in play_events.columns:
                        game_dict[col_play_events] = play_events_row[col_play_events]
                    else: 
                        game_dict[col_play_events] = np.nan
                
                list_for_final_df.append(game_dict)
                i += 1
                                                              
                                                              
                
                                                            
            
            

        
    


## Data Organization and Cleaning

The list created above using the for loop is used to create ```each_pitch```, which is a data frame containing all of the raw play by play data.

In [60]:
each_pitch = pd.DataFrame(list_for_final_df)

These steps are to create a previous pitch column. Pitchers usually pitch in sequneces, so having the previous pitch defined in the data should help the model understand the underlying paterns. 

In [61]:
pitch_id_df = each_pitch[['pitch_id', 'details.type.code']].copy()

In [62]:
merged_df = pd.merge(each_pitch, pitch_id_df,how='left', left_on='prior_pitch', right_on='pitch_id')

In [63]:
each_pitch_merged = merged_df

In [64]:
each_pitch_merged = each_pitch_merged.rename({'pitch_id_y': 'previous_pitch_in_ab', 'details.type.code_y': 'previous_pitch_code'}, axis=1)

In [65]:
each_pitch_clean = each_pitch_merged.drop(['result.eventType', 'type', 'pitch_id_x', 'previous_pitch_in_ab', 'prior_pitch', 'details.ballColor'], axis=1)

The original data has 10 different types of pitches. For the purposes of this model, it is only important to classify a pitch as either a Fastball, Changeup or Breaking Ball. Professional hitters don't necessarily need to know exactly what type of pitch is coming, they just need a high level overview. All 3 of those pitch types have distinct characterics and the hitter will get the infomation they need.

In [66]:
pitch_dict = {'FF': 'Fastball'}

In [67]:
pitch_dict['FT'] = 'Fastball'
pitch_dict['FC']= 'Fastball'
pitch_dict['FS'] = 'Fastball'
pitch_dict['CH'] = 'Changeup'
pitch_dict['SI'] = 'Fastball'
pitch_dict['FT'] = 'Fastball'
pitch_dict['CU'] = 'Breaking_Ball'
pitch_dict['SL'] = 'Breaking_Ball'
pitch_dict['KC'] = 'Breaking_Ball'
pitch_dict['nan'] = 'NA'

In [68]:
each_pitch_clean['pitch_type'] = each_pitch_clean['details.type.code_x'].map(pitch_dict)

In [69]:
each_pitch_clean['prior_pitch_type'] = each_pitch_clean['previous_pitch_code'].map(pitch_dict)

In [70]:
each_pitch_clean = each_pitch_clean.drop(['details.type.code_x', 'details.type.description', 'details.code', 'gamepk', 'index', 'matchup.batter.id'],axis=1)

### Read in dataframes used for merge

Pitchers determine what pitch to throw by who they are facing in the game. Power hitters tend to be thrown more breaking balls and less fastballs whereas less intimadating hitters are usually thrown more fastballs. Player data was pulled from both pitchers and hitters, and will be merged on each pitch they were involved in. This should help the model more accuractley understand the situation of the game.

In [71]:
hitter_df = pd.read_csv('public_data/hitter_data.csv')

In [72]:
pitcher_df = pd.read_csv('public_data/pitcher_data.csv')

In [73]:
main_df = each_pitch_clean

In [74]:
hitter_df.head(5)

Unnamed: 0,RK,PLAYER,TEAM,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,AVG,OBP,SLG,OPS,WAR
0,1.0,Kyle Gibson,MIN,2,2,2,0,0,0,0,0,0,0,0,1.0,1.0,1.0,2.0,0.1
1,,Enny Romero,KC/WSH/PIT,1,1,1,1,0,0,0,0,0,0,0,1.0,1.0,2.0,3.0,0.1
2,,Vidal Nuno,TB,2,0,2,0,0,0,1,0,0,0,0,1.0,1.0,1.0,2.0,0.1
3,,Derek Law,SF,1,1,1,0,0,0,0,0,0,0,0,1.0,1.0,1.0,2.0,0.1
4,,Randy Rosario,CHC,1,1,1,0,0,0,1,0,0,1,0,1.0,1.0,1.0,2.0,0.1


In [75]:
hitter_df = hitter_df[['PLAYER', 'SLG', 'OPS', 'WAR']]
hitter_df = hitter_df.rename(columns={'PLAYER': 'hitter'})
hitter_df.head(2)

Unnamed: 0,hitter,SLG,OPS,WAR
0,Kyle Gibson,1.0,2.0,0.1
1,Enny Romero,2.0,3.0,0.1


In [76]:
pitcher_df.head(5)

Unnamed: 0,RK,PLAYER,TEAM,GP,GS,IP,H,R,ER,BB,SO,W,L,SV,BLSV,WAR,WHIP,ERA
0,1.0,Kendrys Morales,TOR,1,0,1.0,0,0,0,1,0,0,0,0,0,0.0,1.0,0.0
1,,Mark Reynolds,WSH,1,0,0.1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0
2,,Pablo Sandoval,SF,1,0,1.0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0
3,,Danny Valencia,BAL,1,0,0.1,0,0,0,0,1,0,0,0,0,0.0,0.0,0.0
4,,Alex Avila,ARI,1,0,2.0,1,0,0,0,0,0,0,0,0,0.1,0.5,0.0


In [77]:
pitcher_df = pitcher_df[['PLAYER', 'WAR', 'WHIP', 'ERA', 'SO']]
pitcher_df = pitcher_df.rename(columns={'PLAYER': 'pitcher'})
pitcher_df.head(2)

Unnamed: 0,pitcher,WAR,WHIP,ERA,SO
0,Kendrys Morales,0.0,1.0,0.0,0
1,Mark Reynolds,0.0,0.0,0.0,0


In [78]:
main_df.head(5)

Unnamed: 0,about.atBatIndex,about.halfInning,about.inning,count.balls,count.strikes,details.call.code,details.call.description,details.description,matchup.batSide.code,matchup.batter.fullName,matchup.pitchHand.code,matchup.pitcher.fullName,matchup.pitcher.id,matchup.splits.menOnBase,pitchData.nastyFactor,pitchData.zone,pitchNumber,previous_pitch_code,pitch_type,prior_pitch_type
0,0,top,1,0.0,0.0,X,Hit Into Play - Out(s),"In play, run(s)",L,Ian Happ,R,Jose Urena,570632,Empty,32.89,6.0,1.0,,Fastball,
1,1,top,1,1.0,0.0,B,Ball - Called,Ball,R,Kris Bryant,R,Jose Urena,570632,Men_On,24.17,13.0,1.0,,Fastball,
2,1,top,1,2.0,0.0,B,Ball - Called,Ball,R,Kris Bryant,R,Jose Urena,570632,Men_On,29.02,13.0,2.0,FT,Fastball,Fastball
3,1,top,1,2.0,1.0,S,Strike - Swinging,Swinging Strike,R,Kris Bryant,R,Jose Urena,570632,Men_On,41.63,13.0,3.0,FT,Fastball,Fastball
4,1,top,1,3.0,1.0,B,Ball - Called,Ball,R,Kris Bryant,R,Jose Urena,570632,Men_On,59.33,13.0,4.0,FT,Changeup,Fastball


In [79]:
main_df = main_df.rename(columns={'matchup.batter.fullName': 'hitter', 'matchup.pitcher.fullName': 'pitcher'})

In [80]:
merged = pd.merge(hitter_df, main_df, on='hitter')

In [81]:
full_merge = pd.merge(pitcher_df, merged, on='pitcher')

In [82]:
full_merge.head(10)

Unnamed: 0,pitcher,WAR_x,WHIP,ERA,SO,hitter,SLG,OPS,WAR_y,about.atBatIndex,about.halfInning,about.inning,count.balls,count.strikes,details.call.code,details.call.description,details.description,matchup.batSide.code,matchup.pitchHand.code,matchup.pitcher.id,matchup.splits.menOnBase,pitchData.nastyFactor,pitchData.zone,pitchNumber,previous_pitch_code,pitch_type,prior_pitch_type
0,Rex Brothers,0.0,0.0,0.0,0,Maikel Franco,0.467,0.78,0.2,42,top,6,1.0,0.0,B,Ball - Called,Ball,R,L,571521,Loaded,48.55,13.0,1.0,,Fastball,
1,Rex Brothers,0.0,0.0,0.0,0,Maikel Franco,0.467,0.78,0.2,42,top,6,2.0,0.0,B,Ball - Called,Ball,R,L,571521,Loaded,55.71,14.0,2.0,FF,Fastball,Fastball
2,Rex Brothers,0.0,0.0,0.0,0,Maikel Franco,0.467,0.78,0.2,42,top,6,2.0,1.0,S,Strike - Swinging,Foul,R,L,571521,Loaded,20.79,5.0,3.0,FF,Fastball,Fastball
3,Rex Brothers,0.0,0.0,0.0,0,Maikel Franco,0.467,0.78,0.2,42,top,6,3.0,1.0,B,Ball - Called,Ball,R,L,571521,Loaded,43.76,13.0,4.0,FF,Fastball,Fastball
4,Rex Brothers,0.0,0.0,0.0,0,Maikel Franco,0.467,0.78,0.2,42,top,6,4.0,1.0,B,Ball - Called,Ball In Dirt,R,L,571521,Loaded,27.87,13.0,5.0,FF,Fastball,Fastball
5,Rex Brothers,0.0,0.0,0.0,0,J.P. Crawford,0.393,0.712,-0.1,41,top,6,4.0,1.0,,,Mound Visit.,L,L,571521,Loaded,,,,,,
6,Rex Brothers,0.0,0.0,0.0,0,J.P. Crawford,0.393,0.712,-0.1,41,top,6,0.0,0.0,,,Pitching Change: Rex Brothers replaces Julio T...,L,L,571521,Loaded,,,,,,
7,Rex Brothers,0.0,0.0,0.0,0,J.P. Crawford,0.393,0.712,-0.1,41,top,6,0.0,1.0,S,Strike - Swinging,Called Strike,L,L,571521,Loaded,42.92,8.0,1.0,,Fastball,
8,Rex Brothers,0.0,0.0,0.0,0,J.P. Crawford,0.393,0.712,-0.1,41,top,6,1.0,1.0,B,Ball - Called,Ball,L,L,571521,Loaded,55.86,13.0,2.0,FF,Fastball,Fastball
9,Rex Brothers,0.0,0.0,0.0,0,J.P. Crawford,0.393,0.712,-0.1,41,top,6,1.0,2.0,S,Strike - Swinging,Foul,L,L,571521,Loaded,52.16,7.0,3.0,FF,Fastball,Fastball


### Feature Engineering

In the cells below, the count is being turned from two numerical values into one string value. The count of an At-Bat is a singular event and should be treated as such. 

In [83]:
add_feats = full_merge

In [84]:
add_feats['count.balls'] = add_feats['count.balls'].astype(str)

In [85]:
add_feats['count.strikes'] = add_feats['count.strikes'].astype(str)

In [86]:
add_feats['count'] = add_feats['count.balls'] + '-' + add_feats['count.strikes'] 

In [87]:
add_feats = add_feats.drop([ 'previous_pitch_code', 'details.call.code', 'count.balls', 'count.strikes'], axis=1)

In [88]:
final_pitches = add_feats

In [89]:
final_pitches.head(5)

Unnamed: 0,pitcher,WAR_x,WHIP,ERA,SO,hitter,SLG,OPS,WAR_y,about.atBatIndex,about.halfInning,about.inning,details.call.description,details.description,matchup.batSide.code,matchup.pitchHand.code,matchup.pitcher.id,matchup.splits.menOnBase,pitchData.nastyFactor,pitchData.zone,pitchNumber,pitch_type,prior_pitch_type,count
0,Rex Brothers,0.0,0.0,0.0,0,Maikel Franco,0.467,0.78,0.2,42,top,6,Ball - Called,Ball,R,L,571521,Loaded,48.55,13.0,1.0,Fastball,,1.0-0.0
1,Rex Brothers,0.0,0.0,0.0,0,Maikel Franco,0.467,0.78,0.2,42,top,6,Ball - Called,Ball,R,L,571521,Loaded,55.71,14.0,2.0,Fastball,Fastball,2.0-0.0
2,Rex Brothers,0.0,0.0,0.0,0,Maikel Franco,0.467,0.78,0.2,42,top,6,Strike - Swinging,Foul,R,L,571521,Loaded,20.79,5.0,3.0,Fastball,Fastball,2.0-1.0
3,Rex Brothers,0.0,0.0,0.0,0,Maikel Franco,0.467,0.78,0.2,42,top,6,Ball - Called,Ball,R,L,571521,Loaded,43.76,13.0,4.0,Fastball,Fastball,3.0-1.0
4,Rex Brothers,0.0,0.0,0.0,0,Maikel Franco,0.467,0.78,0.2,42,top,6,Ball - Called,Ball In Dirt,R,L,571521,Loaded,27.87,13.0,5.0,Fastball,Fastball,4.0-1.0


Final Pitch Dataframe exported as a csv for ease of use in other notebooks

In [90]:
# final_pitches.to_csv(r'raw_data/final_pitches.csv', index=False, sep=',', encoding='utf-8')

La La