# Data Preparation

Notebook que irá conter o código utilizado para a preparação dos dados do projeto.

os dados "crus" estão localizados em **/src/data/raw-data** e os dados tratados devem ser salvos em **/src/data/processed-data** 

## Initial Setup

In [1]:
import pandas as pd
import numpy as np

import pickle as pkl

from joblib import Parallel, delayed

In [2]:
# File paths for working locally
raw_data_path = '../data/raw-data/'
processed_data_path = '../data/processed-data/'

In [3]:
# # Uncomment this cell if running on Google Colab
# from google.colab import drive
# drive.mount('/content/drive/')
# raw_data_path = '/content/drive/MyDrive/datasets/mlb-player-digital-engagement-forecasting/'

In [4]:
dataset_names = {
    'Awards': 'awards.csv', 
    'Example': 'example_test.csv', 
    'Players': 'players.csv',
    'Seasons': 'seasons.csv', 
    'Teams': 'teams.csv', 
    'Train': 'train_updated.csv'
}
for key in dataset_names:
  dataset_names[key] = raw_data_path + dataset_names[key]
dataset_names

{'Awards': '../data/raw-data/awards.csv',
 'Example': '../data/raw-data/example_test.csv',
 'Players': '../data/raw-data/players.csv',
 'Seasons': '../data/raw-data/seasons.csv',
 'Teams': '../data/raw-data/teams.csv',
 'Train': '../data/raw-data/train_updated.csv'}

## Loading the auxiliary Datasets

### Players

In [5]:
df_players = pd.read_csv(dataset_names['Players'])
df_players.head()

Unnamed: 0,playerId,playerName,DOB,mlbDebutDate,birthCity,birthStateProvince,birthCountry,heightInches,weight,primaryPositionCode,primaryPositionName,playerForTestSetAndFuturePreds
0,665482,Gilberto Celestino,1999-02-13,2021-06-02,Santo Domingo,,Dominican Republic,72,170,8,Outfielder,False
1,593590,Webster Rivas,1990-08-08,2021-05-28,Nagua,,Dominican Republic,73,219,3,First Base,True
2,661269,Vladimir Gutierrez,1995-09-18,2021-05-28,Havana,,Cuba,73,190,1,Pitcher,True
3,669212,Eli Morgan,1996-05-13,2021-05-28,Rancho Palos Verdes,CA,USA,70,190,1,Pitcher,True
4,666201,Alek Manoah,1998-01-09,2021-05-27,Homestead,FL,USA,78,260,1,Pitcher,True


In [6]:
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   playerId                        2061 non-null   int64 
 1   playerName                      2061 non-null   object
 2   DOB                             2061 non-null   object
 3   mlbDebutDate                    2025 non-null   object
 4   birthCity                       2061 non-null   object
 5   birthStateProvince              1516 non-null   object
 6   birthCountry                    2061 non-null   object
 7   heightInches                    2061 non-null   int64 
 8   weight                          2061 non-null   int64 
 9   primaryPositionCode             2061 non-null   object
 10  primaryPositionName             2061 non-null   object
 11  playerForTestSetAndFuturePreds  2057 non-null   object
dtypes: int64(3), object(9)
memory usage: 193.3+ KB


Analisando o dataset, foram removidas a coluna `playerName`, por ela já estar representada pela coluna `playerId` e a coluna `birthStateProvince` pela quantidade de valores nulos que a mesma possui.

In [7]:
# Select the columns that will be used on the df and renames them according to the pattern
cols = {
    'playerId': 'IdPlayer',
    'DOB': 'DtBirth',
    'mlbDebutDate': 'DtMlbDebut',
    'birthCity': 'NmCity',
    #'birthStateProvince': 'NmState',
    'birthCountry': 'NmCountry',
    'heightInches': 'NuHeight',
    'weight': 'NuWeight',
    'primaryPositionCode': 'CdPrimaryPosition',
    'primaryPositionName': 'NmPrimaryPosition',
    'playerForTestSetAndFuturePreds': 'FlgForTestAndPred'
}
df_players = df_players[list(cols)]
df_players.columns = list(cols.values())
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   IdPlayer           2061 non-null   int64 
 1   DtBirth            2061 non-null   object
 2   DtMlbDebut         2025 non-null   object
 3   NmCity             2061 non-null   object
 4   NmCountry          2061 non-null   object
 5   NuHeight           2061 non-null   int64 
 6   NuWeight           2061 non-null   int64 
 7   CdPrimaryPosition  2061 non-null   object
 8   NmPrimaryPosition  2061 non-null   object
 9   FlgForTestAndPred  2057 non-null   object
dtypes: int64(3), object(7)
memory usage: 161.1+ KB


In [8]:
# pd.to_pickle(df_players, processed_data_path + 'players.pkl')
del df_players

### Teams

In [9]:
df_teams = pd.read_csv(dataset_names['Teams'])
df_teams.head()

Unnamed: 0,id,name,teamName,teamCode,shortName,abbreviation,locationName,leagueId,leagueName,divisionId,divisionName,venueId,venueName
0,108,Los Angeles Angels,Angels,ana,LA Angels,LAA,Anaheim,103,American League,200,American League West,1,Angel Stadium
1,109,Arizona Diamondbacks,D-backs,ari,Arizona,ARI,Phoenix,104,National League,203,National League West,15,Chase Field
2,110,Baltimore Orioles,Orioles,bal,Baltimore,BAL,Baltimore,103,American League,201,American League East,2,Oriole Park at Camden Yards
3,111,Boston Red Sox,Red Sox,bos,Boston,BOS,Boston,103,American League,201,American League East,3,Fenway Park
4,112,Chicago Cubs,Cubs,chn,Chi Cubs,CHC,Chicago,104,National League,205,National League Central,17,Wrigley Field


In [10]:
df_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            30 non-null     int64 
 1   name          30 non-null     object
 2   teamName      30 non-null     object
 3   teamCode      30 non-null     object
 4   shortName     30 non-null     object
 5   abbreviation  30 non-null     object
 6   locationName  30 non-null     object
 7   leagueId      30 non-null     int64 
 8   leagueName    30 non-null     object
 9   divisionId    30 non-null     int64 
 10  divisionName  30 non-null     object
 11  venueId       30 non-null     int64 
 12  venueName     30 non-null     object
dtypes: int64(4), object(9)
memory usage: 3.2+ KB


In [11]:
cols = {
     'id': 'IdTeam'
    ,'locationName': 'NmLocation'
    ,'leagueId': 'IdLeague'
    ,'divisionId': 'IdDivision'
    ,'venueId': 'IdVenue'
}
df_teams = df_teams[list(cols.keys())]
df_teams.columns = list(cols.values())
df_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   IdTeam      30 non-null     int64 
 1   NmLocation  30 non-null     object
 2   IdLeague    30 non-null     int64 
 3   IdDivision  30 non-null     int64 
 4   IdVenue     30 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.3+ KB


In [12]:
# pd.to_pickle(df_teams, processed_data_path + 'teams.pkl')
del df_teams

### Awards

In [13]:
df_awards = pd.read_csv(dataset_names['Awards'])
df_awards.head()

Unnamed: 0,awardDate,awardSeason,awardId,awardName,playerId,playerName,awardPlayerTeamId
0,2017-12-21,2017,WARRENSPAHN,Warren Spahn Award,477132,Clayton Kershaw,119.0
1,2017-12-20,2017,MILBORGAS,MiLB.com Organization All-Star,474319,Brandon Snyder,120.0
2,2017-12-20,2017,MILBORGAS,MiLB.com Organization All-Star,592530,Jose Marmolejos,120.0
3,2017-12-20,2017,MILBORGAS,MiLB.com Organization All-Star,593833,Wander Suero,120.0
4,2017-12-20,2017,MILBORGAS,MiLB.com Organization All-Star,600466,Raudy Read,120.0


In [14]:
df_awards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11256 entries, 0 to 11255
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   awardDate          11256 non-null  object 
 1   awardSeason        11256 non-null  int64  
 2   awardId            11256 non-null  object 
 3   awardName          11256 non-null  object 
 4   playerId           11256 non-null  int64  
 5   playerName         11256 non-null  object 
 6   awardPlayerTeamId  11243 non-null  float64
dtypes: float64(1), int64(2), object(4)
memory usage: 615.7+ KB


In [15]:
cols = {
    'awardId': 'IdAward',
    'awardDate': 'DtAward',
    'awardSeason': 'DtAwardSeason',
    'playerId': 'IdPlayer',
    'awardPlayerTeamId': 'IdTeam'
}
df_awards = df_awards[list(cols)]
df_awards.columns = list(cols.values())
df_awards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11256 entries, 0 to 11255
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   IdAward        11256 non-null  object 
 1   DtAward        11256 non-null  object 
 2   DtAwardSeason  11256 non-null  int64  
 3   IdPlayer       11256 non-null  int64  
 4   IdTeam         11243 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 439.8+ KB


In [16]:
# pd.to_pickle(df_awards, processed_data_path + 'awards.pkl')
del df_awards

### Seasons

In [17]:
df_seasons = pd.read_csv(dataset_names['Seasons'])
df_seasons.head()

Unnamed: 0,seasonId,seasonStartDate,seasonEndDate,preSeasonStartDate,preSeasonEndDate,regularSeasonStartDate,regularSeasonEndDate,lastDate1stHalf,allStarDate,firstDate2ndHalf,postSeasonStartDate,postSeasonEndDate
0,2017,2017-04-02,2017-11-01,2017-02-22,2017-04-01,2017-04-02,2017-10-01,2017-07-09,2017-07-11,2017-07-14,2017-10-03,2017-11-01
1,2018,2018-03-29,2018-10-28,2018-02-21,2018-03-27,2018-03-29,2018-10-01,2018-07-15,2018-07-17,2018-07-19,2018-10-02,2018-10-28
2,2019,2019-03-20,2019-10-30,2019-02-21,2019-03-26,2019-03-20,2019-09-29,2019-07-07,2019-07-09,2019-07-11,2019-10-01,2019-10-30
3,2020,2020-07-23,2020-10-28,2020-02-21,2020-07-22,2020-07-23,2020-09-27,2020-08-25,,2020-08-26,2020-09-29,2020-10-28
4,2021,2021-02-28,2021-10-31,2021-02-28,2021-03-30,2021-04-01,2021-10-03,2021-07-11,2021-07-13,2021-07-15,2021-10-04,2021-10-31


In [18]:
df_seasons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seasonId                5 non-null      int64 
 1   seasonStartDate         5 non-null      object
 2   seasonEndDate           5 non-null      object
 3   preSeasonStartDate      5 non-null      object
 4   preSeasonEndDate        5 non-null      object
 5   regularSeasonStartDate  5 non-null      object
 6   regularSeasonEndDate    5 non-null      object
 7   lastDate1stHalf         5 non-null      object
 8   allStarDate             4 non-null      object
 9   firstDate2ndHalf        5 non-null      object
 10  postSeasonStartDate     5 non-null      object
 11  postSeasonEndDate       5 non-null      object
dtypes: int64(1), object(11)
memory usage: 608.0+ bytes


optamos por não utilizar o dataframe de Seasons, já que ele apenas adicionaria complexidade ao modelo e como se tratam de apenas datas, provavelmente a contribuição das features que ele adiciona não seriam relevantes já que são datas de início e fim de eventos pré definidos na temporada

In [19]:
del df_seasons

## Loading the Train dataset

In [20]:
%%time
df_train = pd.read_csv(dataset_names['Train'])

CPU times: user 50.9 s, sys: 7.92 s, total: 58.8 s
Wall time: 1min 11s


### Dataset info

In [21]:
df_train.head()

Unnamed: 0,date,nextDayPlayerEngagement,games,rosters,playerBoxScores,teamBoxScores,transactions,standings,awards,events,playerTwitterFollowers,teamTwitterFollowers
0,20180101,"[{""engagementMetricsDate"":""2018-01-02"",""player...",,"[{""playerId"":400121,""gameDate"":""2018-01-01"",""t...",,,"[{""transactionId"":340732,""playerId"":547348,""pl...",,,,"[{""date"":""2018-01-01"",""playerId"":545361,""playe...","[{""date"":""2018-01-01"",""teamId"":147,""teamName"":..."
1,20180102,"[{""engagementMetricsDate"":""2018-01-03"",""player...",,"[{""playerId"":134181,""gameDate"":""2018-01-02"",""t...",,,"[{""transactionId"":339458,""playerId"":621173,""pl...",,,,,
2,20180103,"[{""engagementMetricsDate"":""2018-01-04"",""player...",,"[{""playerId"":425492,""gameDate"":""2018-01-03"",""t...",,,"[{""transactionId"":347527,""playerId"":572389,""pl...",,,,,
3,20180104,"[{""engagementMetricsDate"":""2018-01-05"",""player...",,"[{""playerId"":282332,""gameDate"":""2018-01-04"",""t...",,,"[{""transactionId"":339549,""playerId"":545343,""pl...",,,,,
4,20180105,"[{""engagementMetricsDate"":""2018-01-06"",""player...",,"[{""playerId"":282332,""gameDate"":""2018-01-05"",""t...",,,"[{""transactionId"":341195,""playerId"":628336,""pl...",,,,,


In [22]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1308 entries, 0 to 1307
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   date                     1308 non-null   int64 
 1   nextDayPlayerEngagement  1308 non-null   object
 2   games                    729 non-null    object
 3   rosters                  1307 non-null   object
 4   playerBoxScores          627 non-null    object
 5   teamBoxScores            627 non-null    object
 6   transactions             1194 non-null   object
 7   standings                623 non-null    object
 8   awards                   309 non-null    object
 9   events                   624 non-null    object
 10  playerTwitterFollowers   43 non-null     object
 11  teamTwitterFollowers     43 non-null     object
dtypes: int64(1), object(11)
memory usage: 122.8+ KB


In [23]:
df_train['nextDayPlayerEngagement'][0][:1000]

'[{"engagementMetricsDate":"2018-01-02","playerId":628317,"target1":0.011167070542384616,"target2":4.4747081712062258,"target3":0.0051677297424994094,"target4":5.7352941176470589},{"engagementMetricsDate":"2018-01-02","playerId":547989,"target1":0.042993221588180773,"target2":5.5933852140077827,"target3":0.045033073470351993,"target4":2.7941176470588238},{"engagementMetricsDate":"2018-01-02","playerId":519317,"target1":0.97432690482305784,"target2":56.177042801556418,"target3":13.693745570517363,"target4":64.166666666666671},{"engagementMetricsDate":"2018-01-02","playerId":607625,"target1":0.0067002423254307695,"target2":2.6750972762645913,"target3":0.0051677297424994094,"target4":1.8627450980392157},{"engagementMetricsDate":"2018-01-02","playerId":592547,"target1":0.0011167070542384616,"target2":0.632295719844358,"target3":0.0029529884242853769,"target4":0.93137254901960786},{"engagementMetricsDate":"2018-01-02","playerId":641553,"target1":0.011725424069503847,"target2":3.842412451361

### Auxiliary functions

In [24]:
def unpack_json(json_str):
    return pd.DataFrame() if pd.isna(json_str) else pd.read_json(json_str)

def unpack_data(data, dfs=None, n_jobs=-1):
    if dfs is not None:
        data = data.loc[:, dfs]
    unnested_dfs = {}
    for name, column in data.iteritems():
        daily_dfs = Parallel(n_jobs=n_jobs)(
            delayed(unpack_json)(item) for date, item in column.iteritems())
        df = pd.concat(daily_dfs)
        unnested_dfs[name] = df
    return unnested_dfs

### Making the targets df

In [25]:
%%time

Y = unpack_data(df_train, dfs = ['nextDayPlayerEngagement'])['nextDayPlayerEngagement']
Y = Y.astype({name: np.float32 for name in ["target1", "target2", "target3", "target4"]})
# Match target dates to feature dates and create date index
Y = Y.rename(columns={'engagementMetricsDate': 'date'})
Y['date'] = pd.to_datetime(Y['date'])
Y = Y.set_index('date').to_period('D')
Y.index = Y.index - 1
Y = Y.reset_index()


CPU times: user 4.43 s, sys: 1.72 s, total: 6.15 s
Wall time: 10.7 s


In [26]:
Y.head()

Unnamed: 0,date,playerId,target1,target2,target3,target4
0,2018-01-01,628317,0.011167,4.474708,0.005168,5.735294
1,2018-01-01,547989,0.042993,5.593385,0.045033,2.794118
2,2018-01-01,519317,0.974327,56.177044,13.693746,64.166664
3,2018-01-01,607625,0.0067,2.675097,0.005168,1.862745
4,2018-01-01,592547,0.001117,0.632296,0.002953,0.931373


In [27]:
Y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2695788 entries, 0 to 2695787
Data columns (total 6 columns):
 #   Column    Dtype    
---  ------    -----    
 0   date      period[D]
 1   playerId  int64    
 2   target1   float32  
 3   target2   float32  
 4   target3   float32  
 5   target4   float32  
dtypes: float32(4), int64(1), period[D](1)
memory usage: 82.3 MB


In [28]:
cols_Y = {
    'date': 'Dt',
    'playerId': 'IdPlayer',
    'target1': 'target1',
    'target2': 'target2',
    'target3': 'target3',
    'target4': 'target4'
}
Y = Y[list(cols_Y)]
Y.columns = list(cols_Y.values())
Y['Dt'] = Y['Dt'].astype('datetime64[ns]')
Y.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2695788 entries, 0 to 2695787
Data columns (total 6 columns):
 #   Column    Dtype         
---  ------    -----         
 0   Dt        datetime64[ns]
 1   IdPlayer  int64         
 2   target1   float32       
 3   target2   float32       
 4   target3   float32       
 5   target4   float32       
dtypes: datetime64[ns](1), float32(4), int64(1)
memory usage: 82.3 MB


In [29]:
pd.to_pickle(Y, processed_data_path + 'targets.pkl')
del Y

### Player Box Scores

In [30]:
%%time
df_playerBoxScores = unpack_data(df_train, dfs = ['playerBoxScores'])['playerBoxScores']

CPU times: user 2.68 s, sys: 285 ms, total: 2.96 s
Wall time: 8.27 s


In [31]:
df_playerBoxScores.head()

Unnamed: 0,home,gamePk,gameDate,gameTimeUTC,teamId,teamName,playerId,playerName,jerseyNum,positionCode,...,catchersInterferencePitching,sacBuntsPitching,sacFliesPitching,saves,holds,blownSaves,assists,putOuts,errors,chances
0,1,529418,2018-03-29,2018-03-29T23:08:00Z,119,Los Angeles Dodgers,605131,Austin Barnes,15,12,...,,,,,,,,,,
1,1,529406,2018-03-29,2018-03-29T20:00:00Z,139,Tampa Bay Rays,605480,Mallex Smith,0,7,...,,,,,,,0.0,0.0,0.0,0.0
2,0,529416,2018-03-29,2018-03-29T20:10:00Z,143,Philadelphia Phillies,546318,Odubel Herrera,37,8,...,,,,,,,0.0,0.0,0.0,0.0
3,0,529412,2018-03-29,2018-03-29T20:05:00Z,108,Los Angeles Angels,527043,Jefry Marte,19,3,...,,,,,,,0.0,1.0,0.0,1.0
4,1,529408,2018-03-29,2018-03-29T20:15:00Z,118,Kansas City Royals,449181,Paulo Orlando,16,8,...,,,,,,,0.0,2.0,0.0,2.0


In [32]:
df_playerBoxScores.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 219727 entries, 0 to 451
Data columns (total 85 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   home                          219727 non-null  int64  
 1   gamePk                        219727 non-null  int64  
 2   gameDate                      219727 non-null  object 
 3   gameTimeUTC                   219727 non-null  object 
 4   teamId                        219727 non-null  int64  
 5   teamName                      219727 non-null  object 
 6   playerId                      219727 non-null  int64  
 7   playerName                    219727 non-null  object 
 8   jerseyNum                     219690 non-null  object 
 9   positionCode                  219727 non-null  int64  
 10  positionName                  219727 non-null  object 
 11  positionType                  219727 non-null  object 
 12  battingOrder                  183390 non-null  

In [33]:
cols = {
    # columns related to other dimensions
    'gamePk': 'IdGame',
    'gameDate': 'DtGame',
    'gameTimeUTC': 'DtGameUTC',
    'playerId': 'IdPlayer',
    'teamId': 'IdTeam',
    'jerseyNum': 'NuJersey',
    'positionCode': 'CdPosition',
    # suggested column
    'strikeOutsPitching': 'NuStrikeOutsPitching',
}  
# numeric columns
for numeric_col in list(df_playerBoxScores.columns[12:]):
    # skip the columns that contains data about pitching due the amount of Nan values
    if 'Pitching' not in numeric_col:
        cols[numeric_col] = 'Nu' + numeric_col[0].upper() + numeric_col[1:]
print(cols)

{'gamePk': 'IdGame', 'gameDate': 'DtGame', 'gameTimeUTC': 'DtGameUTC', 'playerId': 'IdPlayer', 'teamId': 'IdTeam', 'jerseyNum': 'NuJersey', 'positionCode': 'CdPosition', 'strikeOutsPitching': 'NuStrikeOutsPitching', 'battingOrder': 'NuBattingOrder', 'gamesPlayedBatting': 'NuGamesPlayedBatting', 'flyOuts': 'NuFlyOuts', 'groundOuts': 'NuGroundOuts', 'runsScored': 'NuRunsScored', 'doubles': 'NuDoubles', 'triples': 'NuTriples', 'homeRuns': 'NuHomeRuns', 'strikeOuts': 'NuStrikeOuts', 'baseOnBalls': 'NuBaseOnBalls', 'intentionalWalks': 'NuIntentionalWalks', 'hits': 'NuHits', 'hitByPitch': 'NuHitByPitch', 'atBats': 'NuAtBats', 'caughtStealing': 'NuCaughtStealing', 'stolenBases': 'NuStolenBases', 'groundIntoDoublePlay': 'NuGroundIntoDoublePlay', 'groundIntoTriplePlay': 'NuGroundIntoTriplePlay', 'plateAppearances': 'NuPlateAppearances', 'totalBases': 'NuTotalBases', 'rbi': 'NuRbi', 'leftOnBase': 'NuLeftOnBase', 'sacBunts': 'NuSacBunts', 'sacFlies': 'NuSacFlies', 'catchersInterference': 'NuCatch

In [34]:
df_playerBoxScores = df_playerBoxScores[list(cols)]
df_playerBoxScores.columns = list(cols.values())

In [35]:
pd.to_pickle(df_playerBoxScores, processed_data_path + 'playerBoxScores.pkl')
del df_playerBoxScores

### Games

In [36]:
%%time
df_games = unpack_data(df_train, dfs = ['games'])['games']

CPU times: user 2.52 s, sys: 20.7 ms, total: 2.54 s
Wall time: 4.65 s


In [37]:
df_games.head()

Unnamed: 0,gamePk,gameType,season,gameDate,gameTimeUTC,resumeDate,resumedFrom,codedGameState,detailedGameState,isTie,...,homeWinner,homeScore,awayId,awayName,awayAbbrev,awayWins,awayLosses,awayWinPct,awayWinner,awayScore
0,533782,E,2018,2018-02-21,2018-02-21T20:10:00Z,,,F,Final,False,...,True,7.0,5035,Arizona State Sun Devils,ASU,0.0,1.0,0.0,False,2.0
0,534461,E,2018,2018-02-22,2018-02-22T18:05:00Z,,,F,Final,False,...,True,6.0,228,Florida Southern College Mocs,FSC,0.0,1.0,0.0,False,1.0
1,545334,E,2018,2018-02-22,2018-02-22T18:05:00Z,,,F,Final,False,...,True,6.0,231,University of Tampa Spartans,UT,0.0,1.0,0.0,False,0.0
2,547295,E,2018,2018-02-22,2018-02-22T03:33:00Z,,,F,Final,False,...,True,4.0,227,Boston College Eagles,BC,0.0,1.0,0.0,False,2.0
3,533784,E,2018,2018-02-22,2018-02-22T23:05:00Z,,,F,Final,False,...,True,2.0,4864,Minnesota Gophers,UM,0.0,1.0,0.0,False,1.0


In [38]:
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9055 entries, 0 to 14
Data columns (total 32 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gamePk             9055 non-null   int64  
 1   gameType           9055 non-null   object 
 2   season             9055 non-null   int64  
 3   gameDate           9055 non-null   object 
 4   gameTimeUTC        9055 non-null   object 
 5   resumeDate         14 non-null     object 
 6   resumedFrom        18 non-null     object 
 7   codedGameState     9055 non-null   object 
 8   detailedGameState  9055 non-null   object 
 9   isTie              8672 non-null   object 
 10  gameNumber         9055 non-null   int64  
 11  doubleHeader       9055 non-null   object 
 12  dayNight           9055 non-null   object 
 13  scheduledInnings   9055 non-null   int64  
 14  gamesInSeries      9052 non-null   float64
 15  seriesDescription  9055 non-null   object 
 16  homeId             9055 no

In [39]:
cols = {
    'gamePk': 'IdGame',
    'gameType': 'CdGameType',
    'season': 'IdSeason',
    'gameDate': 'DtGame',
    'detailedGameState': 'NmGameState',
    'isTie': 'FlgIsTie',
    'gameNumber': 'NuGameNumber',
    'doubleHeader': 'CdDoubleHeader',
    'dayNight': 'NmDayNight',
    'scheduledInnings': 'NuScheduledInnings',
    'gamesInSeries': 'NuGamesInSeries',
    'seriesDescription': 'NmSeriesDescription',
    'homeId': 'IdHomeTeam',
    'homeWins': 'NuHomeWins',
    'homeLosses': 'NuHomeLosses',
    'homeWinPct': 'NuPctHomeWin',
    'homeScore': 'NuHomeScore',
    'awayId': 'IdAwayTeam',
    'awayWins': 'NuAwayWins',
    'awayLosses': 'NuAwayLosses',
    'awayWinPct': 'NuPctAwayWin',
    'awayScore': 'NuAwayScore',
    'homeWinner': 'FlgHomeWinner',
}
df_games = df_games[list(cols.keys())]
df_games.columns = list(cols.values())
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9055 entries, 0 to 14
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   IdGame               9055 non-null   int64  
 1   CdGameType           9055 non-null   object 
 2   IdSeason             9055 non-null   int64  
 3   DtGame               9055 non-null   object 
 4   NmGameState          9055 non-null   object 
 5   FlgIsTie             8672 non-null   object 
 6   NuGameNumber         9055 non-null   int64  
 7   CdDoubleHeader       9055 non-null   object 
 8   NmDayNight           9055 non-null   object 
 9   NuScheduledInnings   9055 non-null   int64  
 10  NuGamesInSeries      9052 non-null   float64
 11  NmSeriesDescription  9055 non-null   object 
 12  IdHomeTeam           9055 non-null   int64  
 13  NuHomeWins           9055 non-null   int64  
 14  NuHomeLosses         9055 non-null   int64  
 15  NuPctHomeWin         9055 non-null   flo

In [40]:
pd.to_pickle(df_games, processed_data_path + 'games.pkl')
del df_games

### Rosters

In [41]:
%%time
df_rosters = unpack_data(df_train, dfs = ['rosters'])['rosters']

CPU times: user 3 s, sys: 297 ms, total: 3.29 s
Wall time: 5.29 s


In [42]:
df_rosters.head()

Unnamed: 0,playerId,gameDate,teamId,statusCode,status
0,400121,2018-01-01,116,A,Active
1,408045,2018-01-01,142,A,Active
2,425492,2018-01-01,120,A,Active
3,429664,2018-01-01,136,A,Active
4,431151,2018-01-01,121,A,Active


In [43]:
df_rosters.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1610288 entries, 0 to 1347
Data columns (total 5 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   playerId    1610288 non-null  int64 
 1   gameDate    1610288 non-null  object
 2   teamId      1610288 non-null  int64 
 3   statusCode  1610288 non-null  object
 4   status      1610288 non-null  object
dtypes: int64(2), object(3)
memory usage: 73.7+ MB


In [44]:
cols = {
    'gameDate': 'DtGame',
    'playerId': 'IdPlayer',
    'teamId': 'IdTeam',
    'status': 'NmStatus',
    'statusCode': 'CdStatus'
}
df_rosters = df_rosters[list(cols.keys())]
df_rosters.columns = list(cols.values())
df_rosters.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1610288 entries, 0 to 1347
Data columns (total 5 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   DtGame    1610288 non-null  object
 1   IdPlayer  1610288 non-null  int64 
 2   IdTeam    1610288 non-null  int64 
 3   NmStatus  1610288 non-null  object
 4   CdStatus  1610288 non-null  object
dtypes: int64(2), object(3)
memory usage: 73.7+ MB


In [45]:
pd.to_pickle(df_rosters, processed_data_path + 'rosters.pkl')
del df_rosters

### Team Box Scores

In [46]:
%%time
df_teamBoxScores = unpack_data(df_train, dfs = ['teamBoxScores'])['teamBoxScores']

CPU times: user 1.3 s, sys: 21.6 ms, total: 1.32 s
Wall time: 3.5 s


In [47]:
df_teamBoxScores.head()

Unnamed: 0,home,teamId,gamePk,gameDate,gameTimeUTC,flyOuts,groundOuts,runsScored,doubles,triples,...,hitBatsmen,balks,wildPitches,pickoffsPitching,rbiPitching,inheritedRunners,inheritedRunnersScored,catchersInterferencePitching,sacBuntsPitching,sacFliesPitching
0,1,109,529410,2018-03-29,2018-03-30T02:10:00Z,4,9,8,2,1,...,0,0,0,0,2,0,0,0,1,0
1,0,114,529409,2018-03-29,2018-03-30T02:10:00Z,4,9,1,1,0,...,0,0,0,0,2,0,0,0,0,0
2,1,121,529419,2018-03-29,2018-03-29T17:10:00Z,2,10,9,2,0,...,0,0,0,0,4,0,0,0,0,0
3,1,139,529406,2018-03-29,2018-03-29T20:00:00Z,2,6,6,1,1,...,0,0,0,0,4,0,0,0,0,0
4,1,140,529411,2018-03-29,2018-03-29T19:35:00Z,9,4,1,1,0,...,0,0,0,0,4,0,0,0,0,1


In [48]:
df_teamBoxScores.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14892 entries, 0 to 29
Data columns (total 57 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   home                          14892 non-null  int64  
 1   teamId                        14892 non-null  int64  
 2   gamePk                        14892 non-null  int64  
 3   gameDate                      14892 non-null  object 
 4   gameTimeUTC                   14892 non-null  object 
 5   flyOuts                       14892 non-null  int64  
 6   groundOuts                    14892 non-null  int64  
 7   runsScored                    14892 non-null  int64  
 8   doubles                       14892 non-null  int64  
 9   triples                       14892 non-null  int64  
 10  homeRuns                      14892 non-null  int64  
 11  strikeOuts                    14892 non-null  int64  
 12  baseOnBalls                   14892 non-null  int64  
 13  inte

In [49]:
# cols = {

# }
# df_teamBoxScores = df_teamBoxScores[list(cols.keys())]
# df_teamBoxScores.columns = list(cols.values())
# df_teamBoxScores.info()

In [50]:
pd.to_pickle(df_teamBoxScores, processed_data_path + 'teamBoxScores.pkl')
del df_teamBoxScores

### Transactions

In [51]:
%%time
df_transactions = unpack_data(df_train, dfs = ['transactions'])['transactions']

CPU times: user 2.37 s, sys: 70.3 ms, total: 2.44 s
Wall time: 4.93 s


In [52]:
df_transactions.head()

Unnamed: 0,transactionId,playerId,playerName,date,fromTeamId,fromTeamName,toTeamId,toTeamName,effectiveDate,resolutionDate,typeCode,typeDesc,description
0,340732,547348.0,C.C. Lee,2018-01-01,,,119,Los Angeles Dodgers,2018-01-01,2018-01-01,SFA,Signed as Free Agent,Los Angeles Dodgers signed free agent RHP C.C....
0,339458,621173.0,Dylan Baker,2018-01-02,158.0,Milwaukee Brewers,119,Los Angeles Dodgers,2018-03-20,,TR,Trade,Milwaukee Brewers traded RHP Dylan Baker to Lo...
1,357292,678876.0,Angel Rojas,2018-01-02,,,147,New York Yankees,2018-01-02,2018-01-02,SFA,Signed as Free Agent,New York Yankees signed free agent SS Angel Ro...
2,341123,607054.0,Jace Peterson,2018-01-02,,,147,New York Yankees,2018-01-02,2018-01-02,SFA,Signed as Free Agent,New York Yankees signed free agent 2B Jace Pet...
3,339458,,,2018-01-02,119.0,Los Angeles Dodgers,158,Milwaukee Brewers,2018-01-02,,TR,Trade,Milwaukee Brewers traded RHP Dylan Baker to Lo...


In [53]:
df_transactions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45420 entries, 0 to 81
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   transactionId   45420 non-null  int64         
 1   playerId        44997 non-null  float64       
 2   playerName      44995 non-null  object        
 3   date            45420 non-null  datetime64[ns]
 4   fromTeamId      16618 non-null  float64       
 5   fromTeamName    16618 non-null  object        
 6   toTeamId        45420 non-null  int64         
 7   toTeamName      45420 non-null  object        
 8   effectiveDate   45420 non-null  object        
 9   resolutionDate  30328 non-null  object        
 10  typeCode        45420 non-null  object        
 11  typeDesc        45420 non-null  object        
 12  description     45401 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(8)
memory usage: 4.9+ MB


In [54]:
cols = {
    'transactionId': 'IdTransaction',
    'playerId': 'IdPlayer',
    'date': 'DtTransaction',
    'fromTeamId': 'IdFromTeam',
    'toTeamId': 'IdToTeam',
    'typeDesc': 'NmType'
}
df_transactions = df_transactions[list(cols.keys())]
df_transactions.columns = list(cols.values())
df_transactions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45420 entries, 0 to 81
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   IdTransaction  45420 non-null  int64         
 1   IdPlayer       44997 non-null  float64       
 2   DtTransaction  45420 non-null  datetime64[ns]
 3   IdFromTeam     16618 non-null  float64       
 4   IdToTeam       45420 non-null  int64         
 5   NmType         45420 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(1)
memory usage: 2.4+ MB


In [55]:
pd.to_pickle(df_transactions, processed_data_path + 'transactions.pkl')
del df_transactions

### Standings

In [56]:
%%time
df_standings = unpack_data(df_train, dfs = ['standings'])['standings']

CPU times: user 1.81 s, sys: 55.4 ms, total: 1.86 s
Wall time: 3.97 s


In [57]:
df_standings.head()

Unnamed: 0,season,gameDate,divisionId,teamId,teamName,streakCode,divisionRank,leagueRank,wildCardRank,leagueGamesBack,...,grassLosses,turfWins,turfLosses,divWins,divLosses,alWins,alLosses,nlWins,nlLosses,xWinLossPct
0,2018,2018-03-29,205,112,Chicago Cubs,W1,1,3,3.0,-,...,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0
1,2018,2018-03-29,204,146,Miami Marlins,L1,4,12,12.0,1.0,...,1,0,0,0,0,0.0,1.0,0.0,1.0,0.0
2,2018,2018-03-29,204,121,New York Mets,W1,2,5,5.0,-,...,0,0,0,0,0,1.0,0.0,1.0,0.0,1.0
3,2018,2018-03-29,200,140,Texas Rangers,L1,5,14,13.0,1.0,...,1,0,0,0,1,0.0,0.0,0.0,0.0,0.0
4,2018,2018-03-29,204,144,Atlanta Braves,W1,1,2,2.0,-,...,0,0,0,1,0,0.0,0.0,0.0,0.0,0.0


In [58]:
df_standings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18690 entries, 0 to 29
Data columns (total 47 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   season                     18690 non-null  int64  
 1   gameDate                   18690 non-null  object 
 2   divisionId                 18690 non-null  int64  
 3   teamId                     18690 non-null  int64  
 4   teamName                   18690 non-null  object 
 5   streakCode                 18593 non-null  object 
 6   divisionRank               18690 non-null  int64  
 7   leagueRank                 18690 non-null  int64  
 8   wildCardRank               14744 non-null  float64
 9   leagueGamesBack            18690 non-null  object 
 10  sportGamesBack             18690 non-null  object 
 11  divisionGamesBack          18690 non-null  object 
 12  wins                       18690 non-null  int64  
 13  losses                     18690 non-null  int64 

In [59]:
# cols = {

# }
# df_standings = df_standings[list(cols.keys())]
# df_standings.columns = list(cols.values())
# df_standings.info()

In [60]:
pd.to_pickle(df_standings, processed_data_path + 'standings.pkl')
del df_standings

### Events

In [61]:
# %%time
# df_events = unpack_data(df_train, dfs = ['events'])['events']

In [62]:
# df_events.head() 

In [63]:
# df_events.info()

### Player Twitter Followers

In [64]:
%%time
df_ttPlayer = unpack_data(df_train, dfs = ['playerTwitterFollowers'])['playerTwitterFollowers']

CPU times: user 789 ms, sys: 13.2 ms, total: 802 ms
Wall time: 790 ms


In [65]:
df_ttPlayer.head()

Unnamed: 0,date,playerId,playerName,accountName,twitterHandle,numberOfFollowers
0,2018-01-01,545361,Mike Trout,Mike Trout,@miketrout,2452409
1,2018-01-01,506433,Yu Darvish,Yu Darvish,@faridyu,1945081
2,2018-01-01,434378,Justin Verlander,Justin Verlander,@justinverlander,1795985
3,2018-01-01,430897,Nick Swisher,Nick Swisher,@nickswisher,1711807
4,2018-01-01,120074,David Ortiz,David Ortiz,@davidortiz,1515463


In [66]:
df_ttPlayer.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43600 entries, 0 to 1335
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date               43600 non-null  datetime64[ns]
 1   playerId           43600 non-null  int64         
 2   playerName         43600 non-null  object        
 3   accountName        43600 non-null  object        
 4   twitterHandle      43600 non-null  object        
 5   numberOfFollowers  43600 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 2.3+ MB


In [67]:
cols = {
    'playerId': 'IdPlayer',
    'date': 'DtTwitter',
    'numberOfFollowers': 'NuFollowers'
}
df_ttPlayer = df_ttPlayer[list(cols.keys())]
df_ttPlayer.columns = list(cols.values())
df_ttPlayer.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43600 entries, 0 to 1335
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   IdPlayer     43600 non-null  int64         
 1   DtTwitter    43600 non-null  datetime64[ns]
 2   NuFollowers  43600 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 1.3 MB


In [68]:
pd.to_pickle(df_ttPlayer, processed_data_path + 'playerTwitterFollowers.pkl')
del df_ttPlayer

### Team Twitter Followers

In [69]:
%%time
df_ttTeam = unpack_data(df_train, dfs = ['teamTwitterFollowers'])['teamTwitterFollowers']

CPU times: user 671 ms, sys: 15.7 ms, total: 687 ms
Wall time: 677 ms


In [70]:
df_ttTeam.head()

Unnamed: 0,date,teamId,teamName,accountName,twitterHandle,numberOfFollowers
0,2018-01-01,147,New York Yankees,New York Yankees,@Yankees,3130482
1,2018-01-01,112,Chicago Cubs,Chicago Cubs,@Cubs,2373710
2,2018-01-01,141,Toronto Blue Jays,Toronto Blue Jays,@BlueJays,2196352
3,2018-01-01,111,Boston Red Sox,Boston Red Sox,@RedSox,1950737
4,2018-01-01,119,Los Angeles Dodgers,Los Angeles Dodgers,@Dodgers,1949542


In [71]:
df_ttTeam.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1290 entries, 0 to 29
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date               1290 non-null   datetime64[ns]
 1   teamId             1290 non-null   int64         
 2   teamName           1290 non-null   object        
 3   accountName        1290 non-null   object        
 4   twitterHandle      1290 non-null   object        
 5   numberOfFollowers  1290 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 70.5+ KB


In [72]:
cols = {
    'teamId': 'IdTeam',
    'date': 'DtTwitter',
    'numberOfFollowers': 'NuFollowers'
}
df_ttTeam = df_ttTeam[list(cols.keys())]
df_ttTeam.columns = list(cols.values())
df_ttTeam.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1290 entries, 0 to 29
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   IdTeam       1290 non-null   int64         
 1   DtTwitter    1290 non-null   datetime64[ns]
 2   NuFollowers  1290 non-null   int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 40.3 KB


In [73]:
pd.to_pickle(df_ttTeam, processed_data_path + 'teamTwitterFollowers.pkl')
del df_ttTeam