# Data Preparation

Notebook que irá conter o código utilizado para a preparação dos dados do projeto.

os dados "crus" estão localizados em **/src/data/raw-data** e os dados tratados devem ser salvos em **/src/data/processed-data** 

## Initial Setup

In [2]:
import pandas as pd
import numpy as np

import pickle as pkl

In [3]:
# File paths for working locally
raw_data_path = '../data/raw-data/'
processed_data_path = '../data/processed-data/'

In [4]:
# # Uncomment this cell if running on Google Colab
# from google.colab import drive
# drive.mount('/content/drive/')
# raw_data_path = '/content/drive/MyDrive/datasets/mlb-player-digital-engagement-forecasting/'

## Loading the Datasets

In [5]:
dataset_names = {
    'Awards': 'awards.csv', 
    'Example': 'example_test.csv', 
    'Players': 'players.csv',
    'Seasons': 'seasons.csv', 
    'Teams': 'teams.csv', 
    'Train': 'train.csv'
}
for key in dataset_names:
  dataset_names[key] = raw_data_path + dataset_names[key]
dataset_names

{'Awards': '../data/raw-data/awards.csv',
 'Example': '../data/raw-data/example_test.csv',
 'Players': '../data/raw-data/players.csv',
 'Seasons': '../data/raw-data/seasons.csv',
 'Teams': '../data/raw-data/teams.csv',
 'Train': '../data/raw-data/train.csv'}

### Players

In [6]:
df_players = pd.read_csv(dataset_names['Players'])
df_players.head()

Unnamed: 0,playerId,playerName,DOB,mlbDebutDate,birthCity,birthStateProvince,birthCountry,heightInches,weight,primaryPositionCode,primaryPositionName,playerForTestSetAndFuturePreds
0,665482,Gilberto Celestino,1999-02-13,2021-06-02,Santo Domingo,,Dominican Republic,72,170,8,Outfielder,False
1,593590,Webster Rivas,1990-08-08,2021-05-28,Nagua,,Dominican Republic,73,219,3,First Base,True
2,661269,Vladimir Gutierrez,1995-09-18,2021-05-28,Havana,,Cuba,73,190,1,Pitcher,True
3,669212,Eli Morgan,1996-05-13,2021-05-28,Rancho Palos Verdes,CA,USA,70,190,1,Pitcher,True
4,666201,Alek Manoah,1998-01-09,2021-05-27,Homestead,FL,USA,78,260,1,Pitcher,True


In [7]:
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   playerId                        2061 non-null   int64 
 1   playerName                      2061 non-null   object
 2   DOB                             2061 non-null   object
 3   mlbDebutDate                    2025 non-null   object
 4   birthCity                       2061 non-null   object
 5   birthStateProvince              1516 non-null   object
 6   birthCountry                    2061 non-null   object
 7   heightInches                    2061 non-null   int64 
 8   weight                          2061 non-null   int64 
 9   primaryPositionCode             2061 non-null   object
 10  primaryPositionName             2061 non-null   object
 11  playerForTestSetAndFuturePreds  2057 non-null   object
dtypes: int64(3), object(9)
memory usage: 193.3+ KB


Analisando o dataset, foram removidas a coluna `playerName`, por ela já estar representada pela coluna `playerId` e a coluna `birthStateProvince` pela quantidade de valores nulos que a mesma possui.

In [8]:
# Select the columns that will be used on the df and renames them according to the pattern
cols_df_players = {
    'playerId': 'IdPlayer',
    'DOB': 'DtBirth',
    'mlbDebutDate': 'DtMlbDebut',
    'birthCity': 'NmCity',
    #'birthStateProvince': 'NmState',
    'birthCountry': 'NmCountry',
    'heightInches': 'NuHeight',
    'weight': 'NuWeight',
    'primaryPositionCode': 'CdPrimaryPosition',
    'primaryPositionName': 'NmPrimaryPosition',
    'playerForTestSetAndFuturePreds': 'FlgForTestAndPred'
}
df_players = df_players[list(cols_df_players)]
df_players.columns = list(cols_df_players.values())
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   IdPlayer           2061 non-null   int64 
 1   DtBirth            2061 non-null   object
 2   DtMlbDebut         2025 non-null   object
 3   NmCity             2061 non-null   object
 4   NmCountry          2061 non-null   object
 5   NuHeight           2061 non-null   int64 
 6   NuWeight           2061 non-null   int64 
 7   CdPrimaryPosition  2061 non-null   object
 8   NmPrimaryPosition  2061 non-null   object
 9   FlgForTestAndPred  2057 non-null   object
dtypes: int64(3), object(7)
memory usage: 161.1+ KB


In [26]:
# pd.to_pickle(df_players, processed_data_path + 'players.pkl')

### Teams

In [10]:
df_teams = pd.read_csv(dataset_names['Teams'])
df_teams.head()

Unnamed: 0,id,name,teamName,teamCode,shortName,abbreviation,locationName,leagueId,leagueName,divisionId,divisionName,venueId,venueName
0,108,Los Angeles Angels,Angels,ana,LA Angels,LAA,Anaheim,103,American League,200,American League West,1,Angel Stadium
1,109,Arizona Diamondbacks,D-backs,ari,Arizona,ARI,Phoenix,104,National League,203,National League West,15,Chase Field
2,110,Baltimore Orioles,Orioles,bal,Baltimore,BAL,Baltimore,103,American League,201,American League East,2,Oriole Park at Camden Yards
3,111,Boston Red Sox,Red Sox,bos,Boston,BOS,Boston,103,American League,201,American League East,3,Fenway Park
4,112,Chicago Cubs,Cubs,chn,Chi Cubs,CHC,Chicago,104,National League,205,National League Central,17,Wrigley Field


In [11]:
df_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            30 non-null     int64 
 1   name          30 non-null     object
 2   teamName      30 non-null     object
 3   teamCode      30 non-null     object
 4   shortName     30 non-null     object
 5   abbreviation  30 non-null     object
 6   locationName  30 non-null     object
 7   leagueId      30 non-null     int64 
 8   leagueName    30 non-null     object
 9   divisionId    30 non-null     int64 
 10  divisionName  30 non-null     object
 11  venueId       30 non-null     int64 
 12  venueName     30 non-null     object
dtypes: int64(4), object(9)
memory usage: 3.2+ KB


In [12]:
cols_df_teams = {
     'id': 'IdTeam'
    ,'locationName': 'NmLocation'
    ,'leagueId': 'IdLeague'
    ,'divisionId': 'IdDivision'
    ,'venueId': 'IdVenue'
}
df_teams = df_teams[list(cols_df_teams.keys())]
df_teams.columns = list(cols_df_teams.values())
df_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   IdTeam      30 non-null     int64 
 1   NmLocation  30 non-null     object
 2   IdLeague    30 non-null     int64 
 3   IdDivision  30 non-null     int64 
 4   IdVenue     30 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.3+ KB


In [25]:
# pd.to_pickle(df_teams, processed_data_path + 'teams.pkl')

### Awards

In [19]:
df_awards = pd.read_csv(dataset_names['Awards'])
df_awards.head()

Unnamed: 0,awardDate,awardSeason,awardId,awardName,playerId,playerName,awardPlayerTeamId
0,2017-12-21,2017,WARRENSPAHN,Warren Spahn Award,477132,Clayton Kershaw,119.0
1,2017-12-20,2017,MILBORGAS,MiLB.com Organization All-Star,474319,Brandon Snyder,120.0
2,2017-12-20,2017,MILBORGAS,MiLB.com Organization All-Star,592530,Jose Marmolejos,120.0
3,2017-12-20,2017,MILBORGAS,MiLB.com Organization All-Star,593833,Wander Suero,120.0
4,2017-12-20,2017,MILBORGAS,MiLB.com Organization All-Star,600466,Raudy Read,120.0


In [20]:
df_awards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11256 entries, 0 to 11255
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   awardDate          11256 non-null  object 
 1   awardSeason        11256 non-null  int64  
 2   awardId            11256 non-null  object 
 3   awardName          11256 non-null  object 
 4   playerId           11256 non-null  int64  
 5   playerName         11256 non-null  object 
 6   awardPlayerTeamId  11243 non-null  float64
dtypes: float64(1), int64(2), object(4)
memory usage: 615.7+ KB


In [21]:
cols_df_awards = {
    'awardId': 'IdAward',
    'awardDate': 'DtAward',
    'awardSeason': 'DtAwardSeason',
    'playerId': 'IdPlayer',
    'awardPlayerTeamId': 'IdTeam'
}
df_awards = df_awards[list(cols_df_awards)]
df_awards.columns = list(cols_df_awards.values())
df_awards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11256 entries, 0 to 11255
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   IdAward        11256 non-null  object 
 1   DtAward        11256 non-null  object 
 2   DtAwardSeason  11256 non-null  int64  
 3   IdPlayer       11256 non-null  int64  
 4   IdTeam         11243 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 439.8+ KB


In [24]:
# pd.to_pickle(df_awards, processed_data_path + 'awards.pkl')

### Seasons

In [27]:
df_seasons = pd.read_csv(dataset_names['Seasons'])
df_seasons.head()

Unnamed: 0,seasonId,seasonStartDate,seasonEndDate,preSeasonStartDate,preSeasonEndDate,regularSeasonStartDate,regularSeasonEndDate,lastDate1stHalf,allStarDate,firstDate2ndHalf,postSeasonStartDate,postSeasonEndDate
0,2017,2017-04-02,2017-11-01,2017-02-22,2017-04-01,2017-04-02,2017-10-01,2017-07-09,2017-07-11,2017-07-14,2017-10-03,2017-11-01
1,2018,2018-03-29,2018-10-28,2018-02-21,2018-03-27,2018-03-29,2018-10-01,2018-07-15,2018-07-17,2018-07-19,2018-10-02,2018-10-28
2,2019,2019-03-20,2019-10-30,2019-02-21,2019-03-26,2019-03-20,2019-09-29,2019-07-07,2019-07-09,2019-07-11,2019-10-01,2019-10-30
3,2020,2020-07-23,2020-10-28,2020-02-21,2020-07-22,2020-07-23,2020-09-27,2020-08-25,,2020-08-26,2020-09-29,2020-10-28
4,2021,2021-02-28,2021-10-31,2021-02-28,2021-03-30,2021-04-01,2021-10-03,2021-07-11,2021-07-13,2021-07-15,2021-10-04,2021-10-31


In [28]:
df_seasons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seasonId                5 non-null      int64 
 1   seasonStartDate         5 non-null      object
 2   seasonEndDate           5 non-null      object
 3   preSeasonStartDate      5 non-null      object
 4   preSeasonEndDate        5 non-null      object
 5   regularSeasonStartDate  5 non-null      object
 6   regularSeasonEndDate    5 non-null      object
 7   lastDate1stHalf         5 non-null      object
 8   allStarDate             4 non-null      object
 9   firstDate2ndHalf        5 non-null      object
 10  postSeasonStartDate     5 non-null      object
 11  postSeasonEndDate       5 non-null      object
dtypes: int64(1), object(11)
memory usage: 608.0+ bytes


optamos por não utilizar o dataframe de Seasons, já que ele apenas adicionaria complexidade ao modelo e como se tratam de apenas datas, provavelmente a contribuição das features que ele adiciona não seriam relevantes já que são datas de início e fim de eventos pré definidos na temporada

## Train