# MLB Fan Engagement - Getting Started

In this notebook, I will import all of the original data from Kaggle's MLB fan engagement competition (https://www.kaggle.com/c/mlb-player-digital-engagement-forecasting/data) and clean it into a format that allows me to easily analyze, select features, and create models.

This notebook should serve as a guide for getting started on this competition by organizing these large amounts of data.  This notebook will also be well-documented so that I can explain the information that all of the data conveys.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Importing the Data

First, I will import the data and explain what we are working with.

In [None]:
#path varies depending if I am running on my own machine or Kaggle

my_path = "/Users/Ethan/Desktop/Desktop - Ethan’s MacBook Air/Personal Projects/Baseball/fan-engagement/data"

kaggle_path = "../input/mlb-player-digital-engagement-forecasting"

In [None]:
def read_data(path, file):
    
    df = pd.read_csv(f"{path}/{file}.csv")
    
    num_rows = len(df)
    num_cols = len(df.columns)
    mem_usage = df.memory_usage(deep = True).sum()
    
    print(f"{file}.csv: {num_rows} rows; {num_cols} columns; {mem_usage} bytes of memory.")
    return df

In [None]:
#change my_path to kaggle_path when running on Kaggle

train = read_data(kaggle_path, "train")
teams = read_data(kaggle_path, "teams")
seasons = read_data(kaggle_path, "seasons")
players = read_data(kaggle_path, "players")
awards = read_data(kaggle_path, "awards")
example_test = read_data(kaggle_path, "example_test")
example_submission = read_data(kaggle_path, "example_sample_submission")

train.csv is the main dataset of interest and takes up a large amount of memory because each cell corresponds to a multitude of information at a given date.  All of this information is stored in json format.  The rest of the dataframes are helpful for identifying teams and players from their IDs, recording when each season begins, etc.

In [None]:
train.head()

In [None]:
train.info()

Each row in this dataset corresponds to every date spanning from January 1st, 2018 through April 30th, 2021, and missing values often correspond to off-season dates.  The MLB pre-season (Spring Training) usually starts in late February and the post-season usually ends in late October.  The regular season usually spans from the beginning of April through the end of September.  Transactions (trades) and awards are not daily occurances, and Twitter followers are recorded at the 1st of every month according to the competition.

## Unnesting the Training Data 

Now, for train.csv, we need to convert the json from each cell into dataframes.  Each of the 11 columns will correspond to a dataframe, as nextDayPlayerEngagement represents the target variables.  The following function will convert the json into a dataframe given a column from train.csv.

In [None]:
def json_to_df(df, column):
    
    num_rows = len(df)
    
    data_list = []
    for row in range(num_rows):
        
        json_data = df.iloc[row][column]
        if str(json_data) != "nan": #we don't want to append NA values in the dataframes
            data = pd.read_json(json_data)
            data_list.append(data)
        
    all_data = pd.concat(data_list, axis = 0)
    
    num_rows = len(all_data)
    num_cols = len(all_data.columns)
    mem_usage = all_data.memory_usage(deep = True).sum()
    
    print(f"{column}: {num_rows} rows; {num_cols} cols; {mem_usage} bytes.")
    return all_data

In [None]:
#a list of the 11 columns, not including the date.  We have this info in each json cell anyway.

nested_columns = train.columns[1:]

In [None]:
#we yield 11 dataframes from train.csv

engage, games, rosters, player_boxes, team_boxes, transactions, standings, awards, events, player_twitter, team_twitter = [json_to_df(train, var) for var in nested_columns]

Events make up a massive amount of information because it includes many variables for each individual pitch.

In [None]:
engage.head()

Each of these dataframes contain keys of date and player id, including the engagement target variable dataframe.  We can later use these keys to join all dataframes together when analyzing and creating the models.

## Cleaning

Now, some cleaning is required.  We need to convert each date into a pandas datetime object and then make sure all keys are named identically so we can join the data as we please.

In [None]:
from datetime import datetime, timedelta

In [None]:
engage["engagementMetricsDate"] = pd.to_datetime(engage["engagementMetricsDate"])

#As the competiton notes, the engagement data corresponds to information from the day prior.  Therefore, when
#joining this data to any other data, we need to join on the previous day.  

engage["engagementMetricsDate"] = engage["engagementMetricsDate"] - timedelta(days = 1)

engage = engage.rename(columns = {"engagementMetricsDate": "date"})

In [None]:
games["gameDate"] = pd.to_datetime(games["gameDate"])
rosters["gameDate"] = pd.to_datetime(rosters["gameDate"])
player_boxes["gameDate"] = pd.to_datetime(player_boxes["gameDate"])
team_boxes["gameDate"] = pd.to_datetime(team_boxes["gameDate"])
transactions["date"] = pd.to_datetime(transactions["date"])
standings["gameDate"] = pd.to_datetime(standings["gameDate"])
awards["awardDate"] = pd.to_datetime(awards["awardDate"])
events["gameDate"] = pd.to_datetime(events["gameDate"])
player_twitter["date"] = pd.to_datetime(player_twitter["date"])
team_twitter["date"] = pd.to_datetime(team_twitter["date"])

games = games.rename(columns = {"gameDate": "date"})
rosters = rosters.rename(columns = {"gameDate": "date"})
player_boxes = player_boxes.rename(columns = {"gameDate": "date"})
team_boxes = team_boxes.rename(columns = {"gameDate": "date"})
standings = standings.rename(columns = {"gameDate": "date"})
awards = awards.rename(columns = {"awardDate": "date"})
events = events.rename(columns = {"gameDate": "date"})

The new engagement dataframe now has dates that corresponds to targets for the next day.

In [None]:
engage.head()

## Merging: Example

An example of a dataframe created from a json column:

In [None]:
player_twitter.head()

Notice how our keys of interest are date and player id.  All data in the player_twitter df corresponds to engagement data for the next day, so these dataframes are ready to be merged.

In [None]:
#It is important to left join since we don't want to lose any information regarding target variables.

engage_twitter = pd.merge(engage, player_twitter, on = ["date", "playerId"], how = "left")

In [None]:
engage_twitter.head()

NA's represent dates that aren't the first of the month or players who don't have Twitter info.

## Conclusion

We can now save all of the converted data for my next notebook, which will involve merging much of the data, cleaning, and feature selecting.  Thanks for reading!

In [None]:
#engage.to_csv("../fan-engagement/data/engage.csv", index = False)
#games.to_csv("../fan-engagement/data/games.csv", index = False)
#rosters.to_csv("../fan-engagement/data/rosters.csv", index = False)
#player_boxes.to_csv("../fan-engagement/data/player_boxes.csv", index = False)
#team_boxes.to_csv("../fan-engagement/data/team_boxes.csv", index = False)
#transactions.to_csv("../fan-engagement/data/transactions.csv", index = False)
#standings.to_csv("../fan-engagement/data/standings.csv", index = False)
#awards.to_csv("../fan-engagement/data/awards.csv", index = False)
#events.to_csv("../fan-engagement/data/events.csv", index = False)
#player_twitter.to_csv("../fan-engagement/data/player_twitter.csv", index = False)
#team_twitter.to_csv("../fan-engagement/data/team_twitter.csv", index = False)