<span style="color:red">Please open this kernel with maximum viewing area. You can either open it on a wide monitor or simply click on the burger icon to remove the left hand side panel where you have competition, code, dataset, etc.</span>.

![img](https://i.imgur.com/sCqOxsQ.jpg)

# ⛄ Introduction

> In this competition, you’ll predict how fans engage with MLB players’ digital content on a daily basis for a future date range. You’ll have access to player performance data, social media data, and team factors like market size. Successful models will provide new insights into what signals most strongly correlate with and influence engagement.

## ❄️ Problem Statement

In this competition, the task is to "forecast" four different measures of engagement (`target1` - `target4`) for a subset of MLB players who are active in the 2021 season. 

Since we have to forecast, given an input date `d` we need to predict `target1` - `target4` for the date `d+1`.

## 📜 Training Data

We have both "static" files and "daily" data: 

* Static files - The files that are static are `players.csv`, `teams.csv`, `seasons.csv`, and `awards.csv`. These files don't change with time or to put it simply there is no continuous date column in these `csv` files. 

* Daily file - We have a `train.csv` file which is grouped by day or simply put there is a continuous date column. The first row is associated with the date `01-01-2018` and last row is associated with `some-date`. <br>

<img src="https://i.imgur.com/gb6B4ig.png" width="400" alt="Weights & Biases" />

# 💎 W&B Tables

WB Tables accelerate the ML development lifecycle by giving users the ability to rapidly extract meaningful insights from tabular data. The WB Table Visualizer provides an interactive interface to perform powerful analytics functions like grouping, joining, and creating custom fields while simultaneously supporting rich media annotations such as bounding boxes and segmentation masks. 

WB Tables is designed "generically" to work well for a wide range of use cases - from analyzing intermediate data transformations to reviewing model predictions - while being directly integrated directly into the WB UI dashboard, allowing users to learn, adapt, and improve their models effectively and efficiently.

Learn more about W&B Tables [here](https://docs.wandb.ai/guides/data-vis). Note that this feature is still work in progress and would love if you use and and send over feedbacks. 

## ♣️ About this kernel

In this kernel, we will go through each csv files and perform meaningful EDA. We will use W&B Tables for quick EDA and W&B Custom Charts for more complex visualizations. We will also use conventional matplotlib to get some insights. 

This is a work in progress. :D

# ⚙️ Imports and Setups

In [None]:
!pip install --upgrade -q wandb

In [None]:
import os
os.environ["WANDB_SILENT"] = "true"
import gc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import wandb
wandb.login()

> Find you API key at: https://wandb.ai/authorize

In [None]:
CONFIG = dict(
    competition = 'mlb',
    _wandb_kernel = 'ayut',
    infra = "Kaggle",
)

# Load CSV Files

In [None]:
ROOT_PATH = "/kaggle/input/mlb-player-digital-engagement-forecasting/"

# training file
train_df = pd.read_csv(f"{ROOT_PATH}/train.csv")

# meta files
players_df = pd.read_csv(f"{ROOT_PATH}/players.csv")
teams_df = pd.read_csv(f"{ROOT_PATH}/teams.csv")
seasons_df = pd.read_csv(f"{ROOT_PATH}/seasons.csv")
awards_df = pd.read_csv(f"{ROOT_PATH}/awards.csv")

# EDA of Static CSV files

## 1️⃣ players.csv

> #### 📌 Column Names and Descriptions <br> 
`playerId` - Unique identifier for a player. <br>
`playerName` - Name of players. There are 2055 unique player names. 
`DOB` - Player’s date of birth.<br>
`mlbDebutDate`<br>
`birthCity`<br>
`birthStateProvince`<br>
`birthCountry`<br>
`heightInches`<br>
`weight`<br>
`primaryPositionCode` - Player’s primary position code, details are here.<br>
`primaryPositionName` - player’s primary position, details are here.<br>
`playerForTestSetAndFuturePreds` - Boolean, true if player is among those for whom predictions are to be made in test data<br>

In [None]:
player_table = wandb.Table(dataframe=players_df)

run = wandb.init(project='kaggle-mlb', config=CONFIG)
wandb.log({'raw_players_table': player_table})
run.finish()    
run

#### You can also check out the created W&B tables page [here](https://wandb.ai/ayush-thakur/kaggle-mlb/runs/944575zt). 

#### 📌 1. How many rows?

There are 2061 rows in the `players.csv` file.
![img](https://i.imgur.com/WR9wVeS.png)

#### 📌 2. Do we have unique player names?

There are 2055 unique player names. There are 5 player names that occur multiple times. These are Luis Garcia (3), Will Smith (2), Javy Guerra (2), Austin Adams (2), Jose Ramirez (2). 

🎳 **Do It Yourself**: Group the `playerName` column and insert new column to the right of the `playerName` column. Go to the Column Setting of this new column and in the cell expression select use `row.count()`, this will give the `count` column. Sort this column in descending order. Scoll though the table and get more useful insights. 

In [None]:
run

> 📎 Let's dig into Luis Garcia more: 
There's three DOB associated with Luis Garcia, that means they are different individuals. They even have different birth city. One of them comes from New York city. Two of them played at the position with `primaryPositionCode` of 1 and played as pitcher.


#### 📌 4. When were the players born Date of Births

Using the technique as above we can find that there are only 1670 unique date of births. That means multiple players were born on the same day. Interesting! 

5 MLB players are born on `1995-01-17` in particular of which 3 plays as pitcher. 

#### 📌 5. What are the positions these players play?

There are 10 unique playing positions in the `players.csv` file. These are - Pitcher, Catcher, Outfilder, First base, Second Base, Third base, Shortstop, outfield, Designated Hitter, and Infield. 

From the Baseball Positions [wikipedia page](https://en.wikipedia.org/wiki/Baseball_positions) there are total of 9 fielding positions that can be grouped into three groups:

* Outfield (left field, center field, and right field)

* Infield (first base, second base, third base, and shortstop)

* Battery (pitcher and catcher)

🎳 **Do It Yourself**: Head over to the table and group by `primaryPositionName` column. Insert a new column to the left handside of the `playerName` column by clickin on the `Insert 1 left <` from the three dot icon. You will get this three dot by hovering over the column name. Click on the `Column settings` from the three dot icon of the newly created column. In the `cell expression` box input `row.count()`. Sort the hence created `count` column in descending order.

![img](https://i.imgur.com/ZjVwnOy.png)

![img](https://i.imgur.com/BcxYI5F.png)
([Source](https://en.wikipedia.org/wiki/Baseball_positions))

Let's map the `primaryPositionName` to `primaryPositionCode`.

* Pitcher -> 1 <br>
* Outfielder -> 7 ((left fielder), 8 (center fielder), 9(right fielder) <br>
* Catcher -> 2 <br>
* Second Base -> 4 <br>
* First Base -> 3 <br>
* Shortstop -> 6 <br>
* Third Base -> 5 <br>
* Outfield -> "O" <br>
* Designated Hitter -> 10 <br>
* Infield -> "I" <br>

Note that Designated Hitter is a special role. More on it [here](https://en.wikipedia.org/wiki/Designated_hitter).

We might want to the `primaryPositionCode` as a feature. 

#### 📌 6. What's `playerForTestSetAndFuturePreds`?

These players will be present in the test set and we need to forecast the measure of engagement for them. So how many players are present in the test set?

Let's quickly use the filter feature of the Tables to find it. 

DIY: Click on Filter and in the filter expression use the expression `run["playerForTestSetAndFuturePreds"]` and click on apply. 

**There are only 1187 players in the test set.**

![img](https://i.imgur.com/O3Y69Gr.gif)

## 2️⃣ seasons.csv

> #### 📌 Column Names and Description <br>
`seasonId`: Each year starting from 2017 is a season and includes 2021. <br>
`seasonStartDate`<br>
`seasonEndDate`<br>
`preSeasonStartDate`<br>
`preSeasonEndDate`<br>
`regularSeasonStartDate`<br>
`regularSeasonEndDate`<br>
`lastDate1stHalf`<br>
`allStarDate`<br>
`firstDate2ndHalf`<br>
`postSeasonStartDate`<br>
`postSeasonEndDate`

In [None]:
seasons_table = wandb.Table(dataframe=seasons_df)

run = wandb.init(project='kaggle-mlb', config=CONFIG)
wandb.log({'raw_seasons_table': seasons_table})
run.finish()    
run

#### You can also check out the W&B Tables page [here](https://wandb.ai/ayush-thakur/kaggle-mlb/runs/12fs9zho).

* We know the import start and end dates for 5 seasons - 2017 - 2021 <br>
* We will forecast the for the season of 2021. We can use this season's data as validation set. <br>
* A typical MLB season lasts for approximately 7 months. In the year 2017-2019, the seasons were 7 months long. <br>
* The 2020 season was only for 3 months because of Covid-19 pandemic. From the [2020 MLB wikipedia page](https://en.wikipedia.org/wiki/2020_Major_League_Baseball_season), "The 2020 Major League Baseball season began on July 23 and ended on September 27 with 60 games amidst the ongoing COVID-19 pandemic." <br>
* The 2021 season is scheduled for 8 months. <br>
* In Major League Baseball (MLB), spring training is a series of practices and exhibition games preceding the start of the regular season. This is called pre-season. I would assume the engagement to be lower in pre-season games. <br>
* The Major League Baseball "postseason" is an elimination tournament held after the conclusion of the Major League Baseball (MLB) regular season. <br>
* The Major League Baseball All-Star Game, also known as the "Midsummer Classic", is an annual professional baseball game sanctioned by Major League Baseball (MLB) and contested between the all-stars from the American League (AL) and National League (NL). There was no all-star game in 2020. <br>
* In 2021, preseason start date and season start date are the same. Wondering why?

## 3️⃣ teams.csv

> #### 📌 Column Names and Description <br>
`id` - teamId <br> 
`name` <br>
`teamName` <br>
`teamCode` <br>
`shortName` <br>
`abbreviation` <br>
`locationName` <br>
`leagueId` <br>
`leagueName` <br>
`divisionId` <br>
`divisionName` <br>
`venueId` <br>
`venueName` <br>

In [None]:
teams_table = wandb.Table(dataframe=teams_df)

run = wandb.init(project='kaggle-mlb')
wandb.log({'raw_teams_table': teams_table})
run.finish()    
run

#### You can also check out the W&B Tables page [here](https://wandb.ai/ayush-thakur/kaggle-mlb/runs/1xpejcwd)

#### 📌 1. Number of teams - 30
#### 📌 2. How many divisons are there? 

There are 6 divisions with 5 teams per division. 

🎳 **Do It Yourself**: Head over to the table and group by `divisionId`. 

![img](https://i.imgur.com/UMorwgd.png)

#### 📌 3. Number of leagues?

There are two unique leagueIds - 103 and 104 with 15 teams per id. 103 is associated with American League while 104 is associated with National League. 

#### 📌 4. Number of unique playing locations?

There are 29 unique playing locations. There are two teams from Chicago - Chicago Cubs and Chicago White Sox. 

🎳 **Do It Yourself**: Head over to the table and group by `locationName`. 

![img](https://i.imgur.com/C0KCjNt.png)

## 4️⃣ awards.csv

> #### 📌 Column Names and Description <br>
`awardDate` - Date award was given. <br>
`awardSeason` - Season award was from. <br>
`awardId`<br>
`awardName` <br>
`playerId` - Unique identifier for a player. <br>
`playerName` <br>
`awardPlayerTeamId` <br>

In [None]:
awards_table = wandb.Table(dataframe=awards_df)

run = wandb.init(project='kaggle-mlb', config=CONFIG)
wandb.log({'raw_awards_table': awards_table})
run.finish()
run

#### You can also check out the W&B Tables page [here](https://wandb.ai/ayush-thakur/kaggle-mlb/runs/1wu3uepk)

#### 📌 1. Number of rows - 11256
#### 📌 2. How many seasons are covered by this csv file?

There are total of 19 season starting from 1998 - 2017. If you look at the counts column in the Table below, you will find that the number of awards given out every season increased from the past season (2017 is an exception with fewer awards given out compared to 2016). Also it won't be too wrong to assume that the award ceremonies in seasons close to 2017 will have far more impact on the engagement as compared to seasons closer to 1998. 

In [None]:
run

> Scroll through the table above to get more insights like, in the season of 2017, Aaron Judge got the most number of awards (17) followed by Ronald Acuna Jr (16) and Jose Altuve (15). Most number of award were MiLB.com Organization All-Star (216) with `awardId` `MILBORGAS`. Most awards were given out on 20th June (123). 

In the `train.csv` file the start date is `2017-01-01` thus in my opinion the `awardDate` in `awards.csv` file is not that useful. However the number of awards that a player has can be useful. A player with more awards should have more fan base thus driving more digital engagements. 

#### 📌 3. Numbers of players in this csv file?

There are 1692 players in the file. Albert Pujols got 70 awards. 

**Do It Yourself**: Head over to the table and group by `playerName` and add a count column. Sort the count column in descending order. 

![img](https://i.imgur.com/dYwuUMb.png)

# Join all the Static Files

In order to join the `players.csv`, `team.csv`, `seasons.csv`, and `awards.csv` file together we need to find the common columns. 

In the `awards.csv` file, `playerId` and `awardPlayerTeamId` are common with `playerId` in the `players.csv` file and `id` in the `teams.csv` file respectively. The `awardSeason` in the `awards.csv` file is common with `seasonId` in the `seasons.csv` file. 

With W&B Tables we can easily join the tables by using the concept of "foreign keys". Check out the table below. 

In [None]:
# Create Tables for each df
awards_table = wandb.Table(dataframe=awards_df)
players_table = wandb.Table(dataframe=players_df)
teams_table = wandb.Table(dataframe=teams_df)
seasons_table = wandb.Table(dataframe=seasons_df)

# Clean up the IDs for easy mapping
def cleanIds(mapping):
    def cleanIdsFn(ndx, row):
        res = {}
        for oldKey, newKey in mapping:
            if type(row[oldKey]) in [np.float64]:
                item = row[oldKey].item()
                if not np.isnan(item):
                    res[newKey] = str(int(item))
                else:
                    res[newKey] = ""
            else:
                res[newKey] = str(row[oldKey])
        return res
    return cleanIdsFn

awards_table.add_computed_columns(cleanIds([
    ("awardId", "aId"),
    ("awardSeason", "season"),
    ("playerId", "player"),
    ("awardPlayerTeamId", "team"),
]))
seasons_table.add_computed_columns(cleanIds([("seasonId", "sId")]))
teams_table.add_computed_columns(cleanIds([("id", "tId")]))
players_table.add_computed_columns(cleanIds([("playerId", "pId")]))

# Declare the relationship between "awards" and the other tables
awards_table.set_fk("season", seasons_table, "sId")
awards_table.set_fk("player", players_table, "pId")
awards_table.set_fk("team", teams_table, "tId")

In [None]:
run = wandb.init(project='kaggle-mlb', config=CONFIG)
wandb.log({'joined_static_table': awards_table})
run.finish()
run

#### You can also check out the W&B Tables page [here](https://wandb.ai/ayush-thakur/kaggle-mlb/runs/3qw2z9p3)

# <span style="color:blue">WORK IN PROGRESS. :)</span>.