# <center><font color='#041e42'>MLB Player Digital Engagement:</font> <font color='#bf0d3e'>Complete Indepth Analysis</font></center>

<center><img src= "https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F603584%2Ff7c0669c09db26bd45f76ade61b2a91f%2FGoogleCloud_MLB_Lockup.jpg?generation=1623359034713516&alt=media" alt ="Kaggle MLB Competetion Image"></center>


# <font color='#093e88'>1. Introduction</font>

### 👉 A Small Description
* Identify the many other factors which pique supporter engagement and create deeper relationships betweens players and fans.
* Baseball fans watch, read, and search for information about players. Which individuals they seek can depend on player performance, team standings, popularity, among other, currently unknown factors.
* Predict how fans engage with MLB players’ digital content on a daily basis for a future date range. 

### 👉 Evaluation Matrix
* Submissions are evaluated on the mean column-wise mean absolute error (MCMAE). A mean absolute error is calculated for each of the four target variables and the score is the average of those four MAE values.
* For each "date_playerId" you need to predict all four target variables.

|date_playerId|target1|target2|target3|target4|
|--|--|--|--|--|
|20210501_642727|0|100|70|50|

* When predicting on a given date, you are forecasting the target variables for the next day (i.e. for date d, you're predicting the engagement for day d+1).
* Your code will need to be robust and make predictions for any date_playerId combination requested by the module.
* Submit predications using the provided 'mlb' python time series module.

### 👉 Test Sets for Evaluation:
There are 2 test sets for evaluation:
1. During Training phase:  
This unseen test (provided by time series module 'mlb') set is comprised of data for the **month of May 2021** and the set of active players this year.

2. During Evaluation phase:
This test set will be a **future** in-season range of approximately **one month**.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

c1 = '#b90d3c'
c2 = '#041d40'

# <font color='#093e88'>2. Sneak Peak at Datasets</font>
The datasets in this competetion contains vast amount of information. First off, the data contains a set of static files that do not change with time (players.csv, teams.csv, seasons.csv, awards.csv) as well as daily data (train.csv) which is grouped by day. Secondly, each cell in the daily data (train.csv) is a JSON dictionary, the size of each JSON itself can be equivalent to an entire dataset. 

That is why before we move on to analysis and modeling, we will first understand and make sense of data.

### i. **example_sample_submission.csv**

In [None]:
example_sample_submission_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/example_sample_submission.csv', parse_dates=['date'])
print(f'\nShape of Example Sample Submission CSV: {example_sample_submission_df.shape}\n\n')
example_sample_submission_df.head()

> ###  📌 Observations from Example Sample Submission
> * It is not explictly given what the target variables (`target1` to `target4`) represents.
> * **Very Important to note**: Predictions for the subsequent day is made based on current day's data.
> * For example: Using the data of 2021-04-26, we are making predictions 2021-04-27 for multiple players.

### ii. **example_test.csv**

In [None]:
example_test_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/example_test.csv')
print(f'\nShape of Example Test CSV: {example_test_df.shape}\n\n')
example_test_df

> ### 📌 Observations from Example Test data
> * All the columns contains either JSON data or np.Nan values.
> * `date_playerId` , which is required in the submission file, is missing from example_test.
> * Here, each row represents a single day. But, submissions the are to be made for multiple players on each day.
> * So we know that, using the first row data of 2021-04-26, we are making predictions 2021-04-27 for multiple players.(like 20210427_player1, 20210427_player2, and so on)

### iii. **awards.csv**

In [None]:
# Read in CSV
awards_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/awards.csv', parse_dates=['awardDate'])
print(f'\nShape of Awards Dataset: {awards_df.shape}\n\n')
awards_df.head()

In [None]:
# 20 Most frequently given awards
awards_df.awardName.value_counts().head(20)

> Most awarded (1307 times) Award is "MiLB.com Organization All-Star", almost 3x more than "next most awarded award"

In [None]:
# Number of Awards / year
award_yr_df = awards_df.awardSeason.value_counts().sort_index().reset_index()

# Bar plot
plt.figure(figsize=(8,5))
sns.barplot(x=award_yr_df['index'], y=award_yr_df.awardSeason, color='red', alpha=0.7)
sns.despine(bottom=True, left=True)
plt.xticks(rotation=45)
plt.xlabel('Years', fontsize=13)
plt.ylabel('Award Count', fontsize=13)
plt.title('Yearly Count of Awards', fontsize=15)
plt.show()

In [None]:
# Top 10 players with most awards
awards_df.groupby(['playerName']).count()['awardPlayerTeamId'].sort_values(ascending=False).head(10)

### 📌 Observations from Awards data

* Award Date range between June 1998 and December 2017 
* There are 455 unique awards
* Most awarded (1307 times) Award is "MiLB.com Organization All-Star", almost 3x more than next "most awarded award"
* Till now, around 1700 players have been awarded almost 11500 times
* For past 5 years, 1000+ awards/year are given out each season
* Only a Small proportion of players have 40+ awards to their name

### iv. **players.csv**
This data contains players' basic personal as well as positional information. 

In [None]:
# Read in CSV
players_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/players.csv')
print(f'\nShape of Dataset: {players_df.shape}\n\n')
players_df.head()

### v. **seasons.csv**
This data contains various season dates for last 5 seasons (inclding current season).

In [None]:
seasons_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/seasons.csv')
print(f'\nShape of Dataset: {seasons_df.shape}\n\n')
seasons_df.head()

### vi. **teams.csv**
This data contains basic team information for all 30 teams. Such as teamId, leagueId, divisionId, venueId, so on.

In [None]:
teams_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/teams.csv')
print(f'\nShape of Dataset: {teams_df.shape}\n\n')
teams_df.head()

### vii. **train.csv**

In [None]:
# Read in CSV
train_df = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/train.csv', parse_dates=['date'])
print(f'\nShape of Dataset: {train_df.shape}\n\n')
train_df.head()

> ### 📌 Observation from train.csv
> * All the columns contains either JSON data or np.Nan values.
> * `date_playerId`, which is required in the submission file, is not an attribute here.
> * Here, each row represents a single day. But, submissions the are to be made for multiple players on each day.
> * So we know that, using the first row data of 2018-01-01, we will make predictions 2018-01-02 for multiple players.(like 20180101_player1, 20180101_player2, and so on)

#### 🔹 **Game Stats**: A single cell from the 'games' column

In [None]:
# Function to show a single json query
def show_single_json(json):
    return np.NaN if pd.isna(json) else pd.read_json(json)

In [None]:
# Game Information/Stats on 2021-04-30
print('Game Information/Stats Status on 2021-04-30')
print('\n')

show_single_json(train_df['games'].iloc[1215])

> #### 📌 Observations from **games JSON**
> * Contains "game information/stats" for all the games played on a particular date. Such as Home Team Stats, Away Team Stats, Win/Loss, Score, etc. 
> * Results of a game surely have a big impacts the fan engagement on digital platform of the players. We will analyse it later and see how the data supports our claim.

#### 🔹 **Roster Status**: A single cell from the 'rosters' column

In [None]:
# Squad Status on 2021-04-30
print('Squad Status on 2021-04-30')
print('\n')

show_single_json(train_df['rosters'].iloc[1215])

In [None]:
# Roster Status Frequency for the day 2021-04-30
print(show_single_json(train_df['rosters'].iloc[1215]).status.value_counts())

In [None]:
# Visualizing Roster Status Frequency for the day 2021-04-30
plt.figure(figsize=(8,5))
sns.barplot(y=show_single_json(train_df['rosters'].iloc[1215]).status.value_counts().index, 
            x=show_single_json(train_df['rosters'].iloc[1215]).status.value_counts().values,
            color='red', alpha=0.70)
sns.despine(bottom=True, left=True)
plt.xlabel('Player Count', fontsize=13)
plt.ylabel('Player Status', fontsize=13)
plt.title('Roster Count on 2021-04-30', fontsize=15)
plt.show()

> #### 📌 Observations from **rosters JSON**
> * Contains the squad status for all the matches played on a particular date. 
> * Normally, the active player should have more engagement compared to the injured players. In rare cases, the opposite is also very possible where injured players get more engagements. This could may very well be dependent upon player's injury, stardom, or some other intrinsic attribute such as club rivalries, and so on.
> * Again we will analyse it lateron based on the data.

#### 🔹 **Player Box Scores**: A single cell from the 'playerBoxScores' column

In [None]:
# Player Box Scores on 2021-04-30
print('Player Box Scores on 2021-04-30')
print('\n')

show_single_json(train_df['playerBoxScores'].iloc[1215]).head()

#### 🔹 **Transactions**: A single cell from the 'transactions' column

In [None]:
# Transactions on 2021-04-30
print('Tracnsactions on 2021-04-30')
print('\n')

show_single_json(train_df['transactions'].iloc[1214]).head(10)

In [None]:
# Type of Transaction on 2021-04-29
show_single_json(train_df['transactions'].iloc[1214]).typeDesc.value_counts()

> #### 📌 Observations from **transactions.csv**
> * Transactions here represents changes made to the players. Such as status change(like injured), recalled, selection, etc.
> * **Important attributes** from the Transactions:
>     1. typeCode (Transaction status abbreviation)
>     2. fromTeamId, toTeamId representing teams involved

#### 🔹 **Standings**: A single cell from the 'standings' column

* Contains daily standing infomation of the 30 teams
* Contains all kinds of standing table stats such as streaks, league rank, division rank, wins/loses, etc

In [None]:
# Standings on 2021-04-29
print('Standings on 2021-04-29')
print('\n')

show_single_json(train_df['standings'].iloc[1214])

#### 🔹 **Awards**: A single cell from the 'awards' column

In [None]:
# Awards handed out on 2021-04-30
print('Awards handed out on 2021-04-29')
print('\n')

show_single_json(train_df['awards'].iloc[1215])

> #### 📌  Observations from **awards.csv**
> * Contains information about the awards given out on a particular date in the training dataset. As we have already seen, the seperate CSV file named awards.csv contains all the information about awards given out prior to 2018 season. 
> 
> * On one hand, the awards given out before 2018 can help us understand how big a player is, their influence, and how people engage with them on long term basis. Whereas, the daily award information will help us understand how the fans engage with the playes digital content on a daily - much shorter - basis. 

#### 🔹 **Twitter Followers**: A single cell from the 'teamTwitterFollowers' & 'playerTwitterFollowers'
* Contains twitter followers count for players and teams on a particular date.

In [None]:
# A single cell containing Twitter Followers details of multiple "players" on a particular date
show_single_json(train_df['playerTwitterFollowers'].iloc[0])

In [None]:
# A single cell containing Twitter Followers details of all "teams" on a particular date
show_single_json(train_df['teamTwitterFollowers'].iloc[0])

#### 🔹 **Target Variables**: A single cell from the 'nextDayPlayerEngagement' column
* This is what we have to predict. Next Day Target Variables representing players engagements based on Current day data.

In [None]:
# Next Day (2021-05-01) Target Variables as of 2021-04-30
show_single_json(train_df['nextDayPlayerEngagement'].iloc[1215]).head()

Now that we have understood the structure of datasets and what they represent. We can start off with data exploration analysis!

# <font color='#093e88'>3. Exploratory Data Analysis</font>

# <center><font color='#041e42'>This Notebook is Work in Progress...</font> <font color='#bf0d3e'>It will be updated on daily basis...</font></center>