<br>
<h1 style = "font-size:60px; font-family:Garamond ; font-weight : normal; background-color: #f6f5f5 ; color : #fe346e; text-align: center; border-radius: 100px 100px;"> MLB Player-Digi Engagement Forecasting · Quicklook  <br> Data Analysis & Insights </h1>
<br>

# Data Description 

You are tasked with forecasting four different measures of engagement (target1-target4) for a subset of MLB players who are active in the 2021 season. The data contains a set of static files that do not change with time (players.csv, teams.csv, seasons.csv, awards.csv) as well as daily data (train.csv) which is grouped by day. When predicting on a given date, you are forecasting the target variables for the next day (i.e. for date d, you're predicting the engagement for day d+1).

This is a code competition that relies on a time-series module to ensure models do not peek forward in time. The time series module provides you with the test data and writes your submission file automatically. The test data arrives in a data frame identical in format to train.csv, except it does not contain the target values. To submit, follow the instructions on the Evaluation page. When you submit your notebook, it will be rerun on an unseen test set:

During the Training phase of the competition, this unseen test set is comprised of data for the month of May 2021 and the set of active players this year.
During the Evaluation phase, the test set will be a future in-season range of approximately one month.
Your code will need to be robust and make predictions for any date_playerId combination requested by the module. Each team's selected notebooks (up to 2 per team, selected by the Final Submission Deadline) will be rerun during the Evaluation phase.

Before diving into specifics, some high level qualifications about the data:

Some self-explanatory fields do not have an explanation (for example: season)
Binary columns will have null values as well as zeroes. Zeroes will occur if a player had an opportunity to do something, but did not. Nulls will occur if a player never had the opportunity to do something (for example: a player who does not pitch on a given day cannot possibly pitch a shutout - therefore a null value is expected)
Most game state related fields (balls, strikes, outs, etc.) represent the game state after the event in question. Home score and away score, however, represent the score before an event.



We are given 7 csv files:-

 - train.csv:training set
 - example_test.csv:example of test set
 - example_sample_submission.csv:example of sample_submission
 - awards.csv:awards won by players before 2018
 - players.csv:Library high level information about all players.
 - seasons.csv:Information about start and end dates of all seasons in this dataset
 - teams.csv:Library containing high level information about all MLB teams.

In [None]:
!pip install -q sweetviz
!pip install -q klib
!pip install -q raceplotly

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

from sklearn.preprocessing import MinMaxScaler,StandardScaler

import klib
import seaborn as sns

# Data Visualisation libraries 
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
from raceplotly.plots import barplot

sns.set(rc={'figure.figsize':(20.7,20.27)})


import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
players = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/players.csv')
seasons = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/seasons.csv')
awards = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/awards.csv')
teams = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/teams.csv')
train = pd.read_csv('../input/mlb-player-digital-engagement-forecasting/train.csv')

In [None]:
train.head()

In [None]:
train.info()

In [None]:
N_DATES = train.shape[0]
d = []
for idx in tqdm(range(N_DATES)):
    u = eval(train.iloc[idx, 1])
    d += u
train_next_day = pd.DataFrame(d)
train_next_day.engagementMetricsDate = train_next_day.engagementMetricsDate.astype('datetime64[ns]')

In [None]:
train_next_day.groupby(['engagementMetricsDate']).agg({'playerId':'count'})

In [None]:
train_next_day['engagementMetricsDate_day'] = train_next_day['engagementMetricsDate'].dt.day
train_next_day['engagementMetricsDate_month'] = train_next_day['engagementMetricsDate'].dt.month
train_next_day['engagementMetricsDate_year'] = train_next_day['engagementMetricsDate'].dt.year
train_next_day['engagementMetricsDate_week'] = train_next_day['engagementMetricsDate'].dt.weekofyear

In [None]:
plyr_eng_agg_years = train_next_day.groupby(['engagementMetricsDate_year','engagementMetricsDate_month']).agg({'target1':'mean',\
                                                                                                     'target2':'mean',\
                                                                                                     'target3':'mean',\
                                                                                                     'target4':'mean'                                                                                         
                                                                                                    }).reset_index()

In [None]:
plyr_eng_agg_years_tgt1 =  plyr_eng_agg_years.pivot(index='engagementMetricsDate_year',columns=['engagementMetricsDate_month'],values=['target1']).reset_index()

In [None]:
plyr_eng_agg_years_tgt1[['engagementMetricsDate_year','target1']]

In [None]:
pd.concat([plyr_eng_agg_years_tgt1.engagementMetricsDate_year,plyr_eng_agg_years_tgt1.target1],axis=1)

## Observation - 
 
*   Players are pretty engaged from March to September 
*   Jan,Feb,Oct,Nov & Dec are less engaged
*   Year - 2020 , Due to Covid19 Players are less engaged compared to 2018 & 2019
*   2021- Engagement is back normal 

### Univariate Analysis into Target Variables :-

In [None]:
klib.dist_plot(train_next_day['target1'])

# Player dataset Analysis ;-

In [None]:
klib.missingval_plot(players)

#### Obervation :- 
* 25% Birth State province is missing in dataset 
* 3% of mlbDebutDate is missing

# Players Distribution by BirthCountry

In [None]:
source = players['birthCountry'].value_counts()

In [None]:
#players['birthCountry'].value_counts()
fig = go.Figure(data=[go.Pie(labels=source.index,values=source.values)])
fig.update_layout(title='BirthCountry distribution')
fig.show()

### Observation :- 
* 85% of Players birthcountry is USA,D.Republic ,Venezuela & Cuba 

In [None]:
src1=players['primaryPositionName'].value_counts()
#players['birthCountry'].value_counts()
fig = go.Figure(data=[go.Pie(labels=src1.index,values=src1.values)])
fig.update_layout(title='primaryPositionName distribution')
fig.show()

### Observation :- 
* Majority of players playing Picher,Outfielder,Catcher,Second & First based
* Very few belongs to Designated Hitter & Infield

In [None]:
players.head()

# Bivariate analysis in Player Dataset

In [None]:
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(players[['heightInches','weight']])
players = pd.concat([players,pd.DataFrame(scaled_values,columns=['heightInches_minmax','weight_min_max'])],axis=1)

In [None]:
plt.figure(figsize=(30, 30))
sns.pairplot(players[['primaryPositionName','heightInches','weight']],hue='primaryPositionName')

In [None]:
plt.figure(figsize=(30, 30))
sns.pairplot(players[['primaryPositionName','heightInches_minmax','weight_min_max']],hue='primaryPositionName')

#### Observation :- 
    Pichers have positive correlation Weight vs Height Inches 

In [None]:
players['DOB'] = pd.to_datetime(players['DOB'])
players['Age']=2021-players['DOB'].dt.year

In [None]:
fig = go.Figure(data=[go.Box(x=players.primaryPositionName, y=players.Age)])
fig.update_layout(title='primaryPositionName vs Age')
fig.show()

In [None]:
for pos in players.primaryPositionName.unique().tolist():
    print(" Avg Age of ",pos,"=",round(players.loc[players['primaryPositionName']==pos]['Age'].mean(),0))

In [None]:
for pos in players.primaryPositionName.unique().tolist():
    print(" primaryPositionName = ",pos)
    klib.dist_plot(players.loc[players['primaryPositionName']==pos]['Age'])

#### Observation :- 
* Designated hitters are pretty experianced min.32 & max 43 with Avg 36.5 years 
* Other Player positions Age boradly varies from 25-33 years 

In [None]:
players[['primaryPositionName','heightInches_minmax','weight_min_max']]

In [None]:
plt.figure(figsize=(30, 30))
sns.jointplot(players['heightInches_minmax'],players['weight_min_max'],hue=players['primaryPositionName'])

## Let see what is there in Teams dataset

In [None]:
teams.head(10)

In [None]:
klib.missingval_plot(teams)

In [None]:
klib.cat_plot(teams)

In [None]:
teams['locationName'].value_counts()

# Let look into Awards

In [None]:
award_with_team = awards.merge(teams[['id','name', 'teamName','leagueName', 'divisionId', 'divisionName']],left_on='awardPlayerTeamId',right_on='id')
award_with_team_agg =  award_with_team.groupby(['awardSeason','name']).agg({'divisionName':'count','awardSeason':'min'}).rename(columns={'divisionName':'Award_count','awardSeason':'min_award_year'}).reset_index()

In [None]:
my_raceplot = barplot(award_with_team_agg,  item_column='name', value_column='Award_count', time_column='awardSeason')
my_raceplot.plot(item_label = 'team name', value_label = 'Award Count', frame_duration = 800)

In [None]:
award_with_team.head()

In [None]:
award_with_team_agg.pivot(index='name',columns='awardSeason',values='Award_count').fillna(0)

In [None]:
sns.heatmap(award_with_team_agg.pivot(index='name',columns='awardSeason',values='Award_count').fillna(0))

#### Observations :-
* Chicogo Tigers & Auston Astros got maximum awards in 2015 & 2017 compared to other teams

### Team Cohort in Awards

In [None]:
award_with_team.head()

In [None]:
pd.concat(
    [
        award_with_team.groupby(['name']).agg({'divisionName':'count','awardDate':'min'}).rename(columns={'divisionName':'Award_count','awardSeason':'min_award_year','awardDate':'awardDate_min'}),
        award_with_team_agg.pivot(index='name',columns='awardSeason',values='Award_count').fillna(0)
    ],
    axis=1
).reset_index()

## Players Award Churn in History

In [None]:
award_with_team.groupby(['awardSeason','playerName'])\
.agg({'divisionName':'count'}).rename(columns={'divisionName':'Award_count','awardSeason':'min_award_year','awardDate':'awardDate_min'})\
.reset_index()\
.sort_values('Award_count',ascending=False)\
.head(500)\
.pivot(index='playerName',columns='awardSeason',values='Award_count')\
.fillna(0)

In [None]:
sns.heatmap(award_with_team.groupby(['awardSeason','playerName'])\
.agg({'divisionName':'count'}).rename(columns={'divisionName':'Award_count','awardSeason':'min_award_year','awardDate':'awardDate_min'})\
.reset_index()\
.sort_values('Award_count',ascending=False)\
.head(500)\
.pivot(index='playerName',columns='awardSeason',values='Award_count')\
.fillna(0))

## >> In progress , Kindly upvote , If you like this notebooks