# Introduction and Imports 📔

Let's get started with this new tabular data competition! 

Below is a brief introduction to it as well as the data files we will be working with!

In [None]:
! pip install -q rich

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import random

import plotly.graph_objects as go

from rich import print as _pprint

In [None]:
def cprint(string):
    """
    Utility function for beautiful colored printing.
    """
    _pprint(f"[black]{string}[/black]")

def show_pd_table(df, name):
    cells_out = [df[x] for x in df.columns]
    
    fig = go.Figure(data=[go.Table(
        header=dict(values=list(df.columns),
                    fill_color='paleturquoise',
                    align='left'),
        cells=dict(values=cells_out,
                   fill_color='lavender',
                   align='left'))
    ])
    
    fig.update_layout(
    title={
        'text': name,
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

    fig.show()

## About the Competition 🏴‍☠️

A player hits a walk-off home run. A pitcher throws a no-hitter. A team gets red hot going into the Postseason. We know some of the catalysts that increase baseball fan interest. Now Major League Baseball (MLB) and Google Cloud want the Kaggle community’s help to identify the many other factors which pique supporter engagement and create deeper relationships betweens players and fans.

In this competition, we will predict how fans engage with MLB players’ digital content on a daily basis for a future date range. You’ll have access to player performance data, social media data, and team factors like market size. Successful models will provide new insights into what signals most strongly correlate with and influence engagement.

<hr>

## What is our task? 🎯

In this competition, we are tasked with forecasting four different measures of engagement (`target1`-`target4`) for a subset of MLB players who are active in the 2021 season.

The data contains a set of static files that do not change with time as well as a training file of daily data which is grouped by the day.

This is a code competition that relies on a time-series module to ensure models do not peek forward in time. The time series module provides you with the test data and writes your submission file automatically. The test data arrives in a data frame identical in format to `train.csv`, except it does not contain the target values.

You are highly recommended to visit the [Evaluation](https://www.kaggle.com/c/mlb-player-digital-engagement-forecasting/overview/evaluation) Page for more info on submission.

<hr>

## How does the Data look like? 🗃

So, the data provided to us in this competition consists of 7 `.csv` files and 1 folder called `mlb/`

Below is the breakdown of the `.csv` files;

* 📄 `train.csv` - This file is the training dataset.

* 📄 `awards.csv` - This file is a collection of awards given out prior to the first date in the training file.

* 📄 `players.csv` - This file contains high level information about all MLB players in this dataset.

* 📄 `seasons.csv` - This file contains information about start and end dates of all seasons in this dataset.

* 📄 `teams.csv` - This file contains high level information about all MLB teams.

* 📄 `example_test.csv` - This file is an example in the form of the test set that you’ll be evaluated on..

* 📄 `example_sample_submission.csv` - This file is a sample submission file in the correct format based on the example test set.

<div class="alert alert-block alert-info">
This is a Code comeptition and the submissions in this competition must be made using the <code>mlb</code> python module.
</div>

Sample template using `mlb` module to write a submission file:

```python
import mlb
env = mlb.make_env() # initialize the environment
iter_test = env.iter_test() # iterator which loops over each date in test set

for (test_df, sample_prediction_df) in iter_test:
    sample_prediction_df['target1'] = 100 #make predictions here
    env.predict(sample_prediction_df)
```

<hr>

## Peeking at the Data 📈

Now that you have an understanding of the task and the dataset, let's start by looking at the different data files provided and some stats on them.

In [None]:
awards = pd.read_csv("../input/mlb-player-digital-engagement-forecasting/awards.csv")
players = pd.read_csv("../input/mlb-player-digital-engagement-forecasting/players.csv")
seasons = pd.read_csv("../input/mlb-player-digital-engagement-forecasting/seasons.csv")
teams = pd.read_csv("../input/mlb-player-digital-engagement-forecasting/teams.csv")
train = pd.read_csv("../input/mlb-player-digital-engagement-forecasting/train.csv")

In [None]:
train.head()

In [None]:
show_pd_table(awards.head(20), name='Awards DataFrame [First-20 Values]')

In [None]:
show_pd_table(players.head(20), name='Players DataFrame [First-20 Values]')

In [None]:
show_pd_table(seasons, name='Seasons DataFrame [First-20 Values]')

In [None]:
show_pd_table(teams.head(20), name='Teams DataFrame [First-20 Values]')

In [None]:
players.describe()

### Player Height and Weight Distribution

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 7))
fig.suptitle(f"Height and Weight Distribution")

sns.histplot(players['heightInches'], stat='density', color='magenta', legend=True, ax=ax[0])
sns.histplot(players['weight'], stat='density', color='blue', legend=True, ax=ax[1])

ax[0].set_xlabel("Height")
ax[1].set_xlabel("Weight")
ax[0].set_ylabel("Density")
ax[1].set_ylabel("Density")
plt.show()

In [None]:
_pprint("[bold green]Under Work![/bold green]")