# 1. Data Collection
---

### Notebook Summary
For this project, I collected batting data for the 2015-2018 regular seasons from [FanGraphs](https://www.fangraphs.com "FanGraphs"), along with batting data for the beginning of the 2019 season (through May 12) . Each row of data represents a single batter's stats for a single game. I downloaded the data in a series of csv files, but because of the way the data is structured I need to reassemble each season into a single pandas DataFrame.

---

The pandas library is the only import needed for this notebook.

In [1]:
import pandas as pd

---

## Importing & Assembling the CSVs

FanGraphs categorizes the data into separate files for standard batting stats, advanced batting stats, and batted ball stats. A complete season of data consists of six files for each of these three different stat groups for a total of 18 csv files per season. The `read_and_concat_csv` function will speed up the process of stitching together the 18 files into a single DataFrame for each season. The function takes in a base filepath that specifies the season and stat group, and it takes in the number of files in that series. It then concatenates all the files for each stat group.

In [2]:
def read_and_concat_csv(filepath, quantity):
    temp_df_A = pd.DataFrame()
    for x in range(quantity):
        temp_df_B = pd.read_csv(f'{filepath}{x+1}.csv')
        temp_df_A = pd.concat([temp_df_A, temp_df_B])
    return temp_df_A

I'll start by passing a full season's quantity of files (6) for each stat group to assemble the 2016 DataFrames.

In [3]:
daily_2016_standard = read_and_concat_csv(
    '../data/batters_2016_daily/batters_2016_daily_standard', 6)
daily_2016_advanced = read_and_concat_csv(
    '../data/batters_2016_daily/batters_2016_daily_advanced', 6)
daily_2016_battedballs = read_and_concat_csv(
    '../data/batters_2016_daily/batters_2016_daily_battedballs', 6)

With each stat group read into memory, I can now merge the three full-season files into a single 2016 DataFrame.

In [4]:
daily_2016 = daily_2016_standard.merge(daily_2016_advanced)
daily_2016 = daily_2016.merge(daily_2016_battedballs)

Next I will repeat the process for the 2017 data...

In [6]:
daily_2017_standard = read_and_concat_csv(
    '../data/batters_2017_daily/batters_2017_daily_standard', 6)
daily_2017_advanced = read_and_concat_csv(
    '../data/batters_2017_daily/batters_2017_daily_advanced', 6)
daily_2017_battedballs = read_and_concat_csv(
    '../data/batters_2017_daily/batters_2017_daily_battedballs', 6)

daily_2017 = daily_2017_standard.merge(daily_2017_advanced)
daily_2017 = daily_2017.merge(daily_2017_battedballs)

...and the 2018 data.

In [8]:
daily_2018_standard = read_and_concat_csv(
    '../data/batters_2018_daily/batters_2018_daily_standard', 6)
daily_2018_advanced = read_and_concat_csv(
    '../data/batters_2018_daily/batters_2018_daily_advanced', 6)
daily_2018_battedballs = read_and_concat_csv(
    '../data/batters_2018_daily/batters_2018_daily_battedballs', 6)

daily_2018 = daily_2018_standard.merge(daily_2018_advanced)
daily_2018 = daily_2018.merge(daily_2018_battedballs)

Lastly, I want to read in and assemble the current data for the beginning of the 2019 season for use in a demo once I've finished the first stages of modeling. Since there is less data available for the in-progress 2019 regular season, there are only two files per stat group.

In [10]:
daily_2019_standard = read_and_concat_csv(
    '../data/batters_2019_daily/batters_2019_daily_standard', 2)
daily_2019_advanced = read_and_concat_csv(
    '../data/batters_2019_daily/batters_2019_daily_advanced', 2)
daily_2019_battedballs = read_and_concat_csv(
    '../data/batters_2019_daily/batters_2019_daily_battedballs', 2)

daily_2019 = daily_2019_standard.merge(daily_2019_advanced)
daily_2019 = daily_2019.merge(daily_2019_battedballs)

I'll check the first few observations from the 2018 data (transposed with columns on the vertical axis) to make sure everything looks right.

In [23]:
daily_2018.head().T

Unnamed: 0,0,1,2,3,4
Date,2018-03-29,2018-03-29,2018-03-29,2018-03-29,2018-03-29
Name,Ozzie Albies,Maikel Franco,Evan Gattis,Yoan Moncada,Peter Bourjos
Tm,ATL,PHI,HOU,CHW,ATL
G,1,1,1,1,1
PA,5,4,4,6,0
AB,5,2,3,6,0
H,1,0,0,1,0
1B,0,0,0,0,0
2B,0,0,0,1,0
3B,0,0,0,0,0


Looks good, moving on.

---
## Exporting

Some of the feature engineering that I will conduct in the next notebook will be more effective if I apply it to the individual season datasets separately instead of applying it to the seasons as a single comprehensive dataset. For that reason I can now write new CSV files from the the individual season DataFrames.

_The following code block is commented out to protect against accidental overwrites. Simply uncomment the code if you would like to replicate the results for yourself._

In [11]:
# daily_2016.to_csv('../data/batters_2016_daily_master.csv',
#                   index=False)
# daily_2017.to_csv('../data/batters_2017_daily_master.csv',
#                   index=False)
# daily_2018.to_csv('../data/batters_2018_daily_master.csv',
#                   index=False)
# daily_2019.to_csv('../data/batters_2019_daily_master.csv',
#                   index=False)

---
### Conclusion
All of the project's raw data is now in a more manageable format for some initial EDA and pre-processing, which will take place in the next notebook.