## AIM: To mine data from MLB's Statcast database using a prebuilt library called pybaseball and save the raw data as .csv files.

***

Note: There is no data cleaning here, this is purely mining and saving the data. Data cleaning was done in `Cleaning Pitches.ipynb`.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

`pybaseball` will need to be installed for this to work

Can be installed quickly with `pip install pybaseball`

I found it was very quick and had no compatibility issues.

In [6]:
from pybaseball import statcast

In [8]:
# Query statcast from the date range of choice
# I queried from the start of 2015 (2015-01-01) to end of 2019 (2019-12-31)
# This covers the seasons from 2015-2019, giving me a ton of pitch data
# This also takes a LONG time to run!
# For the purpose of submission, I've inputted a much smaller date range with less data just to demo it and not clog up time
data = statcast(start_dt='2014-09-30', end_dt='2015-01-01')

This is a large query, it may take a moment to complete
Completed sub-query from 2014-09-30 to 2014-10-05
Completed sub-query from 2014-10-06 to 2014-10-11
Completed sub-query from 2014-10-12 to 2014-10-17
Completed sub-query from 2014-10-18 to 2014-10-23
Completed sub-query from 2014-10-24 to 2014-10-29
Query unsuccessful for data from 2014-10-30 to 2014-11-03. Skipping these dates.
Query unsuccessful for data from 2014-11-04 to 2014-11-04. Skipping these dates.
Query unsuccessful for data from 2014-11-05 to 2014-11-09. Skipping these dates.
Query unsuccessful for data from 2014-11-10 to 2014-11-10. Skipping these dates.
Query unsuccessful for data from 2014-11-11 to 2014-11-15. Skipping these dates.
Query unsuccessful for data from 2014-11-16 to 2014-11-16. Skipping these dates.
Skipping offseason dates
Query unsuccessful for data from 2015-03-15 to 2015-03-20. Skipping these dates.
Query unsuccessful for data from 2015-03-21 to 2015-03-21. Skipping these dates.


**NOTE:** I did not do this all at once because it takes a long time to run and any internet chops led to the crash of the cell. In the code further down, I combined all the .csv files later on as I cleaned the data in another notebook. This is also why some of the file names are `pitches16.csv` rather than one big pitch file.

***

In [11]:
# Save the raw data to a .csv to preserve the 'base' copy in case of overwriting/errors
# data.to_csv('pitches16.csv')

In [9]:
# Get data shape
# Lot of columns and pitches (rows), even for a small set of data
data.shape

(9329, 90)

In [10]:
# Lots of nulls in the data from statcast, will definitely need to go and clean the data
data.isna().sum()

index                       0
pitch_type                  8
game_date                   0
release_speed               8
release_pos_x            9329
                         ... 
post_home_score             0
post_bat_score              0
post_fld_score              0
if_fielding_alignment    9329
of_fielding_alignment    9329
Length: 90, dtype: int64

In [11]:
data

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,0,FT,2014-10-29,93.4,,,Madison Bumgarner,521692.0,518516.0,field_out,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,,
1,1,FF,2014-10-29,92.3,,,Madison Bumgarner,521692.0,518516.0,,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,,
2,2,FF,2014-10-29,92.2,,,Madison Bumgarner,521692.0,518516.0,,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,,
3,3,FF,2014-10-29,92.4,,,Madison Bumgarner,521692.0,518516.0,,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,,
4,4,FF,2014-10-29,92.3,,,Madison Bumgarner,521692.0,518516.0,,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9324,3838,FT,2014-09-30,93.6,,,James Shields,424825.0,448306.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
9325,3839,FT,2014-09-30,93.9,,,James Shields,424825.0,448306.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
9326,3840,FF,2014-09-30,95.7,,,James Shields,424825.0,448306.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
9327,3841,KC,2014-09-30,83.1,,,James Shields,424825.0,448306.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,


In [13]:
# See if I can pull out a dataframe for just a player of interest 
# Would be smaller and easier to work with at the start of cleaning
df = data[data['player_name'] == 'Madison Bumgarner']
df.head()

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,0,FT,2014-10-29,93.4,,,Madison Bumgarner,521692.0,518516.0,field_out,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,,
1,1,FF,2014-10-29,92.3,,,Madison Bumgarner,521692.0,518516.0,,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,,
2,2,FF,2014-10-29,92.2,,,Madison Bumgarner,521692.0,518516.0,,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,,
3,3,FF,2014-10-29,92.4,,,Madison Bumgarner,521692.0,518516.0,,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,,
4,4,FF,2014-10-29,92.3,,,Madison Bumgarner,521692.0,518516.0,,...,2.0,3.0,2.0,3.0,3.0,2.0,2.0,3.0,,


In [16]:
# See the columns
df.columns

Index(['index', 'pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description', 'spin_dir', 'spin_rate_deprecated',
       'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',
       'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',
       'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',
       'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',
       'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',
       'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',
       'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',
       'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',
       'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
       'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
     

The goal here was to simply just mine MLB's statcast and write the data to a .csv file.

The data cleaning was done in a separate notebook (`Cleaning Pitches.ipynb`) to keep things organized.