### Overview
This Jupyter notebook can be used to retrieve the pitch data from [pybaseball](https://github.com/jldbc/pybaseball), export it as a csv, and also how to import the csv for future use.

#### Retrieveing the data

In [1]:
# Import necessary libraries
import pybaseball as pyb
from time import time
import pandas as pd
import numpy as np

In [2]:
import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")

In [3]:
# This code is for downloading one month at a time to deal with more managable files. 
# For doing the whole season in one-fell-swoop, see the chunks further below

# Number of days in each month
d = {3:31, 4:30, 5:31, 6:30, 7:31, 8:31, 9:30}

# an array to hold onto each month's dataframe
df_months = []

for i in range(3,10):
    df_month = pyb.statcast(
        start_dt='2025-0' + str(i) + '-01',          # start of the month
        end_dt='2025-0'+ str(i) + '-' +str(d[i])     # end of the month
    )
    df_months.append(df_month)                       # append it to our array of months

This is a large query, it may take a moment to complete
Skipping offseason dates


100%|██████████████████████████████████████████████████████████████████████████████████| 17/17 [00:49<00:00,  2.93s/it]


This is a large query, it may take a moment to complete


100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [01:27<00:00,  2.92s/it]


This is a large query, it may take a moment to complete


100%|██████████████████████████████████████████████████████████████████████████████████| 31/31 [01:31<00:00,  2.95s/it]


This is a large query, it may take a moment to complete


100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [01:32<00:00,  3.09s/it]


This is a large query, it may take a moment to complete


100%|██████████████████████████████████████████████████████████████████████████████████| 31/31 [01:27<00:00,  2.81s/it]


This is a large query, it may take a moment to complete


100%|██████████████████████████████████████████████████████████████████████████████████| 31/31 [01:38<00:00,  3.17s/it]


This is a large query, it may take a moment to complete


100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [01:17<00:00,  2.57s/it]


#### Exporting the data

In [5]:
# Create a dictionary for easy access of month names for our file names
month_names = {3:'march', 4:'april', 5:'may', 6:'june', 7:'july', 8:'august', 9:'september'}

# export each month
for i in range(3,10):
    df_months[i-3].to_csv('all_' + month_names[i] + '_2025_pitches.csv', index = True)

#### Importing the data
Only run these once you've either ran the above cells, or if you've downloaded and unzipped the file from [our GitHub](https://github.com/Erdos-Projects/fall-2025-sports-analytics)

In [6]:
# Number of days in each month
d = {3:31, 4:30, 5:31, 6:30, 7:31, 8:31, 9:30}
month_names = {3:'march', 4:'april', 5:'may', 6:'june', 7:'july', 8:'august', 9:'september'}

# an array to hold onto each month's dataframe
df_months = []

for i in range(3,10):
    df_month = pd.read_csv('all_' + month_names[i] + '_2025_pitches.csv')  
    df_months.append(df_month)                       # append it to our array of months

In [7]:
df = pd.concat(df_months[::-1], axis=0)

In [8]:
df.shape

(754044, 119)

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,...,batter_days_until_next_game,api_break_z_with_gravity,api_break_x_arm,api_break_x_batter_in,arm_angle,attack_angle,attack_direction,swing_path_tilt,intercept_ball_minus_batter_pos_x_inches,intercept_ball_minus_batter_pos_y_inches
0,1994,FF,2025-09-28,95.7,-2.15,5.21,"Weissert, Greg",678009,669711,field_out,...,,1.56,0.71,-0.71,20.9,5.991833,-1.319512,28.782516,41.559201,30.599805
1,2040,FF,2025-09-28,95.1,-1.91,5.1,"Weissert, Greg",668670,669711,strikeout,...,,1.59,0.93,0.93,20.5,,,,,
2,2165,FF,2025-09-28,95.4,-1.99,5.22,"Weissert, Greg",668670,669711,,...,,1.36,0.85,0.85,22.9,2.871131,31.805044,22.266527,37.478847,15.582717
3,2256,SL,2025-09-28,84.8,-2.33,4.72,"Weissert, Greg",668670,669711,,...,,2.55,-0.32,-0.32,12.3,13.78541,4.08139,32.414181,38.011685,27.083341
4,2352,SL,2025-09-28,85.3,-2.26,4.85,"Weissert, Greg",668670,669711,,...,,2.71,-0.52,-0.52,15.8,,,,,


#### A modicum of cleaning
How to get rid of spring training games. We can see that when we grabbed March's data, we got some spring training games ('game_type' = 'S'). It's relativly easy to drop those.

In [10]:
# Take a look at the breakdown of our pitches
df['game_type'].value_counts()

game_type
R    710084
S     43960
Name: count, dtype: int64

In [11]:
df.shape

(754044, 119)

In [12]:
# Only keep those pitches from a regular season game
df = df[df['game_type']=='R']

In [13]:
# Check out to see if it worked
df['game_type'].value_counts()


game_type
R    710084
Name: count, dtype: int64

In [14]:
df.shape

(710084, 119)

#### Bonus section 1: doing it all at once. 
Takes longer but less local memory

In [None]:
# If you want to download all of it at once, use this cell. It took ~10 minutes on my computer
a = time()
df = pyb.statcast(start_dt='2025-03-17', end_dt='2025-09-29')
print(time() - a)

#### Bonus section 2: Minor cleaning and wrangling
We ook at pitches where Juan Soto was at-bat

In [15]:
# Check out Juan Soto's stats
soto_stats = df[df['batter'] == 665742]

In [16]:
# How many home runs did he hit
sum(soto_stats['events'] == 'home_run')

43

In [17]:
soto_stats['events'].value_counts()

events
field_out                    249
strikeout                    136
walk                         114
single                        88
home_run                      43
double                        20
grounded_into_double_play     17
force_out                     10
sac_fly                        6
field_error                    6
hit_by_pitch                   3
fielders_choice                3
fielders_choice_out            2
triple                         1
catcher_interf                 1
double_play                    1
sac_bunt                       1
strikeout_double_play          1
truncated_pa                   1
Name: count, dtype: int64