# BSBL-Tomorrow Pre-Processing

Author: Jensen Holm <br>
April 2024

In [1]:
import requests

# get the url for our dataset of all statcast era pitches (2015-2023)
# from the huggingface API
PARQUET_URL = requests.get(
    "https://huggingface.co/api/datasets/Jensen-holm/statcast-era-pitches/parquet/default/train",
).json()[0]

print(PARQUET_URL)

https://huggingface.co/api/datasets/Jensen-holm/statcast-era-pitches/parquet/default/train/0.parquet


In [2]:
import polars as pl

# load the dataset into a polars DataFrame
statcast_era_pitches: pl.DataFrame = pl.read_parquet(PARQUET_URL)

# print columns and their types so we can see what we're working with
statcast_era_pitches.glimpse()

Rows: 5479763
Columns: 92
$ pitch_type                      <str> 'FF', 'FC', 'FF', 'FC', 'FF', 'FF', 'FF', 'FF', 'FC', 'KC'
$ game_date                       <str> '2015-11-01 00:00:00.000000000', '2015-11-01 00:00:00.000000000', '2015-11-01 00:00:00.000000000', '2015-11-01 00:00:00.000000000', '2015-11-01 00:00:00.000000000', '2015-11-01 00:00:00.000000000', '2015-11-01 00:00:00.000000000', '2015-11-01 00:00:00.000000000', '2015-11-01 00:00:00.000000000', '2015-11-01 00:00:00.000000000'
$ release_speed                   <f32> 96.0999984741211, 93.0999984741211, 97.0, 93.5999984741211, 97.0999984741211, 96.5, 96.5999984741211, 97.5999984741211, 92.0, 86.69999694824219
$ release_pos_x                   <f32> -2.0199999809265137, -1.659999966621399, -1.6399999856948853, -1.5800000429153442, -1.7000000476837158, -1.6200000047683716, -1.3899999856948853, -1.5099999904632568, -1.8899999856948853, -1.6200000047683716
$ release_pos_z                   <f32> 6.25, 6.239999771118164, 6.3000001

In [3]:
statcast_era_pitches.columns

['pitch_type',
 'game_date',
 'release_speed',
 'release_pos_x',
 'release_pos_z',
 'player_name',
 'batter',
 'pitcher',
 'events',
 'description',
 'spin_dir',
 'spin_rate_deprecated',
 'break_angle_deprecated',
 'break_length_deprecated',
 'zone',
 'des',
 'game_type',
 'stand',
 'p_throws',
 'home_team',
 'away_team',
 'type',
 'hit_location',
 'bb_type',
 'balls',
 'strikes',
 'game_year',
 'pfx_x',
 'pfx_z',
 'plate_x',
 'plate_z',
 'on_3b',
 'on_2b',
 'on_1b',
 'outs_when_up',
 'inning',
 'inning_topbot',
 'hc_x',
 'hc_y',
 'tfs_deprecated',
 'tfs_zulu_deprecated',
 'fielder_2',
 'umpire',
 'sv_id',
 'vx0',
 'vy0',
 'vz0',
 'ax',
 'ay',
 'az',
 'sz_top',
 'sz_bot',
 'hit_distance_sc',
 'launch_speed',
 'launch_angle',
 'effective_speed',
 'release_spin_rate',
 'release_extension',
 'game_pk',
 'pitcher.1',
 'fielder_2.1',
 'fielder_3',
 'fielder_4',
 'fielder_5',
 'fielder_6',
 'fielder_7',
 'fielder_8',
 'fielder_9',
 'release_pos_y',
 'estimated_ba_using_speedangle',
 'estimat

# Goal

The main goal of this project is to be able to predict how well current MLB pitchers are going to be next week, next month and next year (these will probably all be different models). In order to do this, the data will have to be structured a certain way.

I think that I am going to try and build a model that predicts performance on the game level, then to project the next week, we put all predictions together for the teams that they will be facing. If we do this for all pitchers on a team for every game in an upcoming week, we can see what pitchers should be throwing more against certain teams. 


##### Targets(s)
- Pitcher expected run value (delta_run_exp / some pitch amount)

In [4]:
# calculate each pitchers total delta_run_exp per 50 pitches in each of their outings

PER_PITCHES: int = 50

pitchers_delta_run_exp = (
    statcast_era_pitches.group_by("game_pk", "pitcher")
    # aggregate the delta_run_exp and count the number of pitches in each outing
    .agg(
        pl.sum("delta_run_exp").alias("total_delta_run_exp"),
        pl.count("delta_run_exp").alias("total_pitches"),
    )
    # calculate the delta_run_exp per 50 pitches
    .with_columns([
        (pl.col("total_delta_run_exp") / pl.col("total_pitches") * PER_PITCHES).alias("delta_run_exp_per_50_pitches"),
    ])
)

pitchers_delta_run_exp.select(["pitcher", "delta_run_exp_per_50_pitches"]).sample(5)

pitcher,delta_run_exp_per_50_pitches
i32,f64
434442,0.027083
594792,6.7
642232,-0.888732
623352,-1.813333
656794,5.409574


In [None]:
# 'append' this to the original DataFrame so that each pitch has the delta_run_exp per 50 pitches for that specific outing


##### Features
(might do feature selection, or PCA to shrink this very highly dimensional dataset)
- Statcast metrics of players on teams that the pitchers team will be facing
- Split statcast metrics against hitters of similar hitting profiles
- The pitches that the pitcher throws