# BSBL-Tomorrow Pre-Processing

Author: Jensen Holm <br>
April 2024

In [1]:
from constants import DATA_DIR, COMPRESSION
import polars as pl
import requests
import os

In [2]:
# get the url for our dataset of all statcast era pitches (2015-2023)
# from the huggingface API
PARQUET_URL = requests.get(
    "https://huggingface.co/api/datasets/Jensen-holm/statcast-era-pitches/parquet/default/train",
).json()[0]

print(PARQUET_URL)

https://huggingface.co/api/datasets/Jensen-holm/statcast-era-pitches/parquet/default/train/0.parquet


In [3]:
# load the dataset into a polars DataFrame
statcast_era_pitches: pl.DataFrame = pl.read_parquet(PARQUET_URL)

# print columns and their types so we can see what we're working with
print(statcast_era_pitches.sample(3))

shape: (3, 92)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ pitch_typ ┆ game_date ┆ release_s ┆ release_p ┆ … ┆ of_fieldi ┆ spin_axis ┆ delta_hom ┆ delta_ru │
│ e         ┆ ---       ┆ peed      ┆ os_x      ┆   ┆ ng_alignm ┆ ---       ┆ e_win_exp ┆ n_exp    │
│ ---       ┆ str       ┆ ---       ┆ ---       ┆   ┆ ent       ┆ f32       ┆ ---       ┆ ---      │
│ str       ┆           ┆ f32       ┆ f32       ┆   ┆ ---       ┆           ┆ f32       ┆ f32      │
│           ┆           ┆           ┆           ┆   ┆ str       ┆           ┆           ┆          │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ CH        ┆ 2022-04-1 ┆ 88.300003 ┆ -1.7      ┆ … ┆ Standard  ┆ 237.0     ┆ 0.0       ┆ -0.024   │
│           ┆ 6 00:00:0 ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│           ┆ 0.0000000 ┆           ┆           ┆   ┆           ┆           

# Goal

The main goal of this project is to be able to predict how well current MLB pitchers are going to be next week, next month and next year (these will probably all be different models). In order to do this, the data will have to be structured a certain way.

In order for this to be effective and potentially useful, I want to create a model that will predict the outcome of a plate appearance with decent accuracy. Then, we can string together plate appearance predictions to predict how well a pitcher fairs against a team per 9 innings, or by a pitch amount instead maybe (whatever yeilds better results). 

It will probably be computationally expensive, but one way a team could use this to project what bullpen pitchers they want to save for which teams that they are going to face.

##### Targets(s)
- Pitcher expected run value (expected & regular woba during outing)

##### Features
(might do feature selection, or PCA to shrink this very highly dimensional dataset)
- Statcast metrics of players on teams that the pitchers team will be facing
- Split statcast metrics against hitters of similar hitting profiles
- The pitches that the pitcher throws

In [4]:
# calculate a pitchers mean woba_value for each outing, this is a metric
# that might be useful to know, especially if we can predict it well

PITCH_MINIMUM: int = 0

pitchers_outings_df: pl.DataFrame = (
    statcast_era_pitches.group_by("game_pk", "home_team", "away_team", "pitcher")
    .agg(
        pl.sum("woba_value").alias("total_woba_value"),
        pl.col("batter").unique().len().alias("batters_faced"),
        pl.len().alias("total_pitches"),
    )
    .filter(pl.col("total_pitches") >= PITCH_MINIMUM)
    .with_columns(outing_woba_value=pl.col("total_woba_value") / pl.col("batters_faced"))
    .drop("total_woba_value")
)

print(pitchers_outings_df)

shape: (163_452, 7)
┌─────────┬───────────┬───────────┬─────────┬───────────────┬───────────────┬───────────────────┐
│ game_pk ┆ home_team ┆ away_team ┆ pitcher ┆ batters_faced ┆ total_pitches ┆ outing_woba_value │
│ ---     ┆ ---       ┆ ---       ┆ ---     ┆ ---           ┆ ---           ┆ ---               │
│ i32     ┆ str       ┆ str       ┆ i32     ┆ u32           ┆ u32           ┆ f64               │
╞═════════╪═══════════╪═══════════╪═════════╪═══════════════╪═══════════════╪═══════════════════╡
│ 531461  ┆ MIA       ┆ TOR       ┆ 621295  ┆ 6             ┆ 22            ┆ 0.0               │
│ 565294  ┆ CIN       ┆ AZ        ┆ 543101  ┆ 9             ┆ 91            ┆ 0.794444          │
│ 414289  ┆ ATL       ┆ MIL       ┆ 519293  ┆ 6             ┆ 23            ┆ 0.15              │
│ 633269  ┆ MIA       ┆ SD        ┆ 669622  ┆ 4             ┆ 16            ┆ 0.175             │
│ 633760  ┆ NYM       ┆ ATL       ┆ 607625  ┆ 5             ┆ 22            ┆ 0.54              │
