### Reading data

The data for this project is located in [this hugging face space](https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches) that I made. It contains every single pitch from the modern statcast era up through last year (2015-2023).

In [6]:
import polars as pl
import numpy as np
from dataclasses import dataclass

STATCAST_ERA_PITCHES_URL: str = (
    "https://huggingface.co/api/datasets/Jensen-holm/statcast-era-pitches/parquet/default/train/0.parquet"
)

statcast_era_df: pl.DataFrame = pl.read_parquet(STATCAST_ERA_PITCHES_URL)
statcast_era_df.sample(3)

pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,spin_dir,spin_rate_deprecated,break_angle_deprecated,break_length_deprecated,zone,des,game_type,stand,p_throws,home_team,away_team,type,hit_location,bb_type,balls,strikes,game_year,pfx_x,pfx_z,plate_x,plate_z,on_3b,on_2b,on_1b,outs_when_up,inning,inning_topbot,…,effective_speed,release_spin_rate,release_extension,game_pk,pitcher.1,fielder_2.1,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9,release_pos_y,estimated_ba_using_speedangle,estimated_woba_using_speedangle,woba_value,woba_denom,babip_value,iso_value,launch_speed_angle,at_bat_number,pitch_number,pitch_name,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
str,str,f32,f32,f32,str,i32,i32,str,str,i64,i64,i64,i64,f32,str,str,str,str,str,str,str,f32,str,i32,i32,i32,f32,f32,f32,f32,i32,i32,i32,i32,f32,str,…,f32,i64,f32,i32,i64,i64,i32,i32,i32,i32,i32,i32,i32,f32,f32,f32,f32,f32,f32,f32,f32,i32,i32,str,i32,i32,i32,i64,i32,i32,i32,i64,str,str,f32,f32,f32
"""CU""","""2017-06-19 00:…",76.300003,-2.36,5.49,"""Smith, Josh A.…",608324,595001,"""double""","""hit_into_play""",,,,,9.0,"""Alex Bregman d…","""R""","""R""","""R""","""OAK""","""HOU""","""X""",7.0,"""line_drive""",1,0,2017,1.32,-0.23,0.45,1.89,,,,1,9.0,"""Top""",…,75.0,2525,5.4,491149,595001,519390,475174,640461,592387,489267,501981,595144,459964,55.09,0.927,0.929,1.25,1.0,1.0,1.0,4.0,66,2,"""Curveball""",1,4,4,1,4,1,4,1,"""Standard""","""Standard""",73.0,-0.006,0.39
"""FF""","""2021-05-06 00:…",92.0,-2.16,6.35,"""Pineda, Michae…",621311,501381,"""single""","""hit_into_play""",,,,,14.0,"""David Dahl sin…","""R""","""L""","""R""","""MIN""","""TEX""","""X""",7.0,"""line_drive""",2,1,2021,-0.62,1.14,0.94,1.77,,,,2,4.0,"""Top""",…,91.900002,2050,6.3,634261,501381,666163,593934,624503,553902,592743,664247,621439,596146,54.200001,0.449,0.404,0.9,1.0,1.0,0.0,2.0,30,4,"""4-Seam Fastbal…",3,2,2,3,2,3,2,3,"""Infield shift""","""Standard""",218.0,-0.016,0.106
"""FF""","""2019-05-22 00:…",99.300003,-1.77,6.15,"""Vieira, Thyago…",643603,600986,"""strikeout""","""swinging_strik…",,,,,6.0,"""Tyler White st…","""R""","""R""","""R""","""HOU""","""CWS""","""S""",2.0,,2,2,2019,-0.26,1.47,0.78,2.1,,,,0,9.0,"""Bot""",…,99.599998,2340,6.4,565628,600986,543510,547989,570560,660162,641313,650391,544725,605508,54.060001,,,0.0,1.0,0.0,0.0,,71,5,"""4-Seam Fastbal…",3,9,3,9,9,3,3,9,"""Standard""","""Standard""",190.0,-0.002,-0.217


# Goal

I want to be able to measure a pitchers ability to tunnel pitches in an at bat. This will entail computing distances in 2D space between a few different metrics and combining them into one overarching tunnel score.

- Computing the distance between horizontal movement and vertical movement between different pitches (high score = better?)
- Computing the distance between release position x, y, and z for between different pitches (low score = better) 
- Estimate where the ball would have ended up without spin, compare that to other pitches without spin (low score = better). Say if two pitches had very different movement, but would have ended up in similar spots without spin, this means that the pitches started out on similar trajectories but broke a lot differently which is a very good thing.



In [2]:
TUNNEL_COLS: list[str] = [
    "pitch_type",  # type of pitch: FF, FC, CU, etc ...
    "release_pos_x",  # horizontal release position of ball in ft from catcher pov
    "release_pos_z",  # vertical release position of ball in ft from catcher pov
    "pfx_x",  # horizontal movement in ft from catchers perspective
    "pfx_z",  # vertical movement in ft from catchers perspective
    "plate_x",  # horizontal position of ball when it crossed the plate
    "plate_z",  # vertical position of the ball when it crossed the plate
]

# drop missing values for the columns that we care about
statcast_era_pitches = statcast_era_df.drop_nulls(subset=TUNNEL_COLS)

In [3]:
# grouping by these features will allow us to get fine grained data on each pitcher.
# Each row is going to be a pitchers metrics on one of their pitches in one at bat
GROUP_COLS = ["pitcher", "game_pk", "pitch_type", "at_bat_number"]

pitcher_release_clusters: pl.DataFrame = (
    statcast_era_df.group_by(GROUP_COLS).agg(
        # horizontal & vertical movement
        h_move_variance=pl.col("pfx_x").std() ** 2,
        v_move_variance=pl.col("pfx_z").std() ** 2,
        h_move_mean=pl.col("pfx_x").mean(),
        v_move_mean=pl.col("pfx_z").mean(),
        # release position
        h_release_variance=pl.col("release_pos_x").std() ** 2,
        v_release_variance=pl.col("release_pos_z").std() ** 2,
        h_release_mean=pl.col("release_pos_x").mean(),
        v_release_mean=pl.col("release_pos_z").mean(),
    )
    # merge back with other data that we want to know about the pitch
    .join(
        other=statcast_era_df.select(
            GROUP_COLS + ["spin_axis", "release_spin_rate", "home_team", "away_team"]
        ),
        on=["game_pk", "pitcher", "pitch_type", "at_bat_number"],
        how="left",
    )
)

# the above results in a dataframe where each row has pitch movement metrics within a specific at bat
# for each of the pitchers pitches that they threw in that at bat.
pitcher_release_clusters.select(
    GROUP_COLS
    + [
        "h_move_variance",
        "v_move_variance",
        "h_move_mean",
        "v_move_mean",
    ]
).sample(3)

pitcher,game_pk,pitch_type,at_bat_number,h_move_variance,v_move_variance,h_move_mean,v_move_mean
i32,i32,str,i32,f32,f32,f32,f32
572362,530742,"""FF""",82,0.08405,0.02,0.605,1.38
519326,529524,"""SI""",75,0.019633,0.0309,-1.446667,0.69
543408,414852,"""FF""",46,0.01945,0.00812,-0.74,1.352


## Applying Magnus Equations

The goal of applying the magnus equations in this context will be to figure out where the ball would have ended up in the strike zone if the ball did not have any spin. This is useful because if two pitches end up in the same spot without spin, but ended up in very different places in reality, these pitches 'looked' like one another, and are a very good combo.

In [8]:

@dataclass
class MagnusEffect:
    _ball_radius: float = 1.43 # inches
    _air_density: float = 1.293 # kg/m^3
    _ball_cross_sectional_area: float = np.pi * (0.07468 / 2) ** 2

    def calculate_force(self, spin_rate: float, velocity: float) -> float:
        def lift_coeffificient() -> float:
            # implementation sourced from here: https://www1.grc.nasa.gov/beginners-guide-to-aeronautics/lift-of-a-baseball/
            return 4 * (np.pi**2) * (self._ball_radius**3) * spin_rate * self._air_density * velocity
    
        L = lift_coeffificient()
        return 0.5 * L * self._air_density * self._ball_cross_sectional_area * velocity ** 2 


MagnusEffect().calculate_force(spin_rate=3000, velocity=100)