# Pitcher Data Demo

Inspired by: Thomas Nestico [(@TJStats)](https://x.com/TJStats)

The end result of this code will output a specific pitcher's outing from a specific game in Spring Training with specific values attributed to each type of pitch from the pitcher's pitch mix. This notebook will explain each attribute itself and how they are calculated. I wanted to try and replicate the incredible work that people like TJ put out for baseball fans so that I could better understand pitchers and the game we all love.

The end result will look as such:


In [31]:
import pandas as pd
pd.set_option("display.max_columns", None)  # Ensure all columns are displayed

df = pd.read_csv("pitch_type_counts.csv")

df

Unnamed: 0,Pitcher,Pitch Type,Count,Usage,Spin Rate,Avg Velo,iVB,HB,Whiffs,CS,CS+Whiffs,Zone%,Chase%,Whiff%,vRel,hRel,VAA,HAA,Extension,Max Exit Velo,Batter
0,Jon Gray,Four-Seam Fastball,20,50.0,1938.4,94.4,15.1,8.5,0,4,4,60.0,37.5,0.0,5.4,-1.6,-4.7,1.0,6.3,98.2,Lourdes Gurriel Jr.
1,Jon Gray,Slider,15,37.5,2473.7,86.7,1.9,-3.4,3,3,6,66.7,40.0,33.3,5.6,-1.6,-7.3,2.4,6.1,109.4,Alek Thomas
2,Jon Gray,Changeup,3,7.5,1502.7,87.3,8.1,14.2,1,0,1,66.7,0.0,50.0,5.5,-1.5,-6.7,0.6,6.3,,
3,Jon Gray,Curveball,2,5.0,2622.5,77.3,-8.3,-10.6,0,0,0,0.0,0.0,,5.6,-1.5,-10.7,3.7,6.2,,


As we can see above, there are a lot of different attributes describing the pitches for Jon Gray.

We can see each pitch type that Jon has, as well as how many times he threw each pitch respectively (20 fastballs, 15 sliders, 3 changeups, 2 curveballs).

From that information we can calculate the next column, usage rate, by combining each pitch type count to give us a total amount of pitches thrown. With a total amount of pitches thrown, we can divide each pitch type count by the total to give us the usage rate.

I will go through this final output column by column and showcase the code used as well as an explanation for the code.

## Import Packages

In [None]:
# MLB Scraper Pitcher Data
import pandas as pd
import pybaseball as pyb
import numpy as np
from api_scraper import MLB_Scrape

y0 = 50  # Release y-position (feet)
yf = 17 / 12  # Home plate y-position (feet)

## Display Options to ensure that all of the output is displayed


In [None]:
# Set display options to print all columns without truncation
pd.set_option("display.max_columns", None)  # Ensure all columns are displayed
pd.set_option("display.max_rows", None)  # Display all rows, be cautious with large DataFrames
pd.set_option("display.width", None)  # Remove column width limit

## Retrieving game data with MLB Scraper model by Tnestico

Specific game IDs can be found in baseball savant URLs

For example: `https://baseballsavant.mlb.com/gamefeed?gamePk=778935`

The last six digits `778935` at the end of the URL is the gameID for a Rangers/Diamondbacks Spring Training Game

We are retrieving this data from `scraper.get_data(game_list_input=[778935])` and assigning the data to the variable `game_data`

The following line converts the retrieved game data (stored in `game_data`) into a Polars DataFrame and is then stored in the variable `data_df`

The last line converts the Polars DataFrame (`data_df`) to a Pandas DataFrame (`pandas_df`). This is necessary so that we can utilize Pandas' features later on.

In [None]:
# Initialize the scraper
scraper = MLB_Scrape()

# Retrieve game data for the specific game ID
game_data = scraper.get_data(game_list_input=[778935])

# Convert the game data to a Polars DataFrame
data_df = scraper.get_data_df(data_list=game_data)

# Convert the Polars DataFrame to a Pandas DataFrame
pandas_df = data_df.to_pandas()

We can now rename `pandas_df` to `df_pyb` for convenience's sake.

We will also print out the first few lines of the dataframe so that we can see all the data we have to work with.

In [None]:
df_pyb = pandas_df
df_pyb.head(5)

If we print out the shape of `df_pyb` we can see how much data there is in this game.

In [None]:
df_pyb.shape

If each row is thought of as one pitch, then there were 304 pitches thrown.

Since we are only looking to gather data for one specific pitcher, we can filter out the dataframe to only return rows (or pitches) where `pitcher_name` is equal to the pitcher's name.

For this notebook, we will be looking at the data from Jon Gray. Considering Jon Gray is a starting pitcher, we can assume that he would have one of the higher pitch counts for this game. This will result in a larger sample size for us to make calculations from.

In [None]:
df_pyb = df_pyb[(df_pyb["pitcher_name"] == "Jon Gray")]
df_pyb.shape
df_pyb.head(5)

We can check the shape of the newly filtered DataFrame and can see that there are 40 rows in which `pitcher_name` is equal to Jon Gray.

We can also print out the first 5 lines of the DataFrame to see that we still have all the data we had originally, only that it now pertains specifically to Jon Gray.

There are a lot of columns that we do not need. If we want to find out Jon's pitch mix and how many times he threw each pitch, we can create a new DataFrame. This new DataFrame will contain the columns that we do want, and not ones that we don't want.

In [None]:
pitcher_pyb = df_pyb[
    [
        "game_id",
        "game_date",
        "pitcher_name",
        "pitch_description",
    ]
]
pitcher_pyb.head(5)

Now we have a DataFrame that is still associated to Jon Gray, but is more concise and returns everything we need to calculate his total pitches thrown and find his usage rate for each pitch type.

One way to quickly sum the total pitches thrown is by creating a new column on the DataFrame named `PitchesThrown`.

In [None]:
pitcher_pyb["PitchesThrown"] = 1

Now with the `PitchesThrown` column created, we can create a DataFrame for pitch counts, total pitches, and a usage rate.

In [None]:
pitch_type_counts = pitcher_pyb.groupby("pitch_description", as_index=False)["PitchesThrown"].sum()
pitch_type_counts

We now have the amount of pitches thrown for each pitch type

We can sort the data by most pitches thrown for each pitch type

In [None]:
pitch_type_counts = pitch_type_counts.sort_values(by="PitchesThrown", ascending=False)
pitch_type_counts

Now we can calculate the total amout of pitches thrown

In [None]:
total_pitches = pitch_type_counts['PitchesThrown'].sum()
total_pitches

For the usage rate, we can create column `Usage` and divide `PitchesThrown` by `total_pitches`

We will also multiply `Usage` by 100 as well as round the answer to three decimal places

In [None]:
pitch_type_counts['Usage %'] = ((pitch_type_counts['PitchesThrown']/total_pitches)*100).round(3)
pitch_type_counts

The next attribute to calculate is the average velocity for each pitch.

The first thing to do is update how we filtered the data originally. We need to add the column `start_speed`.

In [None]:
pitcher_pyb = df_pyb[
    [
        "game_id",
        "game_date",
        "pitcher_name",
        "pitch_description",
        "start_speed",
    ]
]

Now we can do what we previously did for `pitch_type_counts`.

Since `start_speed` is already a column in the raw data, we can simply use `.mean()` to find the average for each `pitch_description`. Additionally, we can round the average velocity to 1 decimal place.

In [None]:
pitch_type_velo = pitcher_pyb.groupby(['pitcher_name','pitch_description'],as_index=False)['start_speed'].mean().round(1)
pitch_type_velo

The next thing we must do is merge our `pitch_type_counts` DataFrame and our `pitch_type_velo` DataFrame so that we can use our velocity we calculated on `pitch_type_counts`.

In [None]:
pitch_type_counts = (pitch_type_counts.merge(pitch_type_velo, on="pitch_description", how="left"))

Now, when we print `pitch_type_counts` we have `start_speed` as a column for each `pitch_description`.

In [None]:
pitch_type_counts

Next, we can calculate the average spin rate for each pitch type following the same pattern as before.

First step is to update `pitcher_pyb` and include `spin_rate` from the scraped data.

In [None]:
pitcher_pyb = df_pyb[
    [
        "game_id",
        "game_date",
        "pitcher_name",
        "pitch_description",
        "start_speed",
        "spin_rate"
    ]
]

After that, it is the same process as before when we calculated average velocity.

In [None]:
pitch_type_spin_rate = (pitcher_pyb.groupby("pitch_description", as_index=False)["spin_rate"].mean()).round(1)
pitch_type_spin_rate

Followed by a merging of `pitch_type_spin_rate` and `pitch_type_counts`.

We can also quickly re-order the columns in the output

In [None]:
pitch_type_counts = (pitch_type_counts.merge(pitch_type_spin_rate, on="pitch_description", how="left"))

pitch_type_counts = pitch_type_counts[
    [
        "pitcher_name",
        "pitch_description",
        "PitchesThrown",
        "Usage %",
        "start_speed",
        "spin_rate",
    ]
]

pitch_type_counts

## Induced Vertical Break (iVB)

The next value to calcuate is the Induced Vertical Break (iVB) for each pitch type.

Per [Fangraphs](https://blogs.fangraphs.com/a-visual-scouting-primer-pitching-part-two/), Induced Vertical Break aims to quantify a pitcher’s ability to fight gravity.

iVB does not require a calculation from our end as it is already listed in the raw data as `ivb`. We can simply update the `pitcher_pyb` to include this.

In [28]:
pitcher_pyb = df_pyb[
    [
        "game_id",
        "game_date",
        "pitcher_name",
        "pitch_description",
        "start_speed",
        "spin_rate",
        "ivb"
    ]
]

pitch_type_ivb = (pitcher_pyb.groupby("pitch_description", as_index=False)["ivb"].mean()).round(1)
pitch_type_ivb

Unnamed: 0,pitch_description,ivb
0,Changeup,8.1
1,Curveball,-8.3
2,Four-Seam Fastball,15.1
3,Slider,1.9


Horizontal break is the amount of lateral (side-to-side) movement a pitch experiences due to spin, measured in inches.

We can get the horizontal break the same way as Induced Vertical Break.

In [29]:
pitcher_pyb = df_pyb[
    [
        "game_id",
        "game_date",
        "pitcher_name",
        "pitch_description",
        "start_speed",
        "spin_rate",
        "ivb",
        "hb"
    ]
]

pitch_type_hb = (pitcher_pyb.groupby("pitch_description", as_index=False)["hb"].mean()).round(1)
pitch_type_hb

Unnamed: 0,pitch_description,hb
0,Changeup,14.2
1,Curveball,-10.6
2,Four-Seam Fastball,8.5
3,Slider,-3.4


Now, we can merge these two new DataFrames to `pitch_type_counts` and then see how our output is looking.

In [30]:
pitch_type_counts = (pitch_type_counts.merge(pitch_type_ivb, on="pitch_description", how="left"))
pitch_type_counts = (pitch_type_counts.merge(pitch_type_hb, on="pitch_description", how="left"))
pitch_type_counts

Unnamed: 0,pitcher_name,pitch_description,PitchesThrown,Usage %,start_speed,spin_rate,ivb,hb
0,Jon Gray,Four-Seam Fastball,20,50.0,94.4,1938.4,15.1,8.5
1,Jon Gray,Slider,15,37.5,86.7,2473.7,1.9,-3.4
2,Jon Gray,Changeup,3,7.5,87.3,1502.7,8.1,14.2
3,Jon Gray,Curveball,2,5.0,77.3,2622.5,-8.3,-10.6


In [None]:
'''pitcher_pyb = pitcher_pyb.sort_values(by=["pitch_description"])
is_ball = [11, 12, 13, 14]
strike = [1, 2, 3, 4, 5, 6, 7, 8, 9]
pitcher_pyb["InZone"] = np.where(pitcher_pyb["zone"].isin(strike), 1, 0)
pitcher_pyb["OutZone"] = np.where(pitcher_pyb["zone"].isin(is_ball), 1, 0)
pitcher_pyb'''

In [None]:
'''pitcher_pyb["vy_f"] = -np.sqrt(
    pitcher_pyb["vy0"] ** 2 - (2 * pitcher_pyb["ay"] * (y0 - yf))
)

# Compute time (t)
pitcher_pyb["t"] = (pitcher_pyb["vy_f"] - pitcher_pyb["vy0"]) / pitcher_pyb["ay"]

# Compute final z-velocity (vz_f)
pitcher_pyb["vz_f"] = pitcher_pyb["vz0"] + (pitcher_pyb["az"] * pitcher_pyb["t"])

# Compute final x-velocity (vx_f)
pitcher_pyb["vx_f"] = pitcher_pyb["vx0"] + (pitcher_pyb["ax"] * pitcher_pyb["t"])

# Compute VAA
pitcher_pyb["VAA"] = -np.arctan(pitcher_pyb["vz_f"] / pitcher_pyb["vy_f"]) * (
    180 / np.pi
)

# Compute Horizontal Approach Angle (HAA)
pitcher_pyb["HAA"] = -np.arctan(pitcher_pyb["vx_f"] / pitcher_pyb["vy_f"]) * (
    180 / np.pi
)

# Get average vRel per pitch type
pitch_type_vrel = (
    df_pyb.groupby("pitch_description", as_index=False)["z0"].mean()
).round(1)
pitch_type_vrel.rename(columns={"z0": "vRel"}, inplace=True)

# Get average hRel per pitch type
pitch_type_hrel = (
    df_pyb.groupby("pitch_description", as_index=False)["x0"].mean()
).round(1)
pitch_type_hrel.rename(columns={"x0": "hRel"}, inplace=True)
pitcher_hand_unique = df_pyb[["pitch_description", "pitcher_hand"]].drop_duplicates(
    subset=["pitch_description"]
)
pitch_type_hrel = pitch_type_hrel.merge(
    pitcher_hand_unique, on="pitch_description", how="left"
)
pitch_type_hrel["hRel"] = np.where(
    pitch_type_hrel["pitcher_hand"] == "L",
    -pitch_type_hrel["hRel"],
    pitch_type_hrel["hRel"],
)

whiff_pitches = pitcher_pyb[pitcher_pyb["is_whiff"] == True]
pitch_type_whiff = (
    whiff_pitches.groupby("pitch_description").size().reset_index(name="whiff_count")
)
pitch_type_whiff.rename(columns={"whiff_count": "Whiffs"}, inplace=True)
pitch_type_whiff["Whiffs"] = pitch_type_whiff["Whiffs"].astype(int)

strike_pitches = pitcher_pyb[pitcher_pyb["play_code"] == "C"]
pitch_type_cs = (
    strike_pitches.groupby("pitch_description").size().reset_index(name="CS")
)

pitches_in_zone = pitcher_pyb.groupby("pitch_description")["InZone"].sum().reset_index()
pitches_in_zone.rename(columns={"InZone": "Pitches_In_Zone"}, inplace=True)

pitches_out_of_zone = (
    pitcher_pyb.groupby("pitch_description")["OutZone"].sum().reset_index()
)
pitches_out_of_zone.rename(columns={"OutZone": "Pitches_Out_Of_Zone"}, inplace=True)
# print(pitches_out_of_zone)


swings_out_of_zone = pitcher_pyb[
    pitcher_pyb["zone"].isin([11, 12, 13, 14]) & pitcher_pyb["is_swing"] == True
]
swings_out_of_zone = (
    swings_out_of_zone.groupby("pitch_description")["is_swing"]
    .sum()
    .astype(int)  # Convert to integer
    .reset_index()
)
# print(swings_out_of_zone)


pitch_type_spin = (
    pitcher_pyb.groupby("pitch_description", as_index=False)["spin_rate"].mean()
).round(1)
pitch_type_spin.rename(columns={"spin_rate": "Spin Rate"}, inplace=True)

pitch_type_swing = (
    pitcher_pyb.groupby("pitch_description")["is_swing"].sum().astype(int).reset_index()
)
pitch_type_swing.rename(columns={"is_swing": "Swings"}, inplace=True)


pitch_type_ivb = (
    pitcher_pyb.groupby("pitch_description", as_index=False)["ivb"].mean()
).round(1)
pitch_type_ivb.rename(columns={"ivb": "iVB"}, inplace=True)

pitch_type_hb = (
    pitcher_pyb.groupby("pitch_description", as_index=False)["hb"].mean()
).round(1)
pitch_type_hb.rename(columns={"hb": "HB"}, inplace=True)

pitch_avg_velo = (
    pitcher_pyb.groupby("pitch_description", as_index=False)["start_speed"].mean()
).round(1)
pitch_avg_velo.rename(columns={"start_speed": "Avg Velo"}, inplace=True)

pitch_avg_exten = (
    pitcher_pyb.groupby("pitch_description", as_index=False)["extension"].mean()
).round(1)
pitch_avg_exten.rename(columns={"extension": "Extension"}, inplace=True)

# Compute the mean VAA for each pitch type
vaa_means = (
    pitcher_pyb.groupby("pitch_description", as_index=False)["VAA"].mean()
).round(1)

# Compute the mean HAA for each pitch type
haa_means = (
    pitcher_pyb.groupby("pitch_description", as_index=False)["HAA"].mean()
).round(1)

# Compute the highest exit velocity for each pitch type
df_hits = pitcher_pyb.dropna(subset=["launch_speed"])
# Group by pitch type and find the index of the max exit velocity, handling NaN values
idx = df_hits.groupby("pitch_description")["launch_speed"].idxmax().dropna()
# Retrieve the rows with max exit velocity
max_exit_velo = df_hits.loc[
    idx, ["pitch_description", "batter_name", "launch_speed"]
].copy()
max_exit_velo.rename(columns={"launch_speed": "Max Exit Velo"}, inplace=True)

pitch_type_counts = pitcher_pyb.groupby(
    ["pitcher_name", "pitch_description"], as_index=False
)["PitchesThrown"].sum()'''

In [None]:
'''pitch_type_counts'''

In [None]:
'''pitch_type_counts["Total Pitches"] = pitch_type_counts["PitchesThrown"].sum()
pitch_type_counts'''

In [None]:
'''pitch_type_counts["Usage"] = (
    (pitch_type_counts["PitchesThrown"] / pitch_type_counts["Total Pitches"]) * 100
).round(2)

pitch_type_counts = (
    pitch_type_counts.merge(pitch_type_spin, on="pitch_description", how="left")
    .merge(pitch_avg_velo, on="pitch_description", how="left")
    .merge(pitch_type_ivb, on="pitch_description", how="left")
    .merge(pitch_type_hb, on="pitch_description", how="left")
    .merge(pitch_type_whiff, on="pitch_description", how="left")
    .merge(pitch_type_cs, on="pitch_description", how="left")
    .merge(pitches_in_zone, on="pitch_description", how="left")
    .merge(pitches_out_of_zone, on="pitch_description", how="left")
    .merge(swings_out_of_zone, on="pitch_description", how="left")
    .merge(pitch_type_swing, on="pitch_description", how="left")
    .merge(pitch_type_vrel, on="pitch_description", how="left")
    .merge(pitch_type_hrel, on="pitch_description", how="left")
    .merge(vaa_means, on="pitch_description", how="left")
    .merge(haa_means, on="pitch_description", how="left")
    .merge(pitch_avg_exten, on="pitch_description", how="left")
    .merge(max_exit_velo, on="pitch_description", how="left")
)

pitch_type_counts = pitch_type_counts.sort_values(by="PitchesThrown", ascending=False)

pitch_type_counts["Whiffs"] = pitch_type_counts["Whiffs"].fillna(0).astype(int)
pitch_type_counts["CS"] = pitch_type_counts["CS"].fillna(0).astype(int)
pitch_type_counts["CS+Whiffs"] = pitch_type_counts["CS"] + pitch_type_counts["Whiffs"]
pitch_type_counts["Zone%"] = (
    (pitch_type_counts["Pitches_In_Zone"] / pitch_type_counts["PitchesThrown"]) * 100
).round(1)
pitch_type_counts["is_swing"] = pitch_type_counts["is_swing"].fillna(0).astype(int)
pitch_type_counts["Chase%"] = (
    (pitch_type_counts["is_swing"] / pitch_type_counts["Pitches_Out_Of_Zone"]) * 100
).round(1)
pitch_type_counts["Whiff%"] = (
    (pitch_type_counts["Whiffs"] / pitch_type_counts["Swings"]) * 100
).round(1)


pitch_type_counts = pitch_type_counts[
    [
        "pitcher_name",
        "pitch_description",
        "PitchesThrown",
        "Usage",
        "Spin Rate",
        "Avg Velo",
        "iVB",
        "HB",
        "Whiffs",
        "CS",
        "CS+Whiffs",
        "Zone%",
        "Chase%",
        "Whiff%",
        "vRel",
        "hRel",
        "VAA",
        "HAA",
        "Extension",
        "game_date",
    ]
]


pitch_type_counts.rename(
    columns={
        "game_date": "Date",
        "PitchesThrown": "Count",
        "pitch_description": "Pitch Type",
        "batter_name": "Batter",
        "pitcher_name": "Pitcher",
    },
    inplace=True,
)


print(pitch_type_counts)
'''