# Workload Effect on Pitcher Injury

## Goal: 
Determine if we can find:
1) Maximize the predictability of the probability of a pitcher getting injured
2) A causal effect on pitcher injury (e.g., workload, less rest, etc.)

## Background:
For our (1) goal, we want to predict the probability that a pitcher will get injured as a result of workload, rest (or lack thereof), distance traveled, etc. Essentially, given some features regarding what the pitcher has recently done, we can determine the probability that he'll get injured, should he play the next game. In a sense, we're trying to come up with a solution for teams', such that, based on the probability that results from the features we mentioned, a manager or top decision-maker would make judgdment (e.g., our probability induces a certain threshold such that meeting that threshold induces the idea of resting that pitcher instead of letting him pitch and risk an injury). 

Now we understand that there is one glaring issue, and that is that there are many other confounding factors that induce an injury that might not necessarily *be* correlated with workload (e.g., the ball comes back to the pitcher and hits them and takes them out, the pitcher's form wasn't quite right and it caused an inury as a result of straining your body to come up with velo). To account for this, we want to focus on types of injuries that we are confident *are* correlated with workload, and we'll use scientific evidence to support this (e.g., torn UCL). The reason to avoid those other "freak" types of injuries is that they are a potential source of noise/bias as mentioned previously, some injuries are just not a result of workload, meaning it is entirely possible that our models could *learn* from these types of injuries and determine some kind of relationship with the probability of an injury, which we would not want. 

To approach this problem, we want to take a step back and view this problem from another lens. That being: Industrial Engineering. Essentially, we want to view pitchers as "machines," and as a result, we want to estimate the "failure" rate (injury rate) of these "machines" as a function of workload cycles, rest, and travel. Meaning, we want to quantitatively come up with policies for the pitcher that are analogous to process optimizations. As we want to identify when the injury risk of these pitchers accelerate, such that we can prevent this *potential* injury in order to maximize pitcher usage, and minimize any waste, cost, or downtime.

In future iterations, we'd be interested in learning about *true* causal effects in terms of pitcher injuries, and whether or not the factors we talked about (e.g., workload, rest, etc.) have that kind of effect.

## Methodology:


To approach this problem, we first want to understand if there *are* any type of relationships between these workload variables and injury probability. 

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pybaseball as pyb
import requests

pd.set_option('display.max_columns', None)

In [107]:
mlb_injuries = pd.read_csv('mlb_injuries.csv')
mlb_injuries

Unnamed: 0,rank,player,pos,team,il_type,injury,start_date,end_date,days_missed,cash_total,cash_per_day,reason_raw,year
0,1,Anthony Rendon,3B,LAA,60-Day IL,Hip,2025-03-27,2025-09-28,186,37999986,204301.0,60-Day IL - Hip: 3/27/25-9/28/25,2025
1,2,Gerrit Cole,SP,NYY,60-Day IL,Elbow Tommy John,2025-03-27,2025-09-28,186,35999928,193548.0,60-Day IL - Elbow Tommy John: 3/27/25-9/28/25,2025
2,3,Kris Bryant,1B,COL,60-Day IL,Back,2025-04-13,2025-09-28,169,23623665,139785.0,60-Day IL - Back: 4/13/25-9/28/25,2025
3,4,Jordan Montgomery,SP,ARI,60-Day IL,Elbow Tommy John,2025-03-27,2025-09-28,186,22500048,120968.0,60-Day IL - Elbow Tommy John: 3/27/25-9/28/25,2025
4,5,Joe Musgrove,SP,SD,60-Day IL,Elbow Tommy John,2025-03-27,2025-09-28,186,20000022,107527.0,60-Day IL - Elbow Tommy John: 3/27/25-9/28/25,2025
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3943,452,Reyes Moronta,RP,SF,60-Day IL,Shoulder,2020-07-23,2020-09-27,67,0,0.0,60-Day IL - Shoulder: 7/23/20-9/27/20,2020
3944,453,Yennsy Diaz,SP,TOR,60-Day IL,Arm,2020-07-23,2020-09-27,67,0,0.0,60-Day IL - Arm: 7/23/20-9/27/20,2020
3945,454,Troy Stokes,OF,DET,60-Day IL,Undisclosed,2020-08-09,2020-09-27,50,0,0.0,60-Day IL - Undisclosed: 8/9/20-9/27/20,2020
3946,455,Jacob Webb,RP,ATL,60-Day IL,Shoulder,2020-07-23,2020-09-08,48,0,0.0,60-Day IL - Shoulder: 7/23/20-9/8/20,2020


In [43]:
def get_name_birth(id: int):
    """
    A get function to find the birth year and full name of each player based on their mlb id using MLB's Stats API

    Note: you have to be online to run this
    """
    url = f"https://statsapi.mlb.com/api/v1/people/{id}"
    r = requests.get(url).json()
    return [r['people'][0]['birthDate'].split('-')[0], r['people'][0]['fullName'], r['people'][0]['primaryPosition']['abbreviation']]

In [101]:
def load_year_data(year: int):
    """
    Creates the relevant pitcher data for each given year
    Should take about 7-8 min to run
    """
    df = pyb.statcast(f'{year}-03-01', f'{year}-11-30').copy() ## chooses the specific year
    ## cleans up the pitch by pitch data for that season
    df = df[df['game_type'] == 'R'].sort_values(by=['game_date', 'game_pk', 'inning', 'at_bat_number'], ascending=True).reset_index(drop=True).copy()
    df['game_date'] = pd.to_datetime(df['game_date'])
    ## adds the pitch group to simplify pitch usage
    fastball_pitches = ['FF', 'FC', 'SI']
    breaking_pitches = ['CU', 'KC', 'SC', 'SL', 'SV', 'ST']
    offspeed_pitches = ['CH', 'FO', 'FS']
    df['pitch_group'] = df['pitch_type'].apply(lambda x: 'Fastball' if x in fastball_pitches 
                                               else 'Breakingball' if x in breaking_pitches 
                                               else 'Offspeed' if x in offspeed_pitches 
                                               else 'Other')
    ## finds all of the appearances each pitcher made in that season as well as the number of pitches thrown
    df_pitcher_games = df.groupby(['player_name', 'pitcher', 'game_pk', 'game_date'], as_index=False).agg(pitches_thrown=('pitcher', 'count')).copy()
    df_pitcher_games = df_pitcher_games.sort_values(['player_name', 'game_date']).copy()

    ## creates the indicator of what the current season is
    df_pitcher_games['season'] = df_pitcher_games['game_date'].dt.year
    ## finds the previous appearence made for each current date 
    df_pitcher_games['last_start_date'] = (df_pitcher_games.groupby('pitcher')['game_date'].shift(1))
    ## finds the number of pitches that were thrown in the previous appearance made
    df_pitcher_games['pitches_last_start'] = (df_pitcher_games.groupby('pitcher')['pitches_thrown'].shift(1))
    ## the number of rest days the pitcher had before their current appearence
    df_pitcher_games['days_since_last_start'] = ((df_pitcher_games['game_date'] - df_pitcher_games['last_start_date']).dt.days) - 1
    ## a counter for the number appeareance made during the season
    df_pitcher_games['number_start'] = (df_pitcher_games.groupby('pitcher').cumcount() + 1)
    ## a flag for if that appearence made was their first of the season
    df_pitcher_games['first_start'] = (df_pitcher_games['number_start'] == 1).astype(int)

    ## finds who the real pitchers are in the dataset, as well as their birth year to find age -> takes about 3 min for one season's worth of pitchers
    pitcher_list = df_pitcher_games.groupby(['player_name', 'pitcher'])['number_start'].count().reset_index().copy()
    births = {i: get_name_birth(i) for i in pitcher_list['pitcher'].unique().tolist()}
    names_and_bdays = pd.DataFrame.from_dict(births, orient='index', columns=['birth_year', 'full_name', 'primary_pos']).reset_index(names='id')
    ## keeps it to only pitchers and excludes position players
    names_and_bdays = names_and_bdays[names_and_bdays['primary_pos'].isin(['P', 'TWP'])].reset_index(drop=True).copy() 
    names_and_bdays = names_and_bdays[['id', 'birth_year']].copy()

    ## only includes the players that are actually pitchers + adds their birth year
    df_pitcher_games = df_pitcher_games[df_pitcher_games['pitcher'].isin(names_and_bdays['id'].unique())].reset_index(drop=True).copy()
    df_pitcher_games = df_pitcher_games.merge(names_and_bdays, how='left', left_on='pitcher', right_on='id').copy()

    ## creates the age column
    df_pitcher_games['age'] = df_pitcher_games['season'] - df_pitcher_games['birth_year'].astype(int)

    ## cleans up the data
    df_pitcher_games = df_pitcher_games[['season', 'player_name', 'pitcher', 'age', 'days_since_last_start', \
                                         'pitches_last_start', 'number_start', 'first_start', \
                                            'last_start_date', 'game_date']].reset_index(drop=True).copy()

    ## the pitch-data for each game for each pitcher
    pitches = (df
               .groupby(['player_name', 'pitcher', 'game_pk', 'game_date', 'pitch_group'], as_index=False)
               .agg(pitches_thrown=('pitch_group', 'count'), 
                    avg_release_speed=('release_speed', 'mean'),
                    avg_spin = ('release_spin_rate', 'mean')
                    )
               ).copy()
    pitches = pitches[pitches['pitch_group'] != 'Other'].reset_index(drop=True).copy()
    ## pivots the table to include columns for each pitch type
    df_wide = (pitches
               .pivot_table(index=['player_name', 'pitcher', 'game_pk', 'game_date'], columns='pitch_group', values=['pitches_thrown', 'avg_release_speed', 'avg_spin'])
               )
    df_wide.columns = [f"{pitch}_{metric}" for metric, pitch in df_wide.columns]
    df_wide = df_wide.reset_index().copy()
    ## normalizes the thrown pitches to be rates
    total_pitches_thrown = df_wide['Breakingball_pitches_thrown'] + df_wide['Fastball_pitches_thrown'] + df_wide['Fastball_pitches_thrown']
    df_wide['Breakingball_pitches_thrown'] = df_wide['Breakingball_pitches_thrown'] / total_pitches_thrown
    df_wide['Fastball_pitches_thrown'] = df_wide['Fastball_pitches_thrown'] / total_pitches_thrown
    df_wide['Offspeed_pitches_thrown'] = df_wide['Offspeed_pitches_thrown'] / total_pitches_thrown
    df_wide = df_wide[['pitcher', 'game_date', 'Fastball_pitches_thrown', 'Fastball_avg_release_speed', 'Fastball_avg_spin',\
                       'Breakingball_pitches_thrown', 'Breakingball_avg_release_speed', 'Breakingball_avg_spin',\
                        'Offspeed_pitches_thrown', 'Offspeed_avg_release_speed', 'Offspeed_avg_spin'
                        ]]
    df_wide = df_wide.rename(columns={'Fastball_pitches_thrown': 'FB_usage', 'Fastball_avg_release_speed': 'FB_velo', 'Fastball_avg_spin': 'FB_spin',
                                      'Breakingball_pitches_thrown': 'BB_usage', 'Breakingball_avg_release_speed': 'BB_velo', 'Breakingball_avg_spin': 'BB_spin',
                                      'Offspeed_pitches_thrown': 'OS_usage', 'Offspeed_avg_release_speed': 'OS_velo', 'Offspeed_avg_spin': 'OS_spin'
                                      })

    ## adds the pitch level data to each game
    df_pitcher_games = df_pitcher_games.merge(df_wide, how='left', on=['pitcher', 'game_date']).copy()
    pitches_to_shift = ['FB_usage', 'FB_velo', 'FB_spin', 'BB_usage', 'BB_velo', 'BB_spin', 'OS_usage', 'OS_velo', 'OS_spin']
    df_pitcher_games[pitches_to_shift] = (df_pitcher_games.groupby('pitcher')[pitches_to_shift].shift(1))

    return df_pitcher_games

In [109]:
mlb_data = pd.concat([load_year_data(i) for i in range(2020, 2025)], ignore_index=True)

This is a large query, it may take a moment to complete
Skipping offseason dates
Skipping offseason dates


  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_cop

This is a large query, it may take a moment to complete
Skipping offseason dates
Skipping offseason dates


  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_cop

This is a large query, it may take a moment to complete
Skipping offseason dates
Skipping offseason dates


  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_cop

This is a large query, it may take a moment to complete
Skipping offseason dates
Skipping offseason dates


  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_cop

This is a large query, it may take a moment to complete
Skipping offseason dates
Skipping offseason dates


  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_cop

This is a large query, it may take a moment to complete
Skipping offseason dates
Skipping offseason dates


100%|██████████| 246/246 [03:43<00:00,  1.10it/s]
  final_data = pd.concat(dataframe_list, axis=0).convert_dtypes(convert_string=False)


In [124]:
pitcher_injuries = mlb_injuries[mlb_injuries['pos'].isin(['SP', 'P', 'RP'])].sort_values(by='start_date', ascending=False).reset_index(drop=True).copy()
pitcher_injuries[pitcher_injuries['il_type'].isna()]['reason_raw'].unique()
# pitcher_injuries.head(10)

Unnamed: 0,rank,player,pos,team,il_type,injury,start_date,end_date,days_missed,cash_total,cash_per_day,reason_raw,year
1777,11,Pablo López,SP,MIN,,,,,119,13755329,115591.00,15-Day IL - Hamstring: 4/9/25-4/25/25 60-Day I...,2025
1778,14,Zach Eflin,SP,BAL,,,,,121,11709654,96774.00,"15-Day IL - Back: 4/8/25-5/11/25, 6/29/25-7/23...",2025
1779,15,Frankie Montas,SP,NYM,,,,,128,11698944,91398.00,60-Day IL - Back: 3/27/25-6/24/25 15-Day IL - ...,2025
1780,16,Jon Gray,SP,TEX,,,,,162,11322504,69892.00,60-Day IL - Arm: 3/27/25-7/23/25 60-Day IL - N...,2025
1781,22,Lance McCullers,SP,HOU,,,,,105,9596790,91398.00,15-Day IL - Elbow: 3/27/25-5/4/25 15-Day IL - ...,2025
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2349,297,Sam Coonrod,RP,SF,,,,,26,81432,3132.00,"10-Day IL - Shoulder: 7/31/20-8/23/20, 9/26/20...",2020
2350,307,Walker Buehler,SP,LAD,,,,,23,76728,3336.00,10-Day IL - Finger: 8/27/20-9/2/20 10-Day IL -...,2020
2351,356,Wilmer Font,RP,TOR,,,,,16,51918,3244.88,10-Day IL - Undisclosed: 7/23/20-7/27/20 10-Da...,2020
2352,358,Giovanny Gallegos,RP,STL,,,,,16,50976,3186.00,10-Day IL - Illness: 7/23/20-7/28/20 10-Day IL...,2020


In [112]:
mlb_data

Unnamed: 0,season,player_name,pitcher,age,days_since_last_start,pitches_last_start,number_start,first_start,last_start_date,game_date,FB_usage,FB_velo,FB_spin,BB_usage,BB_velo,BB_spin,OS_usage,OS_velo,OS_spin
0,2020,"Abreu, Albert",656061,25,,,1,1,NaT,2020-08-08,,,,,,,,,
1,2020,"Abreu, Albert",656061,25,25.0,41,2,0,2020-08-08,2020-09-03,0.392157,96.205,1994.8,0.215686,84.636364,2243.0,0.196078,86.68,1934.0
2,2020,"Abreu, Bryan",650556,23,,,1,1,NaT,2020-07-26,,,,,,,,,
3,2020,"Abreu, Bryan",650556,23,2.0,31,2,0,2020-07-26,2020-07-29,0.205128,93.275,2320.875,0.589744,84.291304,2583.782609,,,
4,2020,"Abreu, Bryan",650556,23,1.0,5,3,0,2020-07-29,2020-07-31,0.166667,93.6,2292.0,0.666667,84.875,2543.5,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112091,2025,"deGrom, Jacob",594798,37,5.0,90,26,0,2025-08-25,2025-08-31,0.318966,97.681081,2470.648649,0.362069,90.454762,2687.833333,0.094828,90.809091,1617.545455
112092,2025,"deGrom, Jacob",594798,37,5.0,79,27,0,2025-08-31,2025-09-06,0.359649,97.253659,2488.219512,0.280702,90.63125,2647.34375,0.052632,89.233333,1559.0
112093,2025,"deGrom, Jacob",594798,37,5.0,97,28,0,2025-09-06,2025-09-12,0.322581,97.325,2518.125,0.354839,91.154545,2692.613636,0.104839,90.338462,1646.153846
112094,2025,"deGrom, Jacob",594798,37,4.0,88,29,0,2025-09-12,2025-09-17,0.365672,97.971429,2545.346939,0.268657,89.291667,2733.472222,0.022388,90.8,1518.0
