# 2025 Fantasy Baseball - Hitters 

Welcome to my 2025 fantasy baseball draft prep!! I'm going to take a look at the categories in my league this year to see what players excel and are worth drafting. I know I can probably look this up somewhere but its more fun to try and do it myself. It will help me learn all the players more as I currently have a small scope on my favorite team, the Blue Jays. I want to learn more of the names of players and figure out what I want my team to look like. 

### Hitter Categories 

To understand what players will be the most valuable in my league, I need to look into the stats that will be apart of scoring. 

This year for hitters we have, **Runs, Homeruns, RBIs, Stolen Bases, Total Bases, and OPS (On Base Plus Slugging).**

In [4]:
#import libs
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from scipy import stats 


pd.set_option('display.max_columns', None)  # Show all columns


In [5]:
#read in data 

data = pd.read_csv('2024_hitters.csv')
data[data['last_name, first_name'] == 'Judge, Aaron']

Unnamed: 0,"last_name, first_name",player_id,year,pa,home_run,k_percent,bb_percent,on_base_plus_slg,babip,b_rbi,b_total_bases,r_total_stolen_base,r_run,woba,xwoba,avg_swing_speed,exit_velocity_avg,launch_angle_avg,sweet_spot_percent,barrel_batted_rate,hard_hit_percent,avg_best_speed,avg_hyper_speed,whiff_percent,swing_percent
68,"Judge, Aaron",592450,2024,704,58,24.3,18.9,1.159,0.367,144,392,10,122,0.476,0.479,77.2,96.2,19.0,40.8,26.9,61.0,107.200436,99.103702,30.7,42.0


### Cumulative Z-scores 

The First method I want to try is calculating the Z-score for each player. I will calculate the Z-score for each scoring category them sum them together to get the Cumulative Z-score. 

In [6]:
#First I will make a seperate df of just the scoring categories

scoring_df = data[['last_name, first_name','r_run','home_run','b_rbi','r_total_stolen_base','b_total_bases','on_base_plus_slg']]
scoring_df.columns =['player_name', 'runs', 'HR', 'RBI', 'SB', 'TB', 'OPS']
scoring_df

Unnamed: 0,player_name,runs,HR,RBI,SB,TB,OPS
0,"Freeman, Freddie",81,22,89,9,258,0.854
1,"Mountcastle, Ryan",54,13,63,3,201,0.733
2,"McCutchen, Andrew",66,20,50,3,184,0.739
3,"Kwan, Steven",83,14,44,12,204,0.793
4,"Diaz, Yainer",70,16,84,2,258,0.766
...,...,...,...,...,...,...,...
124,"Hernández, Teoscar",84,33,99,12,295,0.840
125,"Arraez, Luis",83,4,46,9,250,0.738
126,"Benintendi, Andrew",50,20,64,3,189,0.685
127,"Arenado, Nolan",70,16,71,2,228,0.719


In [7]:
#Create Z_scores

#Z-scores
numeric_cols = scoring_df.select_dtypes(include=[np.number])
zscores = stats.zscore(numeric_cols)

#Create data frame and add player names back 
z_scores_df = pd.DataFrame(zscores, columns=numeric_cols.columns)
z_scores_df['player_name'] = scoring_df['player_name']

#Make player_name first 
cols = ['player_name'] + [col for col in z_scores_df.columns if col != 'player_name']
z_scores_df = z_scores_df[cols]


z_scores_df

Unnamed: 0,player_name,runs,HR,RBI,SB,TB,OPS
0,"Freeman, Freddie",0.211181,0.035782,0.731158,-0.297452,0.409961,1.039510
1,"Mountcastle, Ryan",-1.363857,-0.953329,-0.644223,-0.775102,-0.810955,-0.370073
2,"McCutchen, Andrew",-0.663840,-0.184021,-1.331913,-0.775102,-1.175088,-0.300176
3,"Kwan, Steven",0.327850,-0.843428,-1.649309,-0.058626,-0.746696,0.328894
4,"Diaz, Yainer",-0.430501,-0.623626,0.466662,-0.854710,0.409961,0.014359
...,...,...,...,...,...,...,...
124,"Hernández, Teoscar",0.386185,1.244695,1.260151,-0.058626,1.202485,0.876418
125,"Arraez, Luis",0.327850,-1.942441,-1.543510,-0.297452,0.238604,-0.311826
126,"Benintendi, Andrew",-1.597196,-0.184021,-0.591323,-0.775102,-1.067990,-0.929247
127,"Arenado, Nolan",-0.430501,-0.623626,-0.221029,-0.854710,-0.232626,-0.533165


In [8]:
z_scores_df['total_z'] = z_scores_df.select_dtypes(include='number').sum(axis=1)

z_scores_df


Unnamed: 0,player_name,runs,HR,RBI,SB,TB,OPS,total_z
0,"Freeman, Freddie",0.211181,0.035782,0.731158,-0.297452,0.409961,1.039510,2.130140
1,"Mountcastle, Ryan",-1.363857,-0.953329,-0.644223,-0.775102,-0.810955,-0.370073,-4.917539
2,"McCutchen, Andrew",-0.663840,-0.184021,-1.331913,-0.775102,-1.175088,-0.300176,-4.430140
3,"Kwan, Steven",0.327850,-0.843428,-1.649309,-0.058626,-0.746696,0.328894,-2.641315
4,"Diaz, Yainer",-0.430501,-0.623626,0.466662,-0.854710,0.409961,0.014359,-1.017856
...,...,...,...,...,...,...,...,...
124,"Hernández, Teoscar",0.386185,1.244695,1.260151,-0.058626,1.202485,0.876418,4.911307
125,"Arraez, Luis",0.327850,-1.942441,-1.543510,-0.297452,0.238604,-0.311826,-3.528774
126,"Benintendi, Andrew",-1.597196,-0.184021,-0.591323,-0.775102,-1.067990,-0.929247,-5.144878
127,"Arenado, Nolan",-0.430501,-0.623626,-0.221029,-0.854710,-0.232626,-0.533165,-2.895658


In [9]:
#Order by total_z
z_scores_df.sort_values(by='total_z', ascending=False)

Unnamed: 0,player_name,runs,HR,RBI,SB,TB,OPS,total_z
67,"Ohtani, Shohei",3.302922,3.552622,2.900028,3.682968,3.687155,3.159709,20.285403
68,"Judge, Aaron",2.602905,3.992227,3.640617,-0.217843,3.280183,4.592591,17.890680
57,"Witt Jr., Bobby",2.777909,1.134794,1.789143,1.453933,2.894631,2.472392,12.522802
11,"Ramírez, José",2.136227,1.904103,2.265236,2.250017,2.016428,1.249200,11.821212
47,"Soto, Juan",2.952913,2.123905,1.789143,-0.456668,1.909331,2.600536,10.919160
...,...,...,...,...,...,...,...,...
63,"Turner, Justin",-1.072183,-1.173132,-1.067417,-1.013927,-1.346444,-0.323475,-5.996579
28,"Meyers, Jake",-1.538861,-0.953329,-0.750021,-0.138235,-1.560640,-1.383575,-6.324661
8,"Arcia, Orlando",-1.597196,-0.513724,-1.543510,-0.854710,-0.939472,-1.628213,-7.076826
58,"France, Ty",-1.830535,-0.953329,-1.279014,-0.934319,-1.367864,-1.103988,-7.469049


In [10]:
#Export table
z_scores_df.to_csv('2024_hitters_z_scores.csv', index=False)

### Checking consistancy 

I want to see what players a streaky and what players are more consistant. To do this I need to pull data on each player from a game to game bases. 

#### How to identify these types of players? 

- Standard deviation
- Rolling average - pick game window 10-15 
- Coefficient of variation for relative consistency

In [11]:
from pybaseball import playerid_lookup, statcast_batter

In [12]:
#function to return mlb id
def get_key_mlbam(player_name):
    try: 
        last, first = player_name.split(', ', 1)
        player_id = playerid_lookup(last,first)
        return player_id['key_mlbam'].values[0]
    except IndexError:
        return None 

#function to return baseball refrence id
def get_bbref_id(player_name):
    try: 
        last, first = player_name.split(', ', 1)
        player_id = playerid_lookup(last,first)
        return player_id['key_bbref'].values[0]
    except IndexError:
        return None 

In [13]:
#Function to get a single players statcast data for a given time period
def get_statcast_data(start, end, player_name):
    try: 
        statcast_data = statcast_batter(start, end, player_id = get_key_mlbam(player_name))
        statcast_data = statcast_data[statcast_data['game_type'] == 'R']
        return statcast_data
    except IndexError:
        return None
    
### Function to pull statcast data for all players

def player_statcast_data(start,end, data):
    allPlayer_data = []
    for player in data['player_name']:
        try:
            player_data = get_statcast_data(start, end, player)
            allPlayer_data.append(player_data)
        except IndexError:
            pass
    return allPlayer_data

**Runs, Homeruns, RBIs, Stolen Bases, Total Bases, and OPS (On Base Plus Slugging).**

With the pitch by pitch data I will only be able to extract Homeruns, Total Bases, and OPS. I don't mind this as RBIs are more dependant on players being on base, so a player that has a higher OPS should have more chances for RBIs, same as runs. The more they get on base, the more chances they will have to score. 

In [14]:

#Calculate Homeruns 
def get_HR(statcast_data):
    HRs = len(statcast_data[statcast_data['events'].isin(['home_run'])])
    return HRs 

def get_TB(statcast_data):
    singles = len(statcast_data[statcast_data['events'].isin(['single'])])
    doubles = len(statcast_data[statcast_data['events'].isin(['double'])])
    triples = len(statcast_data[statcast_data['events'].isin(['triple'])])
    HRs = len(statcast_data[statcast_data['events'].isin(['home_run'])])
    return (singles + (doubles * 2) + (triples*3) + (HRs*4))

def get_OPS(statcast_data):
    hits = len(statcast_data[statcast_data['events'].isin(['single', 'double', 'triple', 'home_run'])])
    at_bats = len(statcast_data[statcast_data['events'].isin(["single", "double", "triple", "home_run", "strikeout", "field_out", "grounded_into_double_play", "force_out", "field_error", "fielders_choice"])])
    walks_hbp = len(statcast_data[statcast_data['events'].isin(['walk', 'hit_by_pitch'])])
    sac_flys = len(statcast_data[statcast_data['events'].isin(['sac_fly'])])
    SLG = get_TB(statcast_data)/at_bats
    OBP = (hits + walks_hbp)/(at_bats+walks_hbp+sac_flys)

    return round(OBP + SLG, 3) 


In [28]:

start = '2024-03-20'
end = '2024-9-30'
player_name = scoring_df.player_name[0]
print(player_name)
freeman_2024 = get_statcast_data(start, end, player_name)

Freeman, Freddie
Gathering Player Data


In [29]:
#Remove all non event data
freeman_2024 = freeman_2024[freeman_2024['events'].notnull()]

#Convert game_date to datetime
freeman_2024['game_date'] = pd.to_datetime(freeman_2024['game_date'], format='%Y-%m-%d')

In [30]:
get_HR(freeman_2024)

22

In [33]:
freeman_2024[freeman_2024['game_date'].dt.month == 5]['events'].value_counts()

field_out                    48
strikeout                    18
single                       15
walk                          8
double                        7
home_run                      4
force_out                     3
hit_by_pitch                  3
grounded_into_double_play     3
field_error                   1
sac_fly                       1
truncated_pa                  1
triple                        1
double_play                   1
Name: events, dtype: int64

In [35]:


for month in freeman_2024['game_date'].dt.month_name().unique():
    
    print(month)
    print(get_HR(freeman_2024[freeman_2024['game_date'].dt.month_name() == month]))

    

September
3
August
3
July
4
June
6
May
4
April
1
March
1
