# Data Collection and Processing

This notebook is going to cover the data collection from pybaseball, as well as cleaning the data and picking features. The data will be focused on Red Sox hitters in the 2024 season. Pitcher data will be collected based on pitchers that the Red Sox hitters faced throughout the season.

In [1]:
import pandas as pd

In [2]:
from pybaseball import batting_stats

all_qualified_2024 = batting_stats(2024, qual=100)

red_sox_qualified_batters = all_qualified_2024[all_qualified_2024["Team"] == "BOS"]

red_sox_qualified_batters.head()

Unnamed: 0,IDfg,Season,Name,Team,Age,G,AB,PA,H,1B,...,maxEV,HardHit,HardHit%,Events,CStr%,CSW%,xBA,xSLG,xwOBA,L-WAR
41,24617,2024,Jarren Duran,BOS,27,160,671,735,191,108,...,113.9,225,0.437,515,0.161,0.266,0.271,0.448,0.338,6.4
30,17350,2024,Rafael Devers,BOS,27,138,525,601,143,76,...,114.7,201,0.523,384,0.123,0.266,0.272,0.509,0.364,3.9
84,23772,2024,Wilyer Abreu,BOS,25,132,399,447,101,51,...,114.4,139,0.498,279,0.157,0.282,0.229,0.418,0.317,2.8
35,15711,2024,Tyler O'Neill,BOS,29,113,411,473,99,50,...,113.1,123,0.484,254,0.164,0.308,0.213,0.48,0.339,2.1
214,27531,2024,David Hamilton,BOS,26,98,294,317,73,47,...,108.8,69,0.322,214,0.187,0.29,0.231,0.346,0.281,1.9


In [3]:
from pybaseball import playerid_lookup

red_sox_names = red_sox_qualified_batters["Name"].unique()

player_ids = []
for name in red_sox_names:
    first, last = name.split(" ", 1)
    res = playerid_lookup(last, first)
    if not res.empty:
        player_ids.append(res.iloc[0])

player_ids_df = pd.DataFrame(player_ids)

Gathering player lookup table. This may take a moment.


In [4]:
player_ids_df.head()

Unnamed: 0,name_last,name_first,key_mlbam,key_retro,key_bbref,key_fangraphs,mlb_played_first,mlb_played_last
0,duran,jarren,680776,duraj001,duranja01,24617,2021.0,2025.0
0,devers,rafael,646240,dever001,deverra01,17350,2017.0,2025.0
0,abreu,wilyer,677800,abrew002,abreuwi02,23772,2023.0,2025.0
0,o'neill,tyler,641933,oneit001,oneilty01,15711,2018.0,2025.0
0,hamilton,david,666152,hamid002,hamilda03,27531,2023.0,2025.0


In [None]:
from pybaseball import statcast_batter_game_logs