# Introduction to Baseball Data Sources

**Overview**: One of the first tasks of any data science baseball project is collecting the necessary data to answer your questions and create your project. Obviously, there is no shortage of baseball data out there, and different sources contain different types of information you may need. This notebook reviews some of the most common sources, how to extract info from them. 

**Next Steps:** In a secondary notebook we will go over best practices for data extraction, how to combine info from data sources and other general preprocessing steps, and how to store data in a database.

## Table of Contents

1. [Prerequisites & Setup](#1-prerequisites--setup)
2. [pybaseball — Statcast, FanGraphs & Baseball Reference](#2-pybaseball)
   - 2a. Statcast Pitch-Level Data
   - 2b. Player-Specific Statcast Data
   - 2c. Pitcher-Specific Statcast Data
   - 2d. Season-Level Stats (via FanGraphs)
   - **2e. Understanding Column Names & Data Documentation**
3. [MLB Stats API](#3-mlb-stats-api)
4. [Lahman Database (pylahman)](#4-lahman-database)
5. [Baseball Reference Scraping via pybaseball](#5-baseball-reference-scraping)
6. [Driveline OpenBiomechanics Data](#6-driveline-openbiomechanics)
7. [Computer Vision Baseball Datasets](#7-computer-vision-datasets)
8. [Exporting Data to CSV](#8-exporting-to-csv)
9. [Summary & Comparison of Data Sources](#9-summary--comparison)
10. [Next Steps: SQL Database Storage (Teaser)](#10-next-steps)

## 1. Prerequisites & Setup <a id="1-prerequisites--setup"></a>

Before we begin, we need to install several Python packages. If you're new to Python, `pip` is the package manager that downloads and installs libraries for you. Running the cells below with `!pip install` will install each package directly from this notebook.

**Packages we'll use:**
| Package | Purpose | Data Source |
|---------|---------|-------------|
| `pybaseball` | Statcast pitch data, FanGraphs stats, Baseball Reference stats | Baseball Savant, FanGraphs, Baseball Reference |
| `MLB-StatsAPI` | Official MLB Stats API wrapper | MLB.com |
| `pylahman` | Historical baseball database | Lahman Database |
| `ezc3d` | Reading biomechanics C3D files | Driveline OpenBiomechanics |
| `pandas` | Data manipulation and analysis | (utility) |

**Requirements:**
- Python 3.8+
- An internet connection (we'll be making live API calls)
- Jupyter Notebook or JupyterLab

In [22]:
# Install required packages
# The "!" prefix runs shell commands from within a Jupyter notebook
!pip install pybaseball MLB-StatsAPI pylahman ezc3d pandas



In [1]:
# Import the libraries we'll use throughout this notebook
import pandas as pd
import warnings
warnings.filterwarnings('ignore')  # Suppress minor warnings for cleaner output

# We'll import source-specific libraries in each section
# so you can see exactly what comes from where
print("All base imports successful!")

All base imports successful!


**A note on caching:** The `pybaseball` library supports caching, which saves downloaded data locally so you don't have to re-download it every time you run the notebook. Let's enable that now to speed up repeated runs.

In [2]:
# Enable pybaseball's built-in caching
from pybaseball import cache
cache.enable()
print("pybaseball cache enabled -- data will be saved locally for faster re-runs.")

pybaseball cache enabled -- data will be saved locally for faster re-runs.


## 2. pybaseball — Statcast, FanGraphs & Baseball Reference <a id="2-pybaseball"></a>

[pybaseball](https://github.com/jldbc/pybaseball) is the Swiss Army knife of baseball data in Python. It wraps three major data sources into a single, easy-to-use package:

- **Statcast** (via Baseball Savant): Pitch-level tracking data including velocity, spin rate, exit velocity, launch angle, and pitch movement
- **FanGraphs**: Season-level batting and pitching statistics with advanced metrics (WAR, wRC+, FIP, etc.)
- **Baseball Reference**: Traditional stats and WAR calculations

Let's explore each of these.

### 2a. Statcast Pitch-Level Data

Statcast is MLB's pitch-tracking system that records detailed data on every pitch, hit, and play. The data includes:
- **Pitch characteristics**: velocity, spin rate, movement (horizontal & vertical break)
- **Batted ball data**: exit velocity, launch angle, distance
- **Positional data**: pitch location, spray angle

The `statcast()` function pulls this data for a date range. **Important**: Statcast collects *a lot* of data, so keep your date ranges small (a few days to a week) to avoid long download times.

**Common gotchas:**
- **Rate limiting**: Baseball Savant may throttle or block requests if you make too many in a short period. The `cache.enable()` call from Section 1 helps avoid repeat downloads. If you get blocked, wait a few minutes and try again.
- **Missing values**: Many columns (especially batted ball stats like `launch_speed`, `launch_angle`) are only populated when the ball is put in play. Expect lots of NaN values — this is normal, not a data quality issue.
- **Date range size**: Pulling more than ~2 weeks at once can be slow or timeout. For large ranges, pull in weekly chunks and concatenate.

In [3]:
from pybaseball import statcast

# Pull one week of Statcast data from the 2024 season
# start_dt and end_dt are strings in 'YYYY-MM-DD' format
statcast_data = statcast(start_dt='2024-07-01', end_dt='2024-07-07')

print(f"Shape: {statcast_data.shape[0]:,} rows x {statcast_data.shape[1]} columns")
print(f"\nThat means we have {statcast_data.shape[0]:,} individual pitches recorded in just one week!")

This is a large query, it may take a moment to complete


100%|██████████| 7/7 [00:09<00:00,  1.36s/it]

Shape: 27,664 rows x 118 columns

That means we have 27,664 individual pitches recorded in just one week!





In [4]:
# Let's look at the first few rows
# .head() shows the first 5 rows of a DataFrame
statcast_data.head()

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,batter_days_until_next_game,api_break_z_with_gravity,api_break_x_arm,api_break_x_batter_in,arm_angle,attack_angle,attack_direction,swing_path_tilt,intercept_ball_minus_batter_pos_x_inches,intercept_ball_minus_batter_pos_y_inches
2694,SI,2024-07-07,93.2,2.14,5.14,"Bummer, Aaron",679032,607481,field_out,hit_into_play,...,2,2.77,1.31,-1.31,6.4,-1.406081,16.909678,16.097997,41.289885,17.591221
2854,SI,2024-07-07,94.1,2.19,5.21,"Bummer, Aaron",679032,607481,,swinging_strike,...,2,2.71,1.48,-1.48,6.8,-1.82628,12.045115,18.90572,37.240212,20.333823
2950,SI,2024-07-07,93.2,2.1,5.12,"Bummer, Aaron",679032,607481,,ball,...,2,2.45,1.31,-1.31,7.3,,,,,
3057,ST,2024-07-07,82.8,2.09,5.39,"Bummer, Aaron",679032,607481,,ball,...,2,3.18,-1.08,1.08,13.9,,,,,
3096,SI,2024-07-07,94.7,2.23,5.1,"Bummer, Aaron",665506,607481,double,hit_into_play,...,5,2.56,1.23,-1.23,5.7,-8.06621,37.803458,30.09535,32.919415,6.887575


In [5]:
# There are A LOT of columns. Let's see all of them:
print(f"Total columns: {len(statcast_data.columns)}\n")
print("All column names:")
for i, col in enumerate(statcast_data.columns, 1):
    print(f"  {i:3d}. {col}")

Total columns: 118

All column names:
    1. pitch_type
    2. game_date
    3. release_speed
    4. release_pos_x
    5. release_pos_z
    6. player_name
    7. batter
    8. pitcher
    9. events
   10. description
   11. spin_dir
   12. spin_rate_deprecated
   13. break_angle_deprecated
   14. break_length_deprecated
   15. zone
   16. des
   17. game_type
   18. stand
   19. p_throws
   20. home_team
   21. away_team
   22. type
   23. hit_location
   24. bb_type
   25. balls
   26. strikes
   27. game_year
   28. pfx_x
   29. pfx_z
   30. plate_x
   31. plate_z
   32. on_3b
   33. on_2b
   34. on_1b
   35. outs_when_up
   36. inning
   37. inning_topbot
   38. hc_x
   39. hc_y
   40. tfs_deprecated
   41. tfs_zulu_deprecated
   42. umpire
   43. sv_id
   44. vx0
   45. vy0
   46. vz0
   47. ax
   48. ay
   49. az
   50. sz_top
   51. sz_bot
   52. hit_distance_sc
   53. launch_speed
   54. launch_angle
   55. effective_speed
   56. release_spin_rate
   57. release_extension
   58.

That's a lot of columns! Let's look at some of the most commonly used ones:

In [7]:
# Select a subset of the most useful columns for a quick preview
key_columns = [
    'game_date', 'pitcher', 'batter', 'pitch_type', 'release_speed',
    'release_spin_rate', 'pfx_x', 'pfx_z',           # pitch movement
    'plate_x', 'plate_z',                              # pitch location at home plate
    'events', 'description',
    'launch_speed', 'launch_angle', 'hit_distance_sc', # batted ball data
    'home_team', 'away_team'
]

statcast_subset = statcast_data[key_columns]
statcast_subset.head(10)

Unnamed: 0,game_date,pitcher,batter,pitch_type,release_speed,release_spin_rate,pfx_x,pfx_z,plate_x,plate_z,events,description,launch_speed,launch_angle,hit_distance_sc,home_team,away_team
2694,2024-07-07,607481,679032,SI,93.2,2243,1.31,-0.16,0.669982,2.6815,field_out,hit_into_play,80.8,-72.0,1.0,ATL,PHI
2854,2024-07-07,607481,679032,SI,94.1,2229,1.48,-0.14,0.346494,2.317186,,swinging_strike,,,,ATL,PHI
2950,2024-07-07,607481,679032,SI,93.2,2345,1.31,0.15,-1.78623,1.662954,,ball,,,,ATL,PHI
3057,2024-07-07,607481,679032,ST,82.8,2356,-1.08,0.13,-1.736503,2.187113,,ball,,,,ATL,PHI
3096,2024-07-07,607481,665506,SI,94.7,2066,1.23,-0.04,-0.260368,2.74809,double,hit_into_play,95.5,-13.0,11.0,ATL,PHI
3205,2024-07-07,607481,593160,ST,81.7,2467,-1.45,-0.2,-1.395891,1.786845,field_out,hit_into_play,58.6,44.0,175.0,ATL,PHI
3382,2024-07-07,607481,593160,ST,80.4,2426,-1.02,-0.16,0.452714,1.948007,,called_strike,,,,ATL,PHI
3506,2024-07-07,607481,593160,SI,93.1,2027,1.38,-0.09,0.677216,2.271449,,called_strike,,,,ATL,PHI
3551,2024-07-07,607481,624641,SI,92.2,2099,1.2,-0.33,-0.182525,1.380542,single,hit_into_play,102.6,-11.0,7.0,ATL,PHI
3749,2024-07-07,607481,624641,SI,92.5,2218,1.32,-0.03,-0.324314,1.945787,,swinging_strike,,,,ATL,PHI


### 2b. Player-Specific Statcast Data

Often you want data for a specific player. pybaseball provides `playerid_lookup()` to find a player's MLB ID, and then `statcast_pitcher()` or `statcast_batter()` to get their data.

Let's look up Shohei Ohtani's batting data.

In [8]:
from pybaseball import playerid_lookup, statcast_batter

# Step 1: Look up the player's ID
# playerid_lookup takes (last_name, first_name)
ohtani_info = playerid_lookup('ohtani', 'shohei')
print("Player lookup results:")
ohtani_info

Gathering player lookup table. This may take a moment.
Player lookup results:


Unnamed: 0,name_last,name_first,key_mlbam,key_retro,key_bbref,key_fangraphs,mlb_played_first,mlb_played_last
0,ohtani,shohei,660271,ohtas001,ohtansh01,19755,2018.0,2025.0


In [9]:
# Step 2: Get the player's MLB Advanced Media ID (key_mlbam column)
ohtani_mlbam_id = ohtani_info['key_mlbam'].values[0]
print(f"Ohtani's MLBAM ID: {ohtani_mlbam_id}")

# Step 3: Pull Ohtani's Statcast batting data for part of 2024
ohtani_batting = statcast_batter(
    start_dt='2024-04-01',
    end_dt='2024-06-30',
    player_id=ohtani_mlbam_id
)

print(f"\nPitches seen by Ohtani: {len(ohtani_batting):,}")
ohtani_batting[key_columns].head(10)

Ohtani's MLBAM ID: 660271
Gathering Player Data

Pitches seen by Ohtani: 1,377


Unnamed: 0,game_date,pitcher,batter,pitch_type,release_speed,release_spin_rate,pfx_x,pfx_z,plate_x,plate_z,events,description,launch_speed,launch_angle,hit_distance_sc,home_team,away_team
0,2024-06-30,702352,660271,ST,82.2,2242.0,1.17,0.63,1.685588,2.042562,strikeout,swinging_strike,,,,SF,LAD
1,2024-06-30,702352,660271,FF,94.6,1852.0,-0.72,0.76,1.099376,3.422491,,swinging_strike,,,,SF,LAD
2,2024-06-30,702352,660271,SI,94.6,1772.0,-1.31,0.21,-0.606517,2.435513,,swinging_strike,,,,SF,LAD
3,2024-06-30,702352,660271,FF,95.0,1984.0,-0.86,0.86,0.680108,3.596541,strikeout,swinging_strike,,,,SF,LAD
4,2024-06-30,702352,660271,CH,88.7,1547.0,-1.5,-0.3,-1.450678,-0.232773,,blocked_ball,,,,SF,LAD
5,2024-06-30,702352,660271,ST,81.3,2311.0,1.13,0.41,1.417171,0.51973,,blocked_ball,,,,SF,LAD
6,2024-06-30,702352,660271,FC,88.8,2125.0,0.12,0.55,0.877504,2.233842,,foul,100.3,-23.0,5.0,SF,LAD
7,2024-06-30,702352,660271,CH,89.4,1650.0,-1.06,0.2,-0.345595,2.356711,,foul,95.9,46.0,294.0,SF,LAD
8,2024-06-30,694738,660271,CH,87.9,1579.0,-1.15,-0.15,-0.233079,2.377107,field_out,hit_into_play,93.2,52.0,264.0,SF,LAD
9,2024-06-30,694738,660271,CU,77.3,2982.0,0.96,-0.73,-0.356367,2.481083,,foul,70.6,54.0,194.0,SF,LAD


### 2c. Pitcher-Specific Statcast Data

Similarly, we can pull data for a specific pitcher. Let's look at Tarik Skubal's 2024 Cy Young-winning season.

In [10]:
from pybaseball import statcast_pitcher

# Look up Skubal
skubal_info = playerid_lookup('skubal', 'tarik')
skubal_id = skubal_info['key_mlbam'].values[0]
print(f"Skubal's MLBAM ID: {skubal_id}")

# Pull his pitching data
skubal_pitching = statcast_pitcher(
    start_dt='2024-04-01',
    end_dt='2024-06-30',
    player_id=skubal_id
)

print(f"\nTotal pitches thrown by Skubal: {len(skubal_pitching):,}")

# Let's see his pitch mix
print("\nPitch type breakdown:")
print(skubal_pitching['pitch_type'].value_counts())

Skubal's MLBAM ID: 669373
Gathering Player Data

Total pitches thrown by Skubal: 1,353

Pitch type breakdown:
pitch_type
FF    433
CH    366
SI    280
SL    210
CU     62
FS      2
Name: count, dtype: int64


### 2d. Season-Level Stats (via FanGraphs)

While Statcast gives us pitch-level detail, sometimes we want season-level aggregated statistics. The `batting_stats()` and `pitching_stats()` functions pull data from FanGraphs, which includes advanced metrics like WAR (Wins Above Replacement), wRC+ (Weighted Runs Created Plus), and FIP (Fielding Independent Pitching).

In [11]:
from pybaseball import batting_stats

# Get 2024 season batting stats for all qualified batters
batting_2024 = batting_stats(2024)

print(f"Number of batters: {len(batting_2024)}")
print(f"Number of stat columns: {len(batting_2024.columns)}")

# Show top 10 batters by WAR
batting_2024.sort_values('WAR', ascending=False).head(10)[
    ['Name', 'Team', 'G', 'AB', 'H', 'HR', 'RBI', 'BB', 'SO', 'AVG', 'OBP', 'SLG', 'WAR']
]

Number of batters: 129
Number of stat columns: 320


Unnamed: 0,Name,Team,G,AB,H,HR,RBI,BB,SO,AVG,OBP,SLG,WAR
0,Aaron Judge,NYY,158,559,180,58,144,133,171,0.322,0.458,0.701,11.3
3,Bobby Witt Jr.,KCR,161,636,211,32,109,57,106,0.332,0.389,0.588,10.5
1,Shohei Ohtani,LAD,159,636,197,54,130,81,162,0.31,0.39,0.646,8.9
2,Juan Soto,NYY,157,576,166,41,109,129,119,0.288,0.419,0.569,8.3
9,Gunnar Henderson,BAL,159,630,177,37,92,78,159,0.281,0.364,0.529,7.9
19,Francisco Lindor,NYM,152,618,169,33,91,56,127,0.273,0.344,0.5,7.7
22,Jarren Duran,BOS,160,671,191,21,75,54,160,0.285,0.342,0.492,6.8
27,Elly De La Cruz,CIN,160,618,160,25,76,69,218,0.259,0.339,0.471,6.6
17,Jose Ramirez,CLE,158,620,173,39,118,54,82,0.279,0.335,0.537,6.5
8,Ketel Marte,ARI,136,504,147,36,95,65,106,0.292,0.372,0.56,6.3


In [12]:
from pybaseball import pitching_stats

# Get 2024 season pitching stats
pitching_2024 = pitching_stats(2024)

print(f"Number of pitchers: {len(pitching_2024)}")

# Show top 10 pitchers by WAR
pitching_2024.sort_values('WAR', ascending=False).head(10)[
    ['Name', 'Team', 'W', 'L', 'ERA', 'G', 'GS', 'IP', 'SO', 'BB', 'FIP', 'WAR']
]

Number of pitchers: 58


Unnamed: 0,Name,Team,W,L,ERA,G,GS,IP,SO,BB,FIP,WAR
0,Chris Sale,ATL,18,3,2.38,29,29,177.2,225,39,2.09,6.4
1,Tarik Skubal,DET,18,4,2.39,31,31,192.0,228,35,2.49,6.0
2,Zack Wheeler,PHI,16,7,2.57,32,32,200.0,224,52,3.13,5.4
11,Cole Ragans,KCR,11,9,3.14,32,32,186.1,223,67,2.99,4.9
9,Seth Lugo,KCR,16,9,3.0,33,33,206.2,181,48,3.25,4.7
20,Dylan Cease,SDP,14,11,3.47,33,33,189.1,224,65,3.1,4.7
17,Cristopher Sanchez,PHI,11,9,3.32,31,31,181.2,153,44,3.0,4.7
22,Logan Webb,SFG,13,10,3.47,33,33,204.2,172,50,2.95,4.4
24,George Kirby,SEA,14,11,3.53,33,33,191.0,179,23,3.26,4.1
13,Logan Gilbert,SEA,9,12,3.23,33,33,208.2,220,37,3.27,4.0


**Key takeaway**: `pybaseball` is the go-to library for most baseball data science projects. It gives you access to three major data sources (Statcast, FanGraphs, Baseball Reference) through a single, consistent Python interface.

### 2e. Understanding Column Names & Dataset Documentation

With 118 columns in Statcast alone, you'll inevitably encounter column names that aren't self-explanatory. Knowing **where to find documentation** and **how to explore unfamiliar columns programmatically** are essential skills for any baseball data project.

Below we'll cover:
1. How to inspect columns you don't recognize using code
2. A quick-reference table for the most common (and confusing) Statcast columns
3. Links to official glossaries for every data source

In [13]:
# Technique 1: Inspect unfamiliar columns -- check type, unique values, and samples
# This pattern works for ANY dataset, not just Statcast

mystery_cols = ['pfx_x', 'pfx_z', 'launch_speed_angle', 'zone', 'type', 'bb_type', 'if_fielding_alignment']

print("=== Quick Column Inspector ===\n")
for col in mystery_cols:
    dtype = statcast_data[col].dtype
    n_unique = statcast_data[col].nunique()
    n_missing = statcast_data[col].isna().sum()
    pct_missing = 100 * n_missing / len(statcast_data)
    sample_vals = statcast_data[col].dropna().unique()[:5]
    print(f"{col}:")
    print(f"  Type: {dtype} | Unique values: {n_unique} | Missing: {n_missing:,} ({pct_missing:.1f}%)")
    print(f"  Sample values: {list(sample_vals)}")
    print()

=== Quick Column Inspector ===

pfx_x:
  Type: Float64 | Unique values: 382 | Missing: 107 (0.4%)
  Sample values: [np.float64(1.31), np.float64(1.48), np.float64(-1.08), np.float64(1.23), np.float64(-1.45)]

pfx_z:
  Type: Float64 | Unique values: 357 | Missing: 107 (0.4%)
  Sample values: [np.float64(-0.16), np.float64(-0.14), np.float64(0.15), np.float64(0.13), np.float64(-0.04)]

launch_speed_angle:
  Type: Int64 | Unique values: 6 | Missing: 22,854 (82.6%)
  Sample values: [np.int64(2), np.int64(1), np.int64(3), np.int64(5), np.int64(4)]

zone:
  Type: Int64 | Unique values: 13 | Missing: 107 (0.4%)
  Sample values: [np.int64(6), np.int64(13), np.int64(5), np.int64(9), np.int64(7)]

type:
  Type: str | Unique values: 3 | Missing: 0 (0.0%)
  Sample values: ['X', 'S', 'B']

bb_type:
  Type: str | Unique values: 4 | Missing: 22,840 (82.6%)
  Sample values: ['ground_ball', 'popup', 'fly_ball', 'line_drive']

if_fielding_alignment:
  Type: str | Unique values: 3 | Missing: 137 (0.5%)
 

In [14]:
# Technique 2: For categorical columns, value_counts() reveals what the values mean
# This is often the fastest way to understand a column

print("=== Pitch Type Codes ===")
print(statcast_data['pitch_type'].value_counts())

print("\n=== Batted Ball Types (bb_type) ===")
print(statcast_data['bb_type'].value_counts())

print("\n=== 'type' Column (pitch result category) ===")
print(statcast_data['type'].value_counts())
print("  B = Ball, S = Strike (including foul), X = In play")

print("\n=== Event Types (at-bat outcomes) -- top 15 ===")
print(statcast_data['events'].value_counts().head(15))

=== Pitch Type Codes ===
pitch_type
FF    8743
SI    4321
SL    4305
CH    2754
FC    2496
ST    1896
CU    1612
FS     713
KC     524
SV     130
KN      42
EP       8
FA       4
PO       3
Name: count, dtype: int64

=== Batted Ball Types (bb_type) ===
bb_type
ground_ball    2046
fly_ball       1284
line_drive     1167
popup           327
Name: count, dtype: int64

=== 'type' Column (pitch result category) ===
type
S    12951
B     9889
X     4824
Name: count, dtype: int64
  B = Ball, S = Strike (including foul), X = In play

=== Event Types (at-bat outcomes) -- top 15 ===
events
field_out                    2843
strikeout                    1591
single                       1018
walk                          561
double                        318
home_run                      234
force_out                     138
grounded_into_double_play     112
hit_by_pitch                   63
field_error                    43
sac_fly                        42
intent_walk                    25
sac_b

In [15]:
# Technique 3: For numeric columns, .describe() gives you the range and distribution
# This helps you understand units and spot anomalies

print("=== Numeric Summary of Key Statcast Columns ===\n")
numeric_cols = ['release_speed', 'release_spin_rate', 'pfx_x', 'pfx_z',
                'plate_x', 'plate_z', 'launch_speed', 'launch_angle',
                'hit_distance_sc', 'release_extension']

statcast_data[numeric_cols].describe().round(2)

=== Numeric Summary of Key Statcast Columns ===



Unnamed: 0,release_speed,release_spin_rate,pfx_x,pfx_z,plate_x,plate_z,launch_speed,launch_angle,hit_distance_sc,release_extension
count,27557.0,27501.0,27557.0,27557.0,27551.0,27551.0,9245.0,9256.0,9250.0,27555.0
mean,89.43,2267.94,-0.1,0.57,0.03,2.32,82.45,17.68,158.78,6.48
std,5.87,352.87,0.87,0.68,0.84,0.97,15.17,33.09,120.3,0.46
min,41.3,63.0,-2.0,-1.71,-4.37,-7.65,9.1,-87.0,0.0,4.1
25%,85.0,2115.0,-0.85,0.13,-0.54,1.69,73.2,-4.0,23.0,6.2
50%,90.2,2301.0,-0.14,0.6,0.02,2.33,81.9,20.0,171.0,6.5
75%,94.2,2472.0,0.59,1.15,0.59,2.97,94.2,42.0,245.0,6.8
max,103.6,3475.0,2.21,2.09,4.63,6.72,118.1,89.0,459.0,8.4


### Statcast Column Quick-Reference

Here's a reference for the most commonly used (and commonly confusing) Statcast columns:

**Pitch Identification & Context**
| Column | Meaning | Example Values |
|--------|---------|---------------|
| `pitch_type` | Pitch type abbreviation | FF (4-seam), SI (sinker), SL (slider), CU (curveball), CH (changeup), ST (sweeper), FC (cutter), FS (splitter) |
| `pitch_name` | Full pitch type name | "4-Seam Fastball", "Slider" |
| `zone` | Strike zone region (1-9 = in zone, 11-14 = out of zone) | 1-14 |
| `type` | Pitch result category | B (ball), S (strike/foul), X (in play) |
| `description` | Detailed pitch result | "called_strike", "swinging_strike", "hit_into_play", "ball", "foul" |
| `events` | At-bat outcome (only on final pitch of AB) | "single", "strikeout", "home_run", "field_out" |
| `des` | Full text play description | "Aaron Judge doubles (15) on a fly ball..." |

**Pitch Characteristics**
| Column | Meaning | Units |
|--------|---------|-------|
| `release_speed` | Pitch velocity at release | mph |
| `release_spin_rate` | Spin rate at release | rpm |
| `pfx_x` | Horizontal movement (pitcher's perspective) | feet |
| `pfx_z` | Vertical movement (induced, relative to gravity) | feet |
| `plate_x` | Horizontal position at home plate (0 = center, negative = inside to RHB) | feet |
| `plate_z` | Vertical position at home plate | feet |
| `release_extension` | How far in front of the rubber the ball is released | feet |
| `spin_axis` | Spin axis direction | degrees |
| `effective_speed` | Perceived velocity accounting for release extension | mph |

**Batted Ball Data** (only populated when the ball is put in play)
| Column | Meaning | Units |
|--------|---------|-------|
| `launch_speed` | Exit velocity off the bat | mph |
| `launch_angle` | Vertical angle off the bat | degrees (-90 to 90) |
| `hit_distance_sc` | Projected distance of batted ball | feet |
| `bb_type` | Batted ball classification | "ground_ball", "fly_ball", "line_drive", "popup" |
| `hc_x`, `hc_y` | Hit coordinates on the field (spray chart) | pixels (origin is home plate) |
| `estimated_ba_using_speedangle` | Expected batting avg based on exit velo & launch angle (xBA) | 0 to 1 |
| `estimated_woba_using_speedangle` | Expected wOBA based on exit velo & launch angle (xwOBA) | 0 to ~2 |

**Game State**
| Column | Meaning |
|--------|---------|
| `balls`, `strikes` | Count before the pitch |
| `on_1b`, `on_2b`, `on_3b` | Runner on base (player ID or NaN) |
| `outs_when_up` | Outs when batter came up |
| `inning`, `inning_topbot` | Inning number and Top/Bot |
| `stand` | Batter handedness (L/R) |
| `p_throws` | Pitcher handedness (L/R) |
| `delta_run_exp` | Change in run expectancy from this pitch |

**Deprecated Columns** (still in the data but no longer updated)
| Column | Note |
|--------|------|
| `spin_rate_deprecated` | Use `release_spin_rate` instead |
| `break_angle_deprecated` | Use `pfx_x` / `pfx_z` instead |
| `break_length_deprecated` | Use `pfx_x` / `pfx_z` instead |
| `tfs_deprecated` | Formerly game time |
| `tfs_zulu_deprecated` | Formerly game time (UTC) |

> **Tip**: If a column name ends with `_deprecated`, don't use it — there's a newer replacement.

### FanGraphs & Lahman Column Documentation

**FanGraphs** column names are generally more readable (e.g., `WAR`, `wRC+`, `FIP`), but the advanced metrics still need context to interpret. Here are a few common ones:

| Column | Full Name | What It Measures |
|--------|-----------|-----------------|
| `WAR` | Wins Above Replacement | Total player value in wins |
| `wRC+` | Weighted Runs Created Plus | Offense adjusted for park/league (100 = average) |
| `wOBA` | Weighted On-Base Average | Offensive value (weighs outcomes by run value) |
| `FIP` | Fielding Independent Pitching | ERA estimator using only K, BB, HR, HBP |
| `xFIP` | Expected FIP | FIP with league-average HR/FB rate |
| `BABIP` | Batting Avg on Balls In Play | Hit rate on batted balls (excl. HR) |
| `OPS+` | On-base Plus Slugging Plus | OPS adjusted for park/league (100 = average) |
| `ISO` | Isolated Power | SLG - AVG (measures raw power) |
| `BB%` / `K%` | Walk / Strikeout Rate | Percentage of plate appearances |
| `Barrel%` | Barrel Rate | % of batted balls with ideal exit velo + launch angle |

**Lahman** columns use short abbreviations that map to standard baseball stat names. The most common ones (`H`, `HR`, `RBI`, `BB`, `SO`, `ERA`, `W`, `L`) are self-explanatory, but some are less obvious:

| Column | Meaning |
|--------|---------|
| `AB` | At Bats |
| `2B`, `3B` | Doubles, Triples |
| `SB`, `CS` | Stolen Bases, Caught Stealing |
| `HBP` | Hit By Pitch |
| `SF`, `SH` | Sacrifice Flies, Sacrifice Hits (bunts) |
| `GIDP` | Grounded Into Double Play |
| `IPouts` | Outs Recorded (innings pitched × 3) |
| `BFP` | Batters Faced by Pitcher |
| `yearID` | Season year |
| `stint` | Order of appearances with a team in a season (1st team = 1, traded = 2, etc.) |

### Official Glossaries & Documentation Links

Bookmark these — you'll come back to them constantly:

| Source | Documentation Link | What You'll Find |
|--------|-------------------|-----------------|
| **Baseball Savant (Statcast)** | [baseballsavant.mlb.com/csv-docs](https://baseballsavant.mlb.com/csv-docs) | Official definitions for every Statcast CSV column |
| **FanGraphs Glossary** | [library.fangraphs.com/getting-started](https://library.fangraphs.com/getting-started/) | Detailed explanations of every FanGraphs stat |
| **Baseball Reference** | [br-glossary](https://www.baseball-reference.com/about/bat_glossary.shtml) | Batting, pitching, and fielding stat definitions |
| **Lahman Database** | [Lahman README](https://github.com/chadwickbureau/baseballdatabank/blob/master/readme2024.txt) | Column descriptions for every Lahman table |
| **pybaseball Docs** | [github.com/jldbc/pybaseball](https://github.com/jldbc/pybaseball#readme) | Function reference and usage examples |
| **MLB Stats API** | [statsapi.mlb.com](https://statsapi.mlb.com/docs/) | Official API endpoint documentation |

> **Pro tip**: When you encounter an unfamiliar column, try this workflow:
> 1. Check `value_counts()` or `describe()` to understand what values it contains
> 2. Search the glossary links above for the column name
> 3. If still unclear, search "statcast [column_name] meaning" or "fangraphs [stat_name] explained" — the baseball analytics community has written about virtually every metric

## 3. MLB Stats API <a id="3-mlb-stats-api"></a>

The [MLB Stats API](https://github.com/toddrob99/MLB-StatsAPI) is the **official** API provided by Major League Baseball. While `pybaseball` is great for statistical data, the MLB Stats API excels at:

- **Roster information**: Current and historical team rosters
- **Player biographical data**: Height, weight, birth date, draft info
- **Game schedules**: Past and upcoming games
- **Live game data**: Real-time scores and play-by-play

The Python wrapper is installed as `MLB-StatsAPI` but imported as `statsapi`.

In [16]:
import statsapi

# Look up a player by name
results = statsapi.lookup_player('aaron judge')

# This returns a list of dictionaries
for player in results:
    print(f"Name: {player['fullName']}")
    print(f"  ID: {player['id']}")
    print(f"  Position: {player['primaryPosition']['abbreviation']}")
    print(f"  Current Team: {player.get('currentTeam', {}).get('name', 'N/A')}")
    print()

Name: Aaron Judge
  ID: 592450
  Position: RF
  Current Team: N/A



In [17]:
# Get detailed player stats
# statsapi.player_stats() returns a formatted string
judge_stats = statsapi.player_stats(592450, group='hitting', type='season')
print(judge_stats)

Aaron "Baj" Judge, RF (2016-)

Season Hitting
age: 33
gamesPlayed: 152
groundOuts: 84
airOuts: 125
runs: 137
doubles: 30
triples: 2
homeRuns: 53
strikeOuts: 160
baseOnBalls: 124
intentionalWalks: 36
hits: 179
hitByPitch: 7
avg: .331
atBats: 541
obp: .457
slg: .688
ops: 1.145
caughtStealing: 5
stolenBases: 12
stolenBasePercentage: .706
caughtStealingPercentage: .294
groundIntoDoublePlay: 16
numberOfPitches: 2631
plateAppearances: 679
totalBases: 372
rbi: 114
leftOnBase: 205
sacBunts: 0
sacFlies: 7
babip: .376
groundOutsToAirouts: 0.67
catchersInterference: 0
atBatsPerHomeRun: 10.21




### Team Rosters

One of the most useful features of the MLB Stats API is pulling current team rosters.

In [18]:
# Get the New York Yankees roster
# Team IDs: NYY=147, LAD=119, HOU=117, ATL=144, etc.
# You can also look up team IDs with statsapi.lookup_team()
yankees_roster = statsapi.roster(147)  # 147 = Yankees
print("New York Yankees Roster:")
print(yankees_roster)

New York Yankees Roster:
#99  RF  Aaron Judge
#14  3B  Amed Rosario
#57  P   Angel Chivilli
#11  SS  Anthony Volpe
#28  C   Austin Wells
#22  1B  Ben Rice
#47  P   Brent Headrick
#80  P   Cade Winquest
#31  P   Cam Schlittler
#75  P   Camilo Doval
#55  P   Carlos Rodón
#86  P   Chase Hampton
#35  LF  Cody Bellinger
#53  P   David Bednar
#76  P   Elmer Rodríguez
#63  P   Fernando Cruz
#45  P   Gerrit Cole
#27  DH  Giancarlo Stanton
#25  C   J.C. Escarra
#59  P   Jake Bird
#24  LF  Jasson Domínguez
#13  2B  Jazz Chisholm Jr.
#90  2B  Jorbit Vivas
#72  SS  José Caballero
#74  P   Kervin Castro
#81  P   Luis Gil
#54  P   Max Fried
#39  3B  Max Schuemann
#56  P   Osvaldo Bido
#95  3B  Oswaldo Cabrera
#58  P   Paul Blackburn
#48  1B  Paul Goldschmidt
#19  3B  Ryan McMahon
#40  P   Ryan Weathers
#33  P   Ryan Yarbrough
#78  CF  Spencer Jones
#41  P   Tim Hill
#12  CF  Trent Grisham
#98  P   Will Warren
#73  P   Yerry De los Santos



### Game Schedules

Let's pull the schedule for a specific date.

In [19]:
# Get games for a specific date
schedule = statsapi.schedule(start_date='07/01/2024', end_date='07/01/2024')

# The result is a list of dictionaries -- let's convert to DataFrame for easier viewing
schedule_df = pd.DataFrame(schedule)
print(f"Games on July 1, 2024: {len(schedule_df)}")
schedule_df[['game_date', 'away_name', 'home_name', 'away_score', 'home_score',
             'venue_name', 'status']].head(10)

Games on July 1, 2024: 3


Unnamed: 0,game_date,away_name,home_name,away_score,home_score,venue_name,status
0,2024-07-01,Houston Astros,Toronto Blue Jays,3,1,Rogers Centre,Final
1,2024-07-01,New York Mets,Washington Nationals,9,7,Nationals Park,Final
2,2024-07-01,Milwaukee Brewers,Colorado Rockies,7,8,Coors Field,Final


### Converting API Data to DataFrames

The MLB Stats API returns dictionaries and formatted strings, not DataFrames. For data science work, we often need to convert the results. Here's how to get a structured roster DataFrame:

In [20]:
# Use the get() function to hit the API endpoint directly and get raw JSON
roster_json = statsapi.get('team_roster', {'teamId': 147, 'season': 2024})

# Parse the JSON into a clean DataFrame
roster_rows = []
for entry in roster_json.get('roster', []):
    person = entry.get('person', {})
    position = entry.get('position', {})
    roster_rows.append({
        'player_id': person.get('id'),
        'full_name': person.get('fullName'),
        'jersey_number': entry.get('jerseyNumber'),
        'position': position.get('abbreviation'),
        'status': entry.get('status', {}).get('description')
    })

roster_df = pd.DataFrame(roster_rows)
print(f"Structured roster ({len(roster_df)} players):")
roster_df.head(10)

Structured roster (54 players):


Unnamed: 0,player_id,full_name,jersey_number,position,status
0,677076,Clayton Andrews,74,P,Minor League Contract
1,690925,Clayton Beeter,85,P,Forty Man
2,542932,Jon Berti,19,3B,Forty Man
3,641360,Phil Bickford,53,P,Minor League Contract
4,595897,Nick Burdi,57,P,Minor League Contract
5,665828,Oswaldo Cabrera,95,3B,Active
6,665862,Jazz Chisholm Jr.,13,3B,Active
7,543037,Gerrit Cole,45,P,Active
8,641482,Nestor Cortes,65,P,Active
9,664776,Jake Cousins,61,P,Active


**Key takeaway**: The MLB Stats API is best for roster, schedule, and biographical data. It complements `pybaseball` nicely — use pybaseball for statistical analysis and the MLB Stats API for organizational/roster data.

## 4. Lahman Database (pylahman) <a id="4-lahman-database"></a>

The [Lahman Baseball Database](https://www.seanlahman.com/baseball-archive/statistics/) is one of the most important historical baseball datasets. It contains complete batting and pitching statistics from **1871 to the present**, along with fielding, team, managerial, and awards data.

The `pylahman` package provides instant access to this database as pandas DataFrames — no SQL required!

**When to use Lahman data:**
- Historical analysis spanning many decades
- Career-level statistics
- Comparing players across eras
- Hall of Fame analysis
- Quick access to clean, well-structured historical data

In [24]:
import pylahman as lahman

# The lahman package exposes each database table as a function
# Let's start with batting data
batting = lahman.Batting()

print(f"Batting table shape: {batting.shape}")
print(f"Year range: {batting['yearID'].min()} to {batting['yearID'].max()}")
batting.head()

Batting table shape: (115450, 22)
Year range: 1871 to 2024


Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,aardsda01,2004,1,SFN,NL,11,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,aardsda01,2006,1,CHN,NL,45,2,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,aardsda01,2007,1,CHA,AL,25,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,aardsda01,2008,1,BOS,AL,47,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,aardsda01,2009,1,SEA,AL,73,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# Pitching data
pitching = lahman.Pitching()

print(f"Pitching table shape: {pitching.shape}")
print(f"Year range: {pitching['yearID'].min()} to {pitching['yearID'].max()}")
pitching.head()

Pitching table shape: (52344, 30)
Year range: 1871 to 2024


Unnamed: 0,playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,...,IBB,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP
0,aardsda01,2004,1,SFN,NL,1,0,11,0,0,...,0,0,2,0,61,5,8,0,1,1
1,aardsda01,2006,1,CHN,NL,3,0,45,0,0,...,0,1,1,0,225,9,25,1,3,2
2,aardsda01,2007,1,CHA,AL,2,1,25,0,0,...,3,2,1,0,151,7,24,2,1,1
3,aardsda01,2008,1,BOS,AL,4,2,47,0,0,...,2,3,5,0,228,7,32,3,2,4
4,aardsda01,2009,1,SEA,AL,3,6,73,0,0,...,3,2,0,0,296,53,23,2,1,2


In [26]:
# Player biographical info (the "People" table)
people = lahman.People()

print(f"People table shape: {people.shape}")
print(f"\nColumns: {list(people.columns)}")
people.head()

People table shape: (21271, 25)

Columns: ['ID', 'playerID', 'birthYear', 'birthMonth', 'birthDay', 'birthCity', 'birthCountry', 'birthState', 'deathYear', 'deathMonth', 'deathDay', 'deathCountry', 'deathState', 'deathCity', 'nameFirst', 'nameLast', 'nameGiven', 'weight', 'height', 'bats', 'throws', 'debut', 'bbrefID', 'finalGame', 'retroID']


Unnamed: 0,ID,playerID,birthYear,birthMonth,birthDay,birthCity,birthCountry,birthState,deathYear,deathMonth,...,nameLast,nameGiven,weight,height,bats,throws,debut,bbrefID,finalGame,retroID
0,1,aardsda01,1981,12,27,Denver,USA,CO,,,...,Aardsma,David Allan,215,75,R,R,2004-04-06,aardsda01,2015-08-23,aardd001
1,2,aaronha01,1934,2,5,Mobile,USA,AL,2021.0,1.0,...,Aaron,Henry Louis,180,72,R,R,1954-04-13,aaronha01,1976-10-03,aaroh101
2,3,aaronto01,1939,8,5,Mobile,USA,AL,1984.0,8.0,...,Aaron,Tommie Lee,190,75,R,R,1962-04-10,aaronto01,1971-09-26,aarot101
3,4,aasedo01,1954,9,8,Orange,USA,CA,,,...,Aase,Donald William,190,75,R,R,1977-07-26,aasedo01,1990-10-03,aased001
4,5,abadan01,1972,8,25,Palm Beach,USA,FL,,,...,Abad,Fausto Andres,184,73,L,L,2001-09-10,abadan01,2006-04-13,abada001


### Combining Lahman Tables

One of the great things about the Lahman database is that all tables share a common `playerID` field, making it easy to join them together. Let's find the career home run leaders with their biographical info.

In [27]:
# Calculate career home runs per player
career_hr = (
    batting
    .groupby('playerID')['HR']
    .sum()
    .reset_index()
    .sort_values('HR', ascending=False)
    .head(20)
)

# Merge with people table to get player names
career_hr_with_names = career_hr.merge(
    people[['playerID', 'nameFirst', 'nameLast', 'birthYear']],
    on='playerID'
)

career_hr_with_names['full_name'] = career_hr_with_names['nameFirst'] + ' ' + career_hr_with_names['nameLast']
career_hr_with_names[['full_name', 'HR', 'birthYear']].reset_index(drop=True)

Unnamed: 0,full_name,HR,birthYear
0,Barry Bonds,762,1964
1,Hank Aaron,755,1934
2,Babe Ruth,714,1895
3,Albert Pujols,703,1980
4,Alex Rodriguez,696,1975
5,Willie Mays,660,1931
6,Ken Griffey,630,1969
7,Jim Thome,612,1970
8,Sammy Sosa,609,1968
9,Frank Robinson,586,1935


In [28]:
# Let's see what other tables are available in the Lahman database
lahman_tables = [
    ('Batting()', 'Season-level batting stats'),
    ('Pitching()', 'Season-level pitching stats'),
    ('Fielding()', 'Season-level fielding stats'),
    ('People()', 'Player biographical information'),
    ('Teams()', 'Team season records and info'),
    ('Salaries()', 'Player salary data'),
    ('AwardsPlayers()', 'Award winners (MVP, Cy Young, etc.)'),
    ('HallOfFame()', 'Hall of Fame voting data'),
    ('Appearances()', 'Games played by position'),
    ('AllstarFull()', 'All-Star game appearances'),
]

print("Key Lahman Database tables:")
print("-" * 55)
for func, desc in lahman_tables:
    print(f"  lahman.{func:25s} -- {desc}")

Key Lahman Database tables:
-------------------------------------------------------
  lahman.Batting()                 -- Season-level batting stats
  lahman.Pitching()                -- Season-level pitching stats
  lahman.Fielding()                -- Season-level fielding stats
  lahman.People()                  -- Player biographical information
  lahman.Teams()                   -- Team season records and info
  lahman.Salaries()                -- Player salary data
  lahman.AwardsPlayers()           -- Award winners (MVP, Cy Young, etc.)
  lahman.HallOfFame()              -- Hall of Fame voting data
  lahman.Appearances()             -- Games played by position
  lahman.AllstarFull()             -- All-Star game appearances


In [29]:
# Quick example: Who won the AL MVP in the last 10 years?
awards = lahman.AwardsPlayers()
recent_mvp = awards[
    (awards['awardID'] == 'Most Valuable Player') &
    (awards['lgID'] == 'AL') &
    (awards['yearID'] >= 2014)
].sort_values('yearID', ascending=False)

recent_mvp[['yearID', 'playerID', 'awardID', 'lgID']]

Unnamed: 0,yearID,playerID,awardID,lgID
12494,2024,judgeaa01,Most Valuable Player,AL
12292,2023,ohtansh01,Most Valuable Player,AL
11972,2022,judgeaa01,Most Valuable Player,AL
11730,2021,ohtansh01,Most Valuable Player,AL
11369,2020,abreujo02,Most Valuable Player,AL
11338,2019,troutmi01,Most Valuable Player,AL
10918,2018,bettsmo01,Most Valuable Player,AL
10679,2017,altuvjo01,Most Valuable Player,AL
10652,2016,troutmi01,Most Valuable Player,AL
10330,2015,donaljo02,Most Valuable Player,AL


**Key takeaway**: The Lahman database is the gold standard for historical baseball data. It's clean, well-structured, and covers over 150 years of baseball. Use it for historical analysis, career stats, and cross-era comparisons.

## 5. Baseball Reference Scraping via pybaseball <a id="5-baseball-reference-scraping"></a>

While Section 2 showed pybaseball pulling from FanGraphs (via `batting_stats()` and `pitching_stats()`), pybaseball also has dedicated functions that scrape directly from **Baseball Reference**. These are useful because:

- Baseball Reference calculates its own version of WAR (bWAR / rWAR)
- Some stats and formatting differ from FanGraphs
- Certain historical data may only be available on Baseball Reference

The `_bref` suffix functions pull from Baseball Reference specifically.

In [None]:
# Baseball Reference now uses Cloudflare bot protection, which blocks the
# default python-requests User-Agent. We need to patch pybaseball's BRef
# session to use a browser-like User-Agent before making any _bref calls.
from pybaseball.datasources.bref import BRefSession

bref_session = BRefSession()
bref_session.session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'
})

In [33]:
from pybaseball import batting_stats_bref, pitching_stats_bref

# Get 2024 batting stats from Baseball Reference
batting_bref = batting_stats_bref(2024)

print(f"Baseball Reference batting data: {batting_bref.shape}")
batting_bref.head(10)

IndexError: list index out of range

In [31]:
# Get 2024 pitching stats from Baseball Reference
pitching_bref = pitching_stats_bref(2024)

print(f"Baseball Reference pitching data: {pitching_bref.shape}")
pitching_bref.head(10)

IndexError: list index out of range

### Baseball Reference WAR Data

pybaseball also provides direct access to Baseball Reference's WAR calculations (often called bWAR or rWAR to distinguish from FanGraphs' fWAR).

In [None]:
from pybaseball import bwar_bat, bwar_pitch

# Get Baseball Reference WAR for batters
bwar_batting = bwar_bat()

print(f"bWAR batting data shape: {bwar_batting.shape}")
print(f"Year range: {bwar_batting['year_ID'].min()} to {bwar_batting['year_ID'].max()}")

# Top 10 single-season bWAR performances in the 2020s
recent_bwar = bwar_batting[bwar_batting['year_ID'] >= 2020].sort_values('WAR', ascending=False)
recent_bwar.head(10)[['name_common', 'year_ID', 'team_ID', 'WAR', 'PA']]

In [None]:
# Pitching bWAR
bwar_pitching = bwar_pitch()

print(f"bWAR pitching data shape: {bwar_pitching.shape}")

recent_pitch_bwar = bwar_pitching[bwar_pitching['year_ID'] >= 2020].sort_values('WAR', ascending=False)
recent_pitch_bwar.head(10)[['name_common', 'year_ID', 'team_ID', 'WAR', 'IPouts']]

### FanGraphs vs. Baseball Reference: What's the Difference?

| Feature | FanGraphs (`batting_stats`) | Baseball Reference (`batting_stats_bref`) |
|---------|---------------------------|------------------------------------------|
| WAR Calculation | fWAR (uses FIP for pitchers) | bWAR/rWAR (uses RA9 for pitchers) |
| Defensive Metrics | UZR, Def | DRS, Rtot |
| Data Format | More advanced metrics | More traditional presentation |
| Update Speed | Often faster | Slightly slower |

Both are excellent sources — most analysts use both and compare.

**Key takeaway**: Use the `_bref` functions when you specifically need Baseball Reference data or bWAR. For general analysis, the FanGraphs-based functions (`batting_stats`, `pitching_stats`) tend to have more advanced metrics available.

## 6. Driveline OpenBiomechanics Data <a id="6-driveline-openbiomechanics"></a>

The [OpenBiomechanics Project](https://www.openbiomechanics.org/) by [Driveline Baseball](https://www.drivelinebaseball.com/) is the largest open-source collection of high-fidelity motion capture data on elite baseball players. It includes:

- **100 pitchers** and **98 hitters** from the Driveline database
- **Raw C3D files**: Motion capture marker position data (3D coordinates over time)
- **Processed metrics**: Joint angles, velocities, and key biomechanical events (e.g., foot contact, maximum external rotation, ball release)
- **Synchronized video** samples

This data is used for:
- Injury prevention research
- Pitching/hitting mechanics analysis
- Biomechanical modeling
- Computer vision projects

**Data source**: [github.com/drivelineresearch/openbiomechanics](https://github.com/drivelineresearch/openbiomechanics)

### Working with C3D Files

C3D is the standard file format for biomechanics motion capture data. These files contain 3D marker positions tracked over time — think of it as recording the (x, y, z) coordinates of dozens of reflective markers placed on a player's body at hundreds of frames per second.

To read C3D files in Python, we use the `ezc3d` library.

> **Note**: The code below demonstrates how to read C3D files. You would need to first download the data from the OpenBiomechanics GitHub repository. Since these files are large, we'll show the code pattern without executing the download.

In [None]:
# DEMONSTRATION CODE -- How to read OpenBiomechanics C3D files
# Uncomment and run after downloading data from:
# https://github.com/drivelineresearch/openbiomechanics

# import ezc3d
#
# # Read a C3D file
# c3d_data = ezc3d.c3d('path/to/openbiomechanics/baseball_pitching/data/c3d/player_001/pitch_01.c3d')
#
# # The C3D file contains several sections:
# # - 'header': Metadata about the recording
# # - 'parameters': Recording parameters (frame rate, units, etc.)
# # - 'data': The actual motion capture data
#
# # Get marker labels
# marker_labels = c3d_data['parameters']['POINT']['LABELS']['value']
# print(f"Number of markers: {len(marker_labels)}")
# print(f"Marker names: {marker_labels[:10]}...")  # Show first 10
#
# # Get 3D point data (markers x dimensions x frames)
# points = c3d_data['data']['points']
# print(f"\nData shape: {points.shape}")
# print(f"  - {points.shape[0]} dimensions (X, Y, Z, residual)")
# print(f"  - {points.shape[1]} markers")
# print(f"  - {points.shape[2]} frames")

print("Code pattern shown above -- download data from GitHub to run.")
print("Repository: https://github.com/drivelineresearch/openbiomechanics")

### Pre-Processed Biomechanics Data

If you don't want to work with raw C3D files, the OpenBiomechanics project also provides **pre-processed CSV files** with calculated metrics. These are much easier to work with and include things like:

- Joint angles at key events (foot contact, max external rotation, ball release)
- Angular velocities (arm speed, trunk rotation speed)
- Timing metrics (time from foot contact to ball release)
- Force plate data

These CSV files can be loaded directly with pandas — no special libraries needed.

In [None]:
# DEMONSTRATION: Loading pre-processed OpenBiomechanics data
# After cloning the repo, you can load the processed CSVs directly:
#
# poi_data = pd.read_csv('openbiomechanics/baseball_pitching/data/poi/poi_metrics.csv')
# print(f"Shape: {poi_data.shape}")
# poi_data.head()
#
# # Point-of-interest (POI) metrics include values at key biomechanical events:
# # - Foot Contact (FC)
# # - Maximum External Rotation (MER)
# # - Ball Release (BR)
# # - Maximum Internal Rotation (MIR)

print("To access this data:")
print("  1. git clone https://github.com/drivelineresearch/openbiomechanics.git")
print("  2. Navigate to the baseball_pitching/ or baseball_hitting/ directories")
print("  3. Load CSVs from the data/poi/ directory with pd.read_csv()")

**Key takeaway**: The OpenBiomechanics Project is a unique and valuable dataset for biomechanics research. The raw C3D files require specialized tools (`ezc3d`), but pre-processed CSV metrics are available for easier analysis. This data is best suited for advanced projects focused on player mechanics and injury prevention.

## 7. Computer Vision Baseball Datasets <a id="7-computer-vision-datasets"></a>

Computer vision (CV) is increasingly important in baseball analytics. In fact, **Statcast itself is a computer vision system** — MLB's Hawk-Eye tracking system uses an array of high-speed cameras to track the ball and player movements, which is how all that pitch-level data we explored in Section 2 is generated.

Beyond Statcast, there are several open datasets for building your own baseball computer vision models.

### Statcast as Computer Vision Output

It's worth understanding that the Statcast data we pulled in Section 2 is actually the **output** of a computer vision system:

- **2015-2019**: TrackMan (Doppler radar system)
- **2020-present**: Hawk-Eye (multi-camera optical tracking system)

Hawk-Eye uses **12 cameras** installed in each MLB stadium, operating at high frame rates. The system tracks:
- Ball position and spin in 3D space (every pitch)
- Player positions and movements (every play)
- Bat tracking (swing path, contact point)

So when you use `statcast()` to pull data, you're getting processed CV output!

### MLB-YouTube Dataset

The [MLB-YouTube dataset](https://github.com/piergiaj/mlb-youtube) is an academic dataset for **fine-grained activity recognition** in baseball videos. It contains:

- **20 full games** from the 2017 MLB post-season (from YouTube)
- **4,290 segmented video clips** annotated with activities (swing, hit, ball, strike, foul, etc.)
- **Pitch type annotations** (fastball, curveball, slider, etc.) with speed labels
- **42+ hours** of video footage

This dataset was introduced in a 2018 CVsports workshop paper and is primarily used for training activity classification models.

**Access**: [github.com/piergiaj/mlb-youtube](https://github.com/piergiaj/mlb-youtube)

### Roboflow Baseball Detection Datasets

[Roboflow Universe](https://universe.roboflow.com/) hosts several community-created baseball datasets for object detection tasks:

- **Baseball Detection**: Datasets for detecting baseballs in video frames, useful for pitch tracking and trajectory analysis
- **Player Detection**: Detecting and tracking player positions on the field
- **Pitch Type Classification**: Image-based pitch type identification

These datasets are pre-annotated with bounding boxes and come in formats compatible with popular object detection frameworks (YOLO, COCO, Pascal VOC).

**Access**: Search "baseball" on [Roboflow Universe](https://universe.roboflow.com/search?q=class:baseball)

> **Example project**: One notable project combines YOLOv8 with OpenCV to create baseball pitch overlay visualizations — detecting the ball frame-by-frame and drawing the pitch trajectory.

**Key takeaway**: Computer vision is deeply embedded in modern baseball analytics. Statcast data *is* CV output. For building your own CV models, the MLB-YouTube dataset and Roboflow collections provide good starting points. These are more specialized datasets that require familiarity with deep learning frameworks (PyTorch, TensorFlow) and object detection architectures (YOLO, etc.).

## 8. Exporting Data to CSV <a id="8-exporting-to-csv"></a>

Now that we've pulled data from several sources, let's save it to CSV files so we can use it in future notebooks without needing to re-download everything.

CSV (Comma-Separated Values) is the simplest and most universal format for tabular data. Every data tool can read CSVs.

In [None]:
import os

# Create a 'data' subdirectory to store our CSV files
output_dir = 'data'
os.makedirs(output_dir, exist_ok=True)
print(f"Output directory: {os.path.abspath(output_dir)}")

In [None]:
# Save Statcast data
statcast_data.to_csv(f'{output_dir}/statcast_week_sample_2024.csv', index=False)
print(f"Saved: statcast_week_sample_2024.csv ({len(statcast_data):,} rows)")

# Save player-specific data
ohtani_batting.to_csv(f'{output_dir}/ohtani_statcast_2024.csv', index=False)
print(f"Saved: ohtani_statcast_2024.csv ({len(ohtani_batting):,} rows)")

skubal_pitching.to_csv(f'{output_dir}/skubal_statcast_2024.csv', index=False)
print(f"Saved: skubal_statcast_2024.csv ({len(skubal_pitching):,} rows)")

In [None]:
# Save FanGraphs season-level stats
batting_2024.to_csv(f'{output_dir}/fangraphs_batting_2024.csv', index=False)
print(f"Saved: fangraphs_batting_2024.csv ({len(batting_2024)} rows)")

pitching_2024.to_csv(f'{output_dir}/fangraphs_pitching_2024.csv', index=False)
print(f"Saved: fangraphs_pitching_2024.csv ({len(pitching_2024)} rows)")

In [None]:
# Save MLB Stats API data
roster_df.to_csv(f'{output_dir}/yankees_roster_2024.csv', index=False)
print(f"Saved: yankees_roster_2024.csv ({len(roster_df)} rows)")

schedule_df.to_csv(f'{output_dir}/mlb_schedule_sample_2024.csv', index=False)
print(f"Saved: mlb_schedule_sample_2024.csv ({len(schedule_df)} rows)")

In [None]:
# Save Baseball Reference data
batting_bref.to_csv(f'{output_dir}/bref_batting_2024.csv', index=False)
print(f"Saved: bref_batting_2024.csv ({len(batting_bref)} rows)")

pitching_bref.to_csv(f'{output_dir}/bref_pitching_2024.csv', index=False)
print(f"Saved: bref_pitching_2024.csv ({len(pitching_bref)} rows)")

In [None]:
# Save Lahman data (subset -- recent years for manageable file sizes)
lahman_batting_recent = batting[batting['yearID'] >= 2000]
lahman_batting_recent.to_csv(f'{output_dir}/lahman_batting_2000_present.csv', index=False)
print(f"Saved: lahman_batting_2000_present.csv ({len(lahman_batting_recent):,} rows)")

people.to_csv(f'{output_dir}/lahman_people.csv', index=False)
print(f"Saved: lahman_people.csv ({len(people):,} rows)")

In [None]:
# List all saved files with sizes
print("\nAll saved CSV files:")
print("-" * 60)
for filename in sorted(os.listdir(output_dir)):
    if filename.endswith('.csv'):
        filepath = os.path.join(output_dir, filename)
        size_mb = os.path.getsize(filepath) / (1024 * 1024)
        print(f"  {filename:45s} {size_mb:>7.2f} MB")

total_size = sum(
    os.path.getsize(os.path.join(output_dir, f))
    for f in os.listdir(output_dir)
    if f.endswith('.csv')
) / (1024 * 1024)
print(f"\n  {'TOTAL':45s} {total_size:>7.2f} MB")

## 9. Summary & Comparison of Data Sources <a id="9-summary--comparison"></a>

Here's a side-by-side comparison of all the data sources we covered:

| Data Source | Python Package | Data Type | Granularity | History | Best For |
|------------|---------------|-----------|-------------|---------|----------|
| **Statcast** | `pybaseball` | Pitch tracking, batted ball | Pitch-level | 2015-present | Pitch analysis, batted ball analytics |
| **FanGraphs** | `pybaseball` | Batting & pitching stats | Season-level | 1871-present | Advanced metrics (WAR, wRC+, FIP) |
| **Baseball Reference** | `pybaseball` | Batting & pitching stats, WAR | Season-level | 1871-present | bWAR, traditional stats |
| **MLB Stats API** | `MLB-StatsAPI` | Rosters, schedules, bios | Player/game-level | Varies | Roster info, schedules, live data |
| **Lahman Database** | `pylahman` | Historical stats | Season-level | 1871-present | Historical analysis, career stats |
| **OpenBiomechanics** | `ezc3d` / pandas | Motion capture, biomechanics | Frame-level (300Hz) | ~200 players | Mechanics analysis, injury research |
| **CV Datasets** | Various | Video, images | Frame-level | Varies | Object detection, activity recognition |

### Which Source Should You Use?

- **Starting a general baseball analytics project?** Start with `pybaseball` — it covers the most common needs.
- **Need roster or schedule information?** Use the MLB Stats API.
- **Doing historical analysis (pre-2015)?** Use the Lahman database.
- **Studying pitching/hitting mechanics?** Use OpenBiomechanics data.
- **Building a computer vision model?** Check Roboflow and MLB-YouTube.
- **Need everything?** Combine sources! Use player IDs to join data across Statcast, FanGraphs, Baseball Reference, and Lahman.

### A Note on Player ID Systems

One of the trickiest parts of combining baseball data sources is that **each source uses a different player ID system**:

| ID System | Used By | Column Name | Example (Shohei Ohtani) |
|-----------|---------|-------------|------------------------|
| MLBAM (MLB Advanced Media) | Statcast, MLB Stats API | `key_mlbam`, `pitcher`, `batter` | 660271 |
| FanGraphs | FanGraphs stats | `key_fangraphs`, `IDfg` | 19755 |
| Baseball Reference | Baseball Reference, bWAR | `key_bbref`, `bbref_id` | ohtMDSh01 |
| Lahman | Lahman Database | `playerID` | ohtansh01 |
| Retrosheet | Play-by-play data | `key_retro` | ohtas001 |

The `playerid_lookup()` function from `pybaseball` (shown in Section 2b) returns **all of these IDs** at once — this is your Rosetta Stone for mapping between systems. Always look up IDs using this function when you need to join data across sources.

```python
# This one call gives you every ID system
from pybaseball import playerid_lookup
player = playerid_lookup('ohtani', 'shohei')
# Returns: key_mlbam, key_retro, key_bbref, key_fangraphs, and more
```

### Quick Reference: Key Functions

```python
# === pybaseball ===
from pybaseball import statcast, batting_stats, pitching_stats
from pybaseball import playerid_lookup, statcast_batter, statcast_pitcher
from pybaseball import batting_stats_bref, pitching_stats_bref
from pybaseball import bwar_bat, bwar_pitch

# === MLB Stats API ===
import statsapi
statsapi.lookup_player('name')
statsapi.player_stats(player_id, group='hitting', type='season')
statsapi.roster(team_id)
statsapi.schedule(start_date='MM/DD/YYYY', end_date='MM/DD/YYYY')

# === Lahman Database ===
import pylahman as lahman
lahman.Batting()
lahman.Pitching()
lahman.People()
lahman.AwardsPlayers()

# === Biomechanics ===
import ezc3d
c3d_data = ezc3d.c3d('file.c3d')
```

## 10. Next Steps: SQL Database Storage <a id="10-next-steps"></a>

We saved all our data as CSV files, which is great for simple projects. But as your projects grow, you'll want to store data in a **SQL database** for:

- **Faster queries** on large datasets
- **Relationships** between tables (join player stats with bio info efficiently)
- **Data integrity** (prevent duplicate entries, enforce data types)
- **Scalability** (handle millions of rows without memory issues)

Here's a quick preview of what that looks like:

In [None]:
# PREVIEW: Saving to a SQLite database (covered in detail in the next notebook)
import sqlite3

# Create a connection to a SQLite database file
conn = sqlite3.connect(f'{output_dir}/baseball_data.db')

# Save a DataFrame to a SQL table -- it's this easy!
batting_2024.to_sql('fangraphs_batting', conn, if_exists='replace', index=False)

# Verify it worked with a quick query
result = pd.read_sql("SELECT Name, Team, HR, WAR FROM fangraphs_batting ORDER BY WAR DESC LIMIT 5", conn)
print("Top 5 batters by WAR (from SQL query):")
print(result.to_string(index=False))

conn.close()
print("\nDatabase saved! Full SQL workflow covered in the next notebook.")

---

**Congratulations!** You now know how to pull baseball data from six different sources using Python. In the next notebook, we'll cover:

- Best practices for combining data from multiple sources
- Handling player ID mapping across different systems
- Data cleaning and preprocessing
- Building and maintaining a SQL database for your baseball data projects

Happy analyzing!