# ⚾ Pitch-by-Pitch Baseball Analysis

This project performs an in-depth, season-by-season analysis of MLB pitch-by-pitch data using Statcast. It extracts pitch sequences, evaluates pitch velocities, and compares individual pitchers against league averages — all designed for advanced portfolio-level insights.

## 🧱 Project Goals

- **Pitch Sequences:**  
  Extract the order and types of pitches thrown to each batter across games and seasons.

- **Velocity Analysis by Pitch Type:**  
  Calculate each pitcher's average velocity for every pitch type and compare it to league averages.

- **Pitch Grading System:**  
  Assign grades to pitchers based on how their pitch velocities compare to league norms (e.g., above average = A, below average = F).

- **Team-Level Performance:**  
  Summarize performance by starting rotation and bullpen to identify which teams have the best overall pitching staffs.

- **Portfolio-Ready Output:**  
  Create clear, compelling data visualizations and summaries suitable for technical portfolios.

## 📁 Directory Structure
baseball-pitch-analysis/
├── data/
│ ├── raw/ # Year-by-year Statcast data (CSV)
│ ├── processed/ # Cleaned and transformed pitch data
│ └── league_stats/ # League average pitch metrics
│
├── notebooks/
│ ├── 01_data_collection.ipynb
│ ├── 02_pitch_sequences.ipynb
│ ├── 03_velocity_analysis.ipynb
│ ├── 04_team_rotation_analysis.ipynb
│ └── 05_visualization.ipynb
│
├── scripts/
│ ├── fetch_data.py
│ ├── process_sequences.py
│ ├── analyze_velocity.py
│ └── summarize_teams.py
│
├── reports/
│ ├── figures/
│ └── pitcher_profiles/
│
├── utils/
│ └── grading.py
│
├── requirements.txt
├── README.md
└── .gitignore


## 🔍 Data Source

Data is retrieved from MLB's Statcast system using the `pybaseball` Python library.

- 📊 [Statcast Search Tool](https://baseballsavant.mlb.com/statcast_search)  
- 📦 [pybaseball GitHub](https://github.com/jldbc/pybaseball)

## 🧠 Future Work

- Incorporate pitch movement and spin rate  
- Build pitcher similarity scores based on pitch repertoire and velocity  
- Factor in park effects and weather conditions  
- Expand to batter-pitcher matchup analysis and outcome prediction

---




In [2]:
from pybaseball import statcast
import pandas as pd
from datetime import datetime

# 📅 Define yearly ranges
seasons = {
    "2022": ("2022-04-07", "2022-10-05"),
    "2023": ("2023-03-30", "2023-10-01"),
    # "2024": ("2024-03-28", "2024-09-30"),
    "2025": ("2025-03-20", datetime.today().strftime("%Y-%m-%d"))  # up to today
}

for year, (start, end) in seasons.items():
    print(f"📥 Fetching Statcast data for {year} ({start} to {end})...")
    df = statcast(start_dt=start, end_dt=end)
    out_path = f"../data/raw/statcast_{year}.csv"
    df.to_csv(out_path, index=False)
    print(f"✅ Saved: {out_path}")


📥 Fetching Statcast data for 2022 (2022-04-07 to 2022-10-05)...
This is a large query, it may take a moment to complete


That's a nice request you got there. It'd be a shame if something were to happen to it.
We strongly recommend that you enable caching before running this. It's as simple as `pybaseball.cache.enable()`.
Since the Statcast requests can take a *really* long time to run, if something were to happen, like: a disconnect;
gremlins; computer repair by associates of Rudy Giuliani; electromagnetic interference from metal trash cans; etc.;
you could lose a lot of progress. Enabling caching will allow you to immediately recover all the successful
subqueries if that happens.
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[col

✅ Saved: ../data/raw/statcast_2022.csv
📥 Fetching Statcast data for 2023 (2023-03-30 to 2023-10-01)...
This is a large query, it may take a moment to complete


That's a nice request you got there. It'd be a shame if something were to happen to it.
We strongly recommend that you enable caching before running this. It's as simple as `pybaseball.cache.enable()`.
Since the Statcast requests can take a *really* long time to run, if something were to happen, like: a disconnect;
gremlins; computer repair by associates of Rudy Giuliani; electromagnetic interference from metal trash cans; etc.;
you could lose a lot of progress. Enabling caching will allow you to immediately recover all the successful
subqueries if that happens.
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[col

✅ Saved: ../data/raw/statcast_2023.csv
📥 Fetching Statcast data for 2025 (2025-03-20 to 2025-05-30)...
This is a large query, it may take a moment to complete


That's a nice request you got there. It'd be a shame if something were to happen to it.
We strongly recommend that you enable caching before running this. It's as simple as `pybaseball.cache.enable()`.
Since the Statcast requests can take a *really* long time to run, if something were to happen, like: a disconnect;
gremlins; computer repair by associates of Rudy Giuliani; electromagnetic interference from metal trash cans; etc.;
you could lose a lot of progress. Enabling caching will allow you to immediately recover all the successful
subqueries if that happens.
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[col

✅ Saved: ../data/raw/statcast_2025.csv
