# Lab 03: Pitch Classification

---
author: Yiran Hu
date: February 18, 2024
embed-resources: true
---

## Introduction

## Methods

In [60]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from joblib import dump

### Data

In [61]:
pitches_train = pd.read_csv("https://cs307.org/lab-03/data/pitches-train.csv")

The dataset for this lab is sourced from Statcast, accessed via the pybaseball package. It includes detailed pitch data for the pitcher Shohei Ohtani from the 2022 (training data) and 2023 (test data) MLB seasons, featuring pitch type, velocity, spin rate, horizontal and vertical movement, and the batter's stance. The training and testing data are temporally split by season, with the aim to develop a model to predict pitch types based on these features. The code block below show what data looks like.

In [62]:
pitches_train

Unnamed: 0,pitch_name,release_speed,release_spin_rate,pfx_x,pfx_z,stand
0,Sweeper,84.7,2667.0,1.25,0.01,R
1,Sweeper,83.9,2634.0,1.41,0.20,R
2,Sweeper,84.4,2526.0,1.26,0.25,R
3,Curveball,74.3,2389.0,0.93,-1.10,L
4,Sweeper,85.6,2474.0,1.08,0.52,R
...,...,...,...,...,...,...
2623,Split-Finger,91.8,1314.0,-0.30,0.08,R
2624,Sweeper,86.9,2440.0,1.11,0.51,R
2625,4-Seam Fastball,99.2,2320.0,0.04,0.81,R
2626,4-Seam Fastball,97.9,2164.0,0.08,1.06,R


#### Summary Statistics

In [63]:
# Pitch Mix (Pitch Type Statistics)
pitch_group = pitches_train.groupby('pitch_name').agg('count').reset_index()
pitch_group['count'] = pitch_group['release_speed']
pitch_group['Proportion'] = pitch_group['count']/2628
pitch_group = pitch_group[['pitch_name', 'count', 'Proportion']]
pitch_group

Unnamed: 0,pitch_name,count,Proportion
0,4-Seam Fastball,718,0.273212
1,Curveball,222,0.084475
2,Cutter,233,0.088661
3,Sinker,97,0.03691
4,Slider,63,0.023973
5,Split-Finger,312,0.118721
6,Sweeper,983,0.374049


In [64]:
# Velocity by Pitch Type
velocity = pitches_train.groupby(
    'pitch_name')['release_speed'].agg('mean').reset_index()
velocity['speed_std'] = pitches_train.groupby(
    'pitch_name')['release_speed'].agg('std').reset_index()['release_speed']
velocity

Unnamed: 0,pitch_name,release_speed,speed_std
0,4-Seam Fastball,97.270613,1.69927
1,Curveball,77.67973,3.215206
2,Cutter,90.74206,2.364489
3,Sinker,97.160825,1.829592
4,Slider,85.203175,2.401543
5,Split-Finger,89.291346,1.759299
6,Sweeper,85.336419,1.862552


In [65]:
# Spin by Pitch Type
spin = pitches_train.groupby(
    'pitch_name')['release_spin_rate'].agg('mean').reset_index()
spin['spin_std'] = pitches_train.groupby('pitch_name')['release_spin_rate'].agg('std').reset_index()['release_spin_rate']
spin

Unnamed: 0,pitch_name,release_spin_rate,spin_std
0,4-Seam Fastball,2217.331933,114.754683
1,Curveball,2482.666667,119.854726
2,Cutter,2378.424893,206.685887
3,Sinker,1972.747368,143.920632
4,Slider,2497.619048,78.679306
5,Split-Finger,1273.560897,221.291146
6,Sweeper,2492.17294,103.176892


In [66]:
# visualizations

### Models

In [67]:
# find data types of X_train (5 features in total)
X_train = pitches_train.drop("pitch_name", axis=1)
y_train = pitches_train["pitch_name"]
X_train.dtypes

release_speed        float64
release_spin_rate    float64
pfx_x                float64
pfx_z                float64
stand                 object
dtype: object

In [68]:
# split the data into numerical and categorical features
numeric_features = X_train.select_dtypes(include=["float64"]).columns
categorical_features = X_train.select_dtypes(include=["object"]).columns

In [69]:
# define how to handle missing data and scale features for numeric and categorical features
numeric_transformer = Pipeline(
    steps=[
        ("Median Imputer", SimpleImputer(strategy="median")),
        ("Standardization", StandardScaler()),
    ]
)
categorical_transformer = Pipeline(
    steps=[
        ("Modal Imputer", SimpleImputer(strategy="most_frequent")),
        ("One-Hot Encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)

In this lab, I processed the training data with 5 features, including both numerical and categorical variables. I applied a SimpleImputer to fill in the missing values in the numerical features, and used a OneHotEncoder to transform the categorical features. 

In [70]:
# create general preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("Numeric Transformer", numeric_transformer, numeric_features),
        ("Categorical Transformer", categorical_transformer, categorical_features),
    ],
    remainder="drop",
)

In [71]:
# Create the pipeline
model_pipeline = Pipeline(steps=[
    ("Preprocessor", preprocessor),
    ("Classifier", KNeighborsClassifier())
])

I chose the K-Nearest Neighbors (KNN) model for prediction, due to its effectiveness in handling complex patterns. To optimize the model, I conducted hyperparameter tuning through grid search, focusing on the number of neighbors, weight strategies, and distance metrics.

In [72]:
# define the parameter grid for grid search
param_grid = {
    "Classifier__n_neighbors": range(1, 20),  # set the range of k from 1 to 19
    # define how the neighbors contribute, equally or have different weights based on distance
    "Classifier__weights": ["uniform", "distance"],
    # distance metrics to be used for calculating the proximity between data points
    "Classifier__metric": ["euclidean", "manhattan", "chebyshev"]
}

In [73]:
# setup grid search with cross-validation
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring="accuracy")

In [74]:
# fit models
grid_search.fit(X_train, y_train)

In [75]:
print("Best parameters found:", grid_search.best_params_)
print("accuracy:", grid_search.best_score_)

Best parameters found: {'Classifier__metric': 'manhattan', 'Classifier__n_neighbors': 19, 'Classifier__weights': 'distance'}
accuracy: 0.974500814774579


In [77]:
# save models
dump(grid_search, "pitch-classifier.joblib")

['pitch-classifier.joblib']

## Results

In [76]:
# report model metrics

## Discussion

### Conclusion