# **Hit Classification at Elevation**
Author: Declan Costello

Date: 7/25/2023

## **Overview**

For the 2022 MLB season, I plan to add an elevation feature and test more models while recreating [this project](https://github.com/tjburch/mlb-hit-classifier/tree/master). I hope to provide value to the baseball community by combining my interests of the effects of altitude and hitting.

## **Variables**
Descriptions taken Pybaseball and statcast. The following is a list of the important variable to understand for this notebook.

*   **Launch Angle** - In baseball, launch angle is the vertical angle at which the ball leaves the player's bat after being struck. It's measured in relation to the ground. A high launch angle means the ball will go further and higher into the air, and a low launch angle means the ball will go lower and not as far. 

*   **Altitude** - Altitude is a distance measurement, usually in the vertical or "up" direction, between a reference datum and sea level.

# **Installation**

The following installs the necessary packages.

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast
import numpy as np

# **Pulling Data**

We only want data from the 2022 season.

In [6]:
data = statcast(start_dt='2022-03-29', end_dt='2022-10-28')

This is a large query, it may take a moment to complete


That's a nice request you got there. It'd be a shame if something were to happen to it.
We strongly recommend that you enable caching before running this. It's as simple as `pybaseball.cache.enable()`.
Since the Statcast requests can take a *really* long time to run, if something were to happen, like: a disconnect;
gremlins; computer repair by associates of Rudy Giuliani; electromagnetic interference from metal trash cans; etc.;
you could lose a lot of progress. Enabling caching will allow you to immediately recover all the successful
subqueries if that happens.
100%|██████████| 214/214 [11:00<00:00,  3.09s/it]


Saving data for no more pulls

In [7]:
data.to_csv('pybaseball_2022.csv')

# **Inspecting Data**

Looking out for the quality of the data

In [9]:
data.head()

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
131,SL,2022-10-28,82.9,-2.7,5.66,"Robertson, David",649557,502085,field_out,hit_into_play,...,6,6,5,5,6,Standard,Standard,84,-0.235,-0.568
136,SL,2022-10-28,83.4,-2.72,5.73,"Robertson, David",649557,502085,,swinging_strike,...,6,6,5,5,6,Standard,Standard,90,0.0,-0.102
140,SL,2022-10-28,80.8,-2.57,5.82,"Robertson, David",649557,502085,,ball,...,6,6,5,5,6,Standard,Standard,65,0.0,0.037
145,KC,2022-10-28,82.2,-2.5,5.91,"Robertson, David",649557,502085,,ball,...,6,6,5,5,6,Standard,Standard,39,0.0,0.049
151,KC,2022-10-28,84.0,-2.58,5.84,"Robertson, David",649557,502085,,ball,...,6,6,5,5,6,Standard,Standard,33,0.066,0.054


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 741171 entries, 131 to 2217
Data columns (total 92 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   pitch_type                       731602 non-null  object        
 1   game_date                        741171 non-null  datetime64[ns]
 2   release_speed                    731547 non-null  Float64       
 3   release_pos_x                    731562 non-null  Float64       
 4   release_pos_z                    731562 non-null  Float64       
 5   player_name                      741171 non-null  object        
 6   batter                           741171 non-null  Int64         
 7   pitcher                          741171 non-null  Int64         
 8   events                           192992 non-null  object        
 9   description                      741171 non-null  object        
 10  spin_dir                         0 non-null 

In [11]:
data.isnull().sum().sort_values(ascending=False)

sv_id                  741171
umpire                 741171
tfs_zulu_deprecated    741171
tfs_deprecated         741171
spin_dir               741171
                        ...  
fielder_6                   0
fielder_7                   0
fielder_8                   0
fielder_9                   0
p_throws                    0
Length: 92, dtype: int64

In [12]:
data.describe()

Unnamed: 0,release_speed,release_pos_x,release_pos_z,batter,pitcher,spin_dir,spin_rate_deprecated,break_angle_deprecated,break_length_deprecated,zone,...,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,spin_axis,delta_home_win_exp,delta_run_exp
count,731547.0,731562.0,731562.0,741171.0,741171.0,0.0,0.0,0.0,0.0,731562.0,...,741171.0,741171.0,741171.0,741171.0,741171.0,741171.0,741171.0,729590.0,741171.0,718843.0
mean,88.909665,-0.811685,5.804788,615049.081923,613798.197749,,,,,9.100028,...,2.213308,2.122122,2.164348,2.228182,2.088263,2.152097,2.164348,175.301669,0.000128,2.3e-05
std,6.150794,1.835406,0.564332,59629.052082,61250.075346,,,,,4.222814,...,2.587177,2.484834,2.57624,2.593737,2.478732,2.498857,2.57624,72.285741,0.028306,0.237105
min,32.3,-4.95,0.86,405395.0,405395.0,,,,,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.736,-1.473
25%,84.6,-2.09,5.53,592206.0,592791.0,,,,,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,131.0,0.0,-0.066
50%,89.8,-1.51,5.85,640461.0,624133.0,,,,,11.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,199.0,0.0,-0.017
75%,93.8,0.91,6.15,664034.0,663559.0,,,,,13.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,222.0,0.0,0.033
max,104.2,4.75,7.74,703715.0,801389.0,,,,,14.0,...,29.0,29.0,29.0,29.0,21.0,29.0,29.0,360.0,0.91,3.605


In [13]:
data.nunique().sort_values(ascending=False)

vz0                       731562
az                        731562
ay                        731562
ax                        731562
vy0                       731562
                           ...  
spin_dir                       0
tfs_deprecated                 0
tfs_zulu_deprecated            0
umpire                         0
break_angle_deprecated         0
Length: 92, dtype: int64

In [14]:
data.shape

(741171, 92)

In [15]:
data.columns

Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description', 'spin_dir', 'spin_rate_deprecated',
       'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',
       'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',
       'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',
       'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',
       'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',
       'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',
       'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',
       'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',
       'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
       'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
       'estima

In [19]:
data.groupby(['events'])['woba_value'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
events,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
catcher_interf,75.0,0.7,0.0,0.7,0.7,0.7,0.7,0.7
caught_stealing_2b,219.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
caught_stealing_3b,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
caught_stealing_home,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
double,8473.0,1.25,0.0,1.25,1.25,1.25,1.25,1.25
double_play,420.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
field_error,1240.0,0.9,0.0,0.9,0.9,0.9,0.9,0.9
field_out,77765.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
fielders_choice,403.0,0.891067,0.089329,0.0,0.9,0.9,0.9,0.9
fielders_choice_out,299.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## **Cleaning Data**

We only want hit events from data pulled

In [30]:
hitting_data = data['events'].isin('wild_pitch','sac_bunt_double_play')

TypeError: Series.isin() takes 2 positional arguments but 3 were given

In [None]:
keep_columns = ['events'] #ect...

In [None]:
df_pga2 = data[keep_columns].dropna()
df_pga2.rename(columns = {'Player Name':'NAME'}, inplace = True)

# **Exploring Data**

Exploring out the Distributions of the relevant columns

What features determine the result of a hit?

In [None]:
f, ax = plt.subplots(nrows = 9, ncols = 3, figsize=(30,30))
# Selecting columns we want distributions for 
hist_cols = ['events']
row = 0
col = 0

for i, column in enumerate(hist_cols):
    graph = sns.distplot(data[column], ax=ax[row][col])
    graph.set(title = column)
    col += 1
    if col == 3:
        col = 0
        row += 1
        
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=1)

# **Machine Learning**

 **Feature Engineering**

**Data Splitting**

 **Model Choice**

#  **Results**

 # **Future Analysis**