# Research Questions

1. Which pitches were being hit most by Silver Slugger and All Star caliber players at Minute Maid Park in 2017?
2. Which pitch zones were Silver Slugger and All Star caliber players hitting pitches from?
3. How often did Astros players swing at pitches outside of the strike zone?
4. How did Astros players perform on breaking / offspeed pitches compared players from other teams?
5. Other comparisons between Astros players and opposing players.

# Player Classification - Silver Slugger vs All Star
- The __Silver Slugger__ is awarded to the top _hitter_ at each position.
- __All Star__ players are voted on by fans based on their performance during the current season.
    - Only top performing players are considered for the ballot.
    - These players are usually well rounded and excel at both batting and fielding.

# Pitch Classification
- Pitches are categorized into 11 types:
    1. 2 - Seam Fastball
    2. 4 - Seam Fastball
    3. Changeup
    4. Curveball
    5. Cutter
    6. Eephus
    7. Knuckle Curve
    8. Pitch Out
    9. Sinker
    10. Slider
    11. Split Finger

# Key Attributes
- Zone
-https://baseballsavant.mlb.com/sections/statcast_search_v2/images/zones.png

# Data
- Our data comes from BaseballSavant.com
- This houses MLB's Statcast database, which tracks the ball's movement during every pitch and play.

In [None]:
#import packages to use for analysis
import seaborn as sns
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

# reading the csv and loading it to source_data variable
# we do not have an index column 
# low_memory is needed because the file contains a lot of floats
source_data = pd.read_csv('2017_MLB_Data.csv', low_memory=False)

In [3]:
# collect number of rows and columns
rows = source_data.shape[0]
columns = source_data.shape[1]

print('Rows: ' + str(rows) + 'Cols: ' + str(columns))

Rows: 22695Cols: 89


In [4]:
# display columns missing data ratio
source_data.isna().mean().round(4) * 100

pitch_type               0.14
game_date                0.00
release_speed            0.23
release_pos_x            0.23
release_pos_z            0.23
                         ... 
post_home_score          0.00
post_bat_score           0.00
post_fld_score           0.00
if_fielding_alignment    0.26
of_fielding_alignment    0.26
Length: 89, dtype: float64

In [5]:
# drop columns that have more than 50% cells missing
drop_cols = source_data.columns[source_data.isnull().mean()>0.5]
source_data.drop(drop_cols, axis=1)

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,description,zone,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,FF,9/24/17,93.0,-0.6278,6.3865,Jose Altuve,514888,453284,swinging_strike,3.0,...,5,7,5,7,7,5,5,7,Standard,Strategic
1,FS,9/24/17,81.2,-0.9263,6.2440,Jose Altuve,514888,453284,ball,14.0,...,5,7,5,7,7,5,5,7,Standard,Strategic
2,FF,9/24/17,93.5,-0.5680,6.3990,Jose Altuve,514888,453284,ball,14.0,...,5,7,5,7,7,5,5,7,Standard,Strategic
3,FF,9/24/17,93.7,-0.7498,6.3757,Jose Altuve,514888,453284,called_strike,9.0,...,5,7,5,7,7,5,5,7,Standard,Strategic
4,FF,9/24/17,93.0,-0.8375,6.4247,Jose Altuve,514888,453284,called_strike,6.0,...,5,7,5,7,7,5,5,7,Standard,Strategic
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22690,FT,4/3/17,89.2,0.8567,6.0985,Jean Segura,516416,572971,foul,13.0,...,0,0,0,0,0,0,0,0,Strategic,Standard
22691,FT,4/3/17,88.1,0.6802,6.1517,Jean Segura,516416,572971,called_strike,7.0,...,0,0,0,0,0,0,0,0,Strategic,Standard
22692,FT,4/3/17,88.4,0.7248,6.0609,Jean Segura,516416,572971,ball,13.0,...,0,0,0,0,0,0,0,0,Strategic,Standard
22693,FT,4/3/17,88.2,0.8767,6.1514,Jean Segura,516416,572971,ball,14.0,...,0,0,0,0,0,0,0,0,Strategic,Standard


In [None]:

#commented out until we impute columns
# source_data['Embarked_imputed'] = source_data['Embarked'].fillna(source_data['Embarked'].mode()[0])
# source_data['Age_imputed'] = source_data['Age'].fillna(source_data['Age'].mean())
# source_data.isnull().mean()*100
#
print('hello')

In [7]:
# checking the data types
source_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22695 entries, 0 to 22694
Data columns (total 89 columns):
pitch_type                         22664 non-null object
game_date                          22695 non-null object
release_speed                      22643 non-null float64
release_pos_x                      22643 non-null float64
release_pos_z                      22643 non-null float64
player_name                        22695 non-null object
batter                             22695 non-null int64
pitcher                            22695 non-null int64
events                             5836 non-null object
description                        22695 non-null object
spin_dir                           0 non-null float64
spin_rate_deprecated               0 non-null float64
break_angle_deprecated             0 non-null float64
break_length_deprecated            0 non-null float64
zone                               22643 non-null float64
des                                5836 non-nul

In [8]:
# displays basic statistics of the given data
source_data.describe()

Unnamed: 0,release_speed,release_pos_x,release_pos_z,batter,pitcher,spin_dir,spin_rate_deprecated,break_angle_deprecated,break_length_deprecated,zone,...,at_bat_number,pitch_number,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score
count,22643.0,22643.0,22643.0,22695.0,22695.0,0.0,0.0,0.0,0.0,22643.0,...,22695.0,22695.0,22695.0,22695.0,22695.0,22695.0,22695.0,22695.0,22695.0,22695.0
mean,87.995235,-1.110656,5.931885,513236.633091,543621.085217,,,,,9.624608,...,38.433576,2.905398,2.504781,1.970214,2.187927,2.287068,1.970302,2.504781,2.188015,2.287068
std,6.01277,1.514586,0.476753,101091.776151,70691.374294,,,,,4.091088,...,22.605053,1.738803,2.814522,2.254612,2.515007,2.611038,2.25477,2.814522,2.515141,2.611038
min,64.8,-4.9893,3.293,134181.0,112526.0,,,,,1.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,83.3,-2.1115,5.6372,475174.0,501789.0,,,,,6.0,...,19.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,88.9,-1.504,5.9829,514888.0,571666.0,,,,,11.0,...,38.0,3.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0
75%,92.9,-0.77175,6.26545,592743.0,605242.0,,,,,13.0,...,57.0,4.0,4.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0
max,102.4,4.8143,7.1503,664057.0,664641.0,,,,,14.0,...,108.0,13.0,16.0,13.0,16.0,16.0,13.0,16.0,16.0,16.0


In [None]:
# plots & displays a bar graph of the given column
ax = source_data.plot.bar(x='pitch_type')


In [None]:
# plots & displays a distrubtion chart of given columns
ax1 = source_data.groupby(['description']).sum().plot.bar(x='description', rot=0)

In [None]:
# plots & displays a distrubtion chart of given columns
ax2 = source_data['events'].plot(kind='hist')

In [None]:
# sex vs. survived
ax3 = source_data[['description', 'release_speed']].groupby(['description']).mean().plot(kind='bar', rot = 0)

In [None]:
# # sex vs. survived
ax4 = source_data.groupby('description')['release_speed'].value_counts().plot(kind='bar')

In [None]:
# #scatter plot
ax5 = source_data.plot.scatter(x='release_speed', y='release_pos_x', c='DarkBlue')

In [None]:
# scatter plot using seaborn
sns.lmplot('release_speed', 'release_pos_x', data=source_data, fit_reg=False)

In [None]:
# correlation chart
corr = source_data.loc[:,source_data.dtypes == 'float64'].corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap=sns.diverging_palette(220, 10, as_cmap=True))

# Research Answers
1. Which pitches were being hit most by Silver Slugger and All Star caliber players at Minute Maid Park in 2017?
    - 
2. Which pitch zones were Silver Slugger and All Star caliber players hitting pitches from?
    - 
3. How often did Astros players swing at pitches outside of the strike zone?
    - 
4. How did Astros players perform on breaking / offspeed pitches compared players from other teams?
    - 
5. Other comparisons between Astros players and opposing players.
    - 