# 2019 On Base Percentage Projection

This is a multivarite principal component regression analysis, in order to project the on base percentage of several major league baseball players throughout the 2019 season based on their March and April statistics.


##### Importing Data File
First Import the csv file, this is done using pandas. Also some basic data preprocessing to remove % signs, converting strings to decimals, so a proper analysis can be conducted.

In [2]:
import pandas as pd # load in for using dataframes in python
df = pd.read_csv(r'2019MarchAprilBatting.csv') # df contains 
attr = df.columns # put all column headers into list
# remove percentages, change data types to numerical (from reference [1])
for i in attr:
    if '%' in i or i == 'MarApr_HR/FB': # hardcoded the exception case
        df[i] = df[i].str.rstrip('%').astype('float') / 100.0
        
df.head() # For viewing raw data

Unnamed: 0,playerid,Name,Team,MarApr_PA,MarApr_AB,MarApr_H,MarApr_HR,MarApr_R,MarApr_RBI,MarApr_SB,...,MarApr_FB%,MarApr_IFFB%,MarApr_HR/FB,MarApr_O-Swing%,MarApr_Z-Swing%,MarApr_Swing%,MarApr_O-Contact%,MarApr_Z-Contact%,MarApr_Contact%,FullSeason_OBP
0,15998,Cody Bellinger,LAD,132,109,47,14,32,37,5,...,0.361,0.057,0.4,0.226,0.66,0.407,0.811,0.884,0.86,0.406
1,11477,Christian Yelich,MIL,124,102,36,14,26,34,6,...,0.41,0.118,0.412,0.279,0.724,0.448,0.566,0.878,0.757,0.429
2,17975,Scott Kingery,PHI,35,32,13,2,5,6,1,...,0.333,0.0,0.222,0.402,0.745,0.535,0.543,0.854,0.711,0.315
3,7927,Eric Sogard,TOR,49,43,17,3,8,9,2,...,0.405,0.0,0.2,0.273,0.505,0.373,0.939,0.978,0.962,0.353
4,14130,Daniel Vogelbach,SEA,92,71,22,8,15,16,0,...,0.519,0.074,0.296,0.237,0.488,0.337,0.661,0.827,0.757,0.341


#### Principal Component Analysis (PCA)
Next it is necessary to define the variables that are important to be placed in the multivariate analysis.

This can be done using a Principal component analysis (PCA), which essentially is a way to statistically reduce the dimenions within a dataset.

When I conduct this PCA, I will be referencing a python PCA tutorial [here](https://towardsdatascience.com/dimension-reduction-techniques-with-python-f36ca7009e5c).

First I will remove the player ID data, as well as their actual full season OBP from the columns list in order to use only relevant numerical data. I will be using the March and April OBP to predict the final full season OBP.

In [19]:
# acquire relevant data
numericalCols = attr[3:28]

from sklearn.preprocessing import StandardScaler
# Data Normalization
x = df.loc[:, numericalCols].values
y = df.loc[:,['MarApr_OBP']].values # target variable, predicting OBP
x = StandardScaler().fit_transform(x) # transform Data
x = pd.DataFrame(x)

# now the PCA can be started post normalization
from sklearn.decomposition import PCA
pca = PCA()
x_pca = pca.fit_transform(x)
x_pca = pd.DataFrame(x_pca)
x_pca.columns = numericalCols
x_pca.head()

Unnamed: 0,MarApr_PA,MarApr_AB,MarApr_H,MarApr_HR,MarApr_R,MarApr_RBI,MarApr_SB,MarApr_BB%,MarApr_K%,MarApr_ISO,...,MarApr_GB%,MarApr_FB%,MarApr_IFFB%,MarApr_HR/FB,MarApr_O-Swing%,MarApr_Z-Swing%,MarApr_Swing%,MarApr_O-Contact%,MarApr_Z-Contact%,MarApr_Contact%
0,11.11497,0.47756,1.060617,0.830204,0.93986,0.617657,0.25725,0.399657,2.221867,0.169037,...,-0.212215,-0.792368,-0.06212,-0.491231,0.124781,-0.017503,0.018716,-0.103215,0.000943,0.000868
1,8.305063,-1.64629,0.059287,-1.285608,0.526556,2.239042,2.432189,-0.015057,1.790126,-0.392202,...,-0.305724,-0.791882,-0.032813,-0.155883,0.04573,0.020353,-0.02371,-0.03682,-0.006524,0.001754
2,0.448733,-1.593652,-0.236003,4.814896,2.909564,1.899678,2.071981,-1.229633,-0.798181,-0.41322,...,0.065204,0.075552,-0.367605,0.677644,-0.357995,-0.066103,0.101488,0.003487,-0.001189,-0.000995
3,2.612709,3.365974,3.79109,3.189216,3.759889,-0.490079,-0.085583,-1.430738,1.210242,0.232165,...,0.147571,0.012133,0.071572,0.233409,-0.117698,-0.268262,-0.026932,0.062469,0.006567,0.005921
4,4.420087,-1.222236,5.09866,-0.712499,1.292279,-0.070207,0.595314,-0.25824,-0.278129,0.09717,...,0.560325,-0.238942,0.170455,-0.176299,0.011953,-0.078581,0.023225,-0.033073,-6.4e-05,-0.004909


From here, we have transformed data in which we can then expand on to define the principal components, and from there one can conduct the regression.

In [22]:
explained_variance = pca.explained_variance_ratio_
ev = list(explained_variance)
sum(ev[0:17])

0.9968645366957654

As see above the precentage of variance is defined above, so according to this, the first 17 principal components account for 99.7% of the variance of the data.  

#### 10-Fold Cross Validation
Defining these as 17 principal components, one can then use sklearn's cross validation prediction
<br /> <br />
Several aspects of this analysis is referenced from NIRPY research, found [here](https://nirpyresearch.com/principal-component-regression-python/).

In [23]:
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict # for cross validation predictions
from scipy.signal import savgol_filter


### Preprocessing for Fit
# Preprocessing (1): first derivative
d1X = savgol_filter(x, 25, polyorder = 5, deriv=1)
 
# Preprocess (2) Standardize features by removing the mean and scaling to unit variance
Xstd = StandardScaler().fit_transform(d1X[:,:])

# Independent variables, with new reduced dimensions, to be used for regression
Xreg = pca.fit_transform(Xstd)[:,:17] # 17 because 17 principal components were defined by the PCA analysis

# Create linear regression object 
regr = linear_model.LinearRegression() 

# 10 fold Cross validation
y_cv = cross_val_predict(regr, Xreg, y, cv=10) 

# Display for First 10 players
print("First Ten Predicted OBP's:")
print(y_cv[1:10].tolist())

First Ten Predicted OBP's:
[[0.44615990639748226], [0.4071988014621668], [0.48854402943758085], [0.4622998218133807], [0.4626633684527435], [0.45550722983565317], [0.4877908233934448], [0.4445223777301892], [0.4350955551251847]]


So now we have the projections for each player for the entire year using a bit of machine learning. From here, one can compare the values acquired with the actual values using some error analysis.
#### Error Analysis

First analyzing the metrics acquired from the model you can determine the mean squared error, which tells you on average how far off the nueral network was from the actual OBP value.

In [24]:
from sklearn.metrics import mean_squared_error

# now we analyze the full season OBP
fullobp = df.loc[:,['FullSeason_OBP']].values 

# Mean Squared Error calculation
mse_cv = mean_squared_error(fullobp, y_cv)

print("Mean Squared Error:"+str(mse_cv))

Mean Squared Error:0.0022086108723345078


Comparing side to side for each one and acquiring the percent error is also a good way to relate our calculated OBP with the actual OBP for the full season.

In [25]:
player = df.loc[:,['Name']].values 
team = df.loc[:,['Team']].values 

#pre concatonation
error = []

#loop through each row (player) and acquire an error value
for p in range(0,len(fullobp)):
    error.append((abs(y_cv[p]-fullobp[p])/fullobp[p])[0]) # absolute value calculation
    # note: ([0] is to acquire value within nested list)

# now we'll display this within the table of the Dataframe
newdf = df
newdf['Predicted_OBP'] = y_cv
newdf['Percent_Error'] = error
# pd.set_option('display.max_rows', len(newdf)) # to output the full dataset for viewing
newdf.loc[:,["Name","Team", "FullSeason_OBP","Predicted_OBP","Percent_Error"]]

Unnamed: 0,Name,Team,FullSeason_OBP,Predicted_OBP,Percent_Error
0,Cody Bellinger,LAD,0.406,0.499301,0.229804
1,Christian Yelich,MIL,0.429,0.446160,0.040000
2,Scott Kingery,PHI,0.315,0.407199,0.292695
3,Eric Sogard,TOR,0.353,0.488544,0.383977
4,Daniel Vogelbach,SEA,0.341,0.462300,0.355718
...,...,...,...,...,...
315,Jackie Bradley Jr.,BOS,0.317,0.218686,0.310139
316,Keon Broxton,NYM,0.242,0.207813,0.141269
317,Pablo Reyes,PIT,0.274,0.212952,0.222804
318,Eduardo Nunez,BOS,0.243,0.196238,0.192437


In [27]:
# calculate average error:
print("The Average Percent Error is: "+str(sum(error)/len(error)*100)+"%")

The Average Percent Error is: 11.508235038281827%


#### Conclusion

In conclusion the average error for this prediction would be about 11.5% off the actual value. This predictor was constructed using a 10 cross fold regressor based on the reduced dimensionality from the PCA via python's sklearn. Thank you for reading my analysis.



In [29]:
#write to csv for viewing purposes
newdf.to_csv("PredictedOBP2019.csv")

##### References

[1] https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv     <br />
[2] https://towardsdatascience.com/dimension-reduction-techniques-with-python-f36ca7009e5c <br />
[3] https://nirpyresearch.com/principal-component-regression-python/