 # <center> Udacity Machine Learning Capstone Project: MLB Wins Prediction </center>
 ## <center>John K. Hancock, August 2018</center>
 ### <center>username: jkhancock@gmail.com</center>
 


<center> <img src="mlb_logo.jpg" align="center" alt="Copyright Major League Baseball" height="320" width="320" /></center> 



## <center>MLB Wins Prediction</center>


### <a id='Problem-Statement'><font color=black><u>Problem Statement</u></font></a>
A new general manager ("GM") of a major league baseball team has been hired with the task of getting the team into the playoffs. The GM knows that to make the playoffs the team needs to have more wins than most other teams. In order to solve this problem, the GM wants to know what key performance statistics that his teams need to achieve in order to maximize the number of wins.

The GM has tasked his Lead Data Scientist ("LDS") to build a model that can predict how many wins a team will have based on historical
statistical data from winning teams and list the key statistical features that winning team can have. In
the end, the GM will use this model to find players that fit these key statistical features. For example, if
On-Base Percentage (“OBP”) shows that leads to more wins, the GM will look for players that have
demonstrated high OBP.


### <a id='Datasets-Inputs'><font color=black><u>Datasets and Inputs</u></font></a>
Baseball has a long history of meticulous record keeping which laid the groundwork for data
exploration and analysis, and there are several well established sources of data for this project. Perhaps the most comprehensive source is the website, FanGraphs.com (www.fangraphs.com). 

Fangraphs.com is website operated by FanGraphs, Inc. Fangraphs compiles historical statistical data for the entire history of Major League Baseball.  In addition, it creates and records advanced baseball metrics outside of the established statistics. FanGraphs is well established as a chronicler and compiler of baseball statistics.  It has parternership deals with ESPN and SB Nation. (Link to website: https://www.fangraphs.com/) (Wikipedia: https://en.wikipedia.org/wiki/Fangraphs)





### <a id='Solution-Statement'><font color=black><u>Solution Statement</u></font></a>
To predict wins, the Data Scientist will need to build a mapping function using historical team statistics for the past 20 years as input variables to predict the output variable (the number of wins). For this step, s/he will use the standard Linear Regression classification algorithm from Sci-kit Learn. Additionally, the Data Scientist will also use Deep Learning Neural Networks to perform Logistic Regression.

### <a id='Project-Design'><font color=black><u>Project Design</u></font></a>

#### <u> Part One: Data Collection and Data Wrangling </u>

In part one, of this project, twenty years of all available MLB statistics were collected from the website, fangraphs.com.  The data was then explored, cleaned, and wrangled into a final datatset, "FINAL_DATASET_MLB_1998_to_2017.csv". Please see the notebook, <i><b>00_Capstone_Project_MLB_Collection_Wrangling</i></b>, located in the folder, 00_Data Collection and Wrangling for more details and explanations. 

#### <a id="home"><font color=black><u> Part Two: The Model </u></font></a>
[0.0 The Final Dataset: Descriptive Statistics](#0.0)<br />
[1.0 A Look at MLB Wins](#1.0)<br />
[2.0 Separate the dependent variable from the independent variables](#2.0)<br />
[3.0 Outliers](#3.0)<br />
&nbsp; &nbsp; [3.1 Remove Outliers ](#3.1)<br />
[4.0 Check for Skewness](#4.0)<br />
[5.0 Pre-Processing data](#5.0)<br />
 &nbsp; &nbsp; [5.1 Select The Best Features](#5.1)<br />
  &nbsp; &nbsp; [5.2 Split the Dataset](#5.2)<br />
[6.0 Model One: LASSO Regression](#6.0)<br />
[7.0 Model Two: Linear Regression with Deep Neural Network](#7.0)<br />
[8.0 Cluster Segment Analysis](#8.0)<br />
[9.0 Final Summary ](#9.0)<br />
[10.0 Project Reflections ](#10.0)<br />
[11.0 References](#11.0)<br />


#### <a id='0.0'><font color=black><u>0.0 The Final Dataset: Descriptive Statistics</u></font></a>

In [76]:
import warnings
warnings.simplefilter('ignore')

import collections
import numpy as np
import os
import pandas as pd
import pprint as pp
import visuals as vs
import capstoneutils as hp

from glob import glob
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
path = os.getcwd()

import matplotlib.pyplot as plt
# Pretty display for notebooks
%matplotlib inline


In [77]:
Final_MLB_Data = pd.DataFrame()
Final_MLB_Data = pd.read_csv(path +r'\FINAL DATASET version 1.0\FINAL_DATASET_MLB_1998_to_2017.csv', index_col=0)
print ("The baseball dataset has {} data points with {} variables each.".format(*Final_MLB_Data.shape))


FileNotFoundError: File b'C:\\Users\\jkhan\\Documents\\udacity\\MLND\\Capstone Project\\CAPSTONE FINAL PROJECT\\01_The Model\\FINAL DATASET version 1.0\\FINAL DATASET version 1.0\\FINAL_DATASET_MLB_1998_to_2017.csv' does not exist

In [None]:
print("Summary fielding statistics are below:")
Final_MLB_Data[['FIELD_E', 'FIELD_DP', 'FIELD_SB', 'FIELD_CS', 'FIELD_PB',
       'FIELD_WP', 'FIELD_FP', 'FIELD_PO']].describe()


In [None]:
print("Summary hitting statistics are below:")
Final_MLB_Data[['OFF_PA', 'OFF_HR', 'OFF_R',
       'OFF_RBI', 'OFF_SB', 'OFF_ISO', 'OFF_BABIP', 'OFF_AVG', 'OFF_OBP',
       'OFF_SLG', 'OFF_wOBA', 'OFF_wRC+', 'OFF_H', 'OFF_1B', 'OFF_2B',
       'OFF_3B', 'OFF_BB', 'OFF_IBB', 'OFF_SO', 'OFF_HBP', 'OFF_SF',
       'OFF_GDP', 'OFF_CS', 'OFF_BB%', 'OFF_K%', 'OFF_BB/K']].describe()

In [None]:
print("Summary pitching statistics are below:")
Final_MLB_Data[['PITCH_ERA', 'PITCH_CG', 'PITCH_ShO', 'PITCH_SV', 'PITCH_H',
       'PITCH_R', 'PITCH_ER', 'PITCH_HR', 'PITCH_BB', 'PITCH_IBB',
       'PITCH_HBP', 'PITCH_WP', 'PITCH_BK', 'PITCH_SO', 'PITCH_K/9',
       'PITCH_BB/9', 'PITCH_K/BB', 'PITCH_H/9', 'PITCH_HR/9', 'PITCH_AVG',
       'PITCH_WHIP', 'PITCH_BABIP', 'PITCH_LOB%', 'PITCH_FIP',
       'PITCH_Starting', 'PITCH_Relieving', 'PITCH_Start-IP',
       'PITCH_Relief-IP', 'PITCH_K%', 'PITCH_BB%', 'PITCH_Age']].describe()

<font color=black><u><b>The Pythagorean Theorem of Baseball</b></u></font><br />
The Pythagorean Theorem of Baseball is a creation of Bill James which relates the number of runs a team has scored and surrendered to its actual winning percentage, based on the idea that runs scored compared to runs allowed is a better indicator of a team's (future) performance than a team's actual winning percentage. This results in a formula which is referred to as Pythagorean Winning Percentage. (Baseball Reference.com https://www.baseball-reference.com/bullpen/Pythagorean_Theorem_of_Baseball)<br />

The formula is "1 / (1 + (Runs Allowed / Runs Scored)^2)". The results below show that even with this traditional wins prediction heuristic there is variance between what the formula predicts and the actual results.

In [None]:
 #Baseball's Pythagorean Thereom 1 / (1 + (Runs Allowed / Runs Scored)**2)

min_runs_allowed = Final_MLB_Data['PITCH_R'].min()
runs_scored = Final_MLB_Data.loc[Final_MLB_Data['PITCH_R'] == min_runs_allowed]['OFF_R'].item()
win_total = Final_MLB_Data.loc[Final_MLB_Data['PITCH_R'] == min_runs_allowed]['W'].item()
pythag1 =  float(1 / (1 + (min_runs_allowed/runs_scored)**2))
expected_wins1 = 162*pythag1

max_runs_scored = Final_MLB_Data['OFF_R'].max()
runs_allowed = Final_MLB_Data.loc[Final_MLB_Data['OFF_R'] == max_runs_scored]['PITCH_R'].item()
win_total2 = Final_MLB_Data.loc[Final_MLB_Data['OFF_R'] == max_runs_scored]['W'].item()
pythag2 = float(1 / (1 + (runs_allowed/max_runs_scored)**2))
expected_wins2 = 162*pythag2

    
    
print("The minimum number of runs allowed by a team over the past 20 years is %.0f and that team's win total was %.0f." % (min_runs_allowed, win_total))
print("That same team scored %.0f runs for a run differential of %.0f" % (runs_scored, runs_scored-min_runs_allowed))
print("According to the Pythagorean Theorem of Baseball, the team's win total should have been %.2f" % expected_wins1)
print("\n")
print("The maximum number of runs scored by a team over the past 20 years is %.0f and that team's win total was %.0f." % (max_runs_scored, win_total2))
print("That same team allowed %.0f runs for a run differential of %.0f" % (runs_allowed, max_runs_scored-runs_allowed))
print("According to the Pythagorean Theorem of Baseball, the team's win total should have been %.2f" % expected_wins2)

#### <a id='1.0'><font color=black><u>1.0 A Look at MLB Wins </u></font></a>  

The objective of this project is to build a model that can predict the number of wins and lists the most important features that are important to winning. Over the past 20 year, the average and median number of wins is 81 games, and the standard deviation is 11.52. In order to make the playoffs, the team needs to win at least one full standard deviation above the mean or 93 wins.

In [None]:
text = "Median No. of Wins is %.0f" % Final_MLB_Data['W'].median()

Final_MLB_Data['W'].plot.hist(grid=True, bins=40, rwidth=0.9,color='green')
plt.title('MLB Team Wins')
plt.xlabel('Wins')
plt.ylabel('Number of seasons')
plt.text(41,40,text)
plt.grid(axis='y', alpha=0.75)

In [None]:
print("The average number of wins %.0f." % Final_MLB_Data['W'].mean())
print("The median number of wins %.2f." % Final_MLB_Data['W'].median())
print("The standard deviation for wins is %.2f." % Final_MLB_Data['W'].std())

In [None]:
text = "Total number is %d or %.2f%%" % (Final_MLB_Data.loc[Final_MLB_Data['W'] >= 93, ['W']].count(), float((Final_MLB_Data.loc[Final_MLB_Data['W'] >= 93, ['W']].count())/(Final_MLB_Data['W'].count()))*100)

Final_MLB_Data.loc[Final_MLB_Data['W'] >= 93, ['W']].plot.hist(grid=True, bins=40, rwidth=0.9,color='blue')
plt.title('Seasons an MLB Team Won 93 or more Games')
plt.xlabel('Wins')
plt.ylabel('Number of seasons')
plt.text(98,17.5,text)
plt.grid(axis='y', alpha=0.55)
plt.grid(axis='x', alpha=0.35)

In [None]:
text = "Total number is %d or %.2f%%" % (Final_MLB_Data.loc[Final_MLB_Data['W'] >= 95, ['W']].count(), float((Final_MLB_Data.loc[Final_MLB_Data['W'] >= 95, ['W']].count())/(Final_MLB_Data['W'].count()))*100)

Final_MLB_Data.loc[Final_MLB_Data['W'] >= 95, ['W']].plot.hist(grid=True, bins=40, rwidth=0.9,color='orange')
plt.title('Seasons MLB Team Won 95 or more Games')
plt.xlabel('Wins')
plt.ylabel('Number of seasons')
plt.text(98,17.5,text)
plt.grid(axis='y', alpha=0.55)
plt.grid(axis='x', alpha=0.35)

In [None]:
text = "Total number is %d or %.2f%%" % (Final_MLB_Data.loc[Final_MLB_Data['W'] >= 100, ['W']].count(), float((Final_MLB_Data.loc[Final_MLB_Data['W'] >= 100, ['W']].count())/(Final_MLB_Data['W'].count()))*100)

Final_MLB_Data.loc[Final_MLB_Data['W'] >= 100, ['W']].plot.hist(grid=True, bins=40, rwidth=0.9,color='red')
plt.title('Seasons MLB Team Won 100 or more Games')
plt.xlabel('Wins')
plt.ylabel('Number of seasons')
plt.text(104,3.5,text)
plt.grid(axis='y', alpha=0.55)
plt.grid(axis='x', alpha=0.35)

In [None]:
prob_93 = (Final_MLB_Data.loc[Final_MLB_Data['W'] >= 93, ['W']].count())/(Final_MLB_Data['W'].count())*100
prob_95 = (Final_MLB_Data.loc[Final_MLB_Data['W'] >= 95, ['W']].count())/(Final_MLB_Data['W'].count())*100
prob_100 = (Final_MLB_Data.loc[Final_MLB_Data['W'] >= 100, ['W']].count())/(Final_MLB_Data['W'].count())*100
print("To summarize:")
print("The probabilty of winning 93 or more games is %.2f%%." % prob_93 )
print("The probabilty of winning 95 or more games is %.2f%%." % prob_95 )
print("The probabilty of winning 100 or more games is %.2f%%." % prob_100 )

#### <a id='1.0'><font color=black><u>1.1 Dataset information </u></font></a>

By looking at the shape, describing the data, and the information for the data, we see that all of the variables are continuous data, not categorical.  For the purpose of this project, we won't need to convert any column, e.g. one-hot-encoding. 

In [None]:
Final_MLB_Data.info()

#### <a id='2.0'><font color=black><u>2.0 Separate the indepenedent and dependent variables </u></font></a>


In [None]:
#Create the predictor variables by first dropping the "L" column
Final_MLB_Data= Final_MLB_Data.drop("L", axis=1)
#Create the target variable Wins
Wins = Final_MLB_Data["W"]
#Remove the target from the features
features = Final_MLB_Data.drop("W", axis=1)

#### <a id='3.0'><font color=black><u>3.0 Remove Outliers </u></font></a>

In this section, outliers in the dataset are identified and removed.  "Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal." (citation: Udacity MLND course on Unsupervised Learning. Customer Segments project.)(Turkey's Method: http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/)

(See the function: detect_outliers in capstoneutils.py)

In [None]:
cnt = hp.detect_outliers(features)
outliers  = [list(cnt.keys())] 

####  <a id='3.1'><font color=black><u>3.1 Remove Outliers </u></font></a>

From the dictionary, we can see that six observations at indexes 59, 30, 570, 573, and 598 have a high number of outlier features (more than 4).  These observations are removed from both the features and Wins datasets.  The number of observations for both datasets is now 594.

In [None]:
pp.pprint(cnt)


In [None]:
features = features.drop([59,30,570,0,573,598])
Wins = Wins.drop([59,30,570,0,573,598])


In [None]:
print(Wins.shape)
print(features.shape)

#### <a id='4.0'><font color=black><u>4.0 Check for Skewness </u></font></a>

Skewness is a measure of symmetry, or lack thereof, in a dataset while Kurtosis is a measure of how large the tails are compared to a normal distribution.  If both skew and kurtosis are 0, then the data is normally distributed.  In the code below, I culled out those features with a skew and kurtosis greater than .5

In [None]:
from scipy.stats import kurtosis
from scipy.stats import skew

skewed_features = {}
 
for col in features.columns:
    arry = np.array(features[col])
    skew_value = skew(arry)
    if skew_value> 1 or skew_value < -1:
        skewed_features[col] = [skew_value]
    
skewed_features




        
        
        

In [None]:
plt.hist(x=features['OFF_IBB'], bins=50)
plt.xlabel('Number of Intentional Walks for a Team')
plt.title("Pre Transformed Feature Intentional Walks")

In [None]:
plt.hist(x=features['FIELD_PB'], bins=50)
plt.xlabel('Number of Passed Balls(Errors) for a Team')
plt.title("Pre Transformed Feature Passed Balls")

In [None]:
plt.hist(x=features['FIELD_PB'], bins=50)
plt.xlabel('Number of Passed Balls(Errors) for a Team')
plt.title("Pre Transformed Feature Complete Games")

#### <a id='4.1'><font color=black><u>4.1 Check for Skewness </u></font></a>

In this section, I transformed the skewed features of the dataset identified above by taking the log of the feature + 1.  

In [None]:
# Log-transform the skewed features
skewed = [keys for keys in skewed_features.keys()]
features[skewed] = features[skewed].apply(lambda x: np.log(x+1))



In [None]:
transformed_skewed_features = {}

for col in features[skewed]:
    arry = np.array(features[col])
    skew_value = skew(arry)
    transformed_skewed_features[col] = [skew_value]
        
transformed_skewed_features

In [None]:
plt.hist(x=features['OFF_IBB'], bins=50)
plt.xlabel('Number of Intentional Walks for a Team')
plt.title("Transformed Feature Intentional Walks for a Team")

In [None]:
plt.hist(x=features['FIELD_PB'], bins=50)
plt.xlabel('Number of Passed Balls(Errors) for a Team')
plt.title("Transformed Feature Intentional Walks for a Team")

In [None]:
plt.hist(x=features['PITCH_CG'], bins=50)
plt.xlabel('Number of Complete Games Pitched by a Starter for a Team')
plt.title("Transformed Feature Complete Games Pitched by a Starter for a Team")

#### <a id='5.0'><font color=black><u>5.0 Pre Process Dataset </u></font></a>

In this section, I applied the MinMaxScaler to the log transformed features in order to better normalize the data. 

In [None]:
scaler =  MinMaxScaler()

numerical = features.columns.values.tolist()

features[numerical ] = scaler.fit_transform(features[numerical])
   


In [None]:
features.columns.values


In [None]:
from sklearn.feature_selection import SelectKBest, f_regression


kbest = SelectKBest(f_regression, k='all')
kbest.fit_transform(features, Wins)

scores = kbest.scores_
features_scores = dict(zip(features,scores))
feats = list()
for k,v in features_scores.items():
    if v > 50:
        feats.append(k)
features = features[feats]
features_scores = sorted(features_scores.items(), key=lambda kv: kv[1])
print("These are the top 40 features, and their scores, that will be fed into the model:")
pp.pprint(features_scores)

#### <a id='5.2'><font color=black><u>5.2 Split the Dataset </u></font></a>

In [None]:
#Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, Wins, test_size=.2, random_state=9)
print ("Training and testing split was successful.")


In [None]:
features.columns.values

 #### <a id='6.0'><font color=black><u>6.0 Model One: LASSO Regression </u></font></a>
 
Least Absolute Selection and Shrinkage Operator or "LASSO" regression model was the first choice for this project given the high number of features in the dataset and the low number of observations. LASSO is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces.LASSO imposes a constraint on the sum of the absolute values of the model parameters, where the sum has a specified constant as an upper bound. This constraint causes regression coefficients for some variables to shrink towards zero,  effectively choosing a simpler model that does not include those coefficients. https://en.wikipedia.org/wiki/Lasso_(statistics).

In the Lasso regression model imported from sklearn, uses Regularization penalizes model complexity. The lambda parameter penalizes the coefficients so that the model does not over fit. Ridge Regression uses L1 Regularizaton. Ridge regression improves prediction error by shrinking large regression coefficients in order to reduce overfitting, but it does not perform covariate selection and therefore does not help to make the model more interpretable. https://en.wikipedia.org/wiki/Lasso_(statistics)

Lasso uses L2 Regularization which forces the sum of the absolute value of the regression coefficients to be less than a fixed value, and in turn forces certain coefficients to be set to zero, effectively choosing a simpler model that does not include those coefficients. https://en.wikipedia.org/wiki/Lasso_(statistics)

In sum, due to the high number of features, the model will reduce the coefficients of most of them to zero in order to create a simpler, more accurate prediction model.

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

lasso = Lasso(random_state=5)
alphas = 10**np.linspace(10, -2, 100)*0.5

tuned_parameters = [{'alpha': alphas}]
n_folds = 25

clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds)
clf.fit(X_train, y_train)

scores = clf.cv_results_['mean_test_score']
scores_std = clf.cv_results_['std_test_score']



In the graph below, you can visually see how the Lasso model sets the weights of several of the features to zero thereby removing it from the prediction model.

In [None]:
# pp.pprint(clf.cv_results_['params'])
plt.figure(figsize=(10,10))
top_coef = pd.Series(clf.best_estimator_.coef_, X_train.columns).sort_values()
top_coef.plot(kind='bar', title='Weights of Feature Coefficients')


The mean squared error measures the average squared difference between the prediction values and the actual values. The MSE is lower on the training set than it is on the test set, which is a sign that the model is not over-fitting on the training data.

In [None]:
train_error = mean_squared_error(y_train, clf.predict(X_train))
test_error = mean_squared_error(y_test, clf.predict(X_test))


print ('The training data MSE is %.3f' % train_error)
print ('The test data MSE is %.3f' % test_error)



The coefficient of determination, or R-squared, is the proportion of the variance in the dependent variable (Wins) that is predictable from the independent variables. In sum, it provides a statistic that evaluates how well the model fits the data.  The statistic ranges from 0 to 1 where "1" fits the data perfectly and 0 means that the model explains none of the variability of the response data. For this project, the R-square score when the model is tested against the test data is 0.76 which indicates a high correlation between the number of features and the number of wins. 

In [None]:
# R-square from training and test data

rsquared_train=clf.score(X_train,y_train)
rsquared_test=clf.score(X_test,y_test)
print ('The training data R-square is %.2f' % rsquared_train)
print ('The testing data R-square is %.2f' % rsquared_test)


In the graphic below, the scatter plot compares what the model predicted and what the actual win total was.

The yellow color dots show where the model predicted more wins than the actual total.<br />
The purple color dots show where the model predicted less wins than the actual total.<br />
The green color dots show where the model predicted the same or nearly the same as the actual total.

In [None]:
preds = clf.predict(X_test)
plt.figure(figsize=(10,10))
plt.scatter(preds,  y_test, c=preds-y_test)
plt.ylim(40,120)


plt.xlabel('Model Predictions')
plt.ylabel('Actual Wins')
plt.title('Model One Predictions vs. Actual')

In [None]:

for i in range(0, X_test.shape[0]):
    diff = y_test.iloc[i] - preds[i]
    print("The model predicts %d and the actual number is %d. The difference is %d" % (preds[i],y_test.iloc[i], diff))

<b>Model One Summary</b><br/>

This model shows that the features that carry the most weight--where weights are greater than or equal to 2-- in terms of predicting wins are listed below. Pitching stats carry more weight than hitting or fielding.  According to this model, the team should prioritize pitchers who excel in these 21 features. 


In [None]:
pred_actual_df = pd.DataFrame( index=range(0,119))
pred_actual_df['predictions'] = list(preds)
pred_actual_df['actual'] = list(y_test)


<b>Metrics: R-Square and Adjusted R-Square</b>

In [None]:
import statsmodels.formula.api as sm
result = sm.ols(formula="predictions ~actual", data=pred_actual_df ).fit()
# print result.summary()
print ("To compare, the R-Square for the model is: %.4f, and the Adjusted R-Square is: %.4f." % (result.rsquared, result.rsquared_adj))


In [None]:
high_weighted_features = dict(zip( X_train.columns, clf.best_estimator_.coef_))
most_important_features = []
for k,v in high_weighted_features.items():
    if np.absolute(v) >=.01:
        most_important_features.append(k)

most_important_features

#### <a id='7.0'><font color='black'><u>7.0 Model Two: Linear Regression with Deep Neural Network</u></font></a>

The next model that will be used to predict wins is a Deep Learning Neural Network.  Deep Learning is an area of machine learning that mimics animal brains by progressively improving on its ability to learn a task through the use of examples. Input is passed into a network of layers called neurons. Each input is assigned which yields an output. "The DNN finds the correct mathematical manipulation to turn the input into the output, whether it be a linear relationship or a non-linear relationship. The network moves through the layers calculating the probability of each output." (https://en.wikipedia.org/wiki/Deep_learning)

"\[E\]xposed to enough of the right data, deep learning is able to establish correlations between present events and future events. It can run regression between the past and the future."(https://skymind.ai/wiki/neural-network)




In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from keras.callbacks import ModelCheckpoint  
seed = 7
np.random.seed(seed)
model = Sequential()
model.add(Dense(32, input_dim=40, activation='relu'))
model.add(Dense(units = 32, activation='relu'))
model.add(Dense(units = 32, activation='relu'))
model.add(Dense(units = 32, activation='relu'))
#Output layer
model.add(Dense(units = 1))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, epochs=100, batch_size=10, verbose=1)

model.summary()



In [None]:
preds2 = model.predict(X_test)
preds2 = list(preds2[:,0])
diff =[]
for i in range(0, X_test.shape[0]):
    diff.append(preds2[i]- y_test.iloc[i])
    print("The model predicts %d and the actual number is %d. The difference is %d" % (preds2[i],y_test.iloc[i], diff[i]))



In [None]:
mse = [i ** 2 for i in diff]

print("The Mean Squared Error for this model is %.2f" % np.mean(mse))



In [None]:
plt.figure(figsize=(10,10))
plt.scatter(preds2, y_test, c=preds2-y_test)
plt.xlabel('Model Predictions')
plt.ylabel('Actual Wins')
plt.title('Model Two Predictions vs. Actual')

In [None]:
from sklearn.metrics import r2_score
score = r2_score(y_test, preds2)
print('The R square score of the Deep Neural Network models is %.2f' % score)

In [None]:
pred2_actual_df = pd.DataFrame( index=range(0,119))
pred2_actual_df['predictions'] = list(preds2)
pred2_actual_df['actual'] = list(y_test)


In [None]:
import statsmodels.formula.api as sm
result = sm.ols(formula="predictions ~actual", data=pred2_actual_df ).fit()
# print result.summary()
print ("To compare, the R-Square for the model is: %.4f, and the Adjusted R-Square is: %.4f." % (result.rsquared, result.rsquared_adj))


<b>Model Two Summary</b><br/>

Model Two confirms the results that were discovered in Model One. The Mean Squared Error on the test data for this model is 37 whereas the MSE for the previous model was 32 which confirms the 21 most important features.

<b>Model Performance on Unseen Data</b><br/>

In [None]:
import capstoneutils as cp

for i in range(0,3):
    csvPath = path + "\\FINAL DATASET version 1.0\\2018 TEST DATA\Source\\"
    Test_2018_MLB_Stats_df = pd.DataFrame()
    print(csvPath)
    Test_2018_MLB_Stats_df  = cp.mergeCSVs(csvPath, "Test_2018_MLB_Stats.csv", path + r"\FINAL DATASET version 1.0\2018 TEST DATA\Output")


In [None]:
Test_2018_MLB_Stats_df = Test_2018_MLB_Stats_df.loc[:,~Test_2018_MLB_Stats_df.columns.duplicated()]
Team_Wins = (Test_2018_MLB_Stats_df['Team'],Test_2018_MLB_Stats_df['W'])
Features_2018_df = Test_2018_MLB_Stats_df.drop(['Team','W'], axis=1)
Features_2018_df = Features_2018_df.applymap(cp.remove_percentages)
numerical = Features_2018_df .columns.values.tolist()
Features_2018_df[numerical] = scaler.fit_transform(Features_2018_df[numerical])
# # len(Features_2018_df.columns.values)
# X_test.columns.values

In [None]:
##Model One
preds_2018 = clf.predict(Features_2018_df)
preds_2018 = preds_2018*.83
actual_wins = Team_Wins[1]
team = Team_Wins[0]
for i in range(0, len(actual_wins)):
    diff = actual_wins[i] - preds_2018[i]
    print("The model predicts %d and the actual number is %d. The difference is %d. The team is %s." % (preds_2018[i],actual_wins[i], diff, team[i]))

In [None]:
score = r2_score( actual_wins, preds_2018)
print('The R square score of Model One is %.2f' % score)

In [None]:
test_mse = mean_squared_error(actual_wins, clf.predict(Features_2018_df))
test_mse

<b><u>Comparison to the Pythagorean Theorem of Baseball</u></b>

In [None]:


#Comparison to the Pythagorean
model1_predictions = model.predict(X_train.ix[[510]]).tolist()[0][0]
model2_predictions = clf.predict(X_train.ix[[510]])[0]
print("For the team that allowed the least amount of runs, the LASSO model predicted %.0f wins." % model1_predictions)
print("For the team that allowed the least amount of runs, the Deep Neural Networks model predicted %.0f wins." % model2_predictions)




#### <a id='8.0'><font color='black'><u>8.0 Hitters Hierarchial Clustering</u></font></a>

With the knowledge of the number of features, the project will now turn to cluster analysis of the current players. For this project, we want hierarchial clusters of the players since the goal is to target players whose statistics are a better match for the features that we have discovered in the models. With the knowledge of the number of features, the project will now turn to cluster analysis of the current players. For this project, we want hierarchical clusters of the players since the goal is to target players whose statistics are a better match for the features that we have discovered in the models.  Hierarchical clustering starts out with each observation as its own cluster. It then identifies each cluster that is closest together and merges similar clusters until all clusters are merged into one cluster. 


For the sake of brevity, the project will focus only on the segments for hitters. 


In [None]:
path = os.chdir('..')
path = os.chdir('..')
print(os.getcwd())
Player_Hitting_Data = pd.read_csv('2018 Hitting Pitching Stats Players.csv')
Player_Fielding_Data = pd.read_csv('2018 Fielding Stats Players.csv')

In [None]:

Player_Hit_Field_Data = pd.merge(Player_Hitting_Data, Player_Fielding_Data[['FP','playerid']], how='left', on='playerid')
Player_Hit_Field_Data = Player_Hit_Field_Data.drop_duplicates(['playerid'])
Player_Hit_Field_Data[Player_Hit_Field_Data.duplicated() == True]

In [None]:
Player_Hit_Field_Data[Player_Hit_Field_Data['FP'].isnull()]

In [None]:
Player_Hit_Field_Data = Player_Hit_Field_Data.drop([31,133,213])

In [None]:
#Select the 21 most important features for hitting
Player_Hit_Field_Data=Player_Hit_Field_Data.loc[:,['Team','Name','playerid','FP', 'PA', 'ISO', 'BABIP', 'AVG', 'wOBA', 'wRC+', '2B']]

Player_Hit_Field_Data.info()

In [None]:
Player_Hit_Field_Data2 = Player_Hit_Field_Data.drop(['Name','Team'], axis=1)


In [None]:
pd.scatter_matrix(Player_Hit_Field_Data2.iloc[:,1:8], alpha = 0.3, figsize = (14,8), diagonal = 'kde');

In [None]:
Player_Hit_Field_Data2.iloc[:,1:8]= np.log(Player_Hit_Field_Data2.iloc[:,1:8])

# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(Player_Hit_Field_Data2.iloc[:,1:8], alpha = 0.3, figsize = (14,8), diagonal = 'kde');

In [None]:
cnt = collections.Counter()
cnt = hp.detect_outliers(Player_Hit_Field_Data2.iloc[:,1:8])
outliers  = [list(cnt.keys())] 

In [None]:
cnt.most_common()



In [None]:
Player_Hit_Field_Data2 = Player_Hit_Field_Data2.drop([221,219])

In [None]:
import seaborn as sns
sns.boxplot(Player_Hit_Field_Data2['wOBA'])

In [None]:
numerical2 = Player_Hit_Field_Data2.loc[: ,['FP','PA','ISO','BABIP','AVG','wOBA','wRC+','2B']].columns.values.tolist()
Player_Hit_Field_Data[numerical2] = scaler.fit_transform(Player_Hit_Field_Data[numerical2])


In [None]:
import scipy.cluster.hierarchy as shc
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 7))  
plt.title("Hitters Dendograms")  
dend = shc.dendrogram(shc.linkage(Player_Hit_Field_Data2.iloc[:,1:8], method='complete'))  

In [None]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=2 , affinity='euclidean', linkage='complete')  
preds = cluster.fit_predict(Player_Hit_Field_Data2.iloc[:,1:8])

In [None]:
from sklearn.metrics import silhouette_score
score = silhouette_score(Player_Hit_Field_Data2,preds)
print(score)

In [None]:
Player_Hit_Field_Data2['cluster'] = preds

In [None]:
Player_Hit_Field_Data = pd.merge(Player_Hitting_Data, Player_Hit_Field_Data2[['cluster','playerid']], how='left', on='playerid')
cluster1 = Player_Hit_Field_Data[Player_Hit_Field_Data['cluster']==1.0]
cluster0 = Player_Hit_Field_Data[Player_Hit_Field_Data['cluster']==0.0]


<font color='black'><b><u>Summary of Hitters Segment Analysis</u></b></font>

The segment analysis of the hitters shows that the hitters can be separated largely into two clusters, cluster 0 and cluster 1.  Hitters in cluster 0 are the ones that the team should target in order to meet the features that lead to wins. Earlier, the models showed that the hitting feature with the most weight is weighted on-base average ("wOBA").  In the tables below, you see that hitters in cluster 0 have a mean wOBA of 35% while those in cluster 0 have a mean wOBA of 31%.  The other significant offensive statistic that weighs heavily is the weighted Runs Created("wRC+").  Cluster 0 has a mean of 119 wRC+ while cluster 1 has a mean of 95 wRC+.






In [None]:
#cluster 0 statistics
cluster0.describe()

In [None]:
#cluster 1 statistics
cluster1.describe()

In [None]:
cluster0.head(10)

Player to target: <b>Nolan Arenado, 3B Colorado Rockies</b>
<br />
<img src="Arenado.jpg" align="left"  height="100" width="100" /><br />
<br />

Perhaps a top player to target is Colorado Rockies 3B, Nolan Arenado who will be free agent in 2020.  

#### <a id='9.0'><font color=black><u>9.0 Final Summary </u></font></a>  

Over the last 20 years, teams in MLB have a 14% chance of winning 95 games or more in a season.  This report started off with over 600 features which were reduced to  showed that the key offensive features that the team needs to have are the following:

<ol>
<li>Field Percentage ('FP'): Fielding percentage: The percentage of times a defensive player properly handles a batted or thrown ball.</li><br />
<li>Plate Appearance ('PA'): Number of plate appearances.</li><br />
<li>Isolated Power ('ISO'): Average number of extra bases per at bat, calculated several ways such as SLG minus AVG.</li><br />
<li>Batted Balls in Play ('BABIP'):The rate at which the batter gets a hit when he puts the ball in play.</li><br />
<li>Average('AVG'): Rate of hits per at bat, calculated as H/AB.</li><br />
<li>Weighted On Base Average('wOBA'):  Is a statistic, based on linear weights,designed to measure a player's overall offensive contributions per plate appearance. It is formed from taking the observed run values of various offensive events, dividing by a player's plate appearances, and scaling the result to be on the same scale as on-base percentage. </li><br />
<li>Weighted Runs Created plus('wRC+'): Quantifies a player’s total offensive value and measure it by runs. It synthesizes all of a players offensive stats and applies a formula to those stats to quantify how many runs a player is worth to his team in comparison to the league average.  Nolan Arenado's wRC+ is 142 which means that he created 42% more runs than the league average.</li><br />
<li>Doubles('2B'): Total number of doubles.</li><br />
</ol>


There are limitations to this analysis.  One, the low number of observations may have an impact the analysis. We cannot determine if there's enough data to build a completely accurate model.  However, there's no way to add more observations.  Going back further than 1998 is also problematic as the number of teams per year may impact the number of wins.  Two, some of the features gathered from fangraphs are not complete for the entirety of the years.  There's a lot of fielding data that was not consistently gathered.   

Given these limitations, the most important milestone that this project achieved was to build a systematic approach to understanding how Wins in MLB are created. The approach--gathering the data, cleaning it, pre-processing it, and applying a model to it-- may need to be amended or updated, but this can be followed for future projects. 

#### <a id='10.0'><font color=black><u>10.0 Project Reflections </u></font></a> 

<ol>
    <li>Baseball is personal passion of mine, and I wanted to do my Capstone project in this domain since MLB is a leader in using statistics to evaluate talent.  Also, there's a wealth of available information.</li><br />
    <li>My goal was to make sure that my methodology to this project was sound. I made quite a few decisions that may need to be reversed, but as long as the project methdology is sound, decision changes can be easily incorporated. </li><br />
    <li>The project took much long than planned for. Given the deadline, I wasn't able to complete the Pitching segment analysis. Most of the time was taken with the data collection and wrangling.</li><br />
    <li>I am open to any suggestions or changes as I consider this project a first draft.</li><br />
   </ol>

#### <a id='11.0'><font color=black><u>11.0 References</u></font></a> 
<ul>
    <li>https://www.fangraphs.com/</li><br />
    <li>https://machinelearningmastery.com</li><br />
    <li>https://stackoverflow.com/</li><br />
    <li>http://scikit-learn.org/stable/auto_examples/exercises/plot_cv_diabetes.html#sphx-glr-auto-examples-exercises-plot-cv-diabetes-py</li><br />
    
  </ul>