# Baseball - Case Study

Problem Statement

This dataset utilizes data from 2014 Major League Baseball seasons in order to develop an algorithm 
that predicts the number of wins for a given team in the 2015 season based on several different 
indicators of success. There are 16 different features that will be used as the inputs to the machine 
learning and the output will be a value that represents the number of wins. 


-- Input features: Runs, At Bats, Hits, Doubles, Triples, Homeruns, Walks, Strikeouts, Stolen Bases, Runs Allowed, Earned Runs, Earned Run Average (ERA), Shutouts, Saves, Complete Games and Errors

-- Output: Number of predicted wins (W)

# Importing Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns

# Loading the Dataset

In [None]:
df=pd.read_csv('baseball.csv')

In [None]:
df

# Exploratory data analysis -  EDA

In [None]:
df.shape

Therefore we have a dataset which contains 30 rows and 17 columns

In [None]:
df.sample()

Sample function shows random row of the baseball dataset

In [None]:
df.head()

Head function shows the 1st 5 rows of baseball dataset

In [None]:
df.tail()

Tail function shows last 5 rows of baseball dataset

Statistical Summary

In [None]:
df.describe(include='all')

There are less difference between the 1st and 2nd quartile, but huge difference between the 1st and 
3rd quartile

In [None]:
pd.set_option('display.max_columns',500)

In [None]:
df.describe()

To display null values through heatmap

In [None]:
sns.heatmap(df.isnull())
plt.title('Null Values')
plt.show() 

Heatmap clearly depicts Null values is  not presented. Lets see another example

In [None]:
df.isnull().sum().sum()

In [None]:
df.info()

So we shall conclude that our dataset is free from null values

Plot the Histogram of target/independent variable (so2) to see distribution

In [None]:
plt.figure(figsize=(10,5))
df['W'].plot.hist()

In [None]:
df['W'].value_counts()

There is no sign of imbalance 

Checking Correlation

In [None]:
df.corr()

In [None]:
corr_mat=df.corr()
#size of the canvas
plt.figure(figsize=[80,80])
# plot the correlation matrix
sns.heatmap(corr_mat,annot=True)
plt.title("Correlation Matrix")
plt.savefig('correlation_matrix.jpg')
plt.show()

We are unable to identify the correlation in above heatmap due to huge number of columns.
Lets print the correlation of independent variables with target variables in target form.

In [None]:
corr_matrix=df.corr()
corr_matrix

In [None]:
type(corr_matrix)

In [None]:
corr_matrix=df.corr()
corr_matrix['W'].sort_values(ascending=False)

Now from the correlation values we can clearly say that highest positive correlation is for the column
sv which is save column and lowest positive correlation is for the columns CG and,
lowest negative correlation is for column AB and highest negative correlation is for column ERA

# Lets check the data distribution among the columns

In [None]:
df.plot(kind='density',subplots=True,layout=(6,11),sharex=False,legend=False,fontsize=1,figsize=(18,12))

We can see skewness in few columns.Therefore we can handle skewness in further steps

Splitting the independent and dependent variables in x and y before removing the skewness

In [None]:
x=df.drop('W',axis=1)
y=df['W']

In [None]:
x

In [None]:
y

In [None]:
# Checking skewness 
x.skew().sort_values(ascending=False)

In [None]:
sns.distplot(df['R'])

In [None]:
sns.distplot(df['H'])

In [None]:
sns.distplot(df['HR'])

In [None]:
sns.distplot(df['CG'])

In [None]:
sns.distplot(df['E'])

We can see highest skewness is for the column R has 1.20.. and followed by 0.89, 0.73, 0.67 and so on
by the column E, CG, H...

We can remove skewness using the power transform methos of 'Yeo-johnson' method 

In [None]:
from sklearn.preprocessing import power_transform
x_new=power_transform(x,method='yeo-johnson')

In [None]:
# Checking skewness
pd.DataFrame(x_new).skew().sort_values(ascending=False)

In [None]:
type(x_new)

In [None]:
x.columns

In [None]:
x=pd.DataFrame(x_new,columns=x.columns)

In [None]:
x

In [None]:
x.skew().sort_values(ascending=False) # Validating that skewness has been removed or not

In [None]:
sk=x.skew()

In [None]:
sk

In [None]:
sk[np.abs(sk)>0.5].all()

In [None]:
sk[np.abs(sk)>0.5]

In [None]:
x.skew()[np.abs(x.skew())<0.25]

In [None]:
x.skew()[np.abs(x.skew())<0.25].all()

Skewness has been removed, now we can proceed with further steps

# Checking Outliers

In [None]:
# Plotting Boxplot for 1st 9 columns
x.iloc[:,0:9].boxplot(figsize=[20,8])
plt.subplots_adjust(bottom=0.01)
plt.show()

In [None]:
# Plotting Boxplot for 2nd 9 columns
x.iloc[:,9:].boxplot(figsize=[20,8])
plt.subplots_adjust(bottom=0.01)
plt.show()

We can see 1 or 2 values of total 2 columns but those are near to whiskers .

# Removing the outliers

1. Lets find the Boundary Values

In [None]:
print('Highest allowed',df['SHO'].mean()+3*df['SHO'].std())
print('Lowest allowed',df['SHO'].mean()-3*df['SHO'].std())

2. Finding Outliers

In [None]:
df[(df['SHO']>23.660) | (df['SHO']<-1.060)]

3. Trimming the outliers

In [None]:
new_df=df[(df['SHO']<23.660)&(df['SHO']>-1.060)]
new_df

4. Capping on outliers

In [None]:
upper_limit=df['SHO'].mean()+3*df['SHO'].std()
lower_limit=df['SHO'].mean()-3*df['SHO'].std()

5. Now apply the capping

In [None]:
df['SHO'] = np.where(
    df['SHO']>upper_limit,
    upper_limit,
    np.where(
        df['SHO']<lower_limit,
        lower_limit,
        df['SHO']
    )
)

6. Now see the statistics using describe function

In [None]:
df['SHO'].describe()

1. Finding the Boundary values

In [None]:
print('Highest allowed',df['E'].mean()+3*df['E'].std())
print('Lowest allowed',df['E'].mean()-3*df['E'].std())

2. Finding the outliers

In [None]:
df[(df['E']>136.209) | (df['E']<52.456)]

3. Trimming the outliers

In [None]:
new_df=df[(df['E']<136.209)&(df['E']>52.456)]
new_df

4. Capping the outliers

In [None]:
upper_limit=df['E'].mean()+3*df['E'].std()
lower_limit=df['E'].mean()-3*df['E'].std()

5. Now apply the capping 

In [None]:
df['E'] = np.where(
    df['E']>upper_limit,
    upper_limit,
    np.where(
        df['E']<lower_limit,
        lower_limit,
        df['E']
    )
)

6. Now see the statistics using the function

In [None]:
df['E'].describe()

# Let's Quantify

In [None]:
from scipy.stats import zscore
(np.abs(zscore(x))<3).all()

Therefore column R, AB, H has outliers

1. Finding the Boundary values

In [None]:
print('Highest allowed',df['R'].mean()+3*df['R'].std())
print('Lowest allowed',df['R'].mean()-3*df['R'].std())

2. Finding the Outliers

In [None]:
df[(df['R']>864.518) | (df['R']<511.948)]

3. Trimming of outliers

In [None]:
new_df=df[(df['R']<864.518)&(df['R']>511.948)]
new_df

4. Capping of outliers

In [None]:
upper_limit=df['R'].mean()+3*df['R'].std()
lower_limit=df['R'].mean()-3*df['R'].std()

5. Now apply the capping

In [None]:
df['R'] = np.where(
    df['R']>upper_limit,
    upper_limit,
    np.where(
        df['R']<lower_limit,
        lower_limit,
        df['R']
    )
)

6. Now see the statistics using describe function

In [None]:
df['R'].describe()

1. Finding the boundary values

In [None]:
print('Highest allowed',df['AB'].mean()+3*df['AB'].std())
print('Lowest allowed',df['AB'].mean()-3*df['AB'].std())

2. Finding the outliers

In [None]:
df[(df['AB']>5727.668) | (df['AB']<5304.864)]

3. Trimming of outliers

In [None]:
new_df=df[(df['AB']<5727.668)&(df['AB']>5304.864)]
new_df

4. Capping on outliers 

In [None]:
upper_limit=df['AB'].mean()+3*df['AB'].std()
lower_limit=df['AB'].mean()-3*df['AB'].std()

5. Now apply the capping

In [None]:
df['AB'] = np.where(
    df['AB']>upper_limit,
    upper_limit,
    np.where(
        df['AB']<lower_limit,
        lower_limit,
        df['AB']
    )
)

6. Now see the Statistics using describe function

In [None]:
df['AB'].describe()

1. Finding the Boundary values

In [None]:
print('Highest allowed',df['H'].mean()+3*df['H'].std())
print('Lowest allowed',df['H'].mean()-3*df['H'].std())

2. finding the outliers

In [None]:
df[(df['H']>1574.956) | (df['H']<1232.110)]

3. Trimming of outliers

In [None]:
new_df=df[(df['H']<1574.956)&(df['H']>1232.110)]
new_df

4. Capping on outliers

In [None]:
upper_limit=df['H'].mean()+3*df['H'].std()
lower_limit=df['H'].mean()-3*df['H'].std()

5. Now apply the capping

In [None]:
df['H'] = np.where(
    df['H']>upper_limit,
    upper_limit,
    np.where(
        df['H']<lower_limit,
        lower_limit,
        df['H']
    )
)

6. Now see the statistics using describe function

In [None]:
df['H'].describe()

LET'S Quantify

In [None]:
from scipy.stats import zscore
(np.abs(zscore(x))<3).all()

In [None]:
np.abs(x)

In [None]:
(np.abs(zscore(x))<3).all()

Scatter plots

In [None]:
plt.scatter(df['R'],df['W'])
plt.xlabel('R')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['AB'],df['W'])
plt.xlabel('AB')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['H'],df['W'])
plt.xlabel('H')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['2B'],df['W'])
plt.xlabel('2B')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['3B'],df['W'])
plt.xlabel('3B')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['HR'],df['W'])
plt.xlabel('HR')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['BB'],df['W'])
plt.xlabel('BB')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['SO'],df['W'])
plt.xlabel('SO')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['SB'],df['W'])
plt.xlabel('SB')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['RA'],df['W'])
plt.xlabel('RA')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['ER'],df['W'])
plt.xlabel('ER')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['ERA'],df['W'])
plt.xlabel('ERA')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['CG'],df['W'])
plt.xlabel('CG')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['SHO'],df['W'])
plt.xlabel('SHO')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['SV'],df['W'])
plt.xlabel('SV')
plt.ylabel('W')
plt.show()

In [None]:
plt.scatter(df['E'],df['W'])
plt.xlabel('E')
plt.ylabel('W')
plt.show()

Regression Plots

In [None]:
sns.regplot(x='R',y='W',data=df)
plt.ylim(60,105)

In [None]:
sns.regplot(x='AB',y='W',data=df)
plt.ylim(60,105)

In [None]:
sns.regplot(x='H',y='W',data=df)
plt.ylim(55,)
plt.xlim(1300,)

In [None]:
sns.regplot(x='2B',y='W',data=df)
plt.ylim(55,105)

In [None]:
sns.regplot(x='3B',y='W',data=df)
plt.ylim(60,)

In [None]:
sns.regplot(x='HR',y='W',data=df)
plt.ylim(1,)

In [None]:
sns.regplot(x='BB',y='W',data=df)
plt.ylim(60,105)

In [None]:
sns.regplot(x='SO',y='W',data=df)
plt.ylim(60,105)

In [None]:
sns.regplot(x='SB',y='W',data=df)
plt.ylim(55,105)

In [None]:
sns.regplot(x='RA',y='W',data=df)
plt.ylim(55,105)

In [None]:
sns.regplot(x='ER',y='W',data=df)
plt.ylim(60,100)

In [None]:
sns.regplot(x='ERA',y='W',data=df)
plt.ylim(55,105)

In [None]:
sns.regplot(x='CG',y='W',data=df)
plt.ylim(60,100)

In [None]:
sns.regplot(x='SHO',y='W',data=df)
plt.ylim(55,100)

In [None]:
sns.regplot(x='SV',y='W',data=df)
plt.ylim(60,100)

In [None]:
sns.regplot(x='E',y='W',data=df)
plt.ylim(60,110)
plt.xlim(70,125)

Column R, 2B, HR, BB, SO, SHO, SV Shows Positive Correlation and 
Column RA, ERA, ER shows negative correlation.

In [None]:
df.hist(grid=False,figsize=(10,6),bins=30)

In [None]:
df

In [None]:
df.shape

# Splitting the Independent and dependent variable

In [None]:
x=df.drop('W',axis=1)

In [None]:
y=df['W']

In [None]:
x.shape

In [None]:
y.shape

Training process began

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error

Creating Train_Test_split

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.20,random_state=1)

In [None]:
x_train

In [None]:
x_test

In [None]:
y_train

In [None]:
y_test

In [None]:
x_train.shape

In [None]:
y_train.shape

In [None]:
x_test.shape

In [None]:
y_test.shape

# Creating Model 
1. Linear Regressor model

In [None]:
lr=LinearRegression()

In [None]:
lr.fit(x_train,y_train)

In [None]:
predlr=lr.predict(x_test)

In [None]:
coeff_df=pd.DataFrame(lr.coef_,x.columns,columns=['Coefficient'])
coeff_df

In [None]:
lr.intercept_

We got that coefficient for every individual columns and also intercept

In [None]:
y_pred=lr.predict(x_test)

In [None]:
com_df=pd.DataFrame({'Actual W ':y_test,'Predicted W ':y_pred})
com_df

In [None]:
print('Error:')
print('Mean Squared Error : ',mean_squared_error(y_test,y_pred))
print('Mean Absolute Error : ',mean_absolute_error(y_test,y_pred))
print('Root Mean Squared Error : ',np.sqrt(mean_squared_error(y_test,y_pred)))

In [None]:
df.describe(include='all')

Rmse value is 8.7459 which is lower than the mean of column W ie 80.97 therefore,"Lower the Rmse
better the given model is able to fit a dataset".

Mean absolute error is 4.637 and we can certainly interpret that average difference between the 
predicted and actual wins therefore average difference is 4.637 and it is not so high

In [None]:
print('Coefficient of Determination :',r2_score(y_test,y_pred))

Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf=RandomForestRegressor()

In [None]:
rf.fit(x_train,y_train)

In [None]:
y_pred=rf.predict(x_test)
y_pred

In [None]:
com_df2=pd.DataFrame({'Actual W ':y_test,'Predicted W ':y_pred})
com_df2

In [None]:
print('Mean Squared error :',mean_squared_error(y_test,y_pred))
print('Root mean squared error :',np.sqrt(mean_squared_error(y_test,y_pred)))

Decision tree regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
dt=DecisionTreeRegressor(max_depth=4,min_samples_leaf=0.2,random_state=1)

In [None]:
dt.fit(x_train,y_train)

In [None]:
y_pred=dt.predict(x_test)

In [None]:
y_pred

In [None]:
com_df3=pd.DataFrame({'Actual W ':y_test,'Predicted W ':y_pred})
com_df3

In [None]:
print('Mean Squared Error :',mean_squared_error(y_test,y_pred))
print('Root Mean Squared Error :',np.sqrt(mean_squared_error(y_test,y_pred)))

Support Vector Regressor

In [None]:
from sklearn.svm import SVR

In [None]:
sv_regressor=SVR()

In [None]:
sv_regressor.fit(x_train,y_train)

In [None]:
y_pred=sv_regressor.predict(x_test)

In [None]:
y_pred

In [None]:
com_df4=pd.DataFrame({'Actual W ':y_test,'Predicted W ':y_pred})
com_df4

In [None]:
print('Mean Squared Error :',mean_squared_error(y_test,y_pred))
print('Root Mean Squared Error :',np.sqrt(mean_squared_error(y_test,y_pred)))

Out of these model Linear Regressor model shows good accuracy and also rmse value is low than the
others.Therefore we shall go with Linear Regressor 

# Tuning the Model

In [None]:
from sklearn.model_selection import cross_val_score
scr=cross_val_score(lr,x,y,cv=5)
print("Cross Validation Score of Linear Regressor model :",scr.mean())

In [None]:
scr2=cross_val_score(rf,x,y,cv=5)
print("Cross Validation Score of Random Forest model :",scr2.mean())

In [None]:
scr3=cross_val_score(sv_regressor,x,y,cv=5)
print("Cross Validation Score of Support Vector model :",scr3.mean())

Cross validation score of Linear regressor model is giving better results than the other models

# Gradient Descent Algorithm

In [None]:
from sklearn.linear_model import SGDRegressor
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

In [None]:
pipe=[]
pipe.append(('PCA',PCA(n_components=8)))
pipe.append(('SGD',SGDRegressor(alpha=0.1,learning_rate='optimal',max_iter=40)))
model=Pipeline(pipe)
cv_results=cross_val_score(model,df,y,cv=5)
msg="%s:%f(%f)"%('SGDRegressor',cv_results.mean(),cv_results.std())
print(msg)