## Evaluation Projects

## Baseball Case Study

### Problem Statement:
This dataset utilizes data from 2014 Major League Baseball seasons in order to develop an algorithm that predicts the number of wins for a given team in the 2015 season based on several different indicators of success. There are 16 different features that will be used as the inputs to the machine learning and the output will be a value that represents the number of wins. <br>

-- Input features: Runs, At Bats, Hits, Doubles, Triples, Homeruns, Walks, Strikeouts, Stolen Bases, Runs Allowed, Earned Runs, Earned Run Average (ERA), Shutouts, Saves, Complete Games and Errors<br>

-- Output: Number of predicted wins (W)<br>

To understand the columns meaning, follow the link given below to understand the baseball statistics:<br> https://en.wikipedia.org/wiki/Baseball_statistics<br>

## Import Libraries

In [None]:
# Analysis libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore,boxcox

# Machine learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import power_transform, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('baseball.csv')
df.head()

<b>Linear Regression</b> Model needs to be used for this dataset

In [None]:
df.columns

## Description of Dataset

Runs, At Bats, Hits, Doubles, Triples, Homeruns, Walks, Strikeouts, Stolen Bases, Runs Allowed, Earned Runs, Earned Run Average (ERA), Shutouts, Saves, and Errors <br>

   <b>Input features:</b><br>
    R  - Runs<br>
    AB - At Bats<br>
    H  - Hits<br>
    2B - Doubles<br>
    3B - Triples<br>
    HR - Homeruns<br>
    BB - Walks<br>
    SO - Strikeouts<br>
    SB - Stolen Bases<br>
    RA - Run Allowed<br>
    ER - Earned Runs<br>
    ERA - Earned Run Average (ERA)<br>
    CG - Complete games ( referred online)<br>
    SHO - Shutouts<br>
    SV - Saves<br>
    E - Errors<br>
  ----------------------------------------------------<br>  
 <b>Output features:</b><br>
    W -Number of predicted wins (W)<br>

In [None]:
# size of dataset
df.shape

 - There are 30 rows and 17 columns

In [None]:
df.dtypes

 - All values are in numeric only so need to use label encoder.
 - There are 16 integers and 1 float present in dataset.
 - There are no categorical values in dataset.

In [None]:
# Check null values
df.isnull().sum()

 - No null values in dataset.

In [None]:
# recheck null
sns.heatmap(df.isnull())

 - Recheck null, showing no null values in dataset

## EDA

In [None]:
## Statistical summary
df.describe()

 - Some of the columns having outliers and skewness, have to observe that.

## Observation :

In [None]:
df.iloc[:,0:10].plot(kind='box',subplots=True,layout=(2,5))

In [None]:
df.iloc[:,10:].plot(kind='box',subplots=True,layout=(2,5))

 - R,ERA,SHO,SV have outliers.
 - Other colums don't have

## Univarient Analysis

In [None]:
df['E'].plot(kind='box')

In [None]:
df['R'].plot(kind='box')

In [None]:
# checking skewness
df.skew()

 - As per standards(<0.5), there are skewness presnt in R,H,CG,SHO,SV,E
 - We need to handle the skewness.

In [None]:
# checking distribution plot for more detail

for i in df.columns:
    plt.figure()
    sns.distplot(df[i])

## Checking correlation of data

 - As per the above plot R,H,HR,SO,SHO,CG,E having some skewness

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(),annot=True)

 - RA,ER,ERA are highly -vely correlated with target variable

## Bavarient Analysis

In [None]:
plt.scatter(df.R,df.W)
plt.xlabel('Runs')
plt.ylabel('Wins')
plt.show()

In [None]:
# runs relation with dataset
for i in df.columns:
    plt.figure()
    sns.barplot(x=df[i],y=df.W)

 - Relationship between features and label have been observed.

In [None]:
## Removing outliers from data
z = np.abs(zscore(df))
z

In [None]:
df_new = df[(z<3).all(axis=1)]

In [None]:
df.shape

In [None]:
# checking shape of new dataset
df_new.shape

## Removing highly negative correlated columns

In [None]:
df_new.drop(['RA','ER'],axis=1,inplace=True)

In [None]:
df_new.drop(['ERA',],axis=1,inplace=True)

## Splitting X & Y data

In [55]:
x = df_new.iloc[:,1:-1]
y = df_new['W']

In [56]:
x

Unnamed: 0,R,AB,H,2B,3B,HR,BB,SO,SB,CG,SHO,SV
0,724,5575,1497,300,42,139,383,973,104,2,8,56
1,696,5467,1349,277,44,156,439,1264,70,2,12,45
2,669,5439,1395,303,29,141,533,1157,86,11,10,38
3,622,5533,1381,260,27,136,404,1231,68,7,9,37
4,689,5605,1515,289,49,151,455,1259,83,7,12,35
6,764,5567,1397,272,19,212,554,1227,63,3,4,48
7,713,5485,1370,246,20,217,418,1331,44,0,10,43
8,644,5485,1383,278,32,167,436,1310,87,1,12,60
9,748,5640,1495,294,33,161,478,1148,71,3,10,40
10,751,5511,1419,279,32,172,503,1233,101,5,9,45


In [57]:
y

0      95
1      83
2      81
3      76
4      74
6      87
7      81
8      80
9      78
10     88
11     86
12     85
13     76
14     68
15    100
16     98
17     97
18     68
19     64
20     90
21     83
22     71
23     67
24     63
25     92
26     84
27     79
28     74
29     68
Name: W, dtype: int64

## Removing Skewness

In [58]:
df_x = power_transform(x)
df_x = pd.DataFrame(df_x)

In [59]:
x = df_x
x

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.962544,0.0,0.0,1.685188,1.00615,-0.741927,-1.605198,-2.550612,0.936132,-0.307098,-0.787002,1.532753
1,0.298863,0.0,0.0,0.138198,1.185227,-0.109958,-0.462096,0.093683,-0.516377,-0.307098,0.236737,0.31202
2,-0.312105,0.0,0.0,1.907385,-0.228819,-0.664354,1.232098,-0.935611,0.225038,2.011315,-0.252844,-0.664137
3,-1.308298,0.0,0.0,-0.837665,-0.432228,-0.860039,-1.162721,-0.230683,-0.618422,1.264463,-0.513555,-0.820689
4,0.137737,0.0,0.0,0.911435,1.622636,-0.289647,-0.155686,0.044143,0.095038,1.264463,0.236737,-1.149165
5,1.964209,0.0,0.0,-0.16301,-1.295827,1.631637,1.579494,-0.269583,-0.884526,0.121871,-2.064039,0.677176
6,0.698146,0.0,0.0,-1.542635,-1.182758,1.767734,-0.877217,0.77098,-2.082843,-1.732896,-0.252844,0.052325
7,-0.852595,0.0,0.0,0.199897,0.068703,0.269125,-0.520476,0.556008,0.267558,-0.870682,0.236737,1.908137
8,1.555951,0.0,0.0,1.255256,0.166017,0.065014,0.270944,-1.01921,-0.466233,0.121871,-0.252844,-0.365006
9,1.631727,0.0,0.0,0.262086,0.068703,0.43462,0.717576,-0.211199,0.824915,0.770649,-0.513555,0.31202


## Standard Scaling

In [60]:
scaler = StandardScaler()
x = scaler.fit_transform(x)

In [61]:
x

array([[ 9.62543504e-01,  0.00000000e+00,  0.00000000e+00,
         1.68518793e+00,  1.00615029e+00, -7.41927000e-01,
        -1.60519802e+00, -2.55061247e+00,  9.36131648e-01,
        -3.07098204e-01, -7.87002186e-01,  1.53275292e+00],
       [ 2.98863300e-01,  0.00000000e+00,  0.00000000e+00,
         1.38197902e-01,  1.18522654e+00, -1.09958425e-01,
        -4.62095966e-01,  9.36832915e-02, -5.16377335e-01,
        -3.07098204e-01,  2.36736538e-01,  3.12020186e-01],
       [-3.12105130e-01,  0.00000000e+00,  0.00000000e+00,
         1.90738550e+00, -2.28819392e-01, -6.64354121e-01,
         1.23209786e+00, -9.35611465e-01,  2.25038365e-01,
         2.01131531e+00, -2.52844176e-01, -6.64136739e-01],
       [-1.30829774e+00,  0.00000000e+00,  0.00000000e+00,
        -8.37664770e-01, -4.32227907e-01, -8.60039342e-01,
        -1.16272085e+00, -2.30682707e-01, -6.18421529e-01,
         1.26446344e+00, -5.13554932e-01, -8.20688859e-01],
       [ 1.37737301e-01,  0.00000000e+00,  0.0000000

## Finding Best random state

In [66]:
maxScore = 0
macRS = 0
for i in range(1,200):
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.23,random_state=i)
    lr=LinearRegression()
    lr.fit(x_train,y_train)
    pred_train = lr.predict(x_train)
    pred_test = lr.predict(x_test)
    print(f'Best Random State : {i} and Train R2 Score : {r2_score(y_train,pred_train)}')
    print(f'Best Random State : {i} and Test R2 Score : {r2_score(y_test,pred_test)}')
    print('')

Best Random State : 1 and Train R2 Score : 0.8550783206268826
Best Random State : 1 and Test R2 Score : 0.6944530201712394

Best Random State : 2 and Train R2 Score : 0.8173961865951549
Best Random State : 2 and Test R2 Score : -0.8302730889884651

Best Random State : 3 and Train R2 Score : 0.8199055316941919
Best Random State : 3 and Test R2 Score : 0.7257992916491386

Best Random State : 4 and Train R2 Score : 0.8536067232046677
Best Random State : 4 and Test R2 Score : 0.618508188589969

Best Random State : 5 and Train R2 Score : 0.8806333789341638
Best Random State : 5 and Test R2 Score : 0.2671195820874026

Best Random State : 6 and Train R2 Score : 0.8800922107276878
Best Random State : 6 and Test R2 Score : 0.6629688959955353

Best Random State : 7 and Train R2 Score : 0.8488482103041881
Best Random State : 7 and Test R2 Score : 0.38500783612986544

Best Random State : 8 and Train R2 Score : 0.8368074421041336
Best Random State : 8 and Test R2 Score : 0.7888158307008533

Best Ra

<b>Best Random State : 23 and Train R2 Score : 0.8273528079101595<br>
        <br>
Best Random State : 23 and Test R2 Score : 0.7812319076407748</b>

## Splitting test and training data

In [68]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.23,random_state=23)

## LinearRegression

In [69]:
lr = LinearRegression()
lr.fit(x_train,y_train)
pred = lr.predict(x_test)

In [75]:
## Performance of model
def performance(actual,pred):
    print('Error ==> ')
    print('Mean Absolute error : ',mean_absolute_error(actual,pred))
    print('Mean Squared error : ',mean_squared_error(actual,pred))
    print('R2_Score : ',r2_score(actual,pred))

In [76]:
performance(y_test,pred)

Error ==> 
Mean Absolute error :  4.388362740359034
Mean Squared error :  30.797189818243584
R2_Score :  0.7812319076407748


## Cross Validation

In [77]:
from sklearn.model_selection import cross_val_score

In [79]:
cross = cross_val_score(LinearRegression(),x,y,cv=10)
cross.mean()

-5.944004960197801

## Regularization

In [81]:
from sklearn.linear_model import Lasso,Ridge,ElasticNet

In [83]:
# We can corrected by alpha value
le = Lasso(alpha=0.0001)
le.fit(x_train,y_train)
predict = le.predict(x_test)
le.score(x_train,y_train)

print(r2_score(y_test,predict))

0.7812746478132153


 - The score is 78%

## Cross Validation

In [84]:
cross_val = cross_val_score(le,x,y,cv=10)
cross_val.mean()

-5.943691740647567

In [85]:
# Ridge Model
re = Ridge(alpha=0.0001)
re.fit(x_train,y_train)
re.score(x_train,y_train)

0.8273528077486438

 - Ridge is giving 82%

## Cross Validation

In [86]:
cross_val = cross_val_score(re,x,y,cv=10)
cross_val.mean()

-5.943965765736464

## Model

In [87]:
from sklearn.svm import SVR

kernel = ['linear','poly','rbf','sigmoid']

for i in kernel:
    sr = SVR(kernel=i)
    sr.fit(x_train,y_train)
    print(sr.score(x_train,y_train))
    prec = sr.predict(x_test)
    print('Erro ==>')
    print('Mean Absolute error : ',mean_absolute_error(y_test,prec))
    print('Mean Squared error : ',mean_squared_error(y_test,prec))
    cross_val = cross_val_score(sr,x,y,cv=10)
    print('Cross Validation : ',cross_val.mean())
    print('')

0.7179715625258282
Erro ==>
Mean Absolute error :  4.987851266476606
Mean Squared error :  34.70558461989886
Cross Validation :  -7.021812327196119

0.3586794381170487
Erro ==>
Mean Absolute error :  9.496217537067833
Mean Squared error :  134.59662762091347
Cross Validation :  -5.847696092613174

0.17746992568223285
Erro ==>
Mean Absolute error :  9.199110897190863
Mean Squared error :  130.74143458890723
Cross Validation :  -6.54114430244923

0.27731531654048813
Erro ==>
Mean Absolute error :  7.788038340251693
Mean Squared error :  102.80343305349395
Cross Validation :  -4.160333456071724



 - This is giving 71% and less error for linear model.

## Hyper Parameter Tuning 

In [89]:
from sklearn.model_selection import GridSearchCV

In [124]:
# lasso Regression param

param = {'alpha' :[0.1,0.01,0.001,0.0001],
         'selection' :['cyclic','random']}

In [138]:
lasso_hp = GridSearchCV(Lasso(),param,cv=5)
lasso_hp.fit(x_train,y_train)
lasso_hp.best_params_

{'alpha': 0.1, 'selection': 'random'}

In [139]:
# By using alpha value

le = Lasso(alpha=0.1,selection='random')
le.fit(x_train,y_train)
predict = le.predict(x_test)
le.score(x_train,y_train)

print(r2_score(y_test,predict))

0.8041130755251142


In [127]:
# Ridge Regression Param

param_r = {'alpha' :[0.1,0.01,0.001,0.0001],
           'solver' :['auto','svd','cholesky','lsqr','sparse_cg','sag','saga']}

In [128]:
ridge_hp = GridSearchCV(Ridge(),param_r,cv=5)
ridge_hp.fit(x_train,y_train)
ridge_hp.best_params_

{'alpha': 0.1, 'solver': 'saga'}

In [129]:
# Ridge Model

re = Ridge(alpha=0.1,solver='saga')
re.fit(x_train,y_train)
predict = le.predict(x_test)
re.score(x_train,y_train)

print(r2_score(y_test,predict))

0.8041590369937879


 - Both Lasso & Ridge giving 80.4% accuracy.

## Save the Model

In [132]:
import joblib

joblib.dump(le,'Base_ball_case_study.obj')

['Base_ball_case_study.obj']

In [133]:
joblib.dump(re,'Base_ball_case_study1.obj')

['Base_ball_case_study1.obj']

## Load the model to file

In [140]:
mod1 = joblib.load('Base_ball_case_study.obj')
mod1

Lasso(alpha=0.1)

In [136]:
mod1 = joblib.load('Base_ball_case_study1.obj')
mod1

Ridge(alpha=0.1, solver='saga')