In this assignment, we will focus on sports analytics. This data set is made available by http://www.baseball-reference.com. It contains data about professional baseball (MLB) games played in the 2016 season. There are 2,427 games in the data set. Each row represents a single game. The goal is to predict the attendance at a home team’s game. This is an important task because most franchises want to predict the number of attendees for a variety of reasons including profits.

## Description of Variables

The description of variables are provided in "Baseball - Data Dictionary.docx"

## Goal

Use the **baseball.csv** data set and build a model to predict **attendance_binary**.

## Data Prep

In [173]:
#Importing pandas and numpy
import numpy as np
import pandas as pd

np.random.seed(99)

# Get The Data

In [174]:
# we will predict the target valiable "attendance_binary"
#importing the data

baseball_data = pd.read_csv("baseball.csv")

baseball_data.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
0,0,43683,2,6,2,Night Game,Day Game,0,6,6,Wednesday,Monday,55,24,Overcast,2.933333,1
1,0,45785,0,7,2,Night Game,Day Game,0,10,3,Wednesday,Monday,48,7,Unknown,2.8,1
2,0,48282,0,8,4,Night Game,Day Game,2,4,3,Wednesday,Monday,65,10,Cloudy,3.383333,0
3,0,21830,0,9,6,Day Game,Night Game,0,15,11,Wednesday,Tuesday,77,0,In Dome,3.233333,1
4,0,49289,2,4,2,Night Game,Day Game,1,1,3,Tuesday,Monday,81,12,Cloudy,2.633333,1


In [175]:
# to find the total number of rows

baseball_data.shape

(2427, 17)

In [176]:
#info about the data
baseball_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2427 entries, 0 to 2426
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   attendance_binary          2427 non-null   int64  
 1   previous_attendance        2427 non-null   int64  
 2   previous_away_team_errors  2427 non-null   int64  
 3   previous_away_team_hits    2427 non-null   int64  
 4   previous_away_team_runs    2427 non-null   int64  
 5   game_type                  2427 non-null   object 
 6   previous_game_type         2427 non-null   object 
 7   previous_home_team_errors  2427 non-null   int64  
 8   previous_home_team_hits    2427 non-null   int64  
 9   previous_home_team_runs    2427 non-null   int64  
 10  game_day                   2427 non-null   object 
 11  previous_game_day          2427 non-null   object 
 12  temperature                2427 non-null   int64  
 13  wind_speed                 2427 non-null   int64

In [177]:
#Sum of Null values in the data
baseball_data.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

# Split the data into train and Test

In [178]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(baseball_data, test_size=0.3)

In [179]:
train.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [180]:
test.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [181]:
train.shape

(1698, 17)

In [182]:
test.shape

(729, 17)

In [183]:
train.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
1989,1,34362,1,13,4,Day Game,Night Game,1,12,7,Sunday,Saturday,53,5,Unknown,3.183333,1
1759,1,38813,2,6,0,Day Game,Day Game,0,7,6,Saturday,Friday,68,9,Rain,3.0,1
1317,1,36683,1,12,10,Night Game,Day Game,2,11,4,Friday,Wednesday,77,9,Sunny,3.3,0
1418,1,36552,1,22,21,Day Game,Night Game,4,8,2,Sunday,Saturday,81,11,Sunny,3.633333,0
2026,1,39128,1,9,7,Night Game,Night Game,0,13,10,Wednesday,Tuesday,70,14,Cloudy,3.116667,1


In [184]:
test.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
662,1,34215,0,4,0,Night Game,Day Game,2,7,3,Friday,Wednesday,75,5,Sunny,2.166667,1
1279,1,27544,0,14,8,Night Game,Day Game,1,10,5,Monday,Sunday,90,11,Sunny,3.05,0
939,1,33909,0,15,5,Night Game,Day Game,1,11,3,Tuesday,Sunday,73,0,In Dome,3.95,0
356,1,42376,0,9,4,Night Game,Night Game,0,9,8,Sunday,Saturday,70,6,Overcast,3.616667,1
644,1,39146,1,12,8,Day Game,Night Game,0,9,5,Sunday,Saturday,81,7,Sunny,3.666667,0


# Data Prep

In [185]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [186]:
#separate the target variable
traindata_target = train['attendance_binary']
testdata_target = test['attendance_binary']

traindata_input = train.drop(['attendance_binary'], axis=1)
testdata_input = test.drop(['attendance_binary'], axis=1)

In [187]:
#Identifying the datatypes
traindata_input.dtypes

previous_attendance            int64
previous_away_team_errors      int64
previous_away_team_hits        int64
previous_away_team_runs        int64
game_type                     object
previous_game_type            object
previous_home_team_errors      int64
previous_home_team_hits        int64
previous_home_team_runs        int64
game_day                      object
previous_game_day             object
temperature                    int64
wind_speed                     int64
sky                           object
previous_game_duration       float64
previous_homewin               int64
dtype: object

In [188]:
# Identify the numerical columns
numeric_variables = traindata_input.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_variables = traindata_input.select_dtypes('object').columns.to_list()

In [189]:
# Identify the binary columns so we can pass them through without transforming
binary_variables = ['previous_homewin']

In [190]:
#we need to remove the binary columns from numerical columns.

for col in binary_variables:
    numeric_variables.remove(col)

In [191]:
numeric_variables

['previous_attendance',
 'previous_away_team_errors',
 'previous_away_team_hits',
 'previous_away_team_runs',
 'previous_home_team_errors',
 'previous_home_team_hits',
 'previous_home_team_runs',
 'temperature',
 'wind_speed',
 'previous_game_duration']

In [192]:
categorical_variables

['game_type', 'previous_game_type', 'game_day', 'previous_game_day', 'sky']

In [193]:
binary_variables

['previous_homewin']

# Using Pipeline for Transform

In [194]:
numeric_transform = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [195]:
categorical_transform = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [196]:
binary_transform = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [197]:
preprocessing_data = ColumnTransformer([
        ('num', numeric_transform, numeric_variables),
        ('cat', categorical_transform, categorical_variables),
        ('binary', binary_transform, binary_variables)])

# fit_transform() for Train data

In [198]:
#fit and transform for train data
train_transform = preprocessing_data.fit_transform(traindata_input)


In [199]:
train_transform

array([[ 0.41544069,  0.54926461,  1.16575025, ...,  0.        ,
         1.        ,  1.        ],
       [ 0.8703858 ,  1.83214122, -0.77624099, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.65267448,  0.54926461,  0.88832293, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [-0.52337548,  1.83214122,  0.33346829, ...,  0.        ,
         1.        ,  0.        ],
       [ 1.4662811 ,  0.54926461, -0.77624099, ...,  0.        ,
         1.        ,  1.        ],
       [-1.03739906, -0.73361201, -0.49881367, ...,  0.        ,
         0.        ,  0.        ]])

In [200]:
train_transform.shape

(1698, 37)

# Transform form Test data

In [201]:
# Transform the test data
test_transform = preprocessing_data.transform(testdata_input)

test_transform

array([[ 0.40041555, -0.73361201, -1.33109563, ...,  1.        ,
         0.        ,  1.        ],
       [-0.28143995, -0.73361201,  1.44317757, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.36913871, -0.73361201,  1.72060489, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.58484593, -0.73361201,  0.61089561, ...,  0.        ,
         0.        ,  0.        ],
       [-1.84579287,  1.83214122, -0.22138635, ...,  0.        ,
         0.        ,  1.        ],
       [-0.53155243,  0.54926461,  0.05604097, ...,  0.        ,
         0.        ,  0.        ]])

In [202]:
test_transform.shape

(729, 37)

## Finding the Baseline 

In [203]:
traindata_target.value_counts()

1    891
0    807
Name: attendance_binary, dtype: int64

In [204]:
traindata_target.value_counts()/len(traindata_target)

1    0.524735
0    0.475265
Name: attendance_binary, dtype: float64

The major class of target variable is 1 we take baseline as 0.524735

# Section 2: 

## SVM Model 1:

# SVC(kernel = 'linear')

In [278]:
from sklearn.svm import SVC

linear_SVM = SVC(kernel="linear")

linear_SVM.fit(train_transform, traindata_target)

SVC(kernel='linear')

In [279]:
from sklearn.metrics import accuracy_score

#Predict the train values
train_target_pred = linear_SVM.predict(train_transform)

#Train accuracy
accuracy_score(traindata_target, train_target_pred)

0.8368669022379269

In [280]:
#Predict the test values
test_target_pred = linear_SVM.predict(test_transform)

#Test accuracy
accuracy_score(testdata_target, test_target_pred)

0.8340192043895748

## SVM Model 2:

# SVC(kernel = 'Poly')

In [281]:
from sklearn.svm import SVC

poly_SVM = SVC(kernel = "poly", degree = 2, coef0=1, C=10, gamma = 'scale')

poly_SVM.fit(train_transform, traindata_target)

SVC(C=10, coef0=1, degree=2, kernel='poly')

In [282]:
#Predict the train values
train_target_poly_pred = poly_SVM.predict(train_transform)

#Train accuracy
accuracy_score(traindata_target, train_target_poly_pred)

0.8969375736160189

In [283]:
#Predict the test values
test_target_poly_pred = poly_SVM.predict(test_transform)

#Test accuracy
accuracy_score(testdata_target, test_target_poly_pred)

0.813443072702332

## SVM Model 3:

# SVC(kernel = 'rbf')

In [284]:
#Gaussian RBF

rbf_SVM = SVC(kernel="rbf", C=10, gamma='scale')

rbf_SVM.fit(train_transform, traindata_target)

SVC(C=10)

In [285]:
#Predict the train values
train_target_rbf_pred = rbf_SVM.predict(train_transform)

#Train accuracy
accuracy_score(traindata_target, train_target_rbf_pred)

0.9752650176678446

In [286]:
#Predict the test values
test_target_rbf_pred = rbf_SVM.predict(test_transform)

#Test accuracy
accuracy_score(testdata_target, test_target_rbf_pred)

0.7942386831275721

# Section 3

## SGD Model 1: 

# SGD Classifier

In [287]:
from sklearn.linear_model import SGDClassifier 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

SGD_logr = SGDClassifier(max_iter=100, penalty=None, eta0=0.1, tol=0.0001) 

SGD_logr.fit(train_transform, traindata_target)



SGDClassifier(eta0=0.1, max_iter=100, penalty=None, tol=0.0001)

In [288]:
#Predict the train values
train_SGD_class_pred = SGD_logr.predict(train_transform)

#Train accuracy
accuracy_score(traindata_target, train_SGD_class_pred)

0.8268551236749117

In [289]:
#Predict the test values
test_SGD_class_pred = SGD_logr.predict(test_transform)

#Test accuracy
accuracy_score(testdata_target, test_SGD_class_pred)

0.821673525377229

## SGD Model 2:

# SGD L1 Classifier

In [290]:
from sklearn.linear_model import SGDClassifier 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

SGD_logr = SGDClassifier(max_iter=100, penalty='l1', eta0=0.1, tol=0.0001) 

SGD_logr.fit(train_transform, traindata_target)

SGDClassifier(eta0=0.1, max_iter=100, penalty='l1', tol=0.0001)

In [291]:
#Predict the train values
train_SGD_class_pred = SGD_logr.predict(train_transform)

#Train accuracy
accuracy_score(traindata_target, train_SGD_class_pred)

0.7997644287396938

In [292]:
#Predict the test values
test_SGD_class_pred = SGD_logr.predict(test_transform)

#Test accuracy
accuracy_score(testdata_target, test_SGD_class_pred)

0.8024691358024691

## LogisticRegression Model:

In [293]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(penalty='none')

log_reg.fit(train_transform, traindata_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(penalty='none')

In [294]:
from sklearn.metrics import accuracy_score

#Predict the train values
train_target_logr_pred = log_reg.predict(train_transform)

#Train accuracy
accuracy_score(traindata_target, train_target_logr_pred)

0.8404004711425206

In [295]:
#Predict the test values
test_target_logr_pred = log_reg.predict(test_transform)

#Test accuracy
accuracy_score(testdata_target, test_target_logr_pred)

0.8340192043895748

In [296]:
#Clasiification Report

In [297]:
from sklearn.metrics import classification_report

#classification report on test set
print(classification_report(testdata_target, test_target_logr_pred))

              precision    recall  f1-score   support

           0       0.84      0.82      0.83       361
           1       0.83      0.85      0.84       368

    accuracy                           0.83       729
   macro avg       0.83      0.83      0.83       729
weighted avg       0.83      0.83      0.83       729



In [298]:
#Classification Matrix

In [299]:
from sklearn.metrics import confusion_matrix

#confusion matrix on test set
confusion_matrix(testdata_target, test_target_logr_pred)

array([[296,  65],
       [ 56, 312]], dtype=int64)

# Discussion


## The train and test values of each model

## Performance of models

1) In all the models, we can see that test accuracy of Logistic Regression is the highest of all with 83%. 
   It is considered as the best as it is accurate model with 83% to the baseline accuracy of 51%. 
2) There is no evidence for overfitting in the best model, Logistic Regression, because we got an accuracy of 84% for    train data and 83% for test data. 
3) Yes in SVM model 3 RBF there is overfitting because we got 97% for train and 79% for test at c=10. We can decreases    the overfitting in model by changing C value.
4) In all the models, we can see that test accuracy of Logistic Regression is the highest of all with 83%. It is          considered as the best as it is accurate model with 83% to the baseline accuracy of 51%. 
