# Baseball game attendance prediction (Classification problem)

In this notebook, we will focus on sports analytics. This data set is made available by http://www.baseball-reference.com. It contains data about professional baseball (MLB) games played in the 2016 season. There are 2,427 games in the data set. Each row represents a single game. The goal is to predict the attendance at a home team’s game. This is an important task because most franchises want to predict the number of attendees for a variety of reasons including profits.

## Description of Variables

The description of variables are provided in "Baseball - Data Dictionary.docx"

## Goal

Use the **baseball.csv** data set and build a model to predict **attendance_binary**.

# Section 1: 

## Data Prep 

In [284]:
# Common imports

import numpy as np
import pandas as pd

np.random.seed(42)

In [285]:
#Reading the baseball dataset
baseball = pd.read_csv("baseball.csv")


baseball.head(10)

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
0,0,43683,2,6,2,Night Game,Day Game,0,6,6,Wednesday,Monday,55,24,Overcast,2.933333,1
1,0,45785,0,7,2,Night Game,Day Game,0,10,3,Wednesday,Monday,48,7,Unknown,2.8,1
2,0,48282,0,8,4,Night Game,Day Game,2,4,3,Wednesday,Monday,65,10,Cloudy,3.383333,0
3,0,21830,0,9,6,Day Game,Night Game,0,15,11,Wednesday,Tuesday,77,0,In Dome,3.233333,1
4,0,49289,2,4,2,Night Game,Day Game,1,1,3,Tuesday,Monday,81,12,Cloudy,2.633333,1
5,0,15116,1,7,5,Night Game,Night Game,0,8,3,Tuesday,Monday,72,0,In Dome,2.966667,0
6,0,44317,0,17,15,Night Game,Day Game,2,4,0,Tuesday,Monday,70,6,Unknown,3.166667,0
7,0,39500,0,5,1,Night Game,Day Game,1,9,4,Tuesday,Sunday,40,7,Sunny,3.033333,1
8,0,35067,1,7,4,Night Game,Night Game,2,7,3,Tuesday,Monday,70,8,Cloudy,2.933333,0
9,0,44318,0,15,12,Night Game,Day Game,1,8,3,Tuesday,Monday,64,0,In Dome,3.583333,0


In [286]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(baseball, test_size=0.3)

In [287]:
train.shape

(1698, 17)

In [288]:
test.shape

(729, 17)

In [289]:
train.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
270,0,19072,1,8,6,Night Game,Night Game,1,11,7,Friday,Thursday,76,7,Cloudy,3.1,1
2096,1,40725,0,6,4,Day Game,Night Game,0,10,5,Thursday,Wednesday,61,16,Cloudy,3.05,1
1055,1,34036,1,4,1,Night Game,Night Game,0,6,5,Saturday,Friday,84,10,Cloudy,2.433333,1
1014,0,27134,1,14,7,Night Game,Night Game,0,10,5,Wednesday,Tuesday,91,11,Cloudy,3.083333,0
1320,1,41571,0,8,2,Night Game,Day Game,1,8,4,Sunday,Saturday,68,17,Sunny,3.05,1


In [290]:
test.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
1700,1,31034,2,5,3,Night Game,Night Game,0,10,6,Wednesday,Tuesday,61,7,Sunny,2.583333,1
1544,1,41838,0,3,4,Day Game,Night Game,0,9,2,Wednesday,Tuesday,71,11,Cloudy,2.7,0
1917,0,27257,0,8,7,Day Game,Night Game,2,6,1,Saturday,Friday,60,11,Unknown,3.0,0
1985,1,39691,0,5,1,Day Game,Day Game,0,5,2,Sunday,Saturday,55,16,Unknown,2.4,1
1222,1,47747,0,6,1,Night Game,Day Game,0,14,6,Friday,Sunday,89,18,Sunny,2.816667,1


In [291]:
# Descriptive statistics of numerical variables

train.describe()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,temperature,wind_speed,previous_game_duration,previous_homewin
count,1698.0,1698.0,1698.0,1698.0,1698.0,1698.0,1698.0,1698.0,1698.0,1698.0,1698.0,1698.0
mean,0.514134,30225.243227,0.557715,8.723204,4.386337,0.594817,8.648999,4.512367,73.813899,7.435807,3.071103,0.534747
std,0.499947,9928.244195,0.767199,3.491543,3.061784,0.815016,3.499205,3.156705,10.522003,5.040525,0.444609,0.498938
min,0.0,8766.0,0.0,1.0,0.0,0.0,0.0,0.0,32.0,0.0,2.1,0.0
25%,0.0,22270.25,0.0,6.0,2.0,0.0,6.0,2.0,67.0,4.0,2.783333,0.0
50%,1.0,30338.5,0.0,8.0,4.0,0.0,8.0,4.0,74.0,7.0,3.016667,1.0
75%,1.0,38420.25,1.0,11.0,6.0,1.0,11.0,6.0,81.0,11.0,3.3,1.0
max,1.0,54269.0,5.0,22.0,21.0,5.0,22.0,17.0,101.0,25.0,5.566667,1.0


In [292]:
# Total missing values in each column

train.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [293]:
test.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [294]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [295]:
train_y = train[['attendance_binary']]
test_y = test[['attendance_binary']]

train_inputs = train.drop(['attendance_binary'], axis=1)
test_inputs = test.drop(['attendance_binary'], axis=1)

In [296]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [297]:
# Identify the binary columns so we can pass them through without transforming
binary_columns = ['previous_homewin']

In [298]:
# Be careful: numerical columns already includes the binary columns,
# So, we need to remove the binary columns from numerical columns.

for col in binary_columns:
    numeric_columns.remove(col)

In [299]:
numeric_columns

['previous_attendance',
 'previous_away_team_errors',
 'previous_away_team_hits',
 'previous_away_team_runs',
 'previous_home_team_errors',
 'previous_home_team_hits',
 'previous_home_team_runs',
 'temperature',
 'wind_speed',
 'previous_game_duration']

In [300]:
numeric_transformer = Pipeline(steps=[
                ('scaler', StandardScaler())])

In [301]:
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [302]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [303]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

In [304]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[-1.12371621,  0.57666325, -0.20719118, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.05787587, -0.72716391, -0.78017264, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.38394295,  0.57666325, -1.35315411, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.52534761,  0.57666325, -0.78017264, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.94392488, -0.72716391,  0.65228102, ...,  0.        ,
         0.        ,  0.        ],
       [-0.98407336,  1.88049041, -0.49368191, ...,  1.        ,
         0.        ,  1.        ]])

In [305]:
train_x.shape

(1698, 37)

In [306]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[ 0.0814842 ,  1.88049041, -1.06666338, ...,  1.        ,
         0.        ,  1.        ],
       [ 1.17001331, -0.72716391, -1.63964484, ...,  0.        ,
         0.        ,  0.        ],
       [-0.29905768, -0.72716391, -0.20719118, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.49582715, -0.72716391,  1.22526249, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.26445059, -0.72716391, -0.20719118, ...,  0.        ,
         0.        ,  1.        ],
       [-1.71775245, -0.72716391, -1.63964484, ...,  0.        ,
         0.        ,  0.        ]])

In [307]:
test_x.shape

(729, 37)

## Find the Baseline 

In [308]:
# Find majority class
train_y.value_counts()

attendance_binary
1                    873
0                    825
dtype: int64

In [309]:
# Find percentage
train_y.value_counts()/len(train_y)

attendance_binary
1                    0.514134
0                    0.485866
dtype: float64

# Section 2: 

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)

## SVM Model 1:

In [310]:
from sklearn.metrics import accuracy_score

In [311]:
from sklearn.svm import SVC
 
lin_svm2 = SVC(kernel="linear")

lin_svm2.fit(train_x, train_y)

  return f(*args, **kwargs)


SVC(kernel='linear')

In [312]:
#Predict the train values
train_y_pred = lin_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8445229681978799

In [313]:
#Predict the test values
test_y_pred = lin_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8175582990397805

In [314]:
#Classification Matrix
from sklearn.metrics import confusion_matrix

#We usually create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[278,  65],
       [ 68, 318]], dtype=int64)

## SVM Model 2:

In [315]:
from sklearn.svm import SVC

#Changing the SVC kernel to check the model performance

pol_svm2 = SVC(kernel="poly", degree=2, coef0=1, C=1, gamma='scale')

pol_svm2.fit(train_x, train_y)

  return f(*args, **kwargs)


SVC(C=1, coef0=1, degree=2, kernel='poly')

In [316]:
#Predict the train values
train_y_pred = pol_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8733804475853946

In [317]:
#Predict the test values
test_y_pred = pol_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8244170096021948

In [318]:
#Confusion matrix for SVM model 2
confusion_matrix(test_y, test_y_pred)

array([[283,  60],
       [ 68, 318]], dtype=int64)

## SVM Model 3:

In [319]:
#Checking the SVC with rbf kernel
rbf_svm = SVC(kernel="rbf")

rbf_svm.fit(train_x, train_y)

  return f(*args, **kwargs)


SVC()

In [320]:
#Predict the train values
train_y_pred = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8851590106007067

In [321]:
#Predict the test values
test_y_pred = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.821673525377229

In [322]:
#Confusion matrix for SVM model 3
confusion_matrix(test_y, test_y_pred)

array([[280,  63],
       [ 67, 319]], dtype=int64)

# Section 3: 

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRregression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

In [323]:
from sklearn.linear_model import SGDClassifier 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

sgd_logreg = SGDClassifier(max_iter=100, penalty=None, eta0=0.1, tol=0.0001) 

sgd_logreg.fit(train_x, train_y)

  return f(*args, **kwargs)


SGDClassifier(eta0=0.1, max_iter=100, penalty=None, tol=0.0001)

In [324]:
sgd_logreg.n_iter_

63

In [325]:
#Predict the train values
train_y_pred = sgd_logreg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8056537102473498

In [326]:
#Predict the test values
test_y_pred = sgd_logreg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.7969821673525377

## SGD Model 2:

In [335]:
#Stochastic Gradient:
sgd_logreg_L2 = SGDClassifier(max_iter=50, loss="hinge", penalty='l2', alpha = 0.1, eta0=0.1, tol=0.0001)

sgd_logreg_L2.fit(train_x, train_y)

  return f(*args, **kwargs)


SGDClassifier(alpha=0.1, eta0=0.1, max_iter=50, tol=0.0001)

In [336]:
sgd_logreg_L2.n_iter_

19

In [337]:
#Predict the train values
train_y_pred = sgd_logreg_L2.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8362779740871613

In [338]:
#Predict the test values
test_y_pred = sgd_logreg_L2.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8161865569272977

## LogisticRegression Model:

In [339]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(penalty='none')

log_reg.fit(train_x, train_y)

  return f(*args, **kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(penalty='none')

In [340]:
#Accuracy
from sklearn.metrics import accuracy_score

In [341]:
#Predict the train values
train_y_pred = log_reg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8468786808009423

In [342]:
#Predict the test values
test_y_pred = log_reg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8161865569272977

In [343]:
#Classification Matrix
from sklearn.metrics import confusion_matrix

#We usually create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[279,  64],
       [ 70, 316]], dtype=int64)

Training the model with polynomial features.


In [344]:
from sklearn.preprocessing import PolynomialFeatures

# Create second degree terms and interaction terms
poly_features = PolynomialFeatures(degree=2).fit(train_x)

train_x_poly = poly_features.transform(train_x)

test_x_poly = poly_features.transform(test_x)

In [345]:
log_reg = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5)

log_reg.fit(train_x_poly, train_y)

  return f(*args, **kwargs)


LogisticRegression(l1_ratio=0.5, penalty='elasticnet', solver='saga')

In [346]:
#Predict the train values
train_y_pred = log_reg.predict(train_x_poly)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8828032979976443

In [347]:
#Predict the test values
test_y_pred = log_reg.predict(test_x_poly)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8189300411522634

In [348]:
#Classification Matrix
from sklearn.metrics import confusion_matrix

#We usually create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[287,  56],
       [ 76, 310]], dtype=int64)

# Discussion 


## train and test values of each model you built 

## Which model performs the best and why?  How does it compare to baseline? 

Note: The best model is the one that has the highest TEST score (regardless of any of the training values).

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? 

## Is there any evidence of overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? 