# Group work - Classification

In this assignment, we will focus on sports analytics. This data set is made available by http://www.baseball-reference.com. It contains data about professional baseball (MLB) games played in the 2016 season. There are 2,427 games in the data set. Each row represents a single game. The goal is to predict the attendance at a home teamâ€™s game. This is an important task because most franchises want to predict the number of attendees for a variety of reasons including profits.

## Description of Variables

The description of variables are provided in "Baseball - Data Dictionary.docx"

## Goal

Use the **baseball.csv** data set and build a model to predict **attendance_binary**.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Recommended roles for group members:

**Section 1:** to be completed by both group members

**Section 2:** to be completed by the first group member and checked by the second

**Section 3:** to be completed by the second group member and checked by the first

**Important notes:**
- Both group members will get the same grade. Therefore, you should check the work of your group member. If they make a mistake, you will be responsible for that mistake too.
- Both group members must put in their fair share of effort. Otherwise, those who don't contribute to the assignment will not receive any grade.


# Section 1: (6 points in total)

## Data Prep (5.5 points)

In [146]:
# Common Imports
import numpy as np
import pandas as pd

np.random.seed(9990)

In [147]:
# Getting the data or reading the input dataset
baseball = pd.read_csv("baseball.csv")
#Viewing the initial rows of the dataset
baseball.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
0,0,43683,2,6,2,Night Game,Day Game,0,6,6,Wednesday,Monday,55,24,Overcast,2.933333,1
1,0,45785,0,7,2,Night Game,Day Game,0,10,3,Wednesday,Monday,48,7,Unknown,2.8,1
2,0,48282,0,8,4,Night Game,Day Game,2,4,3,Wednesday,Monday,65,10,Cloudy,3.383333,0
3,0,21830,0,9,6,Day Game,Night Game,0,15,11,Wednesday,Tuesday,77,0,In Dome,3.233333,1
4,0,49289,2,4,2,Night Game,Day Game,1,1,3,Tuesday,Monday,81,12,Cloudy,2.633333,1


In [148]:
# Viewing the summary statistics
baseball.describe()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,temperature,wind_speed,previous_game_duration,previous_homewin
count,2427.0,2427.0,2427.0,2427.0,2427.0,2427.0,2427.0,2427.0,2427.0,2427.0,2427.0,2427.0
mean,0.518747,30329.552946,0.58014,8.782859,4.41986,0.588381,8.615987,4.536053,73.97363,7.447054,3.085895,0.532344
std,0.499751,9867.617431,0.79237,3.517743,3.111236,0.806454,3.445614,3.119996,10.416003,5.021387,0.457116,0.499056
min,0.0,8766.0,0.0,1.0,0.0,0.0,0.0,0.0,31.0,0.0,1.916667,0.0
25%,0.0,22385.0,0.0,6.0,2.0,0.0,6.0,2.0,68.0,4.0,2.8,0.0
50%,1.0,30554.0,0.0,9.0,4.0,0.0,8.0,4.0,74.0,7.0,3.033333,1.0
75%,1.0,38358.5,1.0,11.0,6.0,1.0,11.0,6.0,81.0,11.0,3.3,1.0
max,1.0,54449.0,5.0,22.0,21.0,5.0,22.0,17.0,101.0,25.0,6.216667,1.0


In [149]:
# View of numberof rows and columns in the dataset
baseball.shape

(2427, 17)

# Splitting the data into train and test

In [150]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(baseball, test_size=0.3)

In [151]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [152]:
#Checking for the null values in the train dataset
train_set.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [153]:
# Checking for the null values in the test dataset
test_set.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

# Data Preparation

In [154]:
#Separating the target or unused variable
train_y=train_set[['attendance_binary']]
test_y=test_set[['attendance_binary']]

train_inputs=train_set.drop(['attendance_binary'],axis=1)
test_inputs=test_set.drop(['attendance_binary'],axis=1)

In [155]:
# Identifying the attribute datatype
train_inputs.dtypes

previous_attendance            int64
previous_away_team_errors      int64
previous_away_team_hits        int64
previous_away_team_runs        int64
game_type                     object
previous_game_type            object
previous_home_team_errors      int64
previous_home_team_hits        int64
previous_home_team_runs        int64
game_day                      object
previous_game_day             object
temperature                    int64
wind_speed                     int64
sky                           object
previous_game_duration       float64
previous_homewin               int64
dtype: object

In [156]:
# Identifying the numeric columns
numeric_columns=train_inputs.select_dtypes(include=[np.number]).columns.to_list()
# Identifying the categorical columns
categorical_columns=train_inputs.select_dtypes('object').columns.to_list()
# Identifying the binary columns
binary_columns=['previous_homewin']

In [157]:
#Eliminating binary from the numeric attributes
for col in binary_columns:
    numeric_columns.remove(col)

In [158]:
numeric_columns

['previous_attendance',
 'previous_away_team_errors',
 'previous_away_team_hits',
 'previous_away_team_runs',
 'previous_home_team_errors',
 'previous_home_team_hits',
 'previous_home_team_runs',
 'temperature',
 'wind_speed',
 'previous_game_duration']

In [159]:
categorical_columns

['game_type', 'previous_game_type', 'game_day', 'previous_game_day', 'sky']

In [160]:
binary_columns

['previous_homewin']

# Creating Pipeline for Column Transformations

In [161]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='mean')),
                ('scaler', StandardScaler())])

In [162]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [163]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [164]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')


# fit_transform() for TRAIN

In [165]:
train_x = preprocessor.fit_transform(train_inputs)
train_x

array([[-0.3691264 , -0.73139446,  0.05666937, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.63473534,  0.51675407, -1.08886155, ...,  1.        ,
         0.        ,  1.        ],
       [ 1.05239602, -0.73139446, -0.80247882, ...,  0.        ,
         1.        ,  1.        ],
       ...,
       [-0.53594803, -0.73139446,  0.34305211, ...,  0.        ,
         0.        ,  0.        ],
       [-1.16329843, -0.73139446, -1.66162702, ...,  1.        ,
         0.        ,  1.        ],
       [ 0.12628326,  0.51675407,  2.0613485 , ...,  1.        ,
         0.        ,  0.        ]])

In [166]:
train_x.shape

(1698, 37)

# transform() for TEST

In [167]:
test_x = preprocessor.transform(test_inputs)
test_x

array([[ 0.73017753,  1.76490261,  0.34305211, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.46649826, -0.73139446,  0.62943484, ...,  0.        ,
         0.        ,  0.        ],
       [-0.43261973,  0.51675407,  1.2022003 , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.61733676, -0.73139446,  0.62943484, ...,  0.        ,
         0.        ,  0.        ],
       [-0.77030713, -0.73139446,  0.05666937, ...,  1.        ,
         0.        ,  1.        ],
       [-0.75817465, -0.73139446, -0.51609609, ...,  0.        ,
         0.        ,  1.        ]])

In [169]:
test_x.shape

(729, 37)

## Find the Baseline (0.5 point)

In [170]:
# Finding Majority Class
train_y.value_counts()

attendance_binary
1                    896
0                    802
dtype: int64

In [171]:
# Finding Percentage
train_y.value_counts()/len(train_y)

attendance_binary
1                    0.52768
0                    0.47232
dtype: float64

# Section 2: (3 points in total)

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)

## SVM Model 1: LINEAR

In [172]:
from sklearn.svm import LinearSVC 
svm_clf = LinearSVC(C=10)
svm_clf.fit(train_x, train_y)

  return f(*args, **kwargs)


LinearSVC(C=10)

# Accuracy 

In [173]:
from sklearn.metrics import accuracy_score


In [174]:
#Predict the train values
train_y_pred = svm_clf.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8345111896348646

In [175]:
#Predict the test values
test_y_pred = svm_clf.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8353909465020576

# Classification Matrix

In [176]:
from sklearn.metrics import confusion_matrix

#We usually create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[302,  64],
       [ 56, 307]], dtype=int64)

# Classification Report

In [177]:
from sklearn.metrics import classification_report

#We usually create the classification report on test set
print(classification_report(test_y, test_y_pred))

              precision    recall  f1-score   support

           0       0.84      0.83      0.83       366
           1       0.83      0.85      0.84       363

    accuracy                           0.84       729
   macro avg       0.84      0.84      0.84       729
weighted avg       0.84      0.84      0.84       729



## SVM Model 2: POLY

In [178]:
from sklearn.svm import SVC

In [179]:
pol_svm2 = SVC(kernel="poly", degree=2, coef0=1, C=10, gamma='scale')
pol_svm2.fit(train_x, train_y)

  return f(*args, **kwargs)


SVC(C=10, coef0=1, degree=2, kernel='poly')

In [196]:
#Predict the train values
train_y_pred = pol_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.9028268551236749

In [197]:
#Predict the test values
test_y_pred = pol_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8189300411522634

## SVM Model 3: RBF

In [128]:
rbf_svm = SVC(kernel="rbf", C=10, gamma='scale')

rbf_svm.fit(train_x, train_y)

  return f(*args, **kwargs)


SVC(C=10)

In [182]:
#Predict the train values
train_y_pred = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.9204946996466431

In [183]:
#Predict the test values
test_y_pred = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.9176954732510288

# Section 3: (3 points in total)

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRregression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

In [184]:
from sklearn.linear_model import SGDClassifier

In [185]:
sgd_lg=SGDClassifier(max_iter=100,penalty=None,eta0=0.1,tol=0.0001)
sgd_lg.fit(train_x,train_y)

  return f(*args, **kwargs)


SGDClassifier(eta0=0.1, max_iter=100, penalty=None, tol=0.0001)

In [186]:
train_y_pred=sgd_lg.predict(train_x)
accuracy_score(train_y,train_y_pred)

0.7985865724381626

In [187]:
test_y_pred=sgd_lg.predict(test_x)
accuracy_score(test_y,test_y_pred)

0.7928669410150891

## SGD Model 2:

In [188]:
sgd_lg=SGDClassifier(max_iter=100,penalty='l1',eta0=0.1,tol=0.0001)
sgd_lg.fit(train_x,train_y)

  return f(*args, **kwargs)


SGDClassifier(eta0=0.1, max_iter=100, penalty='l1', tol=0.0001)

In [198]:
train_y_pred=sgd_lg.predict(train_x)
accuracy_score(train_y,train_y_pred)

0.8351001177856302

In [199]:
test_y_pred=sgd_lg.predict(test_x)
accuracy_score(test_y,test_y_pred)

0.8367626886145405

## LogisticRegression Model:

In [191]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression

In [192]:
poly_features=PolynomialFeatures(degree=2).fit(train_x)
train_x_poly=poly_features.transform(train_x)
test_x_poly=poly_features.transform(test_x)

In [193]:
log_reg = LogisticRegression()
log_reg.fit(train_x_poly, train_y)

  return f(*args, **kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [200]:
train_y_pred=log_reg.predict(train_x_poly)
accuracy_score(train_y, train_y_pred)

0.9045936395759717

In [201]:
test_y_pred=log_reg.predict(test_x_poly)
accuracy_score(test_y, test_y_pred)

0.8175582990397805

# Discussion (3 points in total)


## List the train and test values of each model you built (1 point)

## Which model performs the best and why? (0.5 points) How does it compare to baseline? (0.5 points)

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? (0.5 points)

## Is there any evidence of overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? (0.5 points)