# Group work - Classification

In this assignment, we will focus on sports analytics. This data set is made available by http://www.baseball-reference.com. It contains data about professional baseball (MLB) games played in the 2016 season. There are 2,427 games in the data set. Each row represents a single game. The goal is to predict the attendance at a home team’s game. This is an important task because most franchises want to predict the number of attendees for a variety of reasons including profits.

## Description of Variables

The description of variables are provided in "Baseball - Data Dictionary.docx"

## Goal

Use the **baseball.csv** data set and build a model to predict **attendance_binary**.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Recommended roles for group members:

**Section 1:** to be completed by both group members

**Section 2:** to be completed by the first group member and checked by the second

**Section 3:** to be completed by the second group member and checked by the first

**Important notes:**
- Both group members will get the same grade. Therefore, you should check the work of your group member. If they make a mistake, you will be responsible for that mistake too.
- Both group members must put in their fair share of effort. Otherwise, those who don't contribute to the assignment will not receive any grade.


# Section 1: (6 points in total)

## Data Prep (5.5 points)

In [32]:
import numpy as np
import pandas as pd
np.random.seed(80)

In [33]:
baseball = pd.read_csv("C:/Users/lenovo/Downloads/baseball.csv")
baseball.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
0,0,43683,2,6,2,Night Game,Day Game,0,6,6,Wednesday,Monday,55,24,Overcast,2.933333,1
1,0,45785,0,7,2,Night Game,Day Game,0,10,3,Wednesday,Monday,48,7,Unknown,2.8,1
2,0,48282,0,8,4,Night Game,Day Game,2,4,3,Wednesday,Monday,65,10,Cloudy,3.383333,0
3,0,21830,0,9,6,Day Game,Night Game,0,15,11,Wednesday,Tuesday,77,0,In Dome,3.233333,1
4,0,49289,2,4,2,Night Game,Day Game,1,1,3,Tuesday,Monday,81,12,Cloudy,2.633333,1


In [34]:
from sklearn.model_selection import train_test_split

# Splitting the data into train and test set in 70:30 proportion.

In [35]:
train_set, test_set= train_test_split(baseball, test_size=0.3)

In [36]:
train_set.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [37]:
test_set.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [38]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

Separating the target variable from train and test dataset.

In [39]:
train_y=train_set[['attendance_binary']]

In [40]:
test_y= test_set[['attendance_binary']]

In [41]:
train_inputs = train_set.drop(['attendance_binary'], axis=1)

In [42]:
test_inputs= test_set.drop(['attendance_binary'], axis=1)

In [43]:
train_inputs.dtypes

previous_attendance            int64
previous_away_team_errors      int64
previous_away_team_hits        int64
previous_away_team_runs        int64
game_type                     object
previous_game_type            object
previous_home_team_errors      int64
previous_home_team_hits        int64
previous_home_team_runs        int64
game_day                      object
previous_game_day             object
temperature                    int64
wind_speed                     int64
sky                           object
previous_game_duration       float64
previous_homewin               int64
dtype: object

In [44]:
#Identifying the numerical columns and making a list of it.
numeric_columns= train_inputs.select_dtypes(include=[np.number]).columns.to_list()

In [53]:
#Identifying the categorical columns and making a list of it.
categorical_columns= train_inputs.select_dtypes('object').columns.to_list()

In [47]:
#Identifying the binary columns so we can pass them without transforming
binary_columns=['previous_homewin']

Removing Binary columns from numerical columns.

In [48]:
for col in binary_columns:
    numeric_columns.remove(col)

In [49]:
binary_columns

['previous_homewin']

In [51]:
numeric_columns

['previous_attendance',
 'previous_away_team_errors',
 'previous_away_team_hits',
 'previous_away_team_runs',
 'previous_home_team_errors',
 'previous_home_team_hits',
 'previous_home_team_runs',
 'temperature',
 'wind_speed',
 'previous_game_duration']

In [55]:
categorical_columns

['game_type', 'previous_game_type', 'game_day', 'previous_game_day', 'sky']

# Creating Pipelines

In [72]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [73]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [74]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [75]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')

# Transform: fit_transform() for TRAIN

In [76]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[-1.48233038,  0.52604711, -0.22742019, ...,  1.        ,
         0.        ,  0.        ],
       [-0.60150979, -0.73557435, -1.65446927, ...,  0.        ,
         1.        ,  1.        ],
       [ 0.2714372 ,  0.52604711, -0.51283   , ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.1265919 , -0.73557435, -0.51283   , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.47441246, -0.73557435, -0.79823982, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.22920608,  0.52604711,  0.62880926, ...,  0.        ,
         0.        ,  1.        ]])

In [77]:
train_x.shape

(1698, 37)

# Transform: transform() for TEST

In [78]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[-0.27214792, -0.73557435,  0.34339944, ...,  0.        ,
         1.        ,  0.        ],
       [-0.39403533, -0.73557435, -0.22742019, ...,  0.        ,
         0.        ,  0.        ],
       [-0.5859671 , -0.73557435, -1.08364964, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.04769921, -0.73557435,  1.77044852, ...,  0.        ,
         0.        ,  0.        ],
       [-0.24576625,  0.52604711,  0.05798963, ...,  0.        ,
         1.        ,  0.        ],
       [-1.83009806,  0.52604711,  1.19962889, ...,  0.        ,
         0.        ,  0.        ]])

In [79]:
test_x.shape

(729, 37)

## Find the Baseline (0.5 point)

In [80]:
# Find majority class
train_y.value_counts()


attendance_binary
1                    882
0                    816
dtype: int64

In [82]:
#Find percentage
train_y.value_counts()/len(train_y)*100

attendance_binary
1                    51.943463
0                    48.056537
dtype: float64

In [None]:
## From the above results , we can say that the baseline starts from 52% and any model with accuracy below 52% cannot be considered.

# Section 2: (3 points in total)

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)

## SVM Model 1:

Linear SVC

In [83]:
from sklearn.svm import LinearSVC

svm_clf = LinearSVC(C=10, multi_class='ovr')
svm_clf.fit(train_x, train_y)

  return f(*args, **kwargs)


LinearSVC(C=10)

In [84]:
from sklearn.metrics import accuracy_score
# Predict the train values
train_y_pred = svm_clf.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8409893992932862

In [85]:
#Predict the test values
test_y_pred = svm_clf.predict(test_x)
#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8203017832647462

## SVM Model 2:

SVC Model 2, kernel='poly'

In [87]:
from sklearn.svm import SVC

pol_svm2=SVC(kernel='poly', degree=2, coef0=1, C=5, gamma='scale', decision_function_shape='ovr')
pol_svm2.fit(train_x, train_y)

  return f(*args, **kwargs)


SVC(C=5, coef0=1, degree=2, kernel='poly')

In [88]:
#predict  the train values
train_y_pred=pol_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8951707891637221

In [89]:
#predict the test values
test_y_pred=pol_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8175582990397805

## SVM Model 3:

SVM Model 3 with kernel = 'rbf'

In [129]:
rbf_svm = SVC(kernel="rbf", C=0.1, gamma='auto', decision_function_shape='ovr')

rbf_svm.fit(train_x, train_y)

  return f(*args, **kwargs)


SVC(C=0.1, gamma='auto')

In [130]:
#Predict the train values
train_y_pred = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8368669022379269

In [131]:
#Predict the test values
test_y_pred = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8244170096021948

# Section 3: (3 points in total)

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRregression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

In [132]:
from sklearn.linear_model import SGDClassifier 
# max_iter = number of passes over training data
# penalty = regularization term
# eta0 = learning rate
# tol = stopping criterion
sgd_logreg = SGDClassifier(max_iter=100, penalty=None, eta0=0.1, tol=0.0001) 

sgd_logreg.fit(train_x, train_y)

  return f(*args, **kwargs)


SGDClassifier(eta0=0.1, max_iter=100, penalty=None, tol=0.0001)

In [133]:
#Predict the train values
train_y_pred = sgd_logreg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.7974087161366313

In [134]:
#Predict the test values
test_y_pred = sgd_logreg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.7640603566529492

## SGD Model 2:

In [135]:

sgd_logreg = SGDClassifier(max_iter=100, penalty='l2', eta0=0.1, tol=0.0001,alpha=0.001) 

sgd_logreg.fit(train_x, train_y)

  return f(*args, **kwargs)


SGDClassifier(alpha=0.001, eta0=0.1, max_iter=100, tol=0.0001)

In [136]:
#Predict the train values
train_y_pred = sgd_logreg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8392226148409894

In [137]:
#Predict the test values
test_y_pred = sgd_logreg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8161865569272977

## LogisticRegression Model:

In [138]:
from sklearn.preprocessing import PolynomialFeatures

# Create second degree terms and interaction terms
poly_features = PolynomialFeatures(degree=2).fit(train_x)

train_x_poly = poly_features.transform(train_x)

test_x_poly = poly_features.transform(test_x)

In [139]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver = 'liblinear',penalty='l1',max_iter=100,C=0.01)

log_reg.fit(train_x_poly, train_y)

  return f(*args, **kwargs)


LogisticRegression(C=0.01, penalty='l1', solver='liblinear')

In [140]:
#Predict the train values
train_y_pred = log_reg.predict(train_x_poly)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8103651354534747

In [141]:
#Predict the test values
test_y_pred = log_reg.predict(test_x_poly)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8052126200274349

# Discussion (3 points in total)


## List the train and test values of each model you built (1 point)

## Which model performs the best and why? (0.5 points) How does it compare to baseline? (0.5 points)

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? (0.5 points)

## Is there any evidence of overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? (0.5 points)