# Group work - Classification

In this assignment, we will focus on sports analytics. This data set is made available by http://www.baseball-reference.com. It contains data about professional baseball (MLB) games played in the 2016 season. There are 2,427 games in the data set. Each row represents a single game. The goal is to predict the attendance at a home team’s game. This is an important task because most franchises want to predict the number of attendees for a variety of reasons including profits.

## Description of Variables

The description of variables are provided in "Baseball - Data Dictionary.docx"

## Goal

Use the **baseball.csv** data set and build a model to predict **attendance_binary**.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Recommended roles for group members:

**Section 1:** to be completed by both group members

**Section 2:** to be completed by the first group member and checked by the second

**Section 3:** to be completed by the second group member and checked by the first

**Important notes:**
- Both group members will get the same grade. Therefore, you should check the work of your group member. If they make a mistake, you will be responsible for that mistake too.
- Both group members must put in their fair share of effort. Otherwise, those who don't contribute to the assignment will not receive any grade.


# Section 1: (6 points in total)

## Data Prep (5.5 points)

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
data_frame = pd.read_csv('baseball.csv')
data_frame.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
0,0,43683,2,6,2,Night Game,Day Game,0,6,6,Wednesday,Monday,55,24,Overcast,2.933333,1
1,0,45785,0,7,2,Night Game,Day Game,0,10,3,Wednesday,Monday,48,7,Unknown,2.8,1
2,0,48282,0,8,4,Night Game,Day Game,2,4,3,Wednesday,Monday,65,10,Cloudy,3.383333,0
3,0,21830,0,9,6,Day Game,Night Game,0,15,11,Wednesday,Tuesday,77,0,In Dome,3.233333,1
4,0,49289,2,4,2,Night Game,Day Game,1,1,3,Tuesday,Monday,81,12,Cloudy,2.633333,1


In [4]:
train_set, test_set = train_test_split(data_frame, test_size=0.3)
train_set.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
951,0,12649,0,12,6,Night Game,Day Game,2,8,3,Monday,Sunday,72,0,In Dome,3.483333,0
982,1,24617,0,11,7,Night Game,Night Game,1,5,3,Saturday,Friday,79,10,Cloudy,2.8,0
1836,1,46470,0,7,5,Day Game,Night Game,1,11,7,Saturday,Friday,71,6,Sunny,2.883333,1
904,1,40256,0,15,7,Night Game,Night Game,1,6,0,Saturday,Friday,80,8,Sunny,2.916667,0
996,0,22261,0,12,7,Night Game,Night Game,1,9,5,Thursday,Tuesday,81,6,Overcast,3.066667,0


In [5]:
test_set.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
1489,1,36715,0,9,6,Night Game,Day Game,0,13,7,Monday,Sunday,78,5,Unknown,3.55,1
2010,1,21753,0,8,3,Night Game,Night Game,0,8,1,Saturday,Friday,77,0,In Dome,3.0,0
1221,1,38586,0,8,6,Night Game,Day Game,0,10,3,Friday,Thursday,90,14,Cloudy,3.583333,0
656,0,20653,1,7,2,Night Game,Night Game,3,13,7,Saturday,Friday,72,14,Cloudy,3.183333,1
1995,1,28232,0,8,3,Night Game,Night Game,2,9,5,Saturday,Friday,62,7,Cloudy,3.316667,1


In [6]:
## Checking Null values counts per column
train_set.dtypes

attendance_binary              int64
previous_attendance            int64
previous_away_team_errors      int64
previous_away_team_hits        int64
previous_away_team_runs        int64
game_type                     object
previous_game_type            object
previous_home_team_errors      int64
previous_home_team_hits        int64
previous_home_team_runs        int64
game_day                      object
previous_game_day             object
temperature                    int64
wind_speed                     int64
sky                           object
previous_game_duration       float64
previous_homewin               int64
dtype: object

In [7]:
for item in train_set.dtypes:
    print(item)

int64
int64
int64
int64
int64
object
object
int64
int64
int64
object
object
int64
int64
object
float64
int64


In [8]:
# check for quantity of Null Values
test_set.isnull().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [9]:
# Checking for null values in the train split
train_set.isnull().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [10]:
# Verifying the split has been done correctly
print(train_set.shape)
print(test_set.shape)

(1698, 17)
(729, 17)


In [11]:
# Gathering cols in a separate list
my_cols = list(data_frame.columns)
my_cols

['attendance_binary',
 'previous_attendance',
 'previous_away_team_errors',
 'previous_away_team_hits',
 'previous_away_team_runs',
 'game_type',
 'previous_game_type',
 'previous_home_team_errors',
 'previous_home_team_hits',
 'previous_home_team_runs',
 'game_day',
 'previous_game_day',
 'temperature',
 'wind_speed',
 'sky',
 'previous_game_duration',
 'previous_homewin']

In [12]:
# Verifying cols quantity
len(my_cols)

17

In [13]:
# Separating target variable into a separate data frame
train_target = train_set[[my_cols[0]]]
train_target

Unnamed: 0,attendance_binary
951,0
982,1
1836,1
904,1
996,0
...,...
841,1
2114,0
2378,1
356,1


In [14]:
test_target = test_set[[my_cols[0]]]
test_target

Unnamed: 0,attendance_binary
1489,1
2010,1
1221,1
656,0
1995,1
...,...
205,1
104,1
483,0
2058,0


In [15]:
# Split target variable from the data frames
train_input = train_set.drop([my_cols[0]], axis=1)
test_input = test_set.drop([my_cols[0]], axis=1)
print(train_input.shape, test_input.shape, sep='\n')

(1698, 16)
(729, 16)


In [16]:
train_input

Unnamed: 0,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
951,12649,0,12,6,Night Game,Day Game,2,8,3,Monday,Sunday,72,0,In Dome,3.483333,0
982,24617,0,11,7,Night Game,Night Game,1,5,3,Saturday,Friday,79,10,Cloudy,2.800000,0
1836,46470,0,7,5,Day Game,Night Game,1,11,7,Saturday,Friday,71,6,Sunny,2.883333,1
904,40256,0,15,7,Night Game,Night Game,1,6,0,Saturday,Friday,80,8,Sunny,2.916667,0
996,22261,0,12,7,Night Game,Night Game,1,9,5,Thursday,Tuesday,81,6,Overcast,3.066667,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
841,24628,0,9,5,Day Game,Night Game,2,4,1,Wednesday,Tuesday,76,3,Sunny,2.750000,0
2114,16323,1,9,4,Night Game,Night Game,0,8,7,Wednesday,Tuesday,71,0,In Dome,3.183333,1
2378,37083,0,8,1,Day Game,Night Game,0,3,0,Sunday,Saturday,45,7,Sunny,2.733333,0
356,42376,0,9,4,Night Game,Night Game,0,9,8,Sunday,Saturday,70,6,Overcast,3.616667,1


In [17]:
#splitting data based on type
train_numerical_data=train_input[['previous_attendance','previous_away_team_errors','previous_away_team_hits','previous_away_team_runs',
                                  'previous_home_team_errors','previous_home_team_hits','previous_home_team_runs','temperature','wind_speed','previous_game_duration']]
train_binary_data=train_input[['previous_homewin']]
train_categorical_data=train_input[['game_type','previous_game_type','game_day','previous_game_day','sky']]

In [18]:
#numercial value Process
#imputation of numerical values
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(strategy="mean")
train_numeric_imputedData=imputer.fit_transform(train_numerical_data)


In [19]:
train_numeric_imputedData

array([[1.26490000e+04, 0.00000000e+00, 1.20000000e+01, ...,
        7.20000000e+01, 0.00000000e+00, 3.48333333e+00],
       [2.46170000e+04, 0.00000000e+00, 1.10000000e+01, ...,
        7.90000000e+01, 1.00000000e+01, 2.80000000e+00],
       [4.64700000e+04, 0.00000000e+00, 7.00000000e+00, ...,
        7.10000000e+01, 6.00000000e+00, 2.88333333e+00],
       ...,
       [3.70830000e+04, 0.00000000e+00, 8.00000000e+00, ...,
        4.50000000e+01, 7.00000000e+00, 2.73333333e+00],
       [4.23760000e+04, 0.00000000e+00, 9.00000000e+00, ...,
        7.00000000e+01, 6.00000000e+00, 3.61666667e+00],
       [2.81500000e+04, 0.00000000e+00, 5.00000000e+00, ...,
        7.30000000e+01, 9.00000000e+00, 2.50000000e+00]])

In [20]:
#standerdizing values
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
train_numeric_StanderdizedData=scaler.fit_transform(train_numeric_imputedData)

In [21]:
train_numeric_StanderdizedData

array([[-1.79867528, -0.71549486,  0.90572871, ..., -0.17058091,
        -1.46221514,  0.86856002],
       [-0.58101942, -0.71549486,  0.62513624, ...,  0.50034057,
         0.50485926, -0.63287796],
       [ 1.6423624 , -0.71549486, -0.49723366, ..., -0.26642684,
        -0.2819705 , -0.44977576],
       ...,
       [ 0.68730428, -0.71549486, -0.21664119, ..., -2.7584209 ,
        -0.08526306, -0.77935971],
       [ 1.22582805, -0.71549486,  0.06395129, ..., -0.36227276,
        -0.2819705 ,  1.16152353],
       [-0.22156269, -0.71549486, -1.05841861, ..., -0.07473499,
         0.30815182, -1.29204585]])

In [22]:
#Back to pandas(Numerics)
train_numeric_DataDF=pd.DataFrame(train_numeric_StanderdizedData,columns=train_numerical_data.columns).reset_index(drop=True)
train_numeric_DataDF.head()

Unnamed: 0,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,temperature,wind_speed,previous_game_duration
0,-1.798675,-0.715495,0.905729,0.491799,1.768767,-0.169446,-0.498697,-0.170581,-1.462215,0.86856
1,-0.581019,-0.715495,0.625136,0.807757,0.512654,-1.037816,-0.498697,0.500341,0.504859,-0.632878
2,1.642362,-0.715495,-0.497234,0.175842,0.512654,0.698924,0.77899,-0.266427,-0.281971,-0.449776
3,1.010134,-0.715495,1.747506,0.807757,0.512654,-0.74836,-1.456962,0.596186,0.111444,-0.376535
4,-0.820725,-0.715495,0.905729,0.807757,0.512654,0.12001,0.140147,0.692032,-0.281971,-0.046951


In [23]:
#Processing Categorical
train_categorical_data.isna().sum()
#I believe no need to process categorical in this case as there are no null values, NIC please confirm.

game_type             0
previous_game_type    0
game_day              0
previous_game_day     0
sky                   0
dtype: int64

In [24]:
#Categorical back to pandas
train_categorical_DataDF=pd.DataFrame(train_categorical_data,columns=train_categorical_data.columns).reset_index(drop=True)
train_categorical_DataDF

Unnamed: 0,game_type,previous_game_type,game_day,previous_game_day,sky
0,Night Game,Day Game,Monday,Sunday,In Dome
1,Night Game,Night Game,Saturday,Friday,Cloudy
2,Day Game,Night Game,Saturday,Friday,Sunny
3,Night Game,Night Game,Saturday,Friday,Sunny
4,Night Game,Night Game,Thursday,Tuesday,Overcast
...,...,...,...,...,...
1693,Day Game,Night Game,Wednesday,Tuesday,Sunny
1694,Night Game,Night Game,Wednesday,Tuesday,In Dome
1695,Day Game,Night Game,Sunday,Saturday,Sunny
1696,Night Game,Night Game,Sunday,Saturday,Overcast


In [25]:
#One hard encoding to categorical values
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
train_categorical_data_Encoded=encoder.fit_transform(train_categorical_DataDF)
train_categorical_data_Encoded.toarray()

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       ...,
       [1., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 1., 0.]])

In [26]:
encoder.categories_

[array(['Day Game', 'Night Game'], dtype=object),
 array(['Day Game', 'Night Game'], dtype=object),
 array(['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday',
        'Wednesday'], dtype=object),
 array(['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday',
        'Wednesday'], dtype=object),
 array(['Cloudy', 'Drizzle', 'In Dome', 'Night', 'Overcast', 'Rain',
        'Sunny', 'Unknown'], dtype=object)]

In [27]:
OneHotEncoder_column_names=[item for sublist in encoder.categories_ for item in sublist]
OneHotEncoder_column_names

['Day Game',
 'Night Game',
 'Day Game',
 'Night Game',
 'Friday',
 'Monday',
 'Saturday',
 'Sunday',
 'Thursday',
 'Tuesday',
 'Wednesday',
 'Friday',
 'Monday',
 'Saturday',
 'Sunday',
 'Thursday',
 'Tuesday',
 'Wednesday',
 'Cloudy',
 'Drizzle',
 'In Dome',
 'Night',
 'Overcast',
 'Rain',
 'Sunny',
 'Unknown']

In [28]:
#back to pandas
train_categorical_data_Encoded_DF=pd.DataFrame(train_categorical_data_Encoded.toarray(),
                                               columns=OneHotEncoder_column_names).reset_index(drop=True)
train_categorical_data_Encoded_DF

Unnamed: 0,Day Game,Night Game,Day Game.1,Night Game.1,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,...,Tuesday.1,Wednesday,Cloudy,Drizzle,In Dome,Night,Overcast,Rain,Sunny,Unknown
0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1693,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1694,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1695,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1696,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [29]:
#Concatinate all Train Data
train_data_final=pd.concat((train_numeric_DataDF.reset_index(drop=True),train_categorical_data_Encoded_DF.reset_index(drop=True)
                           ,train_binary_data.reset_index(drop=True)),axis=1)
train_data_final

Unnamed: 0,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,temperature,wind_speed,previous_game_duration,...,Wednesday,Cloudy,Drizzle,In Dome,Night,Overcast,Rain,Sunny,Unknown,previous_homewin
0,-1.798675,-0.715495,0.905729,0.491799,1.768767,-0.169446,-0.498697,-0.170581,-1.462215,0.868560,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
1,-0.581019,-0.715495,0.625136,0.807757,0.512654,-1.037816,-0.498697,0.500341,0.504859,-0.632878,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,1.642362,-0.715495,-0.497234,0.175842,0.512654,0.698924,0.778990,-0.266427,-0.281971,-0.449776,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
3,1.010134,-0.715495,1.747506,0.807757,0.512654,-0.748360,-1.456962,0.596186,0.111444,-0.376535,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
4,-0.820725,-0.715495,0.905729,0.807757,0.512654,0.120010,0.140147,0.692032,-0.281971,-0.046951,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1693,-0.579900,-0.715495,0.063951,0.175842,1.768767,-1.327273,-1.137540,0.212803,-0.872093,-0.742739,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
1694,-1.424873,0.546094,0.063951,-0.140115,-0.743459,-0.169446,0.778990,-0.266427,-1.462215,0.209392,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1
1695,0.687304,-0.715495,-0.216641,-1.087988,-0.743459,-1.616730,-1.456962,-2.758421,-0.085263,-0.779360,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
1696,1.225828,-0.715495,0.063951,-0.140115,-0.743459,0.120010,1.098412,-0.362273,-0.281971,1.161524,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1


In [30]:
test_numerical_data=test_input[['previous_attendance','previous_away_team_errors','previous_away_team_hits','previous_away_team_runs',
                                  'previous_home_team_errors','previous_home_team_hits','previous_home_team_runs','temperature','wind_speed','previous_game_duration']]
test_binary_data=test_input[['previous_homewin']]
test_categorical_data=test_input[['game_type','previous_game_type','game_day','previous_game_day','sky']]

In [31]:
#processing numerical test data
test_numeric_imputedData=imputer.transform(test_numerical_data)
test_numeric_imputedData

array([[3.67150000e+04, 0.00000000e+00, 9.00000000e+00, ...,
        7.80000000e+01, 5.00000000e+00, 3.55000000e+00],
       [2.17530000e+04, 0.00000000e+00, 8.00000000e+00, ...,
        7.70000000e+01, 0.00000000e+00, 3.00000000e+00],
       [3.85860000e+04, 0.00000000e+00, 8.00000000e+00, ...,
        9.00000000e+01, 1.40000000e+01, 3.58333333e+00],
       ...,
       [2.94290000e+04, 2.00000000e+00, 1.20000000e+01, ...,
        7.30000000e+01, 0.00000000e+00, 3.40000000e+00],
       [2.81480000e+04, 2.00000000e+00, 5.00000000e+00, ...,
        7.30000000e+01, 0.00000000e+00, 2.95000000e+00],
       [3.17680000e+04, 0.00000000e+00, 6.00000000e+00, ...,
        9.10000000e+01, 5.00000000e+00, 3.06666667e+00]])

In [32]:
#Standerdizing the numerical values
test_numeric_standerdizedData=scaler.transform(test_numeric_imputedData)
test_numeric_standerdizedData

array([[ 0.64986299, -0.71549486,  0.06395129, ...,  0.40449464,
        -0.47867794,  1.01504178],
       [-0.87241033, -0.71549486, -0.21664119, ...,  0.30864872,
        -1.46221514, -0.19343269],
       [ 0.84022346, -0.71549486, -0.21664119, ...,  1.55464575,
         1.29168902,  1.08828265],
       ...,
       [-0.09143385,  1.80768328,  0.90572871, ..., -0.07473499,
        -1.46221514,  0.68545783],
       [-0.22176617,  1.80768328, -1.05841861, ..., -0.07473499,
        -1.46221514, -0.30329401],
       [ 0.14654217, -0.71549486, -0.77782614, ...,  1.65049167,
        -0.47867794, -0.04695094]])

In [33]:
#convert numeric to pandas
test_numeric_DataDF=pd.DataFrame(test_numeric_standerdizedData,columns=test_numerical_data.columns).reset_index(drop=True)
test_numeric_DataDF.head()

Unnamed: 0,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,temperature,wind_speed,previous_game_duration
0,0.649863,-0.715495,0.063951,0.491799,-0.743459,1.277837,0.77899,0.404495,-0.478678,1.015042
1,-0.87241,-0.715495,-0.216641,-0.456073,-0.743459,-0.169446,-1.13754,0.308649,-1.462215,-0.193433
2,0.840223,-0.715495,-0.216641,0.491799,-0.743459,0.409467,-0.498697,1.554646,1.291689,1.088283
3,-0.984327,0.546094,-0.497234,-0.77203,3.024881,1.277837,0.77899,-0.170581,1.291689,0.209392
4,-0.21322,-0.715495,-0.216641,-0.456073,1.768767,0.12001,0.140147,-1.12904,-0.085263,0.502356


In [34]:
#Processing Categorical Test Data
test_categorical_data.isna().sum()

game_type             0
previous_game_type    0
game_day              0
previous_game_day     0
sky                   0
dtype: int64

In [35]:
#No null values in categorical data, hence converting back into pandas
test_categorical_dataDF=pd.DataFrame(test_categorical_data,columns=test_categorical_data.columns).reset_index(drop=True)
test_categorical_dataDF

Unnamed: 0,game_type,previous_game_type,game_day,previous_game_day,sky
0,Night Game,Day Game,Monday,Sunday,Unknown
1,Night Game,Night Game,Saturday,Friday,In Dome
2,Night Game,Day Game,Friday,Thursday,Cloudy
3,Night Game,Night Game,Saturday,Friday,Cloudy
4,Night Game,Night Game,Saturday,Friday,Cloudy
...,...,...,...,...,...
724,Day Game,Night Game,Sunday,Saturday,Cloudy
725,Day Game,Night Game,Wednesday,Tuesday,Unknown
726,Night Game,Night Game,Saturday,Friday,In Dome
727,Night Game,Day Game,Monday,Sunday,In Dome


In [36]:
#One Hot Encoding test categorical data
test_categorical_data_Encoded=encoder.transform(test_categorical_dataDF)
test_categorical_data_Encoded.toarray()

array([[0., 1., 1., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 1., 0.]])

In [37]:
#Back to pandas
test_categorical_data_Encoded_DF=pd.DataFrame(test_categorical_data_Encoded.toarray(),
                                               columns=OneHotEncoder_column_names).reset_index(drop=True)
test_categorical_data_Encoded_DF

Unnamed: 0,Day Game,Night Game,Day Game.1,Night Game.1,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,...,Tuesday.1,Wednesday,Cloudy,Drizzle,In Dome,Night,Overcast,Rain,Sunny,Unknown
0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
724,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
725,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
726,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
727,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [38]:
#Concatinate all to one DataFrame
test_final_data=pd.concat((test_numeric_DataDF.reset_index(drop=True),test_categorical_data_Encoded_DF.reset_index(drop=True),
                           test_binary_data.reset_index(drop=True)),axis=1)
test_final_data

Unnamed: 0,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,temperature,wind_speed,previous_game_duration,...,Wednesday,Cloudy,Drizzle,In Dome,Night,Overcast,Rain,Sunny,Unknown,previous_homewin
0,0.649863,-0.715495,0.063951,0.491799,-0.743459,1.277837,0.778990,0.404495,-0.478678,1.015042,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
1,-0.872410,-0.715495,-0.216641,-0.456073,-0.743459,-0.169446,-1.137540,0.308649,-1.462215,-0.193433,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
2,0.840223,-0.715495,-0.216641,0.491799,-0.743459,0.409467,-0.498697,1.554646,1.291689,1.088283,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,-0.984327,0.546094,-0.497234,-0.772030,3.024881,1.277837,0.778990,-0.170581,1.291689,0.209392,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,-0.213220,-0.715495,-0.216641,-0.456073,1.768767,0.120010,0.140147,-1.129040,-0.085263,0.502356,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
724,0.164652,-0.715495,-0.497234,-0.772030,-0.743459,0.409467,0.140147,1.554646,-0.872093,-0.486396,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
725,0.281656,-0.715495,-1.058419,-0.140115,-0.743459,0.698924,0.778990,0.883724,-0.085263,-0.632878,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
726,-0.091434,1.807683,0.905729,1.755629,-0.743459,-0.169446,0.459568,-0.074735,-1.462215,0.685458,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
727,-0.221766,1.807683,-1.058419,-1.087988,-0.743459,0.120010,0.140147,-0.074735,-1.462215,-0.303294,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1


## Find the Baseline (0.5 point)

In [39]:
train_target.value_counts()


attendance_binary
1                    879
0                    819
dtype: int64

In [40]:
train_target.value_counts()/len(train_target)

attendance_binary
1                    0.517668
0                    0.482332
dtype: float64

# Section 2: (3 points in total)

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)

## SVM Model 1:

In [41]:
#Linear SVC
from sklearn.svm import LinearSVC
svm_clf = LinearSVC(C=10)
svm_clf.fit(train_data_final, train_target)

  return f(*args, **kwargs)


LinearSVC(C=10)

In [60]:
from sklearn.metrics import accuracy_score
train_y_pred=svm_clf.predict(train_data_final)
accuracy_score(train_target,train_y_pred)



0.8356890459363958

In [59]:
from sklearn.metrics import accuracy_score
test_y_pred = svm_clf.predict(test_final_data)
accuracy_score(test_target, test_y_pred)

0.8367626886145405

In [142]:
#Classification Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(train_target,train_y_pred)

array([[397, 421],
       [460, 420]], dtype=int64)

In [143]:
#classification report
from sklearn.metrics import classification_report
print(classification_report(train_target,train_y_pred))

              precision    recall  f1-score   support

           0       0.46      0.49      0.47       818
           1       0.50      0.48      0.49       880

    accuracy                           0.48      1698
   macro avg       0.48      0.48      0.48      1698
weighted avg       0.48      0.48      0.48      1698



## SVM Model 2:

In [144]:
#Linear SVC with poly=2
from sklearn.preprocessing import PolynomialFeatures
poly_fea=PolynomialFeatures(degree=2,include_bias=False)
train_poly_x=poly_fea.fit_transform(train_data_final)
test_poly_y=poly_fea.transform(test_final_data)
poly_SVM = LinearSVC(C=10)
poly_SVM.fit(train_poly_x,train_target)
#predicting trained values
trained_poly_y_predicted=poly_SVM.predict(train_poly_x)
#accuracy
accuracy_score(train_target,trained_poly_y_predicted)

  y = column_or_1d(y, warn=True)


0.8828032979976443

In [145]:
#predict test values
test_poly_y_predicted=poly_SVM.predict(test_poly_y)
#accuracy test 
accuracy_score(test_target,test_poly_y_predicted)

0.7969821673525377

## SVM Model 3:

In [146]:
#rbf kernel
from sklearn.svm import SVC
rbf_kernel_svm=SVC(kernel="rbf",C=10,gamma="scale")
rbf_kernel_svm.fit(train_data_final,train_target)

  y = column_or_1d(y, warn=True)


SVC(C=10)

In [147]:
#predicting train values
train_y_predicted_rbf=rbf_kernel_svm.predict(train_data_final)
#Accuracy of train predictions
accuracy_score(train_target,train_y_predicted_rbf)

0.9758539458186102

In [148]:
#predicting test values
test_y_predicted_rbf=rbf_kernel_svm.predict(test_final_data)
#accuracy of test predictions
accuracy_score(test_target,test_y_predicted_rbf)

0.7969821673525377

# Section 3: (3 points in total)

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRregression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

### SGD Model 1: L2 Penalty

### Loading the Necessary Libraries for SGD analysis

In [149]:
from sklearn.linear_model import SGDClassifier

In [164]:
# Initiating trainer object and fit
sgdc_l2 = SGDClassifier(max_iter=1000, tol=0.1)
sgdc_l2.fit(train_data_final, train_target)


  y = column_or_1d(y, warn=True)


SGDClassifier(tol=0.1)

In [165]:
# Training Score
training_score = sgdc_l2.score(train_data_final, train_target)
print(f'Training Score is: {training_score}')


Training Score is: 0.7838633686690224


In [166]:
test_score = sgdc_l2.score(test_final_data, test_target)
print(f'Test Score is: {test_score}')


Test Score is: 0.7901234567901234


In [167]:
# Prediction
target_prediction = sgdc_l2.predict(test_final_data)
confusion_matrix(target_prediction, test_target)


array([[314, 117],
       [ 36, 262]], dtype=int64)

In [168]:
classification_report(target_prediction, test_target)

'              precision    recall  f1-score   support\n\n           0       0.90      0.73      0.80       431\n           1       0.69      0.88      0.77       298\n\n    accuracy                           0.79       729\n   macro avg       0.79      0.80      0.79       729\nweighted avg       0.81      0.79      0.79       729\n'

## SGD Model 2:

## Elasticnet Penalty

In [178]:
# Initiating trainer object and fit
sgdc_elastic = SGDClassifier(max_iter=2000, tol=0.1, penalty='elasticnet')
sgdc_elastic.fit(train_data_final, train_target)


  y = column_or_1d(y, warn=True)


SGDClassifier(max_iter=2000, penalty='elasticnet', tol=0.1)

In [179]:
# Training Score
training_score = sgdc_elastic.score(train_data_final, train_target)
print(f'Training Score is: {training_score}')


Training Score is: 0.8286219081272085


In [180]:
test_score = sgdc_elastic.score(test_final_data, test_target)
print(f'Test Score is: {test_score}')


Test Score is: 0.8093278463648834


In [181]:
# Prediction
target_prediction = sgdc_elastic.predict(test_final_data)
confusion_matrix(target_prediction, test_target)

array([[289,  78],
       [ 61, 301]], dtype=int64)

In [182]:
classification_report(target_prediction, test_target)


'              precision    recall  f1-score   support\n\n           0       0.83      0.79      0.81       367\n           1       0.79      0.83      0.81       362\n\n    accuracy                           0.81       729\n   macro avg       0.81      0.81      0.81       729\nweighted avg       0.81      0.81      0.81       729\n'

## LogisticRegression Model:

In [183]:
from sklearn.linear_model import LogisticRegression

In [187]:
logisticReg = LogisticRegression()
logisticReg.fit(train_data_final, train_target)

  y = column_or_1d(y, warn=True)


LogisticRegression()

In [188]:
# Training Score
training_score = logisticReg.score(train_data_final, train_target)
print(f'Training Score is: {training_score}')

Training Score is: 0.823321554770318


In [189]:
test_score = logisticReg.score(test_final_data, test_target)
print(f'Test Score is: {test_score}')


Test Score is: 0.8353909465020576


In [190]:
# Prediction
target_prediction = sgdc_elastic.predict(test_final_data)
confusion_matrix(target_prediction, test_target)

array([[289,  78],
       [ 61, 301]], dtype=int64)

In [191]:
classification_report(target_prediction, test_target)

'              precision    recall  f1-score   support\n\n           0       0.83      0.79      0.81       367\n           1       0.79      0.83      0.81       362\n\n    accuracy                           0.81       729\n   macro avg       0.81      0.81      0.81       729\nweighted avg       0.81      0.81      0.81       729\n'

# Discussion (3 points in total)


## List the train and test values of each model you built (1 point)

# SGD Model 1: 
> ***Training Score is: 0.7838633686690224***

> ***Test Score is: 0.7901234567901234***

### SGD Model 2: 
> ***Training Score is: 0.8286219081272085***

> ***Test Score is: 0.8093278463648834***

### Logistic Regression:
> ***Training Score is: 0.823321554770318***

> ***Test Score is: 0.8353909465020576***
### SVM Model 1:
>***Training Score is :0.8356890459363958***
>***Test score is:0.8367626886145405***
### SVM Model 2:
>***Training score is:0.8828032979976443***
>***Test Score is:0.7969821673525377***
### SVM Model 3:
>***Training score is:0.9758539458186102***
>***Test score is:0.7969821673525377***

## Which model performs the best and why? (0.5 points) How does it compare to baseline? (0.5 points)

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

> ***The best model by test score was the Logistic Regression Model with a test score of 0.84***

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? (0.5 points)

***There is slight overfitting in the best model. Since the model was implemented with the default implementation settings a solution to solve the overfitting issue would be playing with the 'tol' parameter in the class definition.
The tolerance refers to how much error the we are willing to tolerate in order to try to reach a global minimum. 
In our case reducing the tolerance could in turn reduce the slight overfit problem we encountered with this model.***

## Is there any evidence of overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? (0.5 points)

Yes there is evidence of overfitting in SVM Models 2 and 3, we can overcome this overfitting my changing the values of C.
