<a href="https://colab.research.google.com/github/NagababuVeganti/temp/blob/main/Group_work_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Group work - Classification

In this assignment, we will focus on sports analytics. This data set is made available by http://www.baseball-reference.com. It contains data about professional baseball (MLB) games played in the 2016 season. There are 2,427 games in the data set. Each row represents a single game. The goal is to predict the attendance at a home team’s game. This is an important task because most franchises want to predict the number of attendees for a variety of reasons including profits.

## Description of Variables

The description of variables are provided in "Baseball - Data Dictionary.docx"

## Goal

Use the **baseball.csv** data set and build a model to predict **attendance_binary**.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Recommended roles for group members:

**Section 1:** to be completed by both group members

**Section 2:** to be completed by the first group member and checked by the second

**Section 3:** to be completed by the second group member and checked by the first

**Important notes:**
- Both group members will get the same grade. Therefore, you should check the work of your group member. If they make a mistake, you will be responsible for that mistake too.
- Both group members must put in their fair share of effort. Otherwise, those who don't contribute to the assignment will not receive any grade.


# Section 1: (6 points in total)

## Data Prep (5.5 points)

In [None]:
#Loading the required  libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(46046025)

In [None]:
#Loading the Dataset
game_data=pd.read_csv('baseball.csv')

In [None]:
#Preview the Loaded data
game_data.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
0,0,43683,2,6,2,Night Game,Day Game,0,6,6,Wednesday,Monday,55,24,Overcast,2.933333,1
1,0,45785,0,7,2,Night Game,Day Game,0,10,3,Wednesday,Monday,48,7,Unknown,2.8,1
2,0,48282,0,8,4,Night Game,Day Game,2,4,3,Wednesday,Monday,65,10,Cloudy,3.383333,0
3,0,21830,0,9,6,Day Game,Night Game,0,15,11,Wednesday,Tuesday,77,0,In Dome,3.233333,1
4,0,49289,2,4,2,Night Game,Day Game,1,1,3,Tuesday,Monday,81,12,Cloudy,2.633333,1


# Split the data into train and test

In [None]:
train_set, test_set = train_test_split(game_data, test_size=0.25)

## Check the missing values


In [None]:
train_set.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [None]:
test_set.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

**Observation**: From the above we can see that we dont have any missing values in the dataset so there is no need to imputing values in to the columns

# Data Prep

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [None]:
train_y = train_set[['attendance_binary']]
test_y = test_set[['attendance_binary']]

train_inputs = train_set.drop(['attendance_binary'], axis=1)
test_inputs = test_set.drop(['attendance_binary'], axis=1)

# Building Data pipeline to process the Data

In [None]:
train_inputs.dtypes

previous_attendance            int64
previous_away_team_errors      int64
previous_away_team_hits        int64
previous_away_team_runs        int64
game_type                     object
previous_game_type            object
previous_home_team_errors      int64
previous_home_team_hits        int64
previous_home_team_runs        int64
game_day                      object
previous_game_day             object
temperature                    int64
wind_speed                     int64
sky                           object
previous_game_duration       float64
previous_homewin               int64
dtype: object

In [None]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [None]:
# Identify the binary columns so we can pass them through without transforming
binary_columns = ['previous_homewin']

In [None]:
#we need to remove the binary columns from numerical columns.
for col in binary_columns:
    numeric_columns.remove(col)

#PipeLine

In [None]:
#Here i used mean of the column as a imputer
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='mean')),
                ('scaler', StandardScaler())])

In [None]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [None]:
#passing the pipe line to ColumnTransformer 

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

# Transform: fit_transform() for TRAIN

In [None]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x.shape

(1820, 37)

# Tranform: transform() for TEST

In [None]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x.shape

(607, 37)

In [None]:
train_x_copy=np.copy(train_x)
test_x_copy=np.copy(test_x)

## Find the Baseline (0.5 point)

In [None]:
# Find majority class
train_y.value_counts()
# Find percentage
print(train_y.value_counts()/len(train_y))

attendance_binary
1                    0.512637
0                    0.487363
dtype: float64


In [None]:
#So the baseline accuracy is 51 percent.

# Section 2: (3 points in total)

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)

## SVM Model 1:

**SVM with Linear Kernal**

In [None]:
from sklearn.svm import SVC
 
model1 = SVC(kernel="linear",C=10)

model1.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


SVC(C=10, kernel='linear')

In [None]:
from sklearn.metrics import accuracy_score
#Predict the train values
train_y_pred = model1.predict(train_x)

#Train accuracy
print("Accuracy on Train set:",accuracy_score(train_y, train_y_pred))

Accuracy on Train set: 0.8340659340659341


In [None]:
#Predict the test values
test_y_pred = model1.predict(test_x)

#Test accuracy
print("Accuracy on Test set:",accuracy_score(test_y, test_y_pred))

Accuracy on Test set: 0.8434925864909391


**Analysis(Model 1):**
1.   We see that model accuracy on the train set is above the base line so we can clearly say that its a descnet fit
2.   And also we can conclude that there is no over fitting from the accuracy values 



## SVM Model 2:

**Model 2** :
 Here i am building SVC model with polynomial terms in it.

In [None]:
#Generating the polynomial Terms

from sklearn.preprocessing import PolynomialFeatures

# Create third degree terms
poly_features = PolynomialFeatures(degree=3, include_bias=False)

train_x_poly = poly_features.fit_transform(train_x)

test_x_poly = poly_features.transform(test_x)



In [None]:
#Data Dimensions 
train_x_poly.shape

(1820, 9879)

In [None]:
#Increased the Iterations to 4000 to make it Converge on optimal Decision Boundry
from sklearn.svm import LinearSVC 

pol_svm = LinearSVC(C=10,max_iter=3000)

pol_svm.fit(train_x_poly, train_y)

  y = column_or_1d(y, warn=True)


LinearSVC(C=10, max_iter=3000)

In [None]:
#Predict the train values
train_y_poly_pred = pol_svm.predict(train_x_poly)

#Train accuracy
print("Accuracy of Train set:",accuracy_score(train_y, train_y_poly_pred))

Accuracy of Train set: 1.0


In [None]:
#Predict the test values
test_y_poly_pred = pol_svm.predict(test_x_poly)

#Test accuracy
print("Accuracy of Test set:",accuracy_score(test_y, test_y_poly_pred))

Accuracy of Test set: 0.7528830313014827


**Analysis (Model 2):**
*   Clearly from the above model we can say there is overfitting, becuase we got a training accuracy of 1.0(100%) , and accuary on test set is consideralbly low. 

*   The reason from Over-fitting is that due to high number of features 9879 which is very large




#Sub Model 2.1 (Reducing the polynomial degree to 2

---



(1820, 37)

In [None]:
from sklearn.svm import SVC

# You need to enter a value for gamma. Remember, gamma controls the shape of the bell curve for rbf
# You can also set it is as gamma='scale'. This will be the default option in future releases

pol_svm2 = SVC(kernel="poly", degree=2, coef0=2, C=8, gamma='scale')

pol_svm2.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


SVC(C=8, coef0=2, degree=2, kernel='poly')

In [None]:
#Predict the train values
train_y_pred = pol_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y)

1.0

In [None]:
#Predict the test values
test_y_pred = pol_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8270181219110379

## SVM Model 3:

#Bewlow we build a SVM model using rbf kernal 

In [None]:
#Passing "rgf" as a kernal parameter to the SVC to make it use rbf kernal.
rbf_svm = SVC(kernel="rbf", C=6, gamma='scale')

rbf_svm.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


SVC(C=6)

In [None]:
#Predict the train values
train_y_pred = rbf_svm.predict(train_x)

#Train accuracy
print("Accuracy on the test set:",accuracy_score(train_y, train_y_pred))

Accuracy on the test set: 0.9538461538461539


In [None]:
#Predict the test values
test_y_pred = rbf_svm.predict(test_x)

#Test accuracy
print("Accuracy on the test test:",accuracy_score(test_y, test_y_pred))

Accuracy on the test test: 0.8088962108731467


**Analysis:**


*   we see that among all the 3 models we build rbf kernal perfomed well. and also there is no evidence of over fitting also.
*   we can still achive better accuray by tweaking with C parameter that controls the regularization.



# Section 3: (3 points in total)

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRregression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

In [None]:
#Importing required modeules.
from sklearn.linear_model import SGDClassifier


In [None]:
sdgModel1 = SGDClassifier(loss="log", penalty="l2",max_iter=1500,tol=1e-3)


## SGD Model 2:

## LogisticRegression Model:

In [106]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression(penalty='l2')

log_reg.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


LogisticRegression()

In [None]:
#Predict the train values
train_y_pred = log_reg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8324175824175825

In [None]:
#Predict the test values
test_y_pred = log_reg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8434925864909391

In [None]:
from sklearn.metrics import confusion_matrix

#We usually create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[236,  45],
       [ 50, 276]])

In [None]:
from sklearn.metrics import classification_report

#We usually create the classification report on test set
print(classification_report(test_y, test_y_pred))

              precision    recall  f1-score   support

           0       0.83      0.84      0.83       281
           1       0.86      0.85      0.85       326

    accuracy                           0.84       607
   macro avg       0.84      0.84      0.84       607
weighted avg       0.84      0.84      0.84       607



**Analysis (Logistic Regression):**

*   We think the model fit is good and there is no over fitting or under fitting. 
*   One reason we are seeing little bit more accuracy on test (84.3) compared to train (83.2), is we think due to split in data, we made 75% train and 25% test, so it could be the reason for these numberes.
*   Apart from that we fit the scores we got are acceptable when compared with base line accuracy. 




# Discussion (3 points in total)


## List the train and test values of each model you built (1 point)

## Which model performs the best and why? (0.5 points) How does it compare to baseline? (0.5 points)

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? (0.5 points)

## Is there any evidence of overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? (0.5 points)