# **Binary Classification using SVM**
SVM Model objective is to find the maximum margin classifier. The maximum margin classifier helps to reduces the hypothesis space, effect of high dimensionality and computation. 

The points which maximum margin classifier touches are called support vectors. These vectors alone are enough to classify all other points.

## Problem Statement

    To predict whether the person will survive or not based on the diagonostic factors influencing Hepatitis

## Dataset: _Hepatitis_ 

This dataset contains occurrences of hepatitis in people.

The dataset is obtained from the machine learning repository at UCI. It includes 155 records in two different classes which are die in 32 cases and live in 123 cases. The dataset includes 20 attributes (14 binary and 6 numerical attributes).

### **Attribute information:**

1. **target**: DIE (1), LIVE (2)
2. **age**: 10, 20, 30, 40, 50, 60, 70, 80
3. **gender**: male (1), female (2)

           ------ no = 2,   yes = 1 ------

4. **steroid**: no, yes 
5. **antivirals**: no, yes 
6. **fatique**: no, yes 
7. **malaise**: no, yes 
8. **anorexia**: no, yes 
9. **liverBig**: no, yes 
10. **liverFirm**: no, yes 
11. **spleen**: no, yes 
12. **spiders**: no, yes
13. **ascites**: no, yes 
14. **varices**: no, yes
15. **histology**: no, yes


16. **bilirubin**: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00 -- 
17. **alk**: 33, 80, 120, 160, 200, 250 ---
18. **sgot**: 13, 100, 200, 300, 400, 500, ---
19. **albu**: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0, --- 
20. **protime**: 10, 20, 30, 40, 50, 60, 70, 80, 90, --- 

  * NA's are represented with "?"

### Identify Right Error Metrics

    Based on the business have to identify the right error metrics.

#### Confusion Matrix

#### Code to ignore warnings

In [1]:
import warnings
warnings.filterwarnings("ignore")

#### Loading the required libraries

In [2]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.impute import SimpleImputer

from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, f1_score

from sklearn.model_selection import GridSearchCV

## Read the HEPATITIS dataset

In [5]:
data = pd.read_excel("C:/Users/gsk44/OneDrive/Desktop/SVM/hepatitis.xlsx", na_values="?")

## _Exploratory Data Analysis_

### Check the dimensions (rows and columns)

In [6]:
data.shape

(155, 21)

### Check the datatype of each variable

In [7]:
data.dtypes

ID            float64
target        float64
age           float64
gender        float64
steroid       float64
antivirals    float64
fatigue       float64
malaise       float64
anorexia      float64
liverBig      float64
liverFirm     float64
spleen        float64
spiders       float64
ascites       float64
varices       float64
bili          float64
alk           float64
sgot          float64
albu          float64
protime       float64
histology     float64
dtype: object

### Check the top 5 rows and observe the data

In [8]:
data.head()

Unnamed: 0,ID,target,age,gender,steroid,antivirals,fatigue,malaise,anorexia,liverBig,...,spleen,spiders,ascites,varices,bili,alk,sgot,albu,protime,histology
0,1.0,2.0,30.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,...,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1.0
1,2.0,2.0,50.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,...,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1.0
2,3.0,2.0,78.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1.0
3,4.0,2.0,31.0,1.0,,1.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1.0
4,5.0,2.0,34.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1.0


### Check basic summary statistics

In [9]:
data.describe()

Unnamed: 0,ID,target,age,gender,steroid,antivirals,fatigue,malaise,anorexia,liverBig,...,spleen,spiders,ascites,varices,bili,alk,sgot,albu,protime,histology
count,155.0,155.0,155.0,155.0,154.0,155.0,154.0,154.0,154.0,145.0,...,150.0,150.0,150.0,150.0,149.0,126.0,151.0,139.0,88.0,155.0
mean,78.0,1.793548,41.2,1.103226,1.506494,1.845161,1.350649,1.603896,1.792208,1.827586,...,1.8,1.66,1.866667,1.88,1.427517,105.325397,85.89404,3.817266,61.852273,1.451613
std,44.888751,0.40607,12.565878,0.30524,0.501589,0.362923,0.47873,0.490682,0.407051,0.379049,...,0.40134,0.475296,0.341073,0.32605,1.212149,51.508109,89.65089,0.651523,22.875244,0.499266
min,1.0,1.0,7.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,0.3,26.0,14.0,2.1,0.0,1.0
25%,39.5,2.0,32.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,...,2.0,1.0,2.0,2.0,0.7,74.25,31.5,3.4,46.0,1.0
50%,78.0,2.0,39.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,1.0,85.0,58.0,4.0,61.0,1.0
75%,116.5,2.0,50.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,1.5,132.25,100.5,4.2,76.25,2.0
max,155.0,2.0,78.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,8.0,295.0,648.0,6.4,100.0,2.0


### Check the number of unique levels in each attribute

In [10]:
data.nunique()

ID            155
target          2
age            49
gender          2
steroid         2
antivirals      2
fatigue         2
malaise         2
anorexia        2
liverBig        2
liverFirm       2
spleen          2
spiders         2
ascites         2
varices         2
bili           34
alk            83
sgot           84
albu           29
protime        44
histology       2
dtype: int64

### Target attribute distribution

In [11]:
data.target.value_counts()

2.0    123
1.0     32
Name: target, dtype: int64

In [12]:
data.target.value_counts(normalize=True)*100

2.0    79.354839
1.0    20.645161
Name: target, dtype: float64

## _Data Pre-processing_

### Drop column(s) which are not significant

In [13]:
data.drop(["ID"], axis = 1, inplace=True)

### Check for top 5 rows

In [14]:
data.head()

Unnamed: 0,target,age,gender,steroid,antivirals,fatigue,malaise,anorexia,liverBig,liverFirm,spleen,spiders,ascites,varices,bili,alk,sgot,albu,protime,histology
0,2.0,30.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1.0
1,2.0,50.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1.0
2,2.0,78.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1.0
3,2.0,31.0,1.0,,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1.0
4,2.0,34.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1.0


### Store categorical and numerical attribute name

In [15]:
num_cols = ["age", "bili", "alk", "sgot", "albu", "protime"]
cat_cols = ['gender', 'steroid', 'antivirals', 'fatigue', 'malaise', 'anorexia', 'liverBig', 
            'liverFirm', 'spleen', 'spiders', 'ascites', 'varices', 'histology']

### Convert all attribute to appropriate type 

In [16]:
data[cat_cols] = data[cat_cols].astype('category')

In [17]:
data.dtypes

target         float64
age            float64
gender        category
steroid       category
antivirals    category
fatigue       category
malaise       category
anorexia      category
liverBig      category
liverFirm     category
spleen        category
spiders       category
ascites       category
varices       category
bili           float64
alk            float64
sgot           float64
albu           float64
protime        float64
histology     category
dtype: object

### Check for null values

In [18]:
data.isna().sum()

target         0
age            0
gender         0
steroid        1
antivirals     0
fatigue        1
malaise        1
anorexia       1
liverBig      10
liverFirm     11
spleen         5
spiders        5
ascites        5
varices        5
bili           6
alk           29
sgot           4
albu          16
protime       67
histology      0
dtype: int64

## Split the data into X and y

In [19]:
X = data.drop(["target"], axis = 1)

In [20]:
y = data["target"]

Shape of X and y

In [21]:
print(X.shape, y.shape)

(155, 19) (155,)


In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123, stratify=y)

Shape of X_train, X_test, y_train, y_test

In [23]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(124, 19)
(31, 19)
(124,)
(31,)


### Check for distribution of target values in y_train

In [24]:
y_train.value_counts()

2.0    98
1.0    26
Name: target, dtype: int64

In [25]:
y_train.value_counts(normalize=True)*100

2.0    79.032258
1.0    20.967742
Name: target, dtype: float64

### Check for distribution of target values in y_test

In [26]:
y_test.value_counts(normalize=True)*100

2.0    80.645161
1.0    19.354839
Name: target, dtype: float64

## _Data Pre-processing_

### Handling Missing Data

#### Check null values in train and test

In [27]:
X_train.isna().sum()

age            0
gender         0
steroid        0
antivirals     0
fatigue        0
malaise        0
anorexia       0
liverBig       7
liverFirm      8
spleen         2
spiders        2
ascites        2
varices        2
bili           4
alk           22
sgot           3
albu          11
protime       51
histology      0
dtype: int64

In [28]:
X_test.isna().sum()

age            0
gender         0
steroid        1
antivirals     0
fatigue        1
malaise        1
anorexia       1
liverBig       3
liverFirm      3
spleen         3
spiders        3
ascites        3
varices        3
bili           2
alk            7
sgot           1
albu           5
protime       16
histology      0
dtype: int64

#### Imputation missing Categorical columns with Mode

In [29]:
df_cat_train = X_train[cat_cols]
df_cat_test = X_test[cat_cols]

In [30]:
cat_imputer = SimpleImputer(strategy='most_frequent')

cat_imputer.fit(df_cat_train)

SimpleImputer(strategy='most_frequent')

In [31]:
df_cat_train = pd.DataFrame(cat_imputer.transform(df_cat_train), columns=cat_cols)
df_cat_test = pd.DataFrame(cat_imputer.transform(df_cat_test), columns=cat_cols)

#### Imputation missing Numerical columns with Median

In [32]:
df_num_train = X_train[num_cols]
df_num_test = X_test[num_cols]

In [33]:
num_imputer = SimpleImputer(strategy='median')

num_imputer.fit(df_num_train[num_cols])

SimpleImputer(strategy='median')

In [34]:
df_num_train = pd.DataFrame(num_imputer.transform(df_num_train), columns=num_cols)
df_num_test =  pd.DataFrame(num_imputer.transform(df_num_test), columns=num_cols)

#### Combine imputed categorical and numeric columns

In [35]:
# Combine numeric and categorical in train
X_train = pd.concat([df_num_train, df_cat_train], axis = 1)

# Combine numeric and categorical in test
X_test = pd.concat([df_num_test, df_cat_test], axis = 1)

In [36]:
X_train.isna().sum()

age           0
bili          0
alk           0
sgot          0
albu          0
protime       0
gender        0
steroid       0
antivirals    0
fatigue       0
malaise       0
anorexia      0
liverBig      0
liverFirm     0
spleen        0
spiders       0
ascites       0
varices       0
histology     0
dtype: int64

In [37]:
X_test.isna().sum()

age           0
bili          0
alk           0
sgot          0
albu          0
protime       0
gender        0
steroid       0
antivirals    0
fatigue       0
malaise       0
anorexia      0
liverBig      0
liverFirm     0
spleen        0
spiders       0
ascites       0
varices       0
histology     0
dtype: int64

### Standardize the numerical attributes

In [38]:
scaler = StandardScaler()

scaler.fit(X_train[num_cols])

StandardScaler()

In [39]:
X_train_std = scaler.transform(X_train[num_cols])
X_test_std = scaler.transform(X_test[num_cols])

In [40]:
print(X_train_std.shape)
print(X_test_std.shape)

(124, 6)
(31, 6)


### OneHotEncoder : Converting Categorical attributes to Numeric attributes

In [41]:
enc = OneHotEncoder(drop = 'first')

enc.fit(X_train[cat_cols])

OneHotEncoder(drop='first')

In [42]:
X_train_ohe=enc.transform(X_train[cat_cols]).toarray()
X_test_ohe=enc.transform(X_test[cat_cols]).toarray()

### Concatenate attribute


In [43]:
X_train_con = np.concatenate([X_train_std, X_train_ohe], axis=1)
X_test_con = np.concatenate([X_test_std, X_test_ohe], axis=1)

In [44]:
print(X_train_con.shape)
print(X_test_con.shape)

(124, 19)
(31, 19)


## MODEL BUILDING

### A. SVM (Linear  and RBF Models)

#### Create a SVC classifier using a linear kernel

In [45]:
linear_svm = SVC(kernel='linear', C=1)

#### Train the classifier

In [46]:
linear_svm.fit(X=X_train, y= y_train)

SVC(C=1, kernel='linear')

#### Predict

In [47]:
train_predictions = linear_svm.predict(X_train)
test_predictions = linear_svm.predict(X_test)

#### Error Matrix

In [48]:
def evaluate_model(act, pred):
    print("Confusion Matrix \n", confusion_matrix(act, pred))
    print("Accurcay : ", accuracy_score(act, pred))
    print("Recall   : ", recall_score(act, pred))
    print("Precision: ", precision_score(act, pred))
    print("F1_score : ", f1_score(act, pred))

In [49]:
### Train data accuracy
evaluate_model(y_train, train_predictions)

### Test data accuracy
evaluate_model(y_test, test_predictions)

Confusion Matrix 
 [[20  6]
 [ 7 91]]
Accurcay :  0.8951612903225806
Recall   :  0.7692307692307693
Precision:  0.7407407407407407
F1_score :  0.7547169811320754
Confusion Matrix 
 [[ 3  3]
 [ 2 23]]
Accurcay :  0.8387096774193549
Recall   :  0.5
Precision:  0.6
F1_score :  0.5454545454545454


###  Non Linear SVM (RBF)

#### Create an SVC object 

In [50]:
svc = SVC(kernel='rbf', gamma=0.01, C=10)
svc

SVC(C=10, gamma=0.01)

#### Train the model

In [51]:
svc.fit(X=X_train, y=y_train)

SVC(C=10, gamma=0.01)

#### Predict

In [52]:
train_predictions = svc.predict(X_train)
test_predictions = svc.predict(X_test)

#### Error Matrix

In [53]:
### Train data accuracy
evaluate_model(y_train, train_predictions)

### Test data accuracy
evaluate_model(y_test, test_predictions)

Confusion Matrix 
 [[26  0]
 [ 0 98]]
Accurcay :  1.0
Recall   :  1.0
Precision:  1.0
F1_score :  1.0
Confusion Matrix 
 [[ 1  5]
 [ 0 25]]
Accurcay :  0.8387096774193549
Recall   :  0.16666666666666666
Precision:  1.0
F1_score :  0.2857142857142857


### SVM with Grid Search for Paramater Tuning

#### Define param and instantiate GridSearchCV

In [54]:
svc_grid = SVC()
 
param_grid = { 
                'C': [0.001, 0.01, 0.1, 1, 10, 100 ],
                'gamma': [0, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100], 
                'kernel':['linear', 'rbf', 'poly' ]
             }

svc_cv_grid = GridSearchCV(estimator = svc_grid, param_grid = param_grid, cv = 3)

#### Fit the grid search model

In [55]:
%time svc_cv_grid.fit(X=X_train, y=y_train)

ValueError: The gamma value of 0.0 is invalid. Use 'auto' to set gamma to a value of 1 / n_features.

#### Get the best parameters

In [56]:
svc_cv_grid.best_params_

{'C': 0.1, 'gamma': 0, 'kernel': 'linear'}

#### Predict

In [57]:
### Train data accuracy
evaluate_model(y_train, train_predictions)

### Test data accuracy
evaluate_model(y_test, test_predictions)

Confusion Matrix 
 [[26  0]
 [ 0 98]]
Accurcay :  1.0
Recall   :  1.0
Precision:  1.0
F1_score :  1.0
Confusion Matrix 
 [[ 1  5]
 [ 0 25]]
Accurcay :  0.8387096774193549
Recall   :  0.16666666666666666
Precision:  1.0
F1_score :  0.2857142857142857
