# **Case**:
In this case we will use one of the stacking methods, namely voting, to classify diabetes patients according to several characteristics.  Patients will be classified into patients suffering from diabetes (1) and not suffering from diabetes (0).  

First of all, we will use several classification algorithms separately, namely Naive Bayes, Linear SVM, and SVM RBF.  After that, we will combine the performance of the 3 algorithms using the ensemble voting method. 

# **Import Libraries and Load Data**

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the data
dbt = pd.read_csv('../Data/diabetes.csv')
# show the first 5 rows of the data
dbt.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# **Check columns name**

In [2]:
dbt.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

# **Check Null Column**

In [3]:
dbt.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

# **Data Imputation**

In [4]:
# In this case, it doesn't make sense for some parameters to be 0
#  for example the values for 'Glucose', 'BloodPlessure' or 'Insulin'.
#  No matter how small the values are, every living human being must have 
#  these values

# We will manipulate the value 0 by 'imputation' or replacing the value 
# with a synthetic value
# In this case, we will use the mean value 

# Check the number of 0 values in each column
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
for column in feature_columns:
    print("============================================")
    print(f"{column} ==> Missing zeros : {len(dbt.loc[dbt[column] == 0])}")

# Replace 0 with mean value
from sklearn.impute import SimpleImputer
fill_values = SimpleImputer(missing_values=0, strategy='mean', copy=False)
dbt[feature_columns] = fill_values.fit_transform(dbt[feature_columns])

Pregnancies ==> Missing zeros : 111
Glucose ==> Missing zeros : 5
BloodPressure ==> Missing zeros : 35
SkinThickness ==> Missing zeros : 227
Insulin ==> Missing zeros : 374
BMI ==> Missing zeros : 11
DiabetesPedigreeFunction ==> Missing zeros : 0
Age ==> Missing zeros : 0


# **Split Data**

In [5]:
X = dbt[feature_columns]
y = dbt['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# **Feature Standardization**

In [6]:
# Because the Gaussian NB assumption is that the data is normally 
# distributed,
# So we need to standardize 
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

# Standardization in X_train dan X_test data
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

# **Train Model**

## **GaussianNB**

In [7]:
# create GaussianNB object
gnb_std = GaussianNB()

# fit data that has been standardized
gnb_std.fit(X_train_std, y_train)

# predict test set
y_pred_gnb = gnb_std.predict(X_test_std)

# calculate test data accuracy score
acc_gnb = accuracy_score(y_test, y_pred_gnb)

# print accuracy score
print(f'Test set accuracy {acc_gnb}')
print('Round Test set accuracy: {:.2f}'.format(acc_gnb))

Test set accuracy 0.7359307359307359
Round Test set accuracy: 0.74


## **SVM Linear**

In [8]:
# create SVM object with linear kernel without hyperparameter tuning
svm_lin = SVC(kernel='linear')

# fit data that has been standardized
svm_lin.fit(X_train_std, y_train)

# predict test set
y_pred_svm_lin = svm_lin.predict(X_test_std)

# calculate test data accuracy score
acc_svm_lin = accuracy_score(y_test, y_pred_svm_lin)

# print accuracy score
print(f'Test set accuracy {acc_svm_lin}')
print('Round Test set accuracy: {:.2f}'.format(acc_svm_lin))

Test set accuracy 0.7402597402597403
Round Test set accuracy: 0.74


## **SVM RBF**

In [9]:
# create SVM object with rbf kernel without hyperparameter tuning
svm_rbf = SVC(kernel='rbf')

# fit data that has been standardized
svm_rbf.fit(X_train_std, y_train)

# predict test set
y_pred_svm_rbf = svm_rbf.predict(X_test_std)

# calculate test data accuracy score
acc_svm_rbf = accuracy_score(y_test, y_pred_svm_rbf)

# print accuracy score
print(f'Test set accuracy {acc_svm_rbf}')
print('Round Test set accuracy: {:.2f}'.format(acc_svm_rbf))

Test set accuracy 0.7229437229437229
Round Test set accuracy: 0.72


# **Voting Classifier**

In [10]:
# define algorithm that used in voting classifier
clf1 = GaussianNB()
clf2 = SVC(kernel='linear')
clf3 = SVC(kernel='rbf')

# create hard voting classifier object
voting_clf = VotingClassifier(estimators=[('GaussianNB', clf1), ('SVM-LIN', clf2), ('SVM-RBF', clf3)], voting='hard')

# fit data that has been standardized
voting_clf.fit(X_train_std, y_train)

# predict test set
y_pred_voting = voting_clf.predict(X_test_std)

# calculate test data accuracy score
acc_voting = accuracy_score(y_test, y_pred_voting)

# print accuracy score
print('Voting Hard')
print(f'Test set accuracy {acc_voting}')
print('Round Test set accuracy: {:.2f}'.format(acc_voting))

Voting Hard
Test set accuracy 0.7402597402597403
Round Test set accuracy: 0.74
