## Stacking using Voting

In this section, you will use voting method to classify diabet patient with certain features. The class consists of two kinds, such diabet (1) and no diabet (0). For the first step, we will implement some classification algorithm separately, such: Naive Bayes, SVM Linier, and SVM RBF. Then, we will try to combine Setelah itu, kita akan menggabungkan performa dari 3 algoritma tersebut dengan menggunakan metode ensemble voting.

### Import Library

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB # import Naive Bayes model Gaussian (asumsi data terdistribusi normal)
from sklearn.svm import SVC # import SVM classifier
from sklearn.ensemble import VotingClassifier # import model Voting
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

### Data Preparation

In [2]:
# Load Data

dbt = pd.read_csv('data/diabetes.csv')

dbt.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
# Cek nama kolom
dbt.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [4]:

dbt.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [7]:
# By looking at athe data, it is unusual if some paramaters have '0' value.
# for example, it happens to 'Glucose', 'BloodPlessure' or 'Insulin'.
# This is not normal because the value must not in zero (as the rate of blood for live human)

# We will try to manipulate "0" value using imputation. In other words, we will convert this zero value with synthetic one.
# Concretely, we will use mean value.
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
for column in feature_columns:
    print("============================================")
    print(f"{column} ==> Missing zeros : {len(dbt.loc[dbt[column] == 0])}")

Pregnancies ==> Missing zeros : 111
Glucose ==> Missing zeros : 5
BloodPressure ==> Missing zeros : 35
SkinThickness ==> Missing zeros : 227
Insulin ==> Missing zeros : 374
BMI ==> Missing zeros : 11
DiabetesPedigreeFunction ==> Missing zeros : 0
Age ==> Missing zeros : 0


In [8]:
# Impute 0 value with mean
from sklearn.impute import SimpleImputer

fill_values = SimpleImputer(missing_values=0, strategy="mean", copy=False)

dbt[feature_columns] = fill_values.fit_transform(dbt[feature_columns])

### Split training dan testing data

In [9]:
X = dbt[feature_columns]
y = dbt.Outcome

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Training using GaussianNB

#### Feature Standard

In [10]:

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()


X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

#### Training and Evaluation

In [12]:

gnb_std = GaussianNB()


gnb_std.fit(X_train_std, y_train)


y_pred_gnb = gnb_std.predict(X_test_std)


acc_gnb = accuracy_score(y_test, y_pred_gnb)


print("Test set accuracy: {:.2f}".format(acc_gnb))
print(f"Test set accuracy: {acc_gnb}")

Test set accuracy: 0.74
Test set accuracy: 0.7359307359307359


### Training with SVM Linier

In [15]:

svm_lin = SVC(kernel='linear')


svm_lin.fit(X_train_std, y_train)


y_pred_svm_lin = svm_lin.predict(X_test_std)


acc_svm_lin = accuracy_score(y_test, y_pred_svm_lin)


print("Test set accuracy: {:.2f}".format(acc_svm_lin))
print(f"Test set accuracy: {acc_svm_lin}")

Test set accuracy: 0.74
Test set accuracy: 0.7402597402597403


### Training with SVM RBF

In [16]:

svm_rbf = SVC(kernel='rbf')


svm_rbf.fit(X_train_std, y_train)


y_pred_svm_rbf = svm_rbf.predict(X_test_std)


acc_svm_rbf = accuracy_score(y_test, y_pred_svm_rbf)

print("Test set accuracy: {:.2f}".format(acc_svm_rbf))
print(f"Test set accuracy: {acc_svm_rbf}")

Test set accuracy: 0.72
Test set accuracy: 0.7229437229437229


### Training with Voting

In [20]:


clf1 = GaussianNB()
clf2 = SVC(kernel='linear')
clf3 = SVC(kernel='rbf', probability=True)


voting = VotingClassifier(estimators=[('GaussianNB', clf1), ('SVM-LIN', clf2), ('SVM-RBF', clf3)], voting='hard')


voting.fit(X_train_std, y_train)


y_pred_vt1 = voting.predict(X_test_std)


acc_vt1 = accuracy_score(y_test, y_pred_vt1)


print('Voting Hard')
print("Test set accuracy: {:.2f}".format(acc_vt1))
print(f"Test set accuracy: {acc_vt1}")

Voting Hard
Test set accuracy: 0.74
Test set accuracy: 0.7402597402597403


## Assignment

Implement an ensemble voting using these algorithm
1. Logistic Regression
2. SVM kernel polynomial
3. Decission Tree
