<a href="https://colab.research.google.com/github/Prang-nin/BinaryClassification/blob/main/BinaryClassificationDiabetics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SECTION I : INTRODUCTION**
## **Objective** : 💉
To compare the performances of three different machine learning models using a diabetes dataset to classify whether or not the person is diabetic.

**Candidate models**
1. K-Nearest Neighbors
2. Support Vector Machines
3. Random Forest

**The Data set can be download**

👉 https://www.kaggle.com/uciml/pima-indians-diabetes-database

## **Table of contents** ✍

**Pre-processing** 🔰

- Import modules and libraries
- Load the dataset 
- Data visualization
- Cleaning the data set
- Features normalization
- Imbalanced data handling using SMOTE

**Train models**  ⌛
- Hyperparameters Tuning

**Evaluate  Models** 🔔
- Using accuracy and confusion matrix

**Reference**
- Auther and co-auther

# **1. Import libraries**

In [None]:
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt

# **2.Load dataset**

In [None]:
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['diabetes.csv']))

Saving diabetes.csv to diabetes.csv


# **3. EDA**

⏩ Define the Exploratory Data Analysis functions 

In [None]:
#### Exploratory Data analysis functions ########
def checkData(data):
  '''
  '''
  # check dataset
  print(data.describe())

def plotCorrelation(data):
  # check correlation of each features
  plt.figure(figsize = (10, 8))
  corr = data.corr()
  sns.heatmap(corr, annot = True, linewidths = 1)
  plt.show()

def chekNull(data):
  #check null value
  nullReport = pd.DataFrame({'Null Values' : data.isna().sum(), 'Percentage Null Values' : (data.isna().sum()) / (data.shape[0]) * (100)})
  print(nullReport)

def checkInvalid(data):
# check invalid value
# Only pregnancies and diabetes pedegree can have value equal to zero
  print("How many zero appeared in each feature groupby the outcome: ")
  features = data.iloc[:,:-1]
  for item in features.columns:
    print(data[data[item] == 0].groupby('Outcome').count()[item])

def checkOutlier(df):
# check for outlier by using boxplot
  plt.figure(figsize = (20, 8))
  sns.histplot(df)
  for column in df:
    plt.figure(figsize=(17,1))
    sns.boxplot(data= df, x= column)

def checkClassBalance(df):
  negDF = df[df.Outcome==0].copy()
  posDF = df[df.Outcome==1].copy()
  # check dataset balance
  print("Total number of negative class :\n", negDF.Outcome.count())
  print("Total number of positive class :\n", posDF.Outcome.count())

⏩ Execute the Exploratory Data Analysis functions 

In [None]:
  # Exploratory Data Analysis
  checkData(df)
  plotCorrelation(df)
  chekNull(df)
  checkInvalid(df)
  checkOutlier(df)
  checkClassBalance(df)

# **4. Data Cleaning**

⏩ Clean the dataset by replacing zero(incorrect/missing data) with mean of each class

In [None]:
###### Data cleaning ########

def replaceZero(df, listOfFeature):
# Replacing the zero with mean and median of each class (positive or negative)
  negDF = df[df.Outcome==0].copy()
  posDF = df[df.Outcome==1].copy()
  for i in listOfFeature:
    negDF[i] = negDF[i].replace([0], negDF[i].mean())
    posDF[i] = posDF[i].replace([0], posDF[i].mean())
  cleanedDF = pd.concat([negDF,posDF])
  return cleanedDF
# list all the features that need to be corrected
listOfFeature = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
cleanedDF = replaceZero(df, listOfFeature)


# **5. Split data into tarin and test dataset**

In [None]:
def getFeature(df):
  return df.iloc[:,:-1].copy()
def getLabel(df):
  return df.iloc[:,-1].copy()

# clean the dataset and split into x and y variable
x = getFeature(cleanedDF)
y = getLabel(cleanedDF)
# split dataset into train and test 
# Test size  = 0.3
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.3, random_state=42)

# **6. Generate pipeline with pre-processing **

In [None]:
#Define classifier function to apply pipeline on train and test dataset

def classifier(optimizedPipeline, xTrain, yTrain, xTest, yTest):
    for i, pip in enumerate(optimizedPipeline):
        pip.fit(xTrain,yTrain)
        yPred = pip.predict(xTest)
        #Find the accurancy and evaluate the model
        accurancy = accuracy_score(yTest, yPred)
        confusion = confusion_matrix(yTest, yPred)
        clfReport = classification_report(yTest, yPred)
        print("Accuracy Score of {}: \n {}".format(modelDict[i], accurancy))
        print("Confusion Matrix of {} : \n {}".format(modelDict[i], confusion))
        print("Classification report of {}: \n {}".format(modelDict[i], clfReport))
    return accurancy, confusion

# Defined model list and pipline list
modelDict = {0:'KNN', 1: 'SVM', 2: 'RandomForest'}

In [None]:
###### pipline creation ######
# 1. MinMaxScalar
# 2. Oversampling using SMOTE
# 3. apply classification

pipelineKNN = imbpipeline([('scalar1', MinMaxScaler()), 
              ('SMOTE1', SMOTE(random_state=42)),
              ('knnClassifier', KNeighborsClassifier()) ])

pipelineSVM = imbpipeline([('scalar1', MinMaxScaler()), 
              ('SMOTE1', SMOTE(random_state=42)),
              ('svmClassifier', SVC()) ])

pipelineRDF = imbpipeline([('scalar1', MinMaxScaler()), 
              ('SMOTE1', SMOTE(random_state=42)),
              ('rdfClassifier', RandomForestClassifier(random_state=42)) ])

#List of untunned classifiers
pipelineList = [pipelineKNN, pipelineSVM, pipelineRDF]

  # Results
print("The results of different models without optimization")
classifier(pipelineList, xTrain, yTrain, xTest, yTest)

The results of different models without optimization
Accuracy Score of KNN: 
 0.6796536796536796
Confusion Matrix of KNN : 
 [[102  54]
 [ 20  55]]
Classification report of KNN: 
               precision    recall  f1-score   support

           0       0.84      0.65      0.73       156
           1       0.50      0.73      0.60        75

    accuracy                           0.68       231
   macro avg       0.67      0.69      0.67       231
weighted avg       0.73      0.68      0.69       231

Accuracy Score of SVM: 
 0.7402597402597403
Confusion Matrix of SVM : 
 [[114  42]
 [ 18  57]]
Classification report of SVM: 
               precision    recall  f1-score   support

           0       0.86      0.73      0.79       156
           1       0.58      0.76      0.66        75

    accuracy                           0.74       231
   macro avg       0.72      0.75      0.72       231
weighted avg       0.77      0.74      0.75       231

Accuracy Score of RandomForest: 
 0.852

(0.8528138528138528, array([[134,  22],
        [ 12,  63]]))

# **7. Tuning Hyperparameters**

⏩ Tuning hyperparameters experiment using GridSearchCV

In [None]:
####### Specified parameters #######
#Dictionary of parameters in KNN :  nameOfStep__nameOfParameters ##
parameterKNN = {}
parameterKNN['knnClassifier__n_neighbors'] = [int(i) for i in range(1,31)]
parameterKNN['knnClassifier__weights'] = ['uniform', 'distance']

#Dictionary of parameters in SVM :  nameOfStep__nameOfParameters ##
parameterSVM = {}
parameterSVM['svmClassifier__C'] = [0.1, 1, 10, 100, 1000]
parameterSVM['svmClassifier__gamma'] = ['scale', 'auto']
parameterSVM['svmClassifier__kernel'] = ['linear', 'poly', 'rbf', 'sigmoid']
parameterSVM['svmClassifier__degree']= [int(i) for i in range(1,6)]

#Dictionary of parameters in Randomforest :  nameOfStep__nameOfParameters ##
parametersRDF = {}
parametersRDF['rdfClassifier__bootstrap'] = [True, False]
parametersRDF['rdfClassifier__max_depth'] = [int(i) for i in range(800,1100,100)] #[800,900,1000]
parametersRDF['rdfClassifier__max_features'] = ['auto', 'sqrt', 'log2']
parametersRDF['rdfClassifier__min_samples_leaf'] = [1,10,20,30,40 ] 
parametersRDF['rdfClassifier__min_samples_split'] = [int(i) for i in range(2,10,2)] #[2,4,6,8]
parametersRDF['rdfClassifier__n_estimators'] = [int(i) for i in range(100,240,20)] #[100,120,140,160,180,200,220]

paramList = [parameterKNN, parameterSVM, parametersRDF ]

# find best parameter using GridSearchCV
def tuneParameter(pipelineList, paramList, xTrain, yTrain):
    result = []
    for pip, para in zip(pipelineList,paramList):
        grid = GridSearchCV(pip, para, cv = 5, scoring='accuracy')
        grid.fit(xTrain, yTrain)
        result.append({'best params': grid.best_params_})
    return result

# call the function below to perform experiment of finding best hyperparamters within specified range
paramOptimization = tuneParameter(pipelineList, paramList, xTrain, yTrain)


⏩ Applt the best hyperparameters (from previous experiments)

In [None]:
# apply the best parameters from previous experiments as tuned hyperparameters

pipelineOptKNN = imbpipeline([('scalar1', MinMaxScaler()), 
              ('SMOTE1', SMOTE(random_state=42)),
              ('knnClassifier', KNeighborsClassifier(n_neighbors=8, weights ='distance')) ])

pipelineOptSVM = imbpipeline([('scalar1', MinMaxScaler()), 
              ('SMOTE1', SMOTE(random_state=42)),
              ('svmClassifier', SVC(C = 10, gamma = 'scale', kernel ='poly', degree = 3)) ])

pipelineOptRDF = imbpipeline([('scalar1', MinMaxScaler()), 
              ('SMOTE1', SMOTE(random_state=42)),
              ('rdfClassifier', RandomForestClassifier(bootstrap= True, max_depth= 800 ,max_features='auto', min_samples_leaf=1, min_samples_split=6, n_estimators=220, random_state=42)) ])       



optimizedPipeline = [pipelineOptKNN, pipelineOptSVM, pipelineOptRDF]

#Results
print("The results of different models with optimization")
classifier(optimizedPipeline, xTrain, yTrain, xTest, yTest)


The results of different models with optimization
Accuracy Score of KNN: 
 0.696969696969697
Confusion Matrix of KNN : 
 [[102  54]
 [ 16  59]]
Classification report of KNN: 
               precision    recall  f1-score   support

           0       0.86      0.65      0.74       156
           1       0.52      0.79      0.63        75

    accuracy                           0.70       231
   macro avg       0.69      0.72      0.69       231
weighted avg       0.75      0.70      0.71       231

Accuracy Score of SVM: 
 0.8051948051948052
Confusion Matrix of SVM : 
 [[131  25]
 [ 20  55]]
Classification report of SVM: 
               precision    recall  f1-score   support

           0       0.87      0.84      0.85       156
           1       0.69      0.73      0.71        75

    accuracy                           0.81       231
   macro avg       0.78      0.79      0.78       231
weighted avg       0.81      0.81      0.81       231

Accuracy Score of RandomForest: 
 0.8614718

(0.8614718614718615, array([[134,  22],
        [ 10,  65]]))

# **8. Reference**

**Auther :** 

Chayanin Chomchuen

**Co-auther :** 

Sabarinath Muralidharan Sujatha