## Diabetes Prediction using Diagnostic Features dataset

### Prediabetes Calculation and labelling
#### Based on HbA1c: 
A normal A1C level is below 5.7%, a level of 5.7% to 6.4% indicates prediabetes, and a level of 6.5% or more indicates diabetes. Within the 5.7% to 6.4% prediabetes range, the higher your A1C, the greater your risk is for developing type 2 diabetes.

https://www.cdc.gov/diabetes/managing/managing-blood-sugar/a1c.html#:~:text=A%20normal%20A1C%20level%20is,for%20developing%20type%202%20diabetes.

#### Based on Glucose level: 
Well, prediabetes means exactly what it sounds like: Your blood sugar levels are high, but not enough to diagnose diabetes. So, you may get diabetes in the future, although you don't have it right now. Doctors study your blood reports before declaring either diabetes or prediabetes based on where your fasting blood sugar count falls:

- Normal: Less than 100 mg/dL
- Prediabetic: 100–125 mg/dL
- Diabetic: Greater than 125 mg/dL

https://www.abbott.in/corpnewsroom/diabetes-care/prediabetic-diet--your-guide-to-blood-sugar-regulation.html#:~:text=Doctors%20study%20your%20blood%20reports,Greater%20than%20125%20mg%2FdL

Link for dataset: https://www.kaggle.com/houcembenmansour/predict-diabetes-based-on-diagnostic-measures

#### Features of dataset
- Cholesterol 	
- glucose 	
- hdl_chol 	
- chol_hdl_ratio 	
- age 	
- gender 	
- height 	
- weight 	
- bmi 	
- systolic_bp 	
- diastolic_bp 	
- waist 	
- hip 	
- waist_hip_ratio 	
- diabetes

Some links related to prediabetes
- https://www.mayoclinic.org/diseases-conditions/prediabetes/symptoms-causes/syc-20355278


- High levels of Tg in combination with low levels of HDL-C showed the strongest association with T2DM and prediabetes. This paper suggests that routine monitoring of the commonly used lipid parameters (especially Tg and HDL-C) among patients with T2DM and prediabetes, is warranted in this population considered to be the epi-center for T2DM or CAD.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6165005/ .
Since 2004, the American Diabetes Association (ADA) has recommended screening of high-risk adults at any age for diabetes who have high BMI (≥25 kg/m2), low HDL-C (<0.90 mmol), and or a high Tg level (>2.82 mmol/L). 

- Total cholesterol, low density lipoprotein (LDL), triglyceride (TG), very low density lipoprotein, TG/HDL ratio and LDL/HDL ratio were significantly raised in prediabetic individuals as compared to normal healthy subjects, whereas high density lipoprotein (HDL) was significantly lower in prediabetic individuals as compared to normal healthy subjects.
https://pubmed.ncbi.nlm.nih.gov/27731552/
- https://www.imaware.health/blog/signs-of-prediabetes



<h3>Importing Libraries</h3>

In [1]:
import sys #This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. 
import pandas as pd
import numpy as np
import sklearn
import matplotlib
import keras
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

**Loading the Dataset**

In [2]:
import time
# store starting time
begin_dataprep = time.time()

In [3]:
#Reading the data and some data are not present as decimals
data=pd.read_csv("Predict diabetes based on diagnostic measures.csv",decimal=",")
df =data.copy()
pd.set_option('display.max_row',df.shape[0])
pd.set_option('display.max_column',df.shape[1]) 
df.head()

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
0,1,193,77,49,3.9,19,female,61,119,22.5,118,70,32,38,0.84,No diabetes
1,2,146,79,41,3.6,19,female,60,135,26.4,108,58,33,40,0.83,No diabetes
2,3,217,75,54,4.0,20,female,67,187,29.3,110,72,40,45,0.89,No diabetes
3,4,226,97,70,3.2,20,female,64,114,19.6,122,64,31,39,0.79,No diabetes
4,5,164,91,67,2.4,20,female,70,141,20.2,122,86,32,39,0.82,No diabetes


In [4]:
data.dtypes

patient_number       int64
cholesterol          int64
glucose              int64
hdl_chol             int64
chol_hdl_ratio     float64
age                  int64
gender              object
height               int64
weight               int64
bmi                float64
systolic_bp          int64
diastolic_bp         int64
waist                int64
hip                  int64
waist_hip_ratio    float64
diabetes            object
dtype: object

In [5]:
# Eliminate duplicates
print('There are' , df.duplicated().sum() , 'duplicates')
df.loc[df.duplicated(keep=False),:]
df.drop_duplicates(keep='first',inplace=True)
print('There is now' , df.shape[0] , 'rows')
print('There is now' , df.shape[1] , 'columns')

There are 0 duplicates
There is now 390 rows
There is now 16 columns


In [6]:
df.shape

(390, 16)

In [7]:
df_incorrect = df[(df['glucose'] <= 70) & (df['diabetes'] == 'Diabetes')]

In [8]:
df_incorrect

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
40,41,220,60,66,3.3,26,male,70,150,21.5,136,88,33,39,0.85,Diabetes


In [9]:
df_incorrect.replace('Diabetes','Diabetic')

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
40,41,220,60,66,3.3,26,male,70,150,21.5,136,88,33,39,0.85,Diabetic


In [10]:
df.drop(df[(df['glucose'] <= 70) & (df['diabetes'] == 'Diabetes')].index, inplace=True)

In [11]:
#checking if needed
#Changing incorrectly labelled data
#df[(df['glucose'] <= 70) & (df['diabetes'] == 'Diabetes')].replace('Diabetes','No diabetes')

In [12]:
df_incorrect_2 = df[(df['glucose'] <= 100) & (df['diabetes'] == 'Diabetes')]

In [13]:
df_incorrect_2.head(10)

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
299,300,203,90,51,4.0,60,female,59,123,24.8,130,72,36,41,0.88,Diabetes
326,327,249,90,28,8.9,64,male,68,183,27.8,138,80,44,41,1.07,Diabetes
361,362,207,71,41,5.0,72,male,70,180,25.8,138,88,39,40,0.98,Diabetes


In [14]:
df_incorrect_2.replace('Diabetes','Diabetic')

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
299,300,203,90,51,4.0,60,female,59,123,24.8,130,72,36,41,0.88,Diabetic
326,327,249,90,28,8.9,64,male,68,183,27.8,138,80,44,41,1.07,Diabetic
361,362,207,71,41,5.0,72,male,70,180,25.8,138,88,39,40,0.98,Diabetic


In [15]:
df.drop(df[(df['glucose'] <= 100) & (df['diabetes'] == 'Diabetes')].index, inplace=True)

In [16]:
df_incorrect_3 = df[(df['glucose'] > 100) & (df['glucose'] <= 125) & (df['diabetes'] == 'Diabetes')]
df_incorrect_3.shape

(13, 16)

In [17]:
df_incorrect_3.head(13)

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
117,118,245,119,26,9.4,36,male,66,179,28.9,150,92,37,42,0.88,Diabetes
212,213,245,120,39,6.3,47,female,63,156,27.6,142,102,35,39,0.9,Diabetes
240,241,215,110,36,6.0,51,female,67,282,44.2,142,78,52,59,0.88,Diabetes
250,251,196,120,67,2.9,52,female,62,147,26.9,144,94,34,42,0.81,Diabetes
287,288,195,108,46,4.2,59,female,67,172,26.9,150,102,38,43,0.88,Diabetes
288,289,219,112,73,3.0,59,male,66,170,27.4,146,92,37,40,0.93,Diabetes
310,311,235,109,59,4.0,62,female,63,290,51.4,175,80,55,62,0.89,Diabetes
321,322,215,119,44,3.9,63,female,63,158,28.0,160,68,34,42,0.81,Diabetes
339,340,246,104,62,4.0,66,female,66,189,30.5,200,94,45,46,0.98,Diabetes
344,345,254,121,39,6.5,67,male,68,167,25.4,161,118,36,39,0.92,Diabetes


In [18]:
df_incorrect_3.replace('Diabetes','Diabetic')

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
117,118,245,119,26,9.4,36,male,66,179,28.9,150,92,37,42,0.88,Diabetic
212,213,245,120,39,6.3,47,female,63,156,27.6,142,102,35,39,0.9,Diabetic
240,241,215,110,36,6.0,51,female,67,282,44.2,142,78,52,59,0.88,Diabetic
250,251,196,120,67,2.9,52,female,62,147,26.9,144,94,34,42,0.81,Diabetic
287,288,195,108,46,4.2,59,female,67,172,26.9,150,102,38,43,0.88,Diabetic
288,289,219,112,73,3.0,59,male,66,170,27.4,146,92,37,40,0.93,Diabetic
310,311,235,109,59,4.0,62,female,63,290,51.4,175,80,55,62,0.89,Diabetic
321,322,215,119,44,3.9,63,female,63,158,28.0,160,68,34,42,0.81,Diabetic
339,340,246,104,62,4.0,66,female,66,189,30.5,200,94,45,46,0.98,Diabetic
344,345,254,121,39,6.5,67,male,68,167,25.4,161,118,36,39,0.92,Diabetic


In [19]:
df.drop(df[(df['glucose'] > 100) & (df['glucose'] <= 125) & (df['diabetes'] == 'Diabetes')].index, inplace=True)

#### Data labelling!
- Glucose >125mg/dl is known as diabetes,
- Glucose 125mg/dl to 100mg/dl is prediabetes, while Glucose< 100mg/dl is for normal patients
- Some rows given are complex

In [20]:
df.shape

(373, 16)

In [21]:
df['diabetes'].value_counts(normalize=True)

No diabetes    0.884718
Diabetes       0.115282
Name: diabetes, dtype: float64

The abnormal rows are removed now, now the prediabetic case can be applied

In [22]:
df['diabetes'] = df['glucose'].apply(lambda x: 'Diabetic' if x > 125 else 'Prediabetic' if x > 100 and x <= 125 else 'Normal')
df.head()

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
0,1,193,77,49,3.9,19,female,61,119,22.5,118,70,32,38,0.84,Normal
1,2,146,79,41,3.6,19,female,60,135,26.4,108,58,33,40,0.83,Normal
2,3,217,75,54,4.0,20,female,67,187,29.3,110,72,40,45,0.89,Normal
3,4,226,97,70,3.2,20,female,64,114,19.6,122,64,31,39,0.79,Normal
4,5,164,91,67,2.4,20,female,70,141,20.2,122,86,32,39,0.82,Normal


In [23]:
df_all_rows = pd.concat([df, df_incorrect_3], ignore_index=True)

In [24]:
df_all_rows = pd.concat([df_all_rows, df_incorrect_2], ignore_index=True)

In [25]:
df_all_rows = pd.concat([df_all_rows, df_incorrect], ignore_index=True)

In [26]:
df_all_rows.shape

(390, 16)

In [27]:
def encoding(df):
    code = {'female':1,
            'male':0,
            'Normal':0,
            'Diabetes':1,
            'Prediabetic':2,
           }
    for col in df.select_dtypes('object'):
        df.loc[:,col]=df[col].map(code)        
    return df

def imputation(df):
    df = df.dropna(axis=0)
    return df

def feature_engineering(df):
    useless_columns = ['patient_number']
    df = df.drop(useless_columns,axis=1)
    return df

def preprocessing(df):
    df = encoding(df)
    df = feature_engineering(df)
    df = imputation(df)
    
    X = df.drop('diabetes',axis=1)
    y = df['diabetes']    

    return df,X,y

In [28]:
data,_,_ = preprocessing(df_all_rows)
data.head()

Unnamed: 0,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
0,193,77,49,3.9,19,1,61,119,22.5,118,70,32,38,0.84,0.0
1,146,79,41,3.6,19,1,60,135,26.4,108,58,33,40,0.83,0.0
2,217,75,54,4.0,20,1,67,187,29.3,110,72,40,45,0.89,0.0
3,226,97,70,3.2,20,1,64,114,19.6,122,64,31,39,0.79,0.0
4,164,91,67,2.4,20,1,70,141,20.2,122,86,32,39,0.82,0.0


In [29]:
data['diabetes'] = data['diabetes'].astype(int)

In [30]:
#Delete original label
data = data.rename({'diabetes': 'Outcome'}, axis=1)
#save data
data.to_csv('diabetes_vae_3targetclasses.csv',index=False, header=True)

In [31]:
# store end time
end_dataprep = time.time()
time_tkn = end_dataprep-begin_dataprep

In [32]:
print('Time taken in Secs:',time_tkn/(60*60))

Time taken in Secs: 0.00015417397022247315


In [33]:
data['Outcome'].value_counts(normalize=True) #Classes déséquilibrées

0    0.790909
2    0.157576
1    0.051515
Name: Outcome, dtype: float64