## Diabetes Prediction using NIDDK-DF dataset

### Prediabetes Calculation and labelling
#### Based on HbA1c: 
A normal A1C level is below 5.7%, a level of 5.7% to 6.4% indicates prediabetes, and a level of 6.5% or more indicates diabetes. Within the 5.7% to 6.4% prediabetes range, the higher your A1C, the greater your risk is for developing type 2 diabetes.

https://www.cdc.gov/diabetes/managing/managing-blood-sugar/a1c.html#:~:text=A%20normal%20A1C%20level%20is,for%20developing%20type%202%20diabetes.

#### Based on Glucose level: 
Well, prediabetes means exactly what it sounds like: Your blood sugar levels are high, but not enough to diagnose diabetes. So, you may get diabetes in the future, although you don't have it right now. Doctors study your blood reports before declaring either diabetes or prediabetes based on where your fasting blood sugar count falls:

- Normal: Less than 100 mg/dL
- Prediabetic: 100–125 mg/dL
- Diabetic: Greater than 125 mg/dL

https://www.abbott.in/corpnewsroom/diabetes-care/prediabetic-diet--your-guide-to-blood-sugar-regulation.html#:~:text=Doctors%20study%20your%20blood%20reports,Greater%20than%20125%20mg%2FdL

Link for dataset: https://www.kaggle.com/houcembenmansour/predict-diabetes-based-on-diagnostic-measures

#### Features of dataset
- Cholesterol 	
- glucose 	
- hdl_chol 	
- chol_hdl_ratio 	
- age 	
- gender 	
- height 	
- weight 	
- bmi 	
- systolic_bp 	
- diastolic_bp 	
- waist 	
- hip 	
- waist_hip_ratio 	
- diabetes

<h3>Importing Libraries</h3>

In [1]:
import sys #This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. 
import pandas as pd
import numpy as np
import sklearn
import matplotlib
import keras
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

**Loading the Dataset**

In [2]:
import time
# store starting time
begin_dataprep = time.time()

In [3]:
#Reading the data and some data are not present as decimals
data=pd.read_csv("Predict diabetes based on diagnostic measures.csv",decimal=",")
df =data.copy()
pd.set_option('display.max_row',df.shape[0])
pd.set_option('display.max_column',df.shape[1]) 
df.head()

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
0,1,193,77,49,3.9,19,female,61,119,22.5,118,70,32,38,0.84,No diabetes
1,2,146,79,41,3.6,19,female,60,135,26.4,108,58,33,40,0.83,No diabetes
2,3,217,75,54,4.0,20,female,67,187,29.3,110,72,40,45,0.89,No diabetes
3,4,226,97,70,3.2,20,female,64,114,19.6,122,64,31,39,0.79,No diabetes
4,5,164,91,67,2.4,20,female,70,141,20.2,122,86,32,39,0.82,No diabetes


In [4]:
#df.sample(10)

In [5]:
#df_justglu = df[['glucose','diabetes']]
#df_justglu.to_csv('justglu.csv')

In [6]:
data.dtypes

patient_number       int64
cholesterol          int64
glucose              int64
hdl_chol             int64
chol_hdl_ratio     float64
age                  int64
gender              object
height               int64
weight               int64
bmi                float64
systolic_bp          int64
diastolic_bp         int64
waist                int64
hip                  int64
waist_hip_ratio    float64
diabetes            object
dtype: object

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   patient_number   390 non-null    int64  
 1   cholesterol      390 non-null    int64  
 2   glucose          390 non-null    int64  
 3   hdl_chol         390 non-null    int64  
 4   chol_hdl_ratio   390 non-null    float64
 5   age              390 non-null    int64  
 6   gender           390 non-null    object 
 7   height           390 non-null    int64  
 8   weight           390 non-null    int64  
 9   bmi              390 non-null    float64
 10  systolic_bp      390 non-null    int64  
 11  diastolic_bp     390 non-null    int64  
 12  waist            390 non-null    int64  
 13  hip              390 non-null    int64  
 14  waist_hip_ratio  390 non-null    float64
 15  diabetes         390 non-null    object 
dtypes: float64(3), int64(11), object(2)
memory usage: 48.9+ KB


In [8]:
df['gender'].value_counts()

female    228
male      162
Name: gender, dtype: int64

In [9]:
# Eliminate duplicates
print('There are' , df.duplicated().sum() , 'duplicates')
df.loc[df.duplicated(keep=False),:]
df.drop_duplicates(keep='first',inplace=True)
print('There is now' , df.shape[0] , 'rows')
print('There is now' , df.shape[1] , 'columns')

There are 0 duplicates
There is now 390 rows
There is now 16 columns


In [10]:
df.describe()

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio
count,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0
mean,195.5,207.230769,107.338462,50.266667,4.524615,46.774359,65.951282,177.407692,28.775641,137.133333,83.289744,37.869231,42.992308,0.881385
std,112.727548,44.666005,53.798188,17.279069,1.736634,16.435911,3.918867,40.407824,6.600915,22.859528,13.498192,5.760947,5.664342,0.073212
min,1.0,78.0,48.0,12.0,1.5,19.0,52.0,99.0,15.2,90.0,48.0,26.0,30.0,0.68
25%,98.25,179.0,81.0,38.0,3.2,34.0,63.0,150.25,24.1,122.0,75.0,33.0,39.0,0.83
50%,195.5,203.0,90.0,46.0,4.2,44.5,66.0,173.0,27.8,136.0,82.0,37.0,42.0,0.88
75%,292.75,229.0,107.75,59.0,5.4,60.0,69.0,200.0,32.275,148.0,90.0,41.0,46.0,0.93
max,390.0,443.0,385.0,120.0,19.3,92.0,76.0,325.0,55.8,250.0,124.0,56.0,64.0,1.14


In [11]:
df.shape

(390, 16)

In [12]:
df_prediab = df[(df['glucose'] >= 100)& (df['glucose'] < 125) & (df['diabetes'] == 'No diabetes')]

In [13]:
#df_prediab.head(10)

#### Data labelling!
- Glucose >125mg/dl is known as diabetes,
- Glucose 125mg/dl to 100mg/dl is prediabetes, while Glucose< 100mg/dl is for normal patients
- Some rows given are complex

The abnormal rows are removed now, now the prediabetic case can be applied

In [14]:
df_prediab['diabetes'] = df_prediab['glucose'].apply(lambda x: 'Diabetic' if x > 125 else 'Prediabetic' if x > 100 and x <= 125 else 'Normal')
df_prediab.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_prediab['diabetes'] = df_prediab['glucose'].apply(lambda x: 'Diabetic' if x > 125 else 'Prediabetic' if x > 100 and x <= 125 else 'Normal')


Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
8,9,230,112,64,3.6,20,male,67,159,24.9,100,90,31,39,0.79,Prediabetic
9,10,179,105,60,3.0,20,female,58,170,35.5,140,100,34,46,0.74,Prediabetic
10,11,174,105,117,1.5,20,male,70,187,26.8,132,86,37,41,0.9,Prediabetic
11,12,193,106,63,3.1,20,female,68,274,41.7,165,110,49,58,0.84,Prediabetic
32,33,134,101,36,3.7,25,female,63,245,43.4,142,78,47,58,0.81,Prediabetic


In [15]:
df.drop(df[(df['glucose'] >= 100)& (df['glucose'] < 125) & (df['diabetes'] == 'No diabetes')].index, inplace=True)

In [16]:
df_all_rows = pd.concat([df, df_prediab], ignore_index=True)

In [17]:
df_all_rows.shape

(390, 16)

In [18]:
df_all_rows = df_all_rows.replace('No diabetes','Normal')

In [19]:
def encoding(df):
    code = {'female':0,
            'male':1,
            'Normal':0,
            'Diabetes':1,
            'Prediabetic':2,
           }
    for col in df.select_dtypes('object'):
        df.loc[:,col]=df[col].map(code)        
    return df

def imputation(df):
    df = df.dropna(axis=0)
    return df

def feature_engineering(df):
    useless_columns = ['patient_number']
    df = df.drop(useless_columns,axis=1)
    return df

def preprocessing(df):
    df = encoding(df)
    df = feature_engineering(df)
    df = imputation(df)
    
    X = df.drop('diabetes',axis=1)
    y = df['diabetes']    

    return df,X,y

In [20]:
data,_,_ = preprocessing(df_all_rows)
data.head()

  df.loc[:,col]=df[col].map(code)
  df.loc[:,col]=df[col].map(code)


Unnamed: 0,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
0,193,77,49,3.9,19,0,61,119,22.5,118,70,32,38,0.84,0
1,146,79,41,3.6,19,0,60,135,26.4,108,58,33,40,0.83,0
2,217,75,54,4.0,20,0,67,187,29.3,110,72,40,45,0.89,0
3,226,97,70,3.2,20,0,64,114,19.6,122,64,31,39,0.79,0
4,164,91,67,2.4,20,0,70,141,20.2,122,86,32,39,0.82,0


In [21]:
data['diabetes'] = data['diabetes'].astype(int)

In [22]:
data['gender'].value_counts()

0    228
1    162
Name: gender, dtype: int64

In [23]:
#Delete original label
data = data.rename({'diabetes': 'Outcome'}, axis=1)
#save data
data.to_csv('NIDDK-DF-new-3targets.csv',index=False, header=True)

In [24]:
# store end time
end_dataprep = time.time()
time_tkn = end_dataprep-begin_dataprep

In [25]:
print('Time taken in Secs:',time_tkn/(60*60))

Time taken in Secs: 0.00011096364921993679


In [26]:
data['Outcome'].value_counts(normalize=True) #Classes déséquilibrées

0    0.715385
1    0.153846
2    0.130769
Name: Outcome, dtype: float64