# DIABETES Classification

Acccording to NIH, "Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. Sometimes your body doesn’t make enough—or any—insulin or doesn’t use insulin well. Glucose then stays in your blood and doesn’t reach your cells.

The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.



# Importing Essential Libraries and Dataset

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
pwd

'C:\\Users\\USER\\Downloads'

In [3]:
data=pd.read_csv('diabetes.csv')

# Basic EDA and statistical analysis

In [4]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
data.shape

(768, 9)

data.describe() method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. This method tells us a lot of things about a dataset. One important thing is that the describe() method deals only with numeric values. It doesn't work with any categorical values. So if there are any categorical values in a column the describe() method will ignore it and display summary for the other columns unless parameter include="all" is passed.

Now, let's understand the statistics that are generated by the describe() method:

-count tells us the number of NoN-empty rows in a feature.
-mean tells us the mean value of that feature.
-std tells us the Standard Deviation Value of that feature.
-min tells us the minimum value of that feature.
-25%, 50%, and 75% are the percentile/quartile of each features. This quartile information helps us to detect Outliers.
-max tells us the maximum value of that feature.

In [6]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Checking balance of the data using groupby() function

In [6]:
data.groupby('Outcome').count()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,500,500,500,500,500,500,500,500
1,268,268,268,268,268,268,268,268


The above outcome shows that the data is biased towards datapoints having outcome value as 0 where it means that diabetes was not present actually. The number of non-diabetics is almost twice the number of diabetic patients.

In [7]:
data.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Splitting Dataset into X and Y.

In [8]:
X=data.iloc[:,:-1].values
y=data.iloc[:,-1].values

Train Test Split : To have unknown datapoints to test the data rather than testing with the same points with which the model was trained. This helps capture the model performance much better.

In [9]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)

### Now that we have split the Data into Train and Test, it's now time to fit the dataset on a suitable model. To begin with, i am checking the accuracy using Decision Tree Classifier model.

In [10]:
from sklearn.tree import DecisionTreeClassifier

In [11]:
model=DecisionTreeClassifier(criterion='gini')
model.fit(x_train,y_train)

DecisionTreeClassifier()

In [12]:
model.score(x_test,y_test)

0.7359307359307359

In [13]:
model.score(x_train,y_train)

1.0

Here we can see that the model is clearly Overfitting.

So, to overcome this overfitting problem,i am using RandomForestClassifier. 

In [14]:
from sklearn.ensemble import RandomForestClassifier

In [15]:
random_model=RandomForestClassifier(max_depth=5,random_state=10,criterion='gini')

In [16]:
random_model.fit(x_train,y_train)

RandomForestClassifier(max_depth=5, random_state=10)

In [17]:
random_model.score(x_train,y_train)

0.8566108007448789

In [18]:
y_pred=random_model.predict(x_test)
random_model.score(x_test,y_test)

0.7705627705627706

Now the model is stable, nither overfitting nor underfitting.

To be more precise, i am checking the hyperparameter values using GridSearchCV.

In [19]:
from sklearn.model_selection import GridSearchCV

In [20]:
param_grid={
            'criterion':['gini','entropy'],
            'max_depth':[3,4,5,6],
            'max_features':[4,5,6,7]
                                        }

In [21]:
model=DecisionTreeClassifier()

In [22]:
grid_search=GridSearchCV(estimator=model,param_grid=param_grid,cv=10)

In [23]:
grid_search.fit(x_train,y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [3, 4, 5, 6],
                         'max_features': [4, 5, 6, 7]})

In [24]:
grid_search.best_params_

{'criterion': 'entropy', 'max_depth': 4, 'max_features': 4}

In [25]:
best_model=grid_search.best_estimator_

In [26]:
best_model.fit(x_train,y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=4, max_features=4)

In [27]:
best_model.score(x_test,y_test)

0.7229437229437229

In [28]:
grid_search.best_score_

0.7336477987421384