<p style="font-family: Arials; line-height: 1.3; font-size: 30px; font-weight: bold; letter-spacing: 2px; text-align: center; color: #23527c">Exploring the PIMA Indian Diabetes dataset</p>


# <span style="font-family: Arials; font-size: 25px; font-style: bold; font-weight: bold; letter-spacing: 2px; color: #23527c">1. INTRODUCTION</span>
<hr style="height: 0.5px; border: 0; background-color: 'Black'">

The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. 

![](https://www.medicoverhospitals.in/wp-content/uploads/2020/11/Diabetes-1200x438.jpg)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# %matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from lightgbm import LGBMClassifier
import xgboost as xgb
import catboost as cb
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV

sns.set_theme()
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<p style="font-family: Arials; line-height: 1.3; font-size: 27px; font-weight: bold; letter-spacing: 2px; text-align: center; color: #23527c">Reading the dataset</p>

In [None]:
#Reading the dataset
data = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
data.head()

<p style="font-family: Arials; line-height: 1.3; font-size: 22px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">An overview of the columns in the dataset</p>

- **Pregnancies**: Number of times pregnant
- **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- **BloodPressure**: Diastolic blood pressure (mm Hg)
- **SkinThickness**: Triceps skin fold thickness (mm)
- **Insulin**: 2-Hour serum insulin (mu U/ml)
- **BMI**: Body mass index (weight in kg/(height in m)^2)
- **DiabetesPedigreeFunction**: Diabetes pedigree function
- **Age**: Age (years)
- **Cabin** : Cabin Number
- **Outcome**: Class variable (0 or 1)

In [None]:
#Printing out some information about the data
def eda(data):
    print("----------Top-5- Record----------")
    print(data.head(5))
    print("-----------Information-----------")
    print(data.info())
    print("-----------Data Types-----------")
    print(data.dtypes)
    print("----------Missing value-----------")
    print(data.isnull().sum())
    print("----------Null value-----------")
    print(data.isna().sum())
    print("----------Shape of Data----------")
    print(data.shape)
eda(data)

<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">As we can see, there are no missing and null values in our dataset. Well, that's a relief!</p>

In [None]:
#Lets have a look at the statistical info about our data
data.describe()


Based on the understanding of the parameters, it seems highly unlikely that glucose, bloodpressure, skinthickness, insulin and bmi levels are 0.
I will hence replace the 0 values with the mean of each parameter.

In [None]:
# replace the 0 values of the impacted columns with the mean values

cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
for i in cols:
   data[i].replace(0,data[i].mean(),inplace=True)

data.head()

Now that the 0 values are accounted for, we can proceed further with the rest of the data exploratory analysis

---

<p style="font-family: Arials; line-height: 1.3; font-size: 27px; font-weight: bold; letter-spacing: 2px; text-align: center; color: #23527c">Exploratory data analysis</p>

### Moving ahead let's have a look at the distribution of our dependent variable - **Outcome**

In [None]:
group_Outcome= data.groupby('Outcome')['Pregnancies'].count().reset_index()
group_Outcome.rename(columns={'Pregnancies':'Count'}, inplace=True)
group_Outcome['Percentages'] = round(group_Outcome['Count']/sum(group_Outcome['Count'])*100,2)

# fig
fig = plt.figure(figsize=(12,4))

# axes
axes = fig.add_axes([0,0,1,1])

# barh
axes.barh(width=group_Outcome['Percentages'][0]+group_Outcome['Percentages'][1], y=0, color='silver')
axes.barh(width=group_Outcome['Percentages'][0], y=0, color='steelblue')

# annotation
axes.text(group_Outcome['Percentages'][0]/2.5, 0, f"{group_Outcome['Percentages'][0]}%", color='black', fontsize=30, fontweight='bold')
axes.text(group_Outcome['Percentages'][0]/2.5, -0.1, f"({group_Outcome['Count'][0]})", color='black', fontsize=30, fontweight='bold')
axes.text((group_Outcome['Percentages'][0]+group_Outcome['Percentages'][1])/1.3, 0, f"{group_Outcome['Percentages'][1]}%", color='black', fontsize=30, fontweight='bold')
axes.text((group_Outcome['Percentages'][0]+group_Outcome['Percentages'][1])/1.3, -0.1, f"({group_Outcome['Count'][1]})", color='black', fontsize=30, fontweight='bold')

# title
axes.text(group_Outcome['Percentages'][0]/2.2, 0.5, 'No ', color='Black', fontsize=30, fontweight='bold')
axes.text((group_Outcome['Percentages'][0]+group_Outcome['Percentages'][1])/1.27, 0.5, 'Yes', color='Black', fontsize=30, fontweight='bold')

# conclusion
axes.text(110, 0.3, 'We observe an unbalanced number of target.', fontsize=16, fontweight='bold', color='black', alpha=0.6)
axes.text(110, 0.19, '''The number of people without diabetes significantly 
exceeds the number of people with diabetes.''', fontsize=16, fontweight='bold', color='black', alpha=0.6)

# axis
axes.axis('off')

fig.show()

<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Let's try to get an idea about the outliers in our dataset</p>

In [None]:
plt.style.use('ggplot') 

f, ax = plt.subplots(figsize=(11, 15))

ax.set_facecolor('#B1DEFD')
ax.set(xlim=(-.05, 200))
plt.ylabel('Variables')
plt.title("Overview")
ax = sns.boxplot(data = data, 
  orient = 'h', 
  palette = 'Set2',)

### We can clearly see outliers are present in the data. So now we wil remove the outliers

In [None]:
def Remove_Outlier (col):
    Q1,Q3 = np.percentile (col,[25,75])
    
    IQR= Q3-Q1
    
    upper_range =  Q3+(IQR*1.5)
    
    lower_range =  Q1-(IQR*1.5)
    
    return upper_range,lower_range

# print("Shape Of The Before Ouliers: ", data.shape)

for i in data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age']]:
    ur,lr = Remove_Outlier(data[i])
    data[i]= np.where(data[i]>ur,ur,data[i])
    data[i]= np.where(data[i]<lr,lr,data[i])

# print("Shape Of The After Ouliers: ", data.shape)



<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Let's have a look at the distribution of the data</p>

In [None]:
# Add all column names to a list except for the target variable (outcome)
columns=data.columns
columns=list(columns)
columns.pop()
print("Column names except for the target column are :",columns)

#Graphs to be plotted with these colors
colours=['b','c','g','k','m','r','y','b']
sns.set(rc={'figure.figsize':(15,17)})
sns.set_style(style='white')
for i in range(len(columns)):
    
    plt.subplot(4,2,i+1)
    sns.distplot(data[columns[i]], hist=True, rug=True, color=colours[i])

 ### The plots show that Glucose, Blood Pressure, BMI are normally distributed.Pregnancies, Insulin, Age, DiabetesPedigreeFunction are rightly skewed.

<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Let's look at how number of pregnancies affect the chances if being Diabetic</p>

In [None]:
#Pregnencies vs Outcome
fig = px.histogram(data, x = data['Pregnancies'], color = 'Outcome')
fig.show()
fig2 = px.box(data, x = data['Pregnancies'], color = 'Outcome')
fig2.show()

### Looking at both plots we can seee that higher the number of pragnancies, more is the risks of diabetes


<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Let's look at how glucose levels affect the chances if being Diabetic</p>

In [None]:
#Glucose vs Outcome
fig = px.histogram(data, x = data['Glucose'], color = 'Outcome')
fig.show()
fig2 = px.box(data, x = data['Glucose'], color = 'Outcome')
fig2.show()

### Higher Glucose level leads to more chances of Diabetes!


<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Let's look at how blood pressure affect the chances if being Diabetic</p>

In [None]:
#BloodPressure vs Outcome
fig = px.histogram(data, x = data['BloodPressure'], color = 'Outcome')
fig.show()
fig2 = px.box(data, x = data['BloodPressure'], color = 'Outcome')
fig2.show()

### we can seee that the probabilty of diabetes is higher when Blood pressure is high.



<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Let's look at how the SkinThickness affect the chances if being Diabetic</p>

In [None]:
#SkinThickness vs Outcome
fig = px.histogram(data, x = data['SkinThickness'], color = 'Outcome')
fig.show()
fig2 = px.box(data, x = data['SkinThickness'], color = 'Outcome')
fig2.show()

- This feature needs further analysis


<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Let's look at how the Insulin level affect the chances if being Diabetic</p>

In [None]:
#Insulin vs Outcome
fig = px.histogram(data, x = data['Insulin'], color = 'Outcome')
fig.show()
fig2 = px.box(data, x = data['Insulin'], color = 'Outcome')
fig2.show()

  ### that higher the Insulin level more the chances of diabetes.¶


<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Let's look at how the BMI affect the chances if being Diabetic</p>

In [None]:
#BMI vs Outcome
fig = px.histogram(data, x = data['BMI'], color = 'Outcome')
fig.show()
fig2 = px.box(data, x = data['BMI'], color = 'Outcome')
fig2.show()

 ### We observe that higher the BMI more the chances of diabetes.


<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Let's look at how DiabetesPedigreeFunction affect the chances if being Diabetic</p>

In [None]:
# DiabetesPedigreeFunction andd Outcome
fig = px.histogram(data, x = data['DiabetesPedigreeFunction'], color = 'Outcome')
fig.show()
fig2 = px.box(data, x = data['DiabetesPedigreeFunction'], color = 'Outcome')
fig2.show()

 ### We observe that diabetic people have higher DiabetesPedigreeFunction value i,e genetic influence plays some role in the Diabetes among patients.




<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Let's look at how Age affect the chances if being Diabetic</p>

In [None]:
# Age andd Outcome
fig = px.histogram(data, x = data['Age'], color = 'Outcome')
fig.show()
fig2 = px.box(data, x = data['Age'], color = 'Outcome')
fig2.show()

 ### we observe that there is less chance of diabetes among young people and more chances for the people above the Age of years
---


<p style="font-family: Arials; line-height: 1.3; font-size: 27px; font-weight: bold; letter-spacing: 2px; text-align: center; color: #23527c">Scaling the Data</p>


In [None]:
from sklearn.preprocessing import StandardScaler
# scaler
scaler = StandardScaler()
norm = scaler.fit_transform(data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']])
df_norm = pd.DataFrame({'Pregnancies': norm[ :, 0], 'Glucose' : norm[ :, 1], 'BloodPressure' : norm[ :, 2], 'SkinThickness' : norm[ :, 3],
                       'Insulin' : norm[ :, 4], 'BMI' : norm[ :, 5], 'DiabetesPedigreeFunction' : norm[ :, 5], 'Age' : norm[ :, 6]}, 
                       columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'])
df_norm['Outcome'] = data['Outcome']



<p style="font-family: Arials; line-height: 1.3; font-size: 27px; font-weight: bold; letter-spacing: 2px; text-align: center; color: #23527c">Splitting the Data into training and testing sets</p>

In [None]:
# split
x = df_norm.drop(['Outcome'], axis=1)
y = df_norm['Outcome']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)



<p style="font-family: Arials; line-height: 1.3; font-size: 27px; font-weight: bold; letter-spacing: 2px; text-align: center; color: #23527c">Oversampling the data using SMOTE to deal with imbalance in dataset</p>


In [None]:
# over sampling
os = SMOTE(random_state=42)
columns = x_train.columns
os_data_x,os_data_y = os.fit_resample(x_train, y_train.ravel())

---

<p style="font-family: Arials; line-height: 1.3; font-size: 27px; font-weight: bold; letter-spacing: 2px; text-align: center; color: #23527c">Modelling</p>


In [None]:
# logistic regression
log_params = {'penalty':['l1', 'l2'], 
              'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 100], 
              'solver':['liblinear', 'saga']} 
log_model = GridSearchCV(LogisticRegression(), log_params, cv=5) #Tuning the hyper-parameters
log_model.fit(os_data_x, os_data_y)
log_predict = log_model.predict(x_test)
log_score = log_model.best_score_

In [None]:
# knn
knn_params = {'n_neighbors': list(range(3, 20, 2)),
          'weights':['uniform', 'distance'],
          'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
          'metric':['euclidean', 'manhattan', 'chebyshev', 'minkowski']}
knn_model = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5) #Tuning the hyper-parameters
knn_model.fit(os_data_x, os_data_y)
knn_predict = knn_model.predict(x_test)
knn_score = knn_model.best_score_

In [None]:
# svc
svc_params = {'C': [0.001, 0.01, 0.1, 1],
              'kernel': [ 'linear' , 'poly' , 'rbf' , 'sigmoid' ]}
svc_model = GridSearchCV(SVC(), svc_params, cv=5) #Tuning the hyper-parameters
svc_model.fit(os_data_x, os_data_y)
svc_predict = svc_model.predict(x_test)
svc_score = svc_model.best_score_


In [None]:
# decsion tree
dt_params = {'criterion' : ['gini', 'entropy'],
              'splitter': ['random', 'best'], 
              'max_depth': [3, 5, 7, 9, 11, 13]}
dt_model = GridSearchCV(DecisionTreeClassifier(), dt_params, cv=5) #Tuning the hyper-parameters
dt_model.fit(os_data_x, os_data_y)
dt_predict = dt_model.predict(x_test)
dt_score = dt_model.best_score_

In [None]:
# rf
rf_params = {'criterion' : ['gini', 'entropy'],
             'n_estimators': list(range(5, 26, 5)),
             'max_depth': list(range(3, 20, 2))}
rf_model = GridSearchCV(RandomForestClassifier(), rf_params, cv=5) #Tuning the hyper-parameters
rf_model.fit(os_data_x, os_data_y)
rf_predict = rf_model.predict(x_test)
rf_score = rf_model.best_score_

In [None]:
# sgd
sgd_params = {'loss' : ['hinge', 'log', 'squared_hinge', 'modified_huber'],
              'alpha' : [0.0001, 0.001, 0.01, 0.1, 1, 10],
              'penalty' : ['l2', 'l1', 'none']}
sgd_model = GridSearchCV(SGDClassifier(max_iter=10000), sgd_params, cv=5) #Tuning the hyper-parameters
sgd_model.fit(os_data_x, os_data_y)
sgd_predict = sgd_model.predict(x_test)
sgd_score = sgd_model.best_score_

In [None]:
# lgb
lgb_params = {'n_estimators': [5, 10, 15, 20, 25, 50, 100],
                   'learning_rate': [0.01, 0.05, 0.1],
                   'num_leaves': [7, 15, 31],
                  }
lgb_model = GridSearchCV(LGBMClassifier(), lgb_params, cv=5) #Tuning the hyper-parameters
lgb_model.fit(os_data_x, os_data_y)
lgb_predict = lgb_model.predict(x_test)
lgb_score = lgb_model.best_score_

In [None]:
# xgb
xgb_params = {'max_depth': [3, 5, 7, 9],
              'n_estimators': [5, 10, 15, 20, 25, 50, 100],
              'learning_rate': [0.01, 0.05, 0.1]}
xgb_model = GridSearchCV(xgb.XGBClassifier(eval_metric='logloss'), xgb_params, cv=5) #Tuning the hyper-parameters
xgb_model.fit(os_data_x, os_data_y)
xgb_predict = xgb_model.predict(x_test)
xgb_score = xgb_model.best_score_

In [None]:
# cb
cb_params = {'learning_rate': [0.01, 0.05, 0.1],
             'depth': [3, 5, 7, 9]}
cb_model = GridSearchCV(cb.CatBoostClassifier(verbose=False), cb_params, cv=5) #Tuning the hyper-parameters
cb_model.fit(os_data_x, os_data_y)
cb_predict = cb_model.predict(x_test)
cb_score = cb_model.best_score_

---


<p style="font-family: Arials; line-height: 1.3; font-size: 27px; font-weight: bold; letter-spacing: 2px; text-align: center; color: #23527c">Evaluation</p>


In [None]:
models = ['LogisticRegression', 'KNeighborsClassifier', 'SVC', 'DecisionTreeClassifier', 
          'RandomForestClassifier', 'SGDClassifier', 'LGBMClassifier', 'XGBClassifier', 'CatBoostClassifier']
scores = [log_score, knn_score, svc_score, dt_score, rf_score, sgd_score, lgb_score, xgb_score, cb_score]
score_table = pd.DataFrame({'Model':models, 'Score':scores})
score_table.sort_values(by='Score', axis=0, ascending=False)
print(score_table.sort_values(by='Score', ascending=False))
sns.barplot(x = score_table['Score'], y = score_table['Model'], palette='viridis');



<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">We can see that CatBoost classifier is the Best performing model with a score of .829599</p>

In [None]:
#Printing classification report for catboost claassifier
from sklearn import metrics
print('Classification Report_test','\n',metrics.classification_report(y_test, cb_predict))


---

<p style="font-family: Arials; line-height: 1.3; font-size: 23px; font-weight: bold; letter-spacing: 2px; text-align: left; color: #23527c">Thank You!</p>
