# **Diabetes Prediction Model**


## 1. Import the necessary libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")

#for data manipulation
import pandas as pd
import numpy as np

#for data visualisation
import matplotlib.pyplot as plt
import seaborn as sns

#for data modelling


#for model evaluation


## 2. Load the dataset

In [2]:
df = pd.read_csv('diabetes_data.csv')

## 3. Introductory Insights

Obtain introductory information such as shape of the data, number of rows, number of columns, etc.

In [3]:
df.head(7)

Unnamed: 0,Age,Sex,HighChol,CholCheck,BMI,Smoker,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,GenHlth,MentHlth,PhysHlth,DiffWalk,Stroke,HighBP,Diabetes
0,4.0,1.0,0.0,1.0,26.0,0.0,0.0,1.0,0.0,1.0,0.0,3.0,5.0,30.0,0.0,0.0,1.0,0.0
1,12.0,1.0,1.0,1.0,26.0,1.0,0.0,0.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,1.0,0.0
2,13.0,1.0,0.0,1.0,26.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,10.0,0.0,0.0,0.0,0.0
3,11.0,1.0,1.0,1.0,28.0,1.0,0.0,1.0,1.0,1.0,0.0,3.0,0.0,3.0,0.0,0.0,1.0,0.0
4,8.0,0.0,0.0,1.0,29.0,1.0,0.0,1.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,1.0,18.0,0.0,0.0,1.0,1.0,1.0,0.0,2.0,7.0,0.0,0.0,0.0,0.0,0.0
6,13.0,1.0,1.0,1.0,26.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


1) Age: 13-level age category (_AGEG5YR) 
    * Level 1 - 18 <= AGE <= 24
    * Level 2 - 25 <= AGE <= 29
    * Level 3 - 30 <= AGE <= 34
    * Level 4 - 35 <= AGE <= 39
    * Level 5 - 40 <= AGE <= 44
    * Level 6 - 45 <= AGE <= 49
    * Level 7 - 50 <= AGE <= 54
    * Level 8 - 55 <= AGE <= 59
    * Level 9 - 60 <= AGE <= 64
    * Level 10 - 65 <= AGE <= 69
    * Level 11 - 70 <= AGE <= 74
    * Level 12 - 75 <= AGE <= 79
    * Level 13 - 80 <= AGE <= 99

<br>

2) Sex: Patient's gender 
    * 1: male 
    * 0: female

<br>

3) HighChol: Does patient have high cholesterol? 
    * 0 = no 
    * 1 = yes

<br>

4) CholCheck: Has the patient done a cholesterol check in the past 5 years?
    * 0 = no 
    * 1 = yes

<br>

5) BMI: Body Mass Index (Height in cm and Weight in kg)
    * Formula: (Height x Height)/Weight

<br>

6) Smoker: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 
    * 0 = no 
    * 1 = yes

<br>

7) HeartDiseaseorAttack: Does the patient have a coronary heart disease (CHD) or myocardial infarction (MI) 
    * 0 = no 
    * 1 = yes

<br>

8) PhysActivity: Has the patient done any physical activity in past 30 days - not including job 
    * 0 = no 
    * 1 = yes

<br>

9) Fruits: Does the patient consume fruit 1 or more times per day
    * 0 = no 
    * 1 = yes

<br>

10) Veggies: Does the patient consume vegetables 1 or more times per day 
    * 0 = no 
    * 1 = yes

<br>

11) HvyAlcoholConsump: (adult men >=14 drinks per week and adult women>=7 drinks per week) 
    * 0 = no 
    * 1 = yes

<br>

12) GenHlth: How does the patient rate their general health?scale 1-5 
    * 1 = excellent 
    * 2 = very good 
    * 3 = good 
    * 4 = fair 
    * 5 = poor

<br>

13) MentHlth: How many days has the patient suffered from poor mental health in the past 30 days? 
    * scale 1-30

<br>

14) PhysHlth: How many days has the patient suffered from a physical illness or injury in past 30 days? 
    * scale 1-30

<br>

15) DiffWalk: Does the patient have serious difficulty walking or climbing stairs? 
    * 0 = no 
    * 1 = yes

<br>

16) Stroke: Has the patient ever had a stroke?
    * 0 = no
    * 1 = yes

<br>

17) HighBP: Does the patient have high blood pressure?
    * 0 = no 
    * 1 = yes

<br>

18) Diabetes: Does the patient have diabetes?
    * 0 = no 
    * 1 = yes

In [4]:
df.tail()

Unnamed: 0,Age,Sex,HighChol,CholCheck,BMI,Smoker,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,GenHlth,MentHlth,PhysHlth,DiffWalk,Stroke,HighBP,Diabetes
70687,6.0,0.0,1.0,1.0,37.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0
70688,10.0,1.0,1.0,1.0,29.0,1.0,1.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,1.0
70689,13.0,0.0,1.0,1.0,25.0,0.0,1.0,0.0,1.0,0.0,0.0,5.0,15.0,0.0,1.0,0.0,1.0,1.0
70690,11.0,0.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,1.0,1.0
70691,9.0,0.0,1.0,1.0,25.0,0.0,1.0,1.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,1.0


In [5]:
df.shape

(70692, 18)

## 4. Statistical Insights

In [6]:
df.describe()

Unnamed: 0,Age,Sex,HighChol,CholCheck,BMI,Smoker,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,GenHlth,MentHlth,PhysHlth,DiffWalk,Stroke,HighBP,Diabetes
count,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0,70692.0
mean,8.584055,0.456997,0.525703,0.975259,29.856985,0.475273,0.14781,0.703036,0.611795,0.788774,0.042721,2.837082,3.752037,5.810417,0.25273,0.062171,0.563458,0.5
std,2.852153,0.498151,0.499342,0.155336,7.113954,0.499392,0.354914,0.456924,0.487345,0.408181,0.202228,1.113565,8.155627,10.062261,0.434581,0.241468,0.49596,0.500004
min,1.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7.0,0.0,0.0,1.0,25.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,9.0,0.0,1.0,1.0,29.0,0.0,0.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0,0.5
75%,11.0,1.0,1.0,1.0,33.0,1.0,0.0,1.0,1.0,1.0,0.0,4.0,2.0,6.0,1.0,0.0,1.0,1.0
max,13.0,1.0,1.0,1.0,98.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,30.0,30.0,1.0,1.0,1.0,1.0


## 5. Data Cleaning

Handling outliers, duplicates and missing values

#### 5.1. Duplicates


In [7]:
df.duplicated().sum()

6672

In [8]:
duplicate = df[df.duplicated(keep='first')]
 
print("Duplicate Rows :")
 
duplicate

Duplicate Rows :


Unnamed: 0,Age,Sex,HighChol,CholCheck,BMI,Smoker,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,GenHlth,MentHlth,PhysHlth,DiffWalk,Stroke,HighBP,Diabetes
360,6.0,1.0,0.0,1.0,28.0,0.0,0.0,1.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
397,8.0,0.0,0.0,1.0,29.0,0.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0
436,8.0,1.0,0.0,1.0,27.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
514,9.0,0.0,0.0,1.0,22.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
525,7.0,0.0,0.0,1.0,27.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70591,10.0,1.0,1.0,1.0,30.0,0.0,1.0,1.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,1.0
70621,10.0,0.0,0.0,1.0,30.0,1.0,0.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0,1.0
70640,6.0,1.0,0.0,1.0,37.0,0.0,0.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0,1.0
70642,10.0,0.0,1.0,1.0,35.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,0.0,0.0,0.0,0.0,1.0,1.0


In [9]:
df.drop_duplicates(inplace=True)
df.shape

(64020, 18)

#### 5.2. Missing Values

In [10]:
df.isnull().sum()

Age                     0
Sex                     0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Stroke                  0
HighBP                  0
Diabetes                0
dtype: int64

#### 5.3. Outliers

Some outliers represent natural variations in the population, and they should be left as is in your dataset. These are called true outliers. Other outliers are problematic and should be removed because they represent measurement errors, data entry or processing errors, or poor sampling.

The above data is categorical. The concept of outliers does not apply to categorical values.

## 6. Data Visualisation

## 7. Data Modelling

#### 7.1. Splitting the Dataset

#### 7.2. Encoding Text Data

#### 7.3. Model Training

## 8. Model Evaluation

In [11]:
# modell.predict_proba(X_test)[0]

##  9. Model Testing