# Diabetes Health Indicators Dataset - EDA, Data Preparation

_ | Details
--- | ---
Tasks | Perform EDA on [Diabetes Health Indicators Dataset](https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset?select=diabetes_012_health_indicators_BRFSS2015.csv)<br>Perform Data Preparation, such as missing value mitigation, feature selection, one-hot encoding, and categorical encoding.
Owner | alexteboul/
ID | alexteboul/diabetes-health-indicators-dataset
Tags | health, classification, beginner, diabetes, public health
Subtitle | 253,680 survey responses from cleaned BRFSS 2015 + balanced dataset
Description | Read the complete description from [Diabetes Health Indicators Dataset](https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset?select=diabetes_012_health_indicators_BRFSS2015.csv)<br>... `diabetes_012_health_indicators_BRFSS2015.csv` is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_012 has 3 classes. 0 is for no diabetes or only during pregnancy, 1 is for prediabetes, and 2 is for diabetes. There is class imbalance in this dataset. This dataset has 21 feature variables
License | CC0: Public Domain

**Variable Descriptions**

Variable Type | Variable | Description
--- | --- | ---
Target | Diabetes_012 | 0 = no diabetes, 1 = prediabetes, 2 = diabetes
Feature | HighBP | 0 = no high BP, 1 = high BP
Feature | HighChol | 0 = no high cholesterol, 1 = high cholesterol
Feature | CholCheck | 0 = no cholesterol check in 5 years, 1 = yes cholesterol check in 5 years
Feature | BMI | Body Mass Index
Feature | Smoker | Have you smoked at least 100 cigarettes in your entire life? <br>[Note: 5 packs = 100 cigarettes] 0 = no 1 = yes
Feature | Stroke | (Ever told) you had a stroke. 0 = no 1 = yes
Feature | HeartDiseaseorAttack | coronary heart disease (CHD) or myocardial infarction (MI)<br> 0 = no, 1 = yes
Feature | PhysActivity | physical activity in past 30 days - not including job<br> 0 = no, 1 = yes
Feature | Fruits | Consume Fruit 1 or more times per day<br> 0 = no, 1 = yes

## Setup

In [1]:
import json
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing

# import & magics for import statements
import os
%load_ext dotenv
%dotenv

In [25]:
# Setup env variables
KAGGLE_USERNAME = os.getenv("KAGGLE_USERNAME")
KAGGLE_KEY = os.getenv("KAGGLE_KEY")

In [19]:
# Download the dataset
# !kaggle datasets download -q alexteboul/diabetes-health-indicators-dataset --unzip -f

In [24]:
# Obtain a copy of the dataset metadata
# !kaggle datasets metadata alexteboul/diabetes-health-indicators-dataset

In [17]:
# with open('dataset-metadata.json') as metadata:
#     dataset_metadata = json.load(metadata)
# dataset_metadata

In [5]:
filename = 'diabetes_012_health_indicators_BRFSS2015.csv'

In [6]:
df = pd.read_csv(filename)

In [7]:
sns.set_theme()

In [26]:
df

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253675,0.0,1.0,1.0,1.0,45.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,5.0,0.0,1.0,5.0,6.0,7.0
253676,2.0,1.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0
253677,0.0,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,5.0,2.0
253678,0.0,1.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,7.0,5.0,1.0
