# Capstone 3

## Data Wrangling & Cleaning

Diabetes is among the most prevalent chronic diseases in the United States.  It can lead to reduced quality of life and life expectancy, and can also lead to several complications, including heart disease, vision loss, amputation, and kidney disease.  There is no cure for diabetes, but early diagnosis can lead to lifestyle changes and better treatments and outcomes for patients.

Our client, Very Fancy Hospital (hereinafter, “VFH”), is the largest hospital network in Northeastern Pennsylvania.  VFH is also one of the leading hospitals for diabetes research, and patients from all over the country seek treatment for this chronic disease.  

VFH is interested in determining whether it’s possible to predict which of their current patients are likely to become diabetic in the future.  

This project will use data from Kaggle – 
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset?resource=download

The dataset includes survey responses to the CDC’s Behavioral Risk Factor Surveillance System (a health-related telephone survey that is collected annually by the CDC).  It includes health data from those who either are diabetic, prediabetic, or non-diabetic (or, alternatively, only experienced gestational diabetes).

The business scenario will be modeled by building several classification models, which will be evaluated and compared according to appropriate performance metrics selected according to the goals of the client.  Since we are focused on whether a patient is diabetic or prediabetic instead of non-diabetic, it makes sense for us to pursue the use of machine learning classification models towards this problem.

In addition, interpretability analyses will be conducted to characterize how the variation of identified features will affect the probability associated with each of the classes under study.

The project is split into four separate sections:
Data wranging
Exploratory data analysis
Preprocessing and training
Modeling

As this is the first section, data wrangling, we will focus on importing the data, cleaning/wrangling the data, and saving the data.

## 1. Table of Contents

[1. Table of Contents](#1.-Table-of-Contents)

[2. Import Packages](#2.-Import-Packages)

[3. Import Data](#3.-Import-Data)

[4. Explore the Data](#4.-Explore-the-Data)

[5. Export Data](#5.-Export-Data)

## 2. Import Packages

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

## 3. Import Data

In [35]:
diabetes_data = pd.read_csv('/Users/lauren/Desktop/diabetes_binary_health_indicators_BRFSS2015.csv')

## 4. Explore the Data

In [36]:
#Call the info method on diabetes_data to see a summary of the data
diabetes_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_binary       253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   HeartDiseaseorAttack  253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14  GenHlth               253680 non-null  float64
 15  

In [37]:
#Call the head method on diabetes_data to print the first several rows of the data
diabetes_data.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [38]:
#Count (using `.sum()`) the number of missing values (`.isnull()`) in each column of 
#diabetes_data as well as the percentages (using `.mean()` instead of `.sum()`).
#Order them (increasing or decreasing) using sort_values
#Call `pd.concat` to present these in a single table (DataFrame) with the helpful column names 'count' and '%'
missing = pd.concat([diabetes_data.isnull().sum(), 100 * diabetes_data.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count', ascending = False)

Unnamed: 0,count,%
Diabetes_binary,0,0.0
HighBP,0,0.0
Education,0,0.0
Age,0,0.0
Sex,0,0.0
DiffWalk,0,0.0
PhysHlth,0,0.0
MentHlth,0,0.0
GenHlth,0,0.0
NoDocbcCost,0,0.0


Great! Our dataset is not missing any values!

In [39]:
diabetes_data.columns

Index(['Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income'],
      dtype='object')

In [40]:
diabetes_data["Diabetes_binary"].unique()

array([0., 1.])

Please note, 0 = no diabetes while 1 = prediabetes or diabetes

In [41]:
diabetes_data["HighBP"].unique()

array([1., 0.])

Please note, 0 = no high blood pressure while 1 = high blood pressure

In [42]:
diabetes_data["HighChol"].unique()

array([1., 0.])

Please note, 0 = no high cholesterol while 1 = high cholesterol

In [43]:
diabetes_data["CholCheck"].unique()

array([1., 0.])

Please note, 0 = they haven't had their cholesterol checked in 5 years while 1 = they have had their cholesterol checked in the last 5 years

In [44]:
diabetes_data["BMI"].unique()

array([40., 25., 28., 27., 24., 30., 34., 26., 33., 21., 23., 22., 38.,
       32., 37., 31., 29., 20., 35., 45., 39., 19., 47., 18., 36., 43.,
       55., 49., 42., 17., 16., 41., 44., 50., 59., 48., 52., 46., 54.,
       57., 53., 14., 15., 51., 58., 63., 61., 56., 74., 62., 64., 66.,
       73., 85., 60., 67., 65., 70., 82., 79., 92., 68., 72., 88., 96.,
       13., 81., 71., 75., 12., 77., 69., 76., 87., 89., 84., 95., 98.,
       91., 86., 83., 80., 90., 78.])

BMI = body mass index

In [45]:
diabetes_data["Smoker"].unique()

array([1., 0.])

Please note, 0 = they haven't smoked at least 100 cigarettes in their entire life while 1 = they have smoked at least 100 cigarettes in their entire life

In [46]:
diabetes_data["Stroke"].unique()

array([0., 1.])

Please note, 0 = they have not had a stroke in the past while 1 = they have had a stroke in the past

In [47]:
diabetes_data["HeartDiseaseorAttack"].unique()

array([0., 1.])

Please note, 0 = they do not have coronary heart disease (CHD) or myocardial infarction (MI) while 1 = they do have heart disease

In [48]:
diabetes_data["PhysActivity"].unique()

array([0., 1.])

Please note, 0 = they have not performed physical activity in the last 30 days (not including their job) while 1 = they have performed physical activity in the last 30 days

In [49]:
diabetes_data["Fruits"].unique()

array([0., 1.])

Please note, 0 = they do not consume fruit at least once per day while 1 = they do consume fruit at least once per day

In [50]:
diabetes_data["Veggies"].unique()

array([1., 0.])

Please note, 0 = they do not consume vegetables at least once per day while 1 = they do consume vegetables at least once per day

In [51]:
diabetes_data["HvyAlcoholConsump"].unique()

array([0., 1.])

Please note, 0 = they do not drink a high amount of alchol per week while 1 = they do drink a high amount of alcohol per week

High amount of alcohol = at least 14 drinks per week for men and at least 7 drinks per week for women

In [52]:
diabetes_data["AnyHealthcare"].unique()

array([1., 0.])

Please note, 0 = they do not have any kind of health care coverage while 1 = they do have health care coverage

In [53]:
diabetes_data["NoDocbcCost"].unique()

array([0., 1.])

Please note, 0 = there was not a time within the past 12 months where they could not see a doctor due to cost while 1 = there was a time within the past 12 months where they could not see a doctor due to cost

In [54]:
diabetes_data["GenHlth"].unique()

array([5., 3., 2., 4., 1.])

This survey question asked participants to rate their general health on a 1-5 scale, where 1 = excellent, 2 = very good, 3 = good, 4 = fair, and 5 = poor

In [55]:
diabetes_data["MentHlth"].unique()

array([18.,  0., 30.,  3.,  5., 15., 10.,  6., 20.,  2., 25.,  1.,  4.,
        7.,  8., 21., 14., 26., 29., 16., 28., 11., 12., 24., 17., 13.,
       27., 19., 22.,  9., 23.])

This survey question asked how many days participants experienced poor mental heath in the last 30 days

In [56]:
diabetes_data["PhysHlth"].unique()

array([15.,  0., 30.,  2., 14., 28.,  7., 20.,  3., 10.,  1.,  5., 17.,
        4., 19.,  6., 12., 25., 27., 21., 22.,  8., 29., 24.,  9., 16.,
       18., 23., 13., 26., 11.])

This survey question asked how many days participants experienced either illness or physical injury in the last 30 days

In [57]:
diabetes_data["DiffWalk"].unique()

array([1., 0.])

Please note, 0 = they do not have serious difficulty walking or climbing stairs while 1 = they do have serious difficulty walking or climbing stairs

In [58]:
diabetes_data["Sex"].unique()

array([0., 1.])

Please note, 0 = female while 1 = male

In [59]:
diabetes_data["Age"].unique()

array([ 9.,  7., 11., 10.,  8., 13.,  4.,  6.,  2., 12.,  5.,  1.,  3.])

This survey question asked for the ages of particpants on a 1-13 scale, where 1 = 18 to 24, 2 = 25 to 29, 3 = 30 to 34, 4 = 35 to 39, 5 = 40 to 44, 6 = 45 to 49, 7 = 50 to 54, 8 = 55 to 59, 9 = 60 to 64, 10 = 65 to 69, 11 = 70 to 74, 12 = 75 to 79, and 13 = 80 or older  

In [60]:
diabetes_data["Education"].unique()

array([4., 6., 3., 5., 2., 1.])

This survey question asked for the highest level of education participants completed on a 1-6 scale, where 1 = never attended school or only attended kindergarten, 2 = grades 1 through 8, 3 = grades 9 through 11, 4 = grade 12 or GED, 5 = college 1 year to 3 years, and 6 = college 4 years or more

In [61]:
diabetes_data["Income"].unique()

array([3., 1., 8., 6., 4., 7., 2., 5.])

This survey question asked for the income levels of participants on a 1-8 scale, where 1 = less than \\$10,000, 2 = less than \\$15,000, 3 = less than \\$20,000, 4 = less than \\$25,000, 5 = less than \\$35,000, 6 = less than \\$50,000, 7 = less than \\$75,000, and 8 = \\$75,000 or more

After going through the entire dataset it appears that my data is clean.  I will not need to do anything further to clean this dataset.

In [62]:
diabetes_data.shape

(253680, 22)

In [63]:
print(diabetes_data['Diabetes_binary'].value_counts())

0.0    218334
1.0     35346
Name: Diabetes_binary, dtype: int64


According to the above, we have 218,334 participants who are not diabetic and 35,346 participants who are prediabetic or diabetic

In [64]:
diabetes_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Diabetes_binary,253680.0,0.139333,0.346294,0.0,0.0,0.0,0.0,1.0
HighBP,253680.0,0.429001,0.494934,0.0,0.0,0.0,1.0,1.0
HighChol,253680.0,0.424121,0.49421,0.0,0.0,0.0,1.0,1.0
CholCheck,253680.0,0.96267,0.189571,0.0,1.0,1.0,1.0,1.0
BMI,253680.0,28.382364,6.608694,12.0,24.0,27.0,31.0,98.0
Smoker,253680.0,0.443169,0.496761,0.0,0.0,0.0,1.0,1.0
Stroke,253680.0,0.040571,0.197294,0.0,0.0,0.0,0.0,1.0
HeartDiseaseorAttack,253680.0,0.094186,0.292087,0.0,0.0,0.0,0.0,1.0
PhysActivity,253680.0,0.756544,0.429169,0.0,1.0,1.0,1.0,1.0
Fruits,253680.0,0.634256,0.481639,0.0,0.0,1.0,1.0,1.0


## 5. Export Data

In [65]:
diabetes_data.to_csv('/Users/lauren/Desktop/diabetes_data_cleaned.csv')