![diabetes](images/diabetes.png)

# Detecting Diabetes with Machine Learning

**Authors:** Ian Butler, Red the dog

***

## Overview

A one-paragraph overview of the project, including the problem, data, methods, results and recommendations.

***

## Problem

A summary of the problem we are trying to solve and the questions that we plan to answer to solve it.

This summary, taken from the dataset's home on Kaggle, describes the problem quite well:

"Diabetes is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. Diabetes is a serious chronic disease in which individuals lose the ability to effectively regulate levels of glucose in the blood, and can lead to reduced quality of life and life expectancy. After different foods are broken down into sugars during digestion, the sugars are then released into the bloodstream. This signals the pancreas to release insulin. Insulin helps enable cells within the body to use those sugars in the bloodstream for energy. Diabetes is generally characterized by either the body not making enough insulin or being unable to use the insulin that is made as effectively as needed.

Complications like heart disease, vision loss, lower-limb amputation, and kidney disease are associated with chronically high levels of sugar remaining in the bloodstream for those with diabetes. While there is no cure for diabetes, strategies like losing weight, eating healthily, being active, and receiving medical treatments can mitigate the harms of this disease in many patients. Early diagnosis can lead to lifestyle changes and more effective treatment, making predictive models for diabetes risk important tools for public and public health officials.

The scale of this problem is also important to recognize. The Centers for Disease Control and Prevention has indicated that as of 2018, 34.2 million Americans have diabetes and 88 million have prediabetes. Furthermore, the CDC estimates that 1 in 5 diabetics, and roughly 8 in 10 prediabetics are unaware of their risk. While there are different types of diabetes, type II diabetes is the most common form and its prevalence varies by age, education, income, location, race, and other social determinants of health. Much of the burden of the disease falls on those of lower socioeconomic status as well. Diabetes also places a massive burden on the economy, with diagnosed diabetes costs of roughly 327 billion dollars and total costs with undiagnosed diabetes and prediabetes approaching 400 billion dollars annually."

***

Questions to consider:

* What are the organization's pain points related to this project?
    * Ultimately, being able to identify diabetes quickly and early is the best defense against it.
* What questions are we trying to answer?
    * First, can we predict whether a respondent in our data has diabetes?
    * Second, if we can make this prediction, which features of the dataset are most influential in making that prediction?
    * Third, how many of those features do we need to know to make that prediction?
* Why are these questions important from the organization's perspective?
    * With respect to the first, we can determine the probability of individuals unseen by the model having diabetes.
    * With respect to the second, we can identify which of these features are most worth devoting resources to.
    * With respect to the third, we can make the simplest model possible, in order to use it as a tool or app.

***

## Data Understanding

A description of the data being used for this project.

***

Questions to consider:

* Where does the data come from?
    * This data comes from Kaggle at the following URL:
    * https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset
* How does it relate to the questions?
    * This data relates to the questions in that the survey it comes from is specifically designed to assess the following:
        * risk behaviors
        * chronic health conditions
        * use of preventive services.
* What does the data represent?
    * Each record in the data represents one respondent to the Behavioral Risk Factor Surveillance System survey.
    * Each field in the data represents one question in the Behavioral Risk Factor Surveillance System survey.
* Who is in the sample?
    * The sample consists of 253,680 survey respondents.
* What variables are included?
    * The variables included are as follows:
        * HighBP - High blood pressure
            * Has the respondent ever been told they have high blood pressure by a doctor, nurse, or other health professional?
                * 0 = No
                * 1 = Yes
        * HighChol - High cholesterol
            * Has the respondent ever been told that their blood cholesterol is high by a doctor, nurse or other health professional?
                * 0 = No
                * 1 = Yes
        * CholCheck - Cholesterol checked
            * Has the respondent had their cholesterol checked within the last five years?
                * 0 = No
                * 1 = Yes
        * BMI - Computed body mass index
            * What is the respondent's body mass index (BMI)?
                * a range of integers from 12 to 98, including 12 and 98
        * Smoker - Computed smoking status
            * Has the respondent smoked at least 100 cigarettes in their entire life? [Note: 5 packs = 100 cigarettes]
                * 0 = No
                * 1 = Yes
        * Stroke - Ever diagnosed with a stroke
            * Has the respondent ever been told they had a stroke?
                * 0 = No
                * 1 = Yes
        * HeartDiseaseorAttack - Ever had CHD or MI
            * Did the respondent report ever having coronary heart disease (CHD) or a myocardial infarction (MI)?
                * 0 = No
                * 1 = Yes
        * PhysActivity - Leisure time physical activity
            * Did the respondent report doing physical activity or exercise during the past 30 days other than their regular job?
                * 0 = No
                * 1 = Yes
        * Fruits - Consume fruit 1 or more times per day
            * Did the respondent report that they consume fruit 1 or more times per day?
                * 0 = No
                * 1 = Yes
        * Veggies - Consume vegetables 1 or more times per day
            * Did the respondent report that they consume vegetables 1 or more times per day?
                * 0 = No
                * 1 = Yes
        * HvyAlcoholConsump - Heavy alcohol consumption
            * Is the respondent a heavy drinker (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)?
                * 0 = No
                * 1 = Yes
        * AnyHealthcare - Have any healthcare coverage
            * Does the respondent have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?
                * 0 = No
                * 1 = Yes
        * NoDocbcCost - Could not see doctor because of cost
            * Was there a time in the past 12 months when the respondent needed to see a doctor but could not because of cost?
                * 0 = No
                * 1 = Yes
        * GenHlth - General health
            * How did the respondent say that, in general, their health is?
                * 1 = Excellent
                * 2 = Very good
                * 3 = Good
                * 4 = Fair
                * 5 = Poor
        * MentHlth - Number of days mental health not good
            * Now thinking about their mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days did the respondent report that their mental health was not good?
                * a range of integers from 0 to 30, including 0 and 30
        * PhysHlth - Number of days physical health not good
            * Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days did the respondent report that their physical health not good?
                * a range of integers from 0 to 30, including 0 and 30
        * DiffWalk - Difficult walking or climbing stairs
            * Did the respondent report having serious difficulty walking or climbing stairs?
                * 0 = No
                * 1 = Yes
        * Sex - Respondent's sex
            * What is the sex of the respondent?
                * 0 = Female
                * 1 = Male
        * Age - Reported age in five-year age categories calculated variable
            * What is the age of the respondent?
                * 1 = Age 18 to 24, including 18 and 24
                * 2 = Age 25 to 29, including 25 and 29
                * 3 = Age 30 to 34, including 30 and 34
                * 4 = Age 35 to 39, including 35 and 39
                * 5 = Age 40 to 44, including 40 and 44
                * 6 = Age 45 to 49, including 45 and 49
                * 7 = Age 50 to 54, including 50 and 54
                * 8 = Age 55 to 59, including 55 and 59
                * 9 = Age 60 to 64, including 60 and 64
                * 10 = Age 65 to 69, including 65 and 69
                * 11 = Age 70 to 74, including 70 and 74
                * 12 = Age 75 to 79, including 75 and 79
                * 13 = Age 80 or older, including 80
        * Education - Education level scale 1-6
            * What is the highest grade or year of school the respondent completed?
                * 1 = Never attended school or only kindergarten
                * 2 = Grades 1 through 8 (Elementary)
                * 3 = Grades 9 through 11 (Some high school)
                * 4 = Grade 12 or GED (High school graduate)
                * 5 = College 1 year to 3 years (Some college or technical school)
                * 6 = College 4 years or more (College graduate)
        * Income - Income level scale 1-8
            * What is the respondent's annual household income from all sources?
                * 1 = Less than \\$10,000
                * 2 = Less than \\$15,000 (\\$10,000 to less than \\$15,000)
                * 3 = Less than \\$20,000 (\\$15,000 to less than \\$20,000)
                * 4 = Less than \\$25,000 (\\$20,000 to less than \\$25,000)
                * 5 = Less than \\$35,000 (\\$25,000 to less than \\$35,000)
                * 6 = Less than \\$50,000 (\\$35,000 to less than \\$50,000)
                * 7 = Less than \\$75,000 (\\$50,000 to less than \\$75,000)
                * 8 = $75,000 or more
* What is the target variable?
    * The target variable this:
        * Diabetes_012 - Ever told you have diabetes
            * Has the respondent ever been told that they have diabetes?
                * 0 = no diabetes or only during pregnancy
                * 1 = prediabetes
                * 2 = diabetes
* Which variables do we intend to use?
    * We intend to use all of the original variables in each baseline model for each type of model that we create.
    * Once we have created a baseline model for a particular type of model, we intend to iteratively increase the number of variables of subsequent models, starting from 1, until a satisfactory model performance has been achieved using as few variables as possible.

***

In [None]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# here we run our code to explore the data

## Data Preparation

A description and justification of the process for preparing the data.

***

Questions to consider:
* Were there variables we dropped?
* Were there variables we created?
* How did we address missing values?
* How did we address outliers?
* Why are these choices appropriate given the data and the problem?

***

In [None]:
# here we run our code to clean the data

## Data Modeling

A description and justification of the process for analyzing or modeling the data.

***

Questions to consider:
* How did we analyze or model the data?
* How did we iterate on our initial approach to make it better?
* Why are these choices appropriate given the data and the problem?

***

In [None]:
# here we run our code to model the data

## Evaluation

An evaluation of how well our work solves the stated problem.

***

Questions to consider:
* How do we interpret the results?
* How well does our model fit our data?
* How much better is this model than our baseline model?
* How confident are we that our results would generalize beyond the data we have?
* How confident are we that this model would benefit the organization if put into use?

***

## Conclusions

A provision of our conclusions about the work we've done, including any limitations or next steps.

***

Questions to consider:
* What would we recommend the organiation do as a result of this work?
* What are some reasons why our analysis might not fully solve the problem?
* What else could we do in the future to improve this project?

***