![diabetes](images/diabetes.png)

# Detecting Diabetes with Machine Learning

**Authors:** Ian Butler, Red the dog

***

## Overview

A one-paragraph overview of the project, including the problem, data, methods, results and recommendations.

***

## Problem

A summary of the problem we are trying to solve and the questions that we plan to answer to solve it.

This summary, taken from the dataset's home on Kaggle, describes the problem quite well:

"Diabetes is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. Diabetes is a serious chronic disease in which individuals lose the ability to effectively regulate levels of glucose in the blood, and can lead to reduced quality of life and life expectancy. After different foods are broken down into sugars during digestion, the sugars are then released into the bloodstream. This signals the pancreas to release insulin. Insulin helps enable cells within the body to use those sugars in the bloodstream for energy. Diabetes is generally characterized by either the body not making enough insulin or being unable to use the insulin that is made as effectively as needed.

Complications like heart disease, vision loss, lower-limb amputation, and kidney disease are associated with chronically high levels of sugar remaining in the bloodstream for those with diabetes. While there is no cure for diabetes, strategies like losing weight, eating healthily, being active, and receiving medical treatments can mitigate the harms of this disease in many patients. Early diagnosis can lead to lifestyle changes and more effective treatment, making predictive models for diabetes risk important tools for public and public health officials.

The scale of this problem is also important to recognize. The Centers for Disease Control and Prevention has indicated that as of 2018, 34.2 million Americans have diabetes and 88 million have prediabetes. Furthermore, the CDC estimates that 1 in 5 diabetics, and roughly 8 in 10 prediabetics are unaware of their risk. While there are different types of diabetes, type II diabetes is the most common form and its prevalence varies by age, education, income, location, race, and other social determinants of health. Much of the burden of the disease falls on those of lower socioeconomic status as well. Diabetes also places a massive burden on the economy, with diagnosed diabetes costs of roughly 327 billion dollars and total costs with undiagnosed diabetes and prediabetes approaching 400 billion dollars annually."

***

Questions to consider:

* What are the organization's pain points related to this project?
    * Ultimately, being able to identify diabetes quickly and early is the best defense against it.
* What questions are we trying to answer?
    * First, can we predict whether a respondent in our data has diabetes?
    * Second, if we can make this prediction, which features of the dataset are most influential in making that prediction?
    * Third, how many of those features do we need to know to make that prediction?
* Why are these questions important from the organization's perspective?
    * With respect to the first, we can determine the probability of individuals unseen by the model having diabetes.
    * With respect to the second, we can identify which of these features are most worth devoting resources to.
    * With respect to the third, we can make the simplest model possible, in order to use it as a tool or app.

***

## Data Understanding

A description of the data being used for this project.

***

Questions to consider:

* Where did the data come from?
* How does it relate to the questions?
* What does the data represent?
* Who is in the sample?
* What variables are included?
* What is the target variable?
* Which variables do we intend to use?

***

In [None]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# here we run our code to explore the data

## Data Preparation

A description and justification of the process for preparing the data.

***

Questions to consider:
* Were there variables we dropped?
* Were there variables we created?
* How did we address missing values?
* How did we address outliers?
* Why are these choices appropriate given the data and the problem?

***

In [None]:
# here we run our code to clean the data

## Data Modeling

A description and justification of the process for analyzing or modeling the data.

***

Questions to consider:
* How did we analyze or model the data?
* How did we iterate on our initial approach to make it better?
* Why are these choices appropriate given the data and the problem?

***

In [None]:
# here we run our code to model the data

## Evaluation

An evaluation of how well our work solves the stated problem.

***

Questions to consider:
* How do we interpret the results?
* How well does our model fit our data?
* How much better is this model than our baseline model?
* How confident are we that our results would generalize beyond the data we have?
* How confident are we that this model would benefit the organization if put into use?

***

## Conclusions

A provision of our conclusions about the work we've done, including any limitations or next steps.

***

Questions to consider:
* What would we recommend the organiation do as a result of this work?
* What are some reasons why our analysis might not fully solve the problem?
* What else could we do in the future to improve this project?

***