# Project 1 - Diabetes and Possible Contributing Factors

This jupyter notebook contains the cleaning, analysis, and visualizations of the diabetes_prediction_dataset on kaggle. 

## Contents

- **Importing and Cleaning the Data**
    - Importing the csv
    - Replacing 1/0 Values with True/False Values
    - Cleaning the smoking_history column
    - Dropping n/a values and checking value counts
    - Viewing the clean dataset

- **Analysis**
    - Summary Table 1 - Age, BMI, HbA1c Level, and Blood Glucose Level Among Diabetics and Non-Diabetics
    - Summary Table 2 - Diabetics with Hypertension, Heart Diease, Both, or Neither
    - Summary Table 3 - Diabetics that are at least 45 Years Old, 25+ BMI, Both, or Neither

- **Visualizations**
    

## Importing and Cleaning the Data

The first step is to import the dataframe from the csv, and save it to a new variable. Some of the columns in this dateset have 0 and 1 values, so these will be replace with True/False values.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("Resources/diabetes_prediction_dataset.csv")
df.head()

The hypertension, heart_disease, and diabetes columns all have ones and zeros, indicating the presence or absence of these conditions. To make these columns easier to understand, the values will to changed to True/False, where True indicates the presence of a condition, and False indicates a lack of that condition.

In [None]:
# creating a new variable for the clean dataframe
clean_df = df

# converting hypertension, heart_disease, and diabetes columns to boolean values
# people either have them, or they don't
clean_df["hypertension"] = clean_df["hypertension"].replace({0: False, 1: True})
clean_df["heart_disease"] = clean_df["heart_disease"].replace({0: False, 1: True})
clean_df["diabetes"] = clean_df["diabetes"].replace({0: False, 1: True})

clean_df

The relevant columns are now set to boolean values. It's time to take a closer look at the values for each column.

Let's look at the value counts for the smoking_history column.

In [None]:
clean_df["smoking_history"].value_counts()

In this column, some patients have been labeled as "ever". While this could be seen as a mistyping of "never", this cannot be confirmed. Because these rows are less than 5% of the dataset, they will be removed.

In [None]:
# removing rows where smoking history is "ever"
clean_df = clean_df[clean_df["smoking_history"] != "ever"]

We initially looked to see if the "no info" category needed to be removed. However, we decided not to do this, as that category makes up for approximately 30% of the data.

It is odd that 15 patients under the age of 5 are currently smoking, and that 61 patients under 5 have smoked in the past. Because this cannot be verified, we decided to limit the scope of our study to adults 21 and older. This will allow us to better determine the contributing factors to diabetes.

In [None]:
# limiting the scope of the study to adults over 21
clean_df = clean_df[clean_df["age"] > 21]

Let's drop the n/a values and check the values for all of the categorical and boolean columns.

In [None]:
# dropping na values
clean_df = clean_df.dropna()

list = ["gender", "hypertension", "heart_disease", "smoking_history", "diabetes"]
for x in list:
    print(clean_df[x].value_counts())

This is the cleaned datset in its current form:

In [None]:
# checking remaining rows
clean_df.count()

In [None]:
# viewing the current clean_df
clean_df

# Analysis - Summary Tables

Summary tables help give information at an overview. The first table shows summary statistics compared between patients with and without diabetes.

## Summary Table 1

### Age, BMI, HbA1c Level, and Blood Glucose Level Among Diabetics and Non-Diabetics


In [None]:
# creating a summary table comapring age, bmi, HbA1c levels, and blood glucose levels between people who have and don't have diabetes
summary_1 = clean_df.groupby("diabetes")[["age", "bmi", "HbA1c_level", "blood_glucose_level"]].agg(["mean", "median", "std"])

# transposing the dataframe so it is easier to read
summary_1.transpose()

## Summary Table 2

### Diabetics with Hypertension, Heart Diease, Both, or Neither

This table looks at combinations of contributing factors for those with diabetes.

In [None]:
diabetes_df = clean_df[clean_df["diabetes"] == True]

summary_2 = diabetes_df.groupby(["heart_disease", "hypertension"])[["diabetes"]].count()

summary_2a = round(summary_2 / diabetes_df["diabetes"].count(), 2)

summary_2a

#### Based on this summary table:

- **65%** of diabetics in this dataset have **no heart disease or hypertension.**

- **21%** of diabetics in this dataset have **hypertension**, but **no heart disease.**

- **11%** of diabetics in this dataset have **heart disease**, but **no hypertension.**

- Only **4%** of diabetics in this dataset have **both hypertension and heart disease.**

## Summary Table 3

### Diabetics that are at least 45 Years Old, 25+ BMI, Both, or Neither

In [None]:
# 25 and over bmi, 45 and over age

summary_3 = diabetes_df.groupby([(diabetes_df["bmi"] >= 25), 
                                 (diabetes_df["age"] >= 45)])[["diabetes"]].count()

summary_3a = round(summary_3 / diabetes_df["diabetes"].count(), 2)

summary_3a


#### Based on this summary table:

- **80%** of diabetics in this dataset are **45+** and have a **BMI of 25+.**

- **10%** of diabetics in this dataset are **under 45** with a **BMI of 25+.**

- **9%** of diabetics in this dataset are **45+**, but have a **BMI under 25.**

- Just **1%** of diabetics are **under 45** with a **BMI under 25.**

# Visualizations