**Title:**

*Predictive Modeling for Diabetes Diagnosis Using Patient Data*

**Introduction**:

*Diabetes is a prevalent chronic disease that affects millions of people worldwide. This proposal aims to utilize data science techniques to predict whether a patient has diabetes or not based on various demographic and health-related variables. The analysis will be conducted using a public dataset that includes information on gender, age, hypertension, heart disease, smoking history, BMI, HbA1c level, blood glucose level, and diabetes status.
https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset*

	The primary question to be addressed in this project is: Do the following variables provide accurate training material to predict the likelihood of a patient having diabetes?

Gender

Age

BMI

&

Blood Glucose level

**Method:**

 *We will employ machine learning techniques to conduct the analysis, primarily those learned in the course such as KNearestNeighbors. We will select the most relevant variables (features) based on their impact on diabetes prediction.*

*Visualizations: We will create clearly labeled curves using Altair along with confusion matrices to visualize the performance of our predictive model. These visualizations will help us assess the accuracy and effectiveness of our model.*

3. Expected outcomes and significance:
	From the dataset, we are expected to find the probability of a patient having diabetes based on selected variables: Sex, Age, BMI, and Blood Glucose Level. These variables are real indicators of diabetes and can assist our model in predicting a patient's diagnosis. This analysis could lead to future questions, such as the development of a more accurate predictive model or the exploration of additional risk factors for diabetes. It may also open avenues for studying the impact of diabetes on different demographic groups.


In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

**your report should include code which:**

# Loads data from the original source on the web:

*Loads data from the raw source, Skips 50000 lines to avoid too large of a dataset*

In [2]:
diabetes = pd.read_csv("https://raw.githubusercontent.com/Angry-Cub/dsci-100-group-4/8a0dcf8e4729ead5372d66f3dcf67491e77296ba/diabetes_prediction_dataset.csv", on_bad_lines='skip', skiprows=range(1, 50000))

# Wrangles and cleans the data for the planned analysis:

*Removes columns not necessary to answer the given research question*

In [3]:
diabetes = diabetes[["gender", "age", "bmi", "blood_glucose_level", "diabetes"]]


*Removes invalid gender rows since it won't contribute enough data to make a prediction*

In [20]:
diabetes = diabetes[diabetes["gender"] != "Other"]

*Splits test and training data for later use. 75% Training size*

In [21]:
diabetes_train, diabetes_test = train_test_split(diabetes, train_size=.75)
diabetes_train = diabetes_train.reset_index(drop=True)
diabetes_test = diabetes_test.reset_index(drop=True)

## *Performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis.*

*Grouping by gender and diabetes diagnosis and saving the value to a count column*

- Seperating by gender will be a common occurance throughout the report. This will guarantee that our data analysis is not primarily swayed by gender, allowing us to focus on the variables of interest.

In [39]:
grouped = diabetes.groupby(['gender', 'diabetes']).size().reset_index(name='Count')
grouped

Unnamed: 0,gender,diabetes,Count
0,Female,0,27030
1,Female,1,2185
2,Male,0,18783
3,Male,1,1995


*The following is the primary summarization sequence for the data*

- First we will grab the relevant average values for bmi, blood glucose levels and age across both female and male groups
- Second, we assign this data to the initial grouped data set which contains the quantity of diabetic patients in each gender group

# 1.

In [54]:
avg_bmi_FWdiabetes = diabetes_female[diabetes_female['diabetes'] == 1.0]["bmi"].mean()
avg_bmi_FWOUTdiabetes = diabetes_female[diabetes_female['diabetes'] == 0.0]["bmi"].mean()
avg_BGL_FWdiabetes= diabetes_female[diabetes_female['diabetes'] == 1.0]["blood_glucose_level"].mean()
avg_BGL_FWOUTdiabetes= diabetes_female[diabetes_female['diabetes'] == 0.0]["blood_glucose_level"].mean()

avg_bmi_MWdiabetes = diabetes_male[diabetes_male['diabetes'] == 1.0]["bmi"].mean()
avg_bmi_MWOUTdiabetes = diabetes_male[diabetes_male['diabetes'] == 0.0]["bmi"].mean()
avg_BGL_MWdiabetes = diabetes_male[diabetes_male['diabetes'] == 1.0]["blood_glucose_level"].mean()
avg_BGL_MWOUTdiabetes = diabetes_male[diabetes_male['diabetes'] == 0.0]["blood_glucose_level"].mean()

avg_bmi_MWdiabetes = diabetes_male[diabetes_male['diabetes'] == 1.0]["age"].mean()
avg_bmi_MWOUTdiabetes = diabetes_male[diabetes_male['diabetes'] == 0.0]["age"].mean()
avg_AGE_FWdiabetes= diabetes_female[diabetes_female['diabetes'] == 1.0]["age"].mean()
avg_AGE_FWOUTdiabetes= diabetes_female[diabetes_female['diabetes'] == 0.0]["age"].mean()

# 2.

In [55]:
grouped = grouped.assign(BMI_averages = [avg_bmi_FWOUTdiabetes, avg_bmi_FWdiabetes, avg_bmi_MWOUTdiabetes, avg_bmi_MWdiabetes])
grouped = grouped.assign(BGL_averages = [avg_BGL_FWOUTdiabetes, avg_BGL_FWdiabetes, avg_BGL_MWOUTdiabetes, avg_BGL_MWdiabetes])
grouped = grouped.assign(AGE_averages = [avg_AGE_FWOUTdiabetes, avg_AGE_FWdiabetes, avg_bmi_MWOUTdiabetes, avg_bmi_MWdiabetes])

grouped

Unnamed: 0,gender,diabetes,Count,BMI_averages,BGL_averages,AGE_averages
0,Female,0,27030,27.039364,133.160895,40.938098
1,Female,1,2185,32.3253,193.749657,61.463158
2,Male,0,18783,39.038799,133.196933,39.038799
3,Male,1,1995,60.988972,194.724311,60.988972


# Creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis

- The following steps are required to create efficient and tidy visualization.

*Splitting the main dataframe into Male's with and without diabetes & Female's with and without diabetes*

In [41]:
diabetes_female = diabetes[diabetes['gender'] == "Female"]
diabetes_male = diabetes[diabetes['gender'] == "Male"]

diabetes_male_yes = diabetes_male[diabetes_male["diabetes"] == 1]
diabetes_male_no = diabetes_male[diabetes_male["diabetes"] == 0]

diabetes_female_yes = diabetes_female[diabetes_female["diabetes"] == 1]
diabetes_female_no = diabetes_female[diabetes_female["diabetes"] == 0]


*After we split the dataframe, we can take every 100 rows and compute the average of their results into one row in order to summarize our data in a more computation friendly format*

- This ensures that charts are less cluttered while preserving the original variable relationships.
- We remove the gender column temporarily since it is not a quantitative variable, we will bring it back once we are done averaging.

In [43]:
by_rows = 100


female_concise_yes_diabetes = diabetes_female_yes.drop(columns="gender").groupby(diabetes_female_yes.index // by_rows).mean()
female_concise_no_diabetes = diabetes_female_no.drop(columns="gender").groupby(diabetes_female_no.index // by_rows).mean()
male_concise_yes_diabetes =  diabetes_male_yes.drop(columns="gender").groupby(diabetes_male_yes.index // by_rows).mean()
male_concise_no_diabetes = diabetes_male_no.drop(columns="gender").groupby(diabetes_male_no.index // by_rows).mean()

*Recreating the gender column for visualization purposes within each dataframe*

- We reinstate the gender column with the appropriate genders before concatenating these 4 dataframes into one

In [44]:
male_concise_no_diabetes["gender"] = "male"
male_concise_yes_diabetes["gender"] = "male"

female_concise_yes_diabetes["gender"] = "female"
female_concise_no_diabetes["gender"] = "female"

*Concatenating the 4 dataframes to create a concise replica of the original data at a smaller size*

- diabetes_concise will now be used to create visualizations of our variables to determine any underlying relationships that will help with our exploratory data analysis.

In [46]:
lst = [female_concise_yes_diabetes, female_concise_no_diabetes, male_concise_yes_diabetes, male_concise_no_diabetes]
diabetes_concise = pd.concat(lst)

diabetes_concise

Unnamed: 0,age,bmi,blood_glucose_level,diabetes,gender
0,64.500000,28.893333,187.500000,1.0,female
1,57.000000,30.045000,190.000000,1.0,female
2,57.000000,30.182000,192.800000,1.0,female
3,65.666667,40.990000,213.333333,1.0,female
4,69.000000,38.930000,212.500000,1.0,female
...,...,...,...,...,...
495,35.921053,25.840263,136.421053,0.0,male
496,38.228571,24.860857,145.914286,0.0,male
497,42.862069,27.133448,141.965517,0.0,male
498,43.270270,26.916216,130.702703,0.0,male


**EVERYTHING BELOW IS PART OF VISUALIZING OUR DATA**

In [32]:
import altair as alt
alt.data_transformers.disable_max_rows()

chart_test = alt.Chart(diabetes_concise, title="Distribution").mark_point(opacity=.4).encode(
    x = alt.X("age"),
    y = alt.Y("blood_glucose_level").scale(zero=False),
    color=alt.Color("diabetes:N")
) 



In [33]:
chart_test

In [12]:
chart_bmi = alt.Chart(concatenated_concise, title="Distribution of BMI").mark_bar().encode(
    x = alt.X("bmi", bin=alt.Bin(maxbins = 20), scale=alt.Scale(domain=[10, 80])),
    y = alt.Y("count()", title = "Count")#.title("Count")
) 

chart_bmi

In [13]:
chart_age = alt.Chart(concatenated_concise, title="Distribution of Age").mark_bar().encode(
    x = alt.X("age", bin=alt.Bin(maxbins = 30), title = "Age (years, binned)"),
    y = alt.Y("count()", title = "Count")
)

chart_age

In [14]:
chart_glucose = alt.Chart(concatenated_concise, title="Distribution of Blood Glucose Levels").mark_bar().encode(
    x = alt.X("blood_glucose_level", bin=alt.Bin(maxbins = 20), title = "Blood Glucose Level (random testing, mg/dl)"),
              #, scale=alt.Scale(domain=[10, 60])
             #).title("Blood Glucose Level (random testing, mg/dl)"),
    y = alt.Y("count()", title = "Count")
)

chart_glucose


In [15]:
chart_gender = alt.Chart(concatenated_concise, title="Gender").mark_bar().encode(
    x = alt.X("gender"
              #, scale=alt.Scale(domain=[10, 60])
             , title = "Gender"),
    y = alt.Y("count()", title = "Count")
)

chart_gender

In [17]:
chart_diabetes = alt.Chart(concatenated_concise, title="Diagnosis").mark_bar().encode(
    x = alt.X("diabetes", bin=alt.Bin(maxbins = 2)
              , scale=alt.Scale(domain=[0, 1]), title = "Diabetes (0 = no, 1 = yes)"
             ),
    y = alt.Y("count()", title = "Count")
)

chart_diabetes

In [18]:
chart_bmi | chart_age | chart_gender | chart_glucose | chart_diabetes