Predictive Modeling for Diabetes Diagnosis Using Patient Data

Introduction
Diabetes is a prevalent chronic disease that affects millions of people worldwide. This proposal aims to utilize data science techniques to predict whether a patient has diabetes or not based on various demographic and health-related variables. The analysis will be conducted using a public dataset that includes information on gender, age, hypertension, heart disease, smoking history, BMI, HbA1c level, blood glucose level, and diabetes status.
https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

	The primary question to be addressed in this project is: Can we predict the likelihood of a patient having diabetes based on the following variables:
Sex
Age
BMI
Blood Glucose level

2.   Method:

 We will employ machine learning techniques to conduct the analysis, primarily those learned in the course such as KNearestNeighbors. We will select the most relevant variables (features) based on their impact on diabetes prediction.

Visualizations: We will create clearly labeled curves using Altair along with confusion matrices to visualize the performance of our predictive model. These visualizations will help us assess the accuracy and effectiveness of our model.

3. Expected outcomes and significance:
	From the dataset, we are expected to find the probability of a patient having diabetes based on selected variables: Sex, Age, BMI, and Blood Glucose Level. These variables are real indicators of diabetes and can assist our model in predicting a patient's diagnosis. This analysis could lead to future questions, such as the development of a more accurate predictive model or the exploration of additional risk factors for diabetes. It may also open avenues for studying the impact of diabetes on different demographic groups.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
diabetes = pd.read_csv("https://raw.githubusercontent.com/Angry-Cub/dsci-100-group-4/8a0dcf8e4729ead5372d66f3dcf67491e77296ba/diabetes_prediction_dataset.csv", on_bad_lines='skip', skiprows=range(1, 50000))

diabetes = diabetes[["gender", "age", "bmi", "blood_glucose_level", "diabetes"]]

diabetes

Unnamed: 0,gender,age,bmi,blood_glucose_level,diabetes
0,Female,80.0,27.32,90,0
1,Female,67.0,27.32,80,0
2,Female,58.0,27.32,90,0
3,Male,2.0,18.41,100,0
4,Male,26.0,27.32,100,0
...,...,...,...,...,...
49996,Female,80.0,27.32,90,0
49997,Female,2.0,17.37,100,0
49998,Male,66.0,27.83,155,0
49999,Female,24.0,35.42,100,0


In [3]:
diabetes_train, diabetes_test = train_test_split(diabetes, train_size=.75)

In [4]:
diabetes_train 

Unnamed: 0,gender,age,bmi,blood_glucose_level,diabetes
8735,Male,9.0,16.63,140,0
14989,Female,42.0,30.02,159,0
7047,Female,70.0,22.53,126,0
15422,Male,6.0,19.32,155,0
45095,Female,22.0,22.32,155,0
...,...,...,...,...,...
44637,Male,58.0,32.30,159,1
22066,Male,29.0,28.72,158,0
48767,Male,29.0,27.32,126,0
9916,Male,5.0,21.22,158,0


In [5]:
diabetes_test

Unnamed: 0,gender,age,bmi,blood_glucose_level,diabetes
4907,Female,75.0,42.26,159,1
13813,Female,41.0,27.32,159,0
23584,Female,6.0,16.96,155,0
25131,Female,25.0,27.32,140,0
26212,Male,34.0,30.24,159,0
...,...,...,...,...,...
44261,Female,31.0,27.32,130,0
45536,Male,17.0,27.14,160,0
38572,Male,56.0,33.11,145,0
37135,Female,49.0,19.86,126,0


In [6]:
diabetes_train = diabetes_train.reset_index(drop=True)
diabetes_test = diabetes_test.reset_index(drop=True)

**This marks the beginning of Visualization and table analysis**

In [7]:
grouped = diabetes_train.groupby(['gender', 'diabetes']).size().reset_index(name='Count')

In [8]:
grouped

Unnamed: 0,gender,diabetes,Count
0,Female,0,20290
1,Female,1,1614
2,Male,0,14109
3,Male,1,1481
4,Other,0,6


In [9]:
diabetes_train

Unnamed: 0,gender,age,bmi,blood_glucose_level,diabetes
0,Male,9.0,16.63,140,0
1,Female,42.0,30.02,159,0
2,Female,70.0,22.53,126,0
3,Male,6.0,19.32,155,0
4,Female,22.0,22.32,155,0
...,...,...,...,...,...
37495,Male,58.0,32.30,159,1
37496,Male,29.0,28.72,158,0
37497,Male,29.0,27.32,126,0
37498,Male,5.0,21.22,158,0


In [10]:
diabetes_female = diabetes_train[diabetes_train['gender'] == "Female"]
diabetes_male = diabetes_train[diabetes_train['gender'] == "Male"]

In [11]:
avg_bmi_FWdiabetes = diabetes_female[diabetes_female['diabetes'] == 1.0]["bmi"].mean()
avg_bmi_FWOUTdiabetes = diabetes_female[diabetes_female['diabetes'] == 0.0]["bmi"].mean()
avg_BGL_FWdiabetes= diabetes_female[diabetes_female['diabetes'] == 1.0]["blood_glucose_level"].mean()
avg_BGL_FWOUTdiabetes= diabetes_female[diabetes_female['diabetes'] == 0.0]["blood_glucose_level"].mean()

avg_bmi_MWdiabetes = diabetes_male[diabetes_male['diabetes'] == 1.0]["bmi"].mean()
avg_bmi_MWOUTdiabetes = diabetes_male[diabetes_male['diabetes'] == 0.0]["bmi"].mean()
avg_BGL_MWdiabetes = diabetes_male[diabetes_male['diabetes'] == 1.0]["blood_glucose_level"].mean()
avg_BGL_MWOUTdiabetes = diabetes_male[diabetes_male['diabetes'] == 0.0]["blood_glucose_level"].mean()

avg_bmi_MWdiabetes = diabetes_male[diabetes_male['diabetes'] == 1.0]["age"].mean()
avg_bmi_MWOUTdiabetes = diabetes_male[diabetes_male['diabetes'] == 0.0]["age"].mean()

avg_AGE_FWdiabetes= diabetes_female[diabetes_female['diabetes'] == 1.0]["age"].mean()
avg_AGE_FWOUTdiabetes= diabetes_female[diabetes_female['diabetes'] == 0.0]["age"].mean()

Utilized basic statistical measures to understand data distribution and relationships. This includes:
- Grouping by 'gender' and 'diabetes'.
- Filtering data for males and females.
- Calculating average BMI, blood glucose levels, and age for thedifferent groups.

In [12]:
grouped = grouped.assign(BMI_averages = [avg_bmi_FWOUTdiabetes, avg_bmi_FWdiabetes, avg_bmi_MWOUTdiabetes, avg_bmi_MWdiabetes, 0])
grouped = grouped.assign(BGL_averages = [avg_BGL_FWOUTdiabetes, avg_BGL_FWdiabetes, avg_BGL_MWOUTdiabetes, avg_BGL_MWdiabetes, 0])
grouped = grouped.assign(AGE_averages = [avg_AGE_FWOUTdiabetes, avg_AGE_FWdiabetes, avg_bmi_MWOUTdiabetes, avg_bmi_MWdiabetes, 0])

grouped = grouped.drop(4, axis=0)

**EVERYTHING BELOW IS PART OF VISUALIZING OUR DATA**

In [13]:
import altair as alt
import pandas as pd
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [14]:

diabetes = diabetes[diabetes["gender"] != "Other"]
diabetes

Unnamed: 0,gender,age,bmi,blood_glucose_level,diabetes
0,Female,80.0,27.32,90,0
1,Female,67.0,27.32,80,0
2,Female,58.0,27.32,90,0
3,Male,2.0,18.41,100,0
4,Male,26.0,27.32,100,0
...,...,...,...,...,...
49996,Female,80.0,27.32,90,0
49997,Female,2.0,17.37,100,0
49998,Male,66.0,27.83,155,0
49999,Female,24.0,35.42,100,0


In [15]:
chart_bmi = alt.Chart(diabetes, title="Distribution of BMI").mark_bar().encode(
    x = alt.X("bmi", bin=alt.Bin(maxbins = 20), scale=alt.Scale(domain=[10, 80])),
    y = alt.Y("count()", title = "Count")#.title("Count")
) 

In [16]:
chart_age = alt.Chart(diabetes, title="Distribution of Age").mark_bar().encode(
    x = alt.X("age", bin=alt.Bin(maxbins = 30), title = "Age (years, binned)"),
    y = alt.Y("count()", title = "Count")
)


In [17]:
chart_glucose = alt.Chart(diabetes, title="Distribution of Blood Glucose Levels").mark_bar().encode(
    x = alt.X("blood_glucose_level", bin=alt.Bin(maxbins = 20), title = "Blood Glucose Level (random testing, mg/dl)"),
              #, scale=alt.Scale(domain=[10, 60])
             #).title("Blood Glucose Level (random testing, mg/dl)"),
    y = alt.Y("count()", title = "Count")
)


In [18]:
chart_gender = alt.Chart(diabetes, title="Gender").mark_bar().encode(
    x = alt.X("gender"
              #, scale=alt.Scale(domain=[10, 60])
             , title = "Gender"),
    y = alt.Y("count()", title = "Count")
)

In [19]:
chart_diabetes = alt.Chart(diabetes, title="Diagnosis").mark_bar().encode(
    x = alt.X("diabetes", bin=alt.Bin(maxbins = 2)
              , scale=alt.Scale(domain=[0, 1]), title = "Diabetes (0 = no, 1 = yes)"
             ),
    y = alt.Y("count()", title = "Count")
)

In [20]:
chart_bmi | chart_age | chart_gender | chart_glucose | chart_diabetes

*The table below named GROUPED is the final table has summarized significant values within the dataset.*

*The Visualization below is the final summary of all valid visuals to analyze our data.*

In [21]:
grouped

Unnamed: 0,gender,diabetes,Count,BMI_averages,BGL_averages,AGE_averages
0,Female,0,20290,27.032155,133.007836,40.895545
1,Female,1,1614,32.465855,192.311029,61.136927
2,Male,0,14109,38.981388,133.039407,38.981388
3,Male,1,1481,61.00135,194.195814,61.00135
