**Title:**

*Predictive Modeling for Diabetes Diagnosis Using Patient Data*

**Introduction**:

*Diabetes is a prevalent chronic disease that affects millions of people worldwide. This proposal aims to utilize data science techniques to predict whether a patient has diabetes or not based on various demographic and health-related variables. The analysis will be conducted using a public dataset that includes information on gender, age, hypertension, heart disease, smoking history, BMI, HbA1c level, blood glucose level, and diabetes status.
https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset*

	The primary question to be addressed in this project is: Can we predict if an individual has diabetes based on the following variables?:
Gender

Age

BMI

&

Blood Glucose level

**Method:**

 *We will employ machine learning techniques to conduct the analysis, primarily those learned in the course such as KNearestNeighbors. We will select the most relevant variables (features) based on their impact on diabetes prediction. We will create a Classification model to predict the diabetes diagnosis*

*Visualizations: We will create clearly labeled curves using Altair along with confusion matrices to visualize the performance of our predictive model. These visualizations will help us assess the accuracy and effectiveness of our model.*

3. Expected outcomes and significance:
	From the dataset, we are expected to find the probability of a patient having diabetes based on selected variables: Sex, Age, BMI, and Blood Glucose Level. These variables are real indicators of diabetes and can assist our model in predicting a patient's diagnosis. This analysis could lead to future questions, such as the development of a more accurate predictive model or the exploration of additional risk factors for diabetes. It may also open avenues for studying the impact of diabetes on different demographic groups.


In [337]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

In [338]:
pip install -U altair

Collecting altair
  Downloading altair-5.2.0-py3-none-any.whl (996 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m996.9/996.9 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: altair
  Attempting uninstall: altair
    Found existing installation: altair 5.1.2
    Uninstalling altair-5.1.2:
      Successfully uninstalled altair-5.1.2
Successfully installed altair-5.2.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


**your report should include code which:**

# Loads data from the original source on the web:

*Loads data from the raw source, Skips 50000 lines to avoid too large of a dataset*

In [339]:
diabetes = pd.read_csv("https://raw.githubusercontent.com/Angry-Cub/dsci-100-group-4/8a0dcf8e4729ead5372d66f3dcf67491e77296ba/diabetes_prediction_dataset.csv", on_bad_lines='skip', skiprows=range(1, 50000))

# Wrangles and cleans the data for the planned analysis:

*Removes columns not necessary to answer the given research question*

In [340]:
diabetes = diabetes[["gender", "age", "bmi", "blood_glucose_level", "diabetes"]]


*Removes invalid gender rows since it won't contribute enough data to make a prediction*

In [311]:
diabetes = diabetes[diabetes["gender"] != "Other"]

## *Performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis.*

*Grouping by gender and diabetes diagnosis and saving the value to a count column*

- Seperating by gender will be a common occurance throughout the report. This will guarantee that our data analysis is not primarily swayed by gender, allowing us to focus on the variables of interest.

In [312]:
grouped = diabetes.groupby(['gender', 'diabetes']).size().reset_index(name='Count')
grouped

Unnamed: 0,gender,diabetes,Count
0,Female,0,27030
1,Female,1,2185
2,Male,0,18783
3,Male,1,1995


*The following is the primary summarization sequence for the data*

- First we will grab the relevant average values for bmi, blood glucose levels and age across both female and male groups
- Second, we assign this data to the initial grouped data set which contains the quantity of diabetic patients in each gender group

# 1.

In [313]:
diabetes_female = diabetes[diabetes['gender'] == "Female"]
diabetes_male = diabetes[diabetes['gender'] == "Male"]

In [314]:
avg_bmi_FWdiabetes = diabetes_female[diabetes_female['diabetes'] == 1.0]["bmi"].mean()
avg_bmi_FWOUTdiabetes = diabetes_female[diabetes_female['diabetes'] == 0.0]["bmi"].mean()
avg_BGL_FWdiabetes= diabetes_female[diabetes_female['diabetes'] == 1.0]["blood_glucose_level"].mean()
avg_BGL_FWOUTdiabetes= diabetes_female[diabetes_female['diabetes'] == 0.0]["blood_glucose_level"].mean()

avg_bmi_MWdiabetes = diabetes_male[diabetes_male['diabetes'] == 1.0]["bmi"].mean()
avg_bmi_MWOUTdiabetes = diabetes_male[diabetes_male['diabetes'] == 0.0]["bmi"].mean()
avg_BGL_MWdiabetes = diabetes_male[diabetes_male['diabetes'] == 1.0]["blood_glucose_level"].mean()
avg_BGL_MWOUTdiabetes = diabetes_male[diabetes_male['diabetes'] == 0.0]["blood_glucose_level"].mean()

avg_bmi_MWdiabetes = diabetes_male[diabetes_male['diabetes'] == 1.0]["age"].mean()
avg_bmi_MWOUTdiabetes = diabetes_male[diabetes_male['diabetes'] == 0.0]["age"].mean()
avg_AGE_FWdiabetes= diabetes_female[diabetes_female['diabetes'] == 1.0]["age"].mean()
avg_AGE_FWOUTdiabetes= diabetes_female[diabetes_female['diabetes'] == 0.0]["age"].mean()

# 2.

In [315]:
grouped = grouped.assign(BMI_averages = [avg_bmi_FWOUTdiabetes, avg_bmi_FWdiabetes, avg_bmi_MWOUTdiabetes, avg_bmi_MWdiabetes])
grouped = grouped.assign(BGL_averages = [avg_BGL_FWOUTdiabetes, avg_BGL_FWdiabetes, avg_BGL_MWOUTdiabetes, avg_BGL_MWdiabetes])
grouped = grouped.assign(AGE_averages = [avg_AGE_FWOUTdiabetes, avg_AGE_FWdiabetes, avg_bmi_MWOUTdiabetes, avg_bmi_MWdiabetes])

grouped

Unnamed: 0,gender,diabetes,Count,BMI_averages,BGL_averages,AGE_averages
0,Female,0,27030,27.039364,133.160895,40.938098
1,Female,1,2185,32.3253,193.749657,61.463158
2,Male,0,18783,39.038799,133.196933,39.038799
3,Male,1,1995,60.988972,194.724311,60.988972


# Creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis

- The following steps are required to create efficient and tidy visualization.

*Splitting the main dataframe into Male's with and without diabetes & Female's with and without diabetes*

In [316]:
diabetes_male_yes = diabetes_male[diabetes_male["diabetes"] == 1]
diabetes_male_no = diabetes_male[diabetes_male["diabetes"] == 0]

diabetes_female_yes = diabetes_female[diabetes_female["diabetes"] == 1]
diabetes_female_no = diabetes_female[diabetes_female["diabetes"] == 0]


*After we split the dataframe, we can take every 100 rows and compute the average of their results into one row in order to summarize our data in a more computation friendly format*

- This ensures that charts are less cluttered while preserving the original variable relationships.
- We remove the gender column temporarily since it is not a quantitative variable, we will bring it back once we are done averaging.

In [317]:
by_rows = 100


female_concise_yes_diabetes = diabetes_female_yes.drop(columns="gender").groupby(diabetes_female_yes.index // by_rows).mean()
female_concise_no_diabetes = diabetes_female_no.drop(columns="gender").groupby(diabetes_female_no.index // by_rows).mean()
male_concise_yes_diabetes =  diabetes_male_yes.drop(columns="gender").groupby(diabetes_male_yes.index // by_rows).mean()
male_concise_no_diabetes = diabetes_male_no.drop(columns="gender").groupby(diabetes_male_no.index // by_rows).mean()

*Recreating the gender column for visualization purposes within each dataframe*

- We reinstate the gender column with the appropriate genders before concatenating these 4 dataframes into one

In [318]:
male_concise_no_diabetes["gender"] = "Male"
male_concise_yes_diabetes["gender"] = "Male"

female_concise_yes_diabetes["gender"] = "Female"
female_concise_no_diabetes["gender"] = "Female"

*Concatenating the 4 dataframes to create a concise replica of the original data at a smaller size*

- diabetes_concise will now be used to create visualizations of our variables to determine any underlying relationships that will help with our exploratory data analysis.

In [319]:
lst_females = [female_concise_yes_diabetes, female_concise_no_diabetes]
lst_males = [male_concise_yes_diabetes, male_concise_no_diabetes]
lst = [female_concise_yes_diabetes, female_concise_no_diabetes, male_concise_yes_diabetes, male_concise_no_diabetes]

diabetes_concise_females = pd.concat(lst_females)
diabetes_concise_males = pd.concat(lst_males)
diabetes_concise= pd.concat(lst)

diabetes_concise

Unnamed: 0,age,bmi,blood_glucose_level,diabetes,gender
0,64.500000,28.893333,187.500000,1.0,Female
1,57.000000,30.045000,190.000000,1.0,Female
2,57.000000,30.182000,192.800000,1.0,Female
3,65.666667,40.990000,213.333333,1.0,Female
4,69.000000,38.930000,212.500000,1.0,Female
...,...,...,...,...,...
495,35.921053,25.840263,136.421053,0.0,Male
496,38.228571,24.860857,145.914286,0.0,Male
497,42.862069,27.133448,141.965517,0.0,Male
498,43.270270,26.916216,130.702703,0.0,Male


### Visualizing and plotting AGE in comparison to BLOOD_GLUCOSE_LEVEL 

# ASK HERE

In [320]:
import altair as alt
alt.data_transformers.disable_max_rows()

age_bgl_facet = alt.Chart(diabetes_concise, title="Distribution of Age and Blood Glucose Levels FEMALES").mark_point(opacity=.4).encode(
    x = alt.X("age").title("Age"),
    y = alt.Y("blood_glucose_level").scale(zero=False).title("Blood Glucose Level"),
    color=alt.Color("diabetes:N"),
).facet("gender", columns= 1)

bmi_bgl_facet = alt.Chart(diabetes_concise, title="Distribution of Age and Blood Glucose Levels FEMALES").mark_point(opacity=.4).encode(
    x = alt.X("bmi").title("BMI"),
    y = alt.Y("blood_glucose_level").scale(zero=False).title("Blood Glucose Level"),
    color=alt.Color("diabetes:N"),
).facet("gender", columns= 1)

# age_bgl = alt.Chart(diabetes_concise, title="Distribution").mark_point(opacity=.4).encode(
#     x = alt.X("age"),
#     y = alt.Y("blood_glucose_level").scale(zero=False),
#     color=alt.Color("diabetes:N")
# ) 



In [281]:
age_bgl_facet | bmi_bgl_facet

In [282]:
chart_bmi = alt.Chart(diabetes_concise, title="Distribution of BMI").mark_bar().encode(
    x = alt.X("bmi", bin=alt.Bin(maxbins = 20), scale=alt.Scale(domain=[10, 80])),
    y = alt.Y("count()", title = "Count")#.title("Count")
) 

chart_bmi

In [283]:
chart_age = alt.Chart(diabetes_concise, title="Distribution of Age").mark_bar().encode(
    x = alt.X("age", bin=alt.Bin(maxbins = 30), title = "Age (years, binned)"),
    y = alt.Y("count()", title = "Count")
)

chart_age

In [284]:
chart_glucose = alt.Chart(diabetes_concise, title="Distribution of Blood Glucose Levels").mark_bar().encode(
    x = alt.X("blood_glucose_level", bin=alt.Bin(maxbins = 20), title = "Blood Glucose Level (random testing, mg/dl)"),
              #, scale=alt.Scale(domain=[10, 60])
             #).title("Blood Glucose Level (random testing, mg/dl)"),
    y = alt.Y("count()", title = "Count")
)

chart_glucose


In [285]:
chart_gender = alt.Chart(diabetes_concise, title="Gender").mark_bar().encode(
    x = alt.X("gender"
              #, scale=alt.Scale(domain=[10, 60])
             , title = "Gender"),
    y = alt.Y("count()", title = "Count")
)

chart_gender

In [286]:
chart_diabetes = alt.Chart(diabetes_concise, title="Diagnosis").mark_bar().encode(
    x = alt.X("diabetes", bin=alt.Bin(maxbins = 2)
              , scale=alt.Scale(domain=[0, 1]), title = "Diabetes (0 = no, 1 = yes)"
             ),
    y = alt.Y("count()", title = "Count")
)

chart_diabetes

In [287]:
chart_bmi | chart_age | chart_gender | chart_glucose | chart_diabetes

# performs the data analysis

- Will first split the data into training and testing sets
- Will standardize the data in order to create the most accurate prediction fit
- Will assess the most effective K Value to use for the model fit
- Using that K value we will build a success rate model for the fit.

In [288]:
from sklearn import set_config

# Output dataframes instead of arrays
set_config(transform_output="pandas")
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline

*Changing female and male status to integers in order for the dataset to work in training and splitting the result into training and testing material at a .75 training rate*

In [289]:
unscaled_diabetes = diabetes.copy()

unscaled_diabetes['gender'] = unscaled_diabetes['gender'].replace({
   'Female' : 0,
   'Male' : 1
})

diabetes_train, diabetes_test = train_test_split(unscaled_diabetes, train_size=.75)
diabetes_train = diabetes_train.reset_index(drop=True)
diabetes_test = diabetes_test.reset_index(drop=True)

*Preliminary fit at n_neighbors = 5 in order to test the functionality of a classification model*

- We create the preprocessor that will standardize the contents of our training data.
- We set the X and Y parameters to what we want to train and predict based on.
- We set up a pipeline that will fit the preprocessor to the KNeighborsClassifier
- We then predict using the test data to determine the success of the model

In [331]:
knn = KNeighborsClassifier(n_neighbors=5)

diabetes_preprocessor = make_column_transformer(
    (StandardScaler(), ["gender", "age", "bmi", "blood_glucose_level"]),
    remainder='passthrough',
    verbose_feature_names_out=False
)


X = diabetes_train[["gender", "age", "bmi", "blood_glucose_level"]]
y = diabetes_train["diabetes"]


diabetes_pipe = make_pipeline(diabetes_preprocessor, knn).fit(X, y)


diabetest_test_predictions = diabetes_test.assign(
    predicted = diabetes_pipe.predict(diabetes_test[["gender", "age", "bmi", "blood_glucose_level"]])
)

diabetest_test_predictions

Unnamed: 0,gender,age,bmi,blood_glucose_level,diabetes,predicted
0,0,61.0,27.32,126,0,0
1,0,43.0,27.32,200,0,0
2,1,45.0,26.70,80,0,0
3,1,67.0,27.32,126,0,0
4,1,2.0,17.36,130,0,0
...,...,...,...,...,...,...
12494,1,55.0,27.32,126,0,0
12495,1,52.0,26.60,155,0,0
12496,0,26.0,22.74,100,0,0
12497,0,42.0,27.32,159,0,0


# Accuracy testing

*Preliminary accuracy test of the model at N=5*

In [332]:
cancer_acc_1 = diabetes_pipe.score(
    diabetes_test[["gender", "age", "bmi", "blood_glucose_level"]],
    diabetes_test["diabetes"]
)
cancer_acc_1

0.9409552764221137

*Confusion Matrix for N = 5*

In [333]:
pd.crosstab(
    diabetest_test_predictions["diabetes"],
    diabetest_test_predictions["predicted"]
)

predicted,0,1
diabetes,Unnamed: 1_level_1,Unnamed: 2_level_1
0,11343,97
1,641,418


# Tuning the model

*The neighbor value of 5 may not be the best choice, below we will assess the accuracy of several models at different values of n in comparison*

*The following steps are taken in the next code cell*
- We create a parameter grid for a range of n_neighbors [0, 100] skipping by 2
- We create a standard pipeline on an empty KNeighborsClassifier object
- Using GridSearchCV we create a tuning grid based on the parameters given
- We assess the accuracy of the tuned grid by fitting it to the training data with cv_results_

In [336]:
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 100, 2),
}

knn = KNeighborsClassifier()
diabetes_tune_pipe = make_pipeline(diabetes_preprocessor, knn)

from sklearn.model_selection import GridSearchCV

cancer_tune_grid = GridSearchCV(
    estimator=diabetes_tune_pipe,
    param_grid=parameter_grid,
    cv=10
)

accuracies_grid = pd.DataFrame(
    cancer_tune_grid.fit(
        diabetes_train[["gender", "age", "bmi", "blood_glucose_level"]],
        diabetes_train["diabetes"]
    ).cv_results_
)

accuracies_grid.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.01163,0.000586,0.052809,0.001873,1,{'kneighborsclassifier__n_neighbors': 1},0.9136,0.9152,0.909867,0.912533,0.914377,0.915444,0.915978,0.914377,0.917045,0.91331,0.914173,0.001912,50
1,0.011812,0.001177,0.061531,0.014011,3,{'kneighborsclassifier__n_neighbors': 3},0.942667,0.939467,0.9344,0.937333,0.934916,0.942118,0.933316,0.936783,0.941051,0.93785,0.93799,0.003107,49
2,0.011165,0.000539,0.056888,0.000764,5,{'kneighborsclassifier__n_neighbors': 5},0.9472,0.9448,0.940533,0.946133,0.938384,0.94852,0.942651,0.940251,0.946119,0.943985,0.943858,0.003156,48
3,0.011395,0.000383,0.058574,0.000365,7,{'kneighborsclassifier__n_neighbors': 7},0.949867,0.948267,0.9432,0.949067,0.939184,0.949587,0.943452,0.943185,0.946386,0.945052,0.945724,0.003337,47
4,0.012118,0.000919,0.062405,0.002108,9,{'kneighborsclassifier__n_neighbors': 9},0.9496,0.948267,0.944267,0.949333,0.940784,0.949853,0.944519,0.946119,0.94932,0.946652,0.946871,0.002834,32


In [295]:
accuracies_grid = (
    accuracies_grid[["param_kneighborsclassifier__n_neighbors", "mean_test_score", "std_test_score"]].assign(sem_test_score=accuracies_grid["std_test_score"] / 10**(1/2)).rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"}).drop(columns=["std_test_score"])
)

accuracies_grid

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,0.914173,0.000605
1,3,0.93799,0.000983
2,5,0.943858,0.000998
3,7,0.945724,0.001055
4,9,0.946871,0.000896
5,11,0.947485,0.000923
6,13,0.947858,0.000963
7,15,0.947911,0.000974
8,17,0.948045,0.00088
9,19,0.948072,0.000866


In [296]:
accuracies_grid.max()

n_neighbors              99
mean_test_score    0.948178
sem_test_score     0.001055
dtype: object

25 neighbors is the best number

In [325]:
accuracy_vs_k = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

In [326]:
accuracy_vs_k

In [330]:
accuracy_vs_k = alt.Chart(accuracies_grid.drop()).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

accuracy_vs_k

TypeError: 'method' object is not subscriptable