# CASE STUDY II: HEALTHCARE – EARLY DETECTION OF DIABETES
### TEAM GLUCOSE-GUARD DATA SCIENCE CAPSTONE PROJECT

##### Background  
A mobile health clinic wants to pre-screen patients for diabetes using basic health indicators to reduce hospital crowding and focus on at-risk individuals.

##### Objectives
1) To use EDA to uncover factors that greatly influence the risk of diabetes.
2) To develop a classification model to predict if a person has diabetes based on attributes like BMI, glucose levels, insulin levels, and age.

## DATA CLEANING

### 1. Loading and Inspection of Dataset with Importation of necessary libraries

In [1]:
# Import Libaries
import pandas as pd

In [2]:
# Import Libaries
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [3]:
# Loading dataset
df = pd.read_csv ("./diabetes.csv")
print (df.head())
print (df.info())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768

### 2. Identifying columns where zeros are invalid

In [4]:
# Columns where 0 is likely a missing value
zero_as_missing = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
print (f"These features should not be zero for real patients:{zero_as_missing}")

These features should not be zero for real patients:['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']


### 3. Creating Missingness Indicator Flags

In [5]:
# Adding missingness flags
for col in zero_as_missing:
    df[col + "_missing"] = df [col] == 0

In [6]:
# To verify that the flags are being correctly created
print(df[["Insulin", "Insulin_missing"]].head(10))

   Insulin  Insulin_missing
0        0             True
1        0             True
2        0             True
3       94            False
4      168            False
5        0             True
6       88            False
7        0             True
8      543            False
9        0             True


### 4. Replacing Zeros with NaN

In [7]:
# Replace 0s with NaN
df[zero_as_missing] = df[zero_as_missing].replace(0,np.nan)

In [9]:
# To confirm how many missing values exist in each column
print("This is the missing values existing in specified columns:")
print(df[zero_as_missing].isnull().sum())

This is the missing values existing in specified columns:
Glucose            5
BloodPressure     35
SkinThickness    227
Insulin          374
BMI               11
dtype: int64


### 5. Applying Iterative Imputer (Multivariate Imputation by Chained Equations [MICE])

#### What is Iterative Imputer?
Iterative Imputer fills in missing values by modeling each feature with missing values as a function of the other features. It's based on Multiple Imputation by Chained Equations (MICE).

So instead of guessing missing values with a mean or median, it:

1. Picks one feature with missing values.

2. Predicts those missing values using a regression model trained on the other columns.

3. Repeats this process for each feature with missing values, in a cycle, until the values stabilize.

This method preserves relationships between variables, which is critical in medical and clinical datasets.

In [10]:
# Separating target from features
X = df.drop(columns= "Outcome")
y = df["Outcome"]

In [None]:
# Applying Iterative Imputer
imputer = IterativeImputer(random_state=0)
X_imputed = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

In [12]:
# Checking for any remaining NaNs
print("Any missing after imputation?", X_imputed.isna().sum().sum())

Any missing after imputation? 0


In [13]:
# Ensuring all NaNs are gone and imputation worked
print(X_imputed.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Glucose_missing             0
BloodPressure_missing       0
SkinThickness_missing       0
Insulin_missing             0
BMI_missing                 0
dtype: int64


### 6. Comparison of Values Before and After

In [15]:
# Reprinting columns where 0 is likely a missing value
zero_as_missing = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
print (f"These features should not be zero for real patients:{zero_as_missing}")

These features should not be zero for real patients:['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']


#### Glucose Comparison

In [16]:
# Comparing values before and after for Glucose

# Get the describe() outputs
before_desc_glucose = df["Glucose"].describe()
after_desc_glucose = X_imputed["Glucose"].describe()

# Combine into a DataFrame
glucose_comparison = pd.DataFrame({
    "Metric": before_desc_glucose.index,
    "Before Imputation": before_desc_glucose.values,
    "After Imputation": after_desc_glucose.values
})

# Display as a table
print(glucose_comparison)

  Metric  Before Imputation  After Imputation
0  count         763.000000        768.000000
1   mean         121.686763        121.641747
2    std          30.535641         30.466564
3    min          44.000000         44.000000
4    25%          99.000000         99.000000
5    50%         117.000000        117.000000
6    75%         141.000000        140.250000
7    max         199.000000        199.000000


#### Blood Pressure Comparison

In [None]:
# Comparing values before and after for Blood Pressure

# Get the describe() outputs
before_desc_BloodPressure = df["BloodPressure"].describe()
after_desc_BloodPressure = X_imputed["BloodPressure"].describe()

# Combine into a DataFrame
BloodPressure_comparison = pd.DataFrame({
    "Metric": before_desc_BloodPressure.index,
    "Before Imputation": before_desc_BloodPressure.values,
    "After Imputation": after_desc_BloodPressure.values
})

# Display as a table
print(BloodPressure_comparison)

  Metric  Before Imputation  After Imputation
0  count         733.000000        768.000000
1   mean          72.405184         72.361292
2    std          12.382158         12.147518
3    min          24.000000         24.000000
4    25%          64.000000         64.000000
5    50%          72.000000         72.000000
6    75%          80.000000         80.000000
7    max         122.000000        122.000000


#### Skin Thickness

In [None]:
# Comparing values before and after for Skin Thickness

# Get the describe() outputs
before_desc_SkinThickness = df["SkinThickness"].describe()
after_desc_SkinThickness = X_imputed["SkinThickness"].describe()

# Combine into a DataFrame
SkinThickness_comparison = pd.DataFrame({
    "Metric": before_desc_SkinThickness.index,
    "Before Imputation": before_desc_SkinThickness.values,
    "After Imputation": after_desc_SkinThickness.values
})

# Display as a table
print(SkinThickness_comparison)

  Metric  Before Imputation  After Imputation
0  count         541.000000        768.000000
1   mean          29.153420         28.926373
2    std          10.476982          9.521595
3    min           7.000000          7.000000
4    25%          22.000000         22.217409
5    50%          29.000000         28.442091
6    75%          36.000000         35.000000
7    max          99.000000         99.000000


#### Insulin Comparison

In [None]:
# Comparing values before and after for Insulin

# Get the describe() outputs
before_desc_insulin = df["Insulin"].describe()
after_desc_insulin = X_imputed["Insulin"].describe()

# Combine into a DataFrame
insulin_comparison = pd.DataFrame({
    "Metric": before_desc_insulin.index,
    "Before Imputation": before_desc_insulin.values,
    "After Imputation": after_desc_insulin.values
})

# Display as a table
print(insulin_comparison)

  Metric  Before Imputation  After Imputation
0  count         394.000000        768.000000
1   mean         155.548223        152.640510
2    std         118.775855         97.347662
3    min          14.000000        -19.487740
4    25%          76.250000         89.986466
5    50%         125.000000        130.196342
6    75%         190.000000        190.000000
7    max         846.000000        846.000000


All descriptive statistics are similar except **min** whose after imputation value is negative, which is not physiologically valid for Insulin.

To fix this, we will clip negative values to a realistic minimum

In [None]:
# Clipping insulin negative values to a realistic minimum of 14
X_imputed['Insulin'] = X_imputed['Insulin'].clip(lower=14)

##### Updated Insulin Comparison

In [23]:
# Get updated describe() output
after_clipping_desc_insulin = X_imputed["Insulin"].describe()

# Re-create the comparison table
insulin_comparison_clipped = pd.DataFrame({
    "Metric": before_desc_insulin.index,
    "Before Imputation": before_desc_insulin.values,
    "After Imputation & Clipping": after_clipping_desc_insulin.values
})

# Display the new comparison table
print(insulin_comparison_clipped)

  Metric  Before Imputation  After Imputation & Clipping
0  count         394.000000                   768.000000
1   mean         155.548223                   152.684350
2    std         118.775855                    97.277600
3    min          14.000000                    14.000000
4    25%          76.250000                    89.986466
5    50%         125.000000                   130.196342
6    75%         190.000000                   190.000000
7    max         846.000000                   846.000000


#### Body Mass Index (BMI) Comparison

In [21]:
# Get the describe() outputs
before_desc_BMI = df["BMI"].describe()
after_desc_BMI = X_imputed["BMI"].describe()

# Combine into a DataFrame
BMI_comparison = pd.DataFrame({
    "Metric": before_desc_BMI.index,
    "Before Imputation": before_desc_BMI.values,
    "After Imputation": after_desc_BMI.values
})

# Display as a table
print(BMI_comparison)

  Metric  Before Imputation  After Imputation
0  count         757.000000        768.000000
1   mean          32.457464         32.436839
2    std           6.924988          6.879242
3    min          18.200000         18.200000
4    25%          27.500000         27.500000
5    50%          32.300000         32.000000
6    75%          36.600000         36.600000
7    max          67.100000         67.100000


### 7. Combining Imputed Data with Target

In [24]:
# Combine Imputed Data with Target
df_imputed = X_imputed.copy()
df_imputed['Outcome'] = y.values # to ensure proper alignment

In [25]:
# Confirming combination
print("Target column added. Final DataFrame shape:", df_imputed.shape)
print("Column names:", df_imputed.columns.tolist())

Target column added. Final DataFrame shape: (768, 14)
Column names: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Glucose_missing', 'BloodPressure_missing', 'SkinThickness_missing', 'Insulin_missing', 'BMI_missing', 'Outcome']


In [26]:
# Ensuring final dataframe is assembled properly
print("Preview of final dataset:")
print(df_imputed.head())

Preview of final dataset:
   Pregnancies  Glucose  BloodPressure  SkinThickness     Insulin   BMI  \
0          6.0    148.0           72.0      35.000000  218.923435  33.6   
1          1.0     85.0           66.0      29.000000   70.306082  26.6   
2          8.0    183.0           64.0      21.640837  268.531745  23.3   
3          1.0     89.0           66.0      23.000000   94.000000  28.1   
4          0.0    137.0           40.0      35.000000  168.000000  43.1   

   DiabetesPedigreeFunction   Age  Glucose_missing  BloodPressure_missing  \
0                     0.627  50.0              0.0                    0.0   
1                     0.351  31.0              0.0                    0.0   
2                     0.672  32.0              0.0                    0.0   
3                     0.167  21.0              0.0                    0.0   
4                     2.288  33.0              0.0                    0.0   

   SkinThickness_missing  Insulin_missing  BMI_missing  Outc

### 8. Final Imputed Dataset

In [27]:
# Save the imputed dataset
df_imputed.to_csv("diabetes_imputed.csv", index=False)
print("Dataset saved as 'diabetes_imputed.csv'")

Dataset saved as 'diabetes_imputed.csv'
