# Diabetes Prediction - Final Project CSCI 4050 (ID: Group 16)

## Learning Problem: 
The purpose of this project is to create a Machine Learning model that can predict whether you have diabetes or not and also give you the percentage of confidence that the prediction is accurate. It uses various information including: demographic information (gender, age, number of pregnancies), physical measurements (BMI, blood pressure, skin thickness), lab measurements (blood glucose, HbA1c, insulin) and health history (hypertension, heart disease, smoking status) to help make the predictions. The model was trained using 2 datasets taken from Kaggle and uses a sequence of linear and ReLu layers to help conduct a binary classification of whether a user has diabetes. The model also uses the PyTorch Lightning module and the Adam optimizer to help improve the model's performance.

In [1]:
# Imports
import warnings
warnings.filterwarnings("ignore")

import os
import numpy as np
import pandas as pd

import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from torch.optim import Adam

import lightning as L
from lightning.pytorch.callbacks import EarlyStopping
from torchmetrics import Accuracy

from torchinfo import summary

---
## Data Loading and Cleaning

### 1. Loading Raw Data

In [2]:
first_dataset_path  = "data/raw/diabetes.csv"
second_dataset_path = "data/raw/diabetes_prediction_dataset.csv"

first_dataset  = pd.read_csv(first_dataset_path)
second_dataset = pd.read_csv(second_dataset_path)

print("First dataset:", first_dataset.shape)
print("Second dataset:", second_dataset.shape)
first_dataset.head(), second_dataset.head()

First dataset: (768, 9)
Second dataset: (100000, 9)


(   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
 0            6      148             72             35        0  33.6   
 1            1       85             66             29        0  26.6   
 2            8      183             64              0        0  23.3   
 3            1       89             66             23       94  28.1   
 4            0      137             40             35      168  43.1   
 
    DiabetesPedigreeFunction  Age  Outcome  
 0                     0.627   50        1  
 1                     0.351   31        0  
 2                     0.672   32        1  
 3                     0.167   21        0  
 4                     2.288   33        1  ,
    gender   age  hypertension  heart_disease smoking_history    bmi  \
 0  Female  80.0             0              1           never  25.19   
 1  Female  54.0             0              0         No Info  27.32   
 2    Male  28.0             0              0           never  27.32   
 

### 2. Clean Data and Align Columns

In [3]:
# Fill missing columns in the first dataset
for col in ["gender", "hypertension", "heart_disease", "smoking_history", "HbA1c_level"]:
    if col != "smoking_history":
        first_dataset[col] = 0
    else:
        first_dataset[col] = "never"
first_dataset["blood_glucose_level"] = first_dataset["Glucose"]

# Fill missing columns in the second dataset
for col in ["Pregnancies", "BloodPressure", "SkinThickness", "Insulin",
            "DiabetesPedigreeFunction", "Glucose"]:
    second_dataset[col] = 0
second_dataset["Glucose"] = second_dataset["blood_glucose_level"]

# Concatenate datasets row-wise
combined_df = pd.concat([first_dataset, second_dataset], ignore_index=True)
combined_df.shape

(100768, 18)

### 3. Convert Zero Values to Median Values
*For preservation of large datasets*

In [4]:
zero_as_missing = [
    "Glucose", "BloodPressure", "SkinThickness", "Insulin",
    "BMI", "HbA1c_level", "blood_glucose_level"
]

for col in zero_as_missing:
    combined_df[col] = combined_df[col].replace(0, np.nan)
    combined_df[col] = combined_df[col].fillna(combined_df[col].median())

### 4. Encode Categorical Data to Continuous Data

In [5]:
# Gender
combined_df["gender"] = combined_df["gender"].map({"Male": 1, "Female": 0})
combined_df["gender"] = combined_df["gender"].fillna(0)

# Smoking history
combined_df["smoking_history"] = combined_df["smoking_history"].replace("No Info", "never")
combined_df = pd.get_dummies(combined_df, columns=["smoking_history"], drop_first=False)

# Fill any remaining NaNs in features
combined_df = combined_df.fillna(0)

combined_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,gender,...,HbA1c_level,blood_glucose_level,age,bmi,diabetes,smoking_history_current,smoking_history_ever,smoking_history_former,smoking_history_never,smoking_history_not current
0,6,148.0,72.0,35.0,125.0,33.6,0.627,50.0,1.0,0.0,...,5.8,148.0,0.0,0.0,0.0,False,False,False,True,False
1,1,85.0,66.0,29.0,125.0,26.6,0.351,31.0,0.0,0.0,...,5.8,85.0,0.0,0.0,0.0,False,False,False,True,False
2,8,183.0,64.0,29.0,125.0,23.3,0.672,32.0,1.0,0.0,...,5.8,183.0,0.0,0.0,0.0,False,False,False,True,False
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0.0,0.0,...,5.8,89.0,0.0,0.0,0.0,False,False,False,True,False
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1.0,0.0,...,5.8,137.0,0.0,0.0,0.0,False,False,False,True,False


### 5. Final Feature List

In [6]:
numeric_cols = [
    "Pregnancies", "blood_glucose_level", "BloodPressure", "SkinThickness",
    "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "HbA1c_level"
]

categorical_cols = ["gender", "hypertension", "heart_disease"]

smoking_cols = [
    "smoking_history_never",
    "smoking_history_former",
    "smoking_history_current"
]

feature_cols = numeric_cols + categorical_cols + smoking_cols
print("Total features:", len(feature_cols))

Total features: 15


### 6. Extract X and y Values (Parameters and Validation Outputs)

In [7]:
X = combined_df[feature_cols].astype(np.float32).values
y = combined_df["Outcome"].fillna(combined_df["diabetes"]).astype(int).values

X[:5], y[:5]

(array([[6.000e+00, 1.480e+02, 7.200e+01, 3.500e+01, 1.250e+02, 3.360e+01,
         6.270e-01, 5.000e+01, 5.800e+00, 0.000e+00, 0.000e+00, 0.000e+00,
         1.000e+00, 0.000e+00, 0.000e+00],
        [1.000e+00, 8.500e+01, 6.600e+01, 2.900e+01, 1.250e+02, 2.660e+01,
         3.510e-01, 3.100e+01, 5.800e+00, 0.000e+00, 0.000e+00, 0.000e+00,
         1.000e+00, 0.000e+00, 0.000e+00],
        [8.000e+00, 1.830e+02, 6.400e+01, 2.900e+01, 1.250e+02, 2.330e+01,
         6.720e-01, 3.200e+01, 5.800e+00, 0.000e+00, 0.000e+00, 0.000e+00,
         1.000e+00, 0.000e+00, 0.000e+00],
        [1.000e+00, 8.900e+01, 6.600e+01, 2.300e+01, 9.400e+01, 2.810e+01,
         1.670e-01, 2.100e+01, 5.800e+00, 0.000e+00, 0.000e+00, 0.000e+00,
         1.000e+00, 0.000e+00, 0.000e+00],
        [0.000e+00, 1.370e+02, 4.000e+01, 3.500e+01, 1.680e+02, 4.310e+01,
         2.288e+00, 3.300e+01, 5.800e+00, 0.000e+00, 0.000e+00, 0.000e+00,
         1.000e+00, 0.000e+00, 0.000e+00]], dtype=float32),
 array([1, 0, 1, 0

### 7. Standardize Mean and Standard Deviation Data

In [8]:
mean = X.mean(axis=0)
std = X.std(axis=0)
std[std == 0] = 1

X_scaled = (X - mean) / std
X_scaled = np.clip(X_scaled, -10, 10)

### 8. Split Data Into Training and Validation Data

In [9]:
np.random.seed(42)
indices = np.arange(len(X_scaled))
np.random.shuffle(indices)

val_ratio = 0.2
split_idx = int(len(X_scaled) * (1 - val_ratio))

train_idx = indices[:split_idx]
val_idx   = indices[split_idx:]

X_train, y_train = X_scaled[train_idx], y[train_idx]
X_val, y_val     = X_scaled[val_idx], y[val_idx]

### 9. Convert Data to PyTorch Tensors

In [10]:
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.long)

X_val = torch.tensor(X_val, dtype=torch.float32)
y_val = torch.tensor(y_val, dtype=torch.long)

train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=32, shuffle=True)
val_loader   = DataLoader(TensorDataset(X_val, y_val), batch_size=32, shuffle=False)

---
## Import Diabetes Model

In [11]:
from src.model.model import DiabetesModel
model = DiabetesModel(input_size=X_val.shape[1])
summary(model, input_size=(1, X_val.shape[1]))

Layer (type:depth-idx)                   Output Shape              Param #
DiabetesModel                            [1, 2]                    --
â”œâ”€Sequential: 1-1                        [1, 2]                    --
â”‚    â””â”€Linear: 2-1                       [1, 128]                  2,048
â”‚    â””â”€ReLU: 2-2                         [1, 128]                  --
â”‚    â””â”€Linear: 2-3                       [1, 64]                   8,256
â”‚    â””â”€ReLU: 2-4                         [1, 64]                   --
â”‚    â””â”€Dropout: 2-5                      [1, 64]                   --
â”‚    â””â”€Linear: 2-6                       [1, 32]                   2,080
â”‚    â””â”€ReLU: 2-7                         [1, 32]                   --
â”‚    â””â”€Linear: 2-8                       [1, 2]                    66
Total params: 12,450
Trainable params: 12,450
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0.01
Input size (MB): 0.00
Forward/backward pass size (MB):

---
## Training the Model

In [12]:
# Enable early stopping if val_loss doesn't improve in 5 epochs
early_stop = EarlyStopping(
    monitor="val_loss",
    patience=5,
    mode="min"
)

trainer = L.Trainer(
    max_epochs=50,
    accelerator="auto",
    devices=1,
    log_every_n_steps=10,
    callbacks=[early_stop]
)

trainer.fit(model, train_loader, val_loader)

ðŸ’¡ Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores

  | Name      | Type               | Params | Mode  | FLOPs
-----------------------------------------------------------------
0 | net       | Sequential         | 12.5 K | train | 0    
1 | train_acc | MulticlassAccuracy | 0      | train | 0    
2 | val_acc   | MulticlassAccuracy | 0      | train | 0    
-----------------------------------------------------------------
12.5 K    Trainable params
0         Non-trainable params
12.5 K    Total params
0.050     Total estimated model params size (MB)
11        Modules in train mode
0         Modules in eval mode
0         Total Flops


Epoch 9: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2520/2520 [00:22<00:00, 114.27it/s, v_num=7, train_loss=3.48e-6, train_acc=1.000, val_loss=0.00459, val_acc=0.997] 


### Final Evaluation

In [13]:
model.eval()
with torch.no_grad():
    logits = model(X_val)
    preds = torch.argmax(logits, dim=1)
    accuracy = (preds == y_val).float().mean().item()

print("Final Validation Accuracy:", accuracy)

Final Validation Accuracy: 0.9971717596054077
