Final Project

Problem: Due to the Western diet, lack of exercise, and high stress levels - type 2 diabetes is exceedlingly high among the  population. A machine learning predictive algorithm will allow for a postive/negative prediction on having or getting diabetes based on key variable metrics seen in the human population. Since type 2 diabetes is preventable, it can help individuals to mitigate effects by changing behaviors to increase longevity and quality of life. 

Reference: 

Mustafa, T. (2021). Diabetes prediction dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from imblearn.over_sampling import SMOTE
import joblib


df = pd.read_csv(r"C:\Users\Nanoo\OneDrive\Desktop\ANA-680\diabetes_prediction_dataset.csv")


print(df.head())

   gender   age  hypertension  heart_disease smoking_history    bmi  \
0  Female  80.0             0              1           never  25.19   
1  Female  54.0             0              0         No Info  27.32   
2    Male  28.0             0              0           never  27.32   
3  Female  36.0             0              0         current  23.45   
4    Male  76.0             1              1         current  20.14   

   HbA1c_level  blood_glucose_level  diabetes  
0          6.6                  140         0  
1          6.6                   80         0  
2          5.7                  158         0  
3          5.0                  155         0  
4          4.8                  155         0  


In [6]:
# Basic Info

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


In [7]:
# Summary Statistics


df.describe()


Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,41.885856,0.07485,0.03942,27.320767,5.527507,138.05806,0.085
std,22.51684,0.26315,0.194593,6.636783,1.070672,40.708136,0.278883
min,0.08,0.0,0.0,10.01,3.5,80.0,0.0
25%,24.0,0.0,0.0,23.63,4.8,100.0,0.0
50%,43.0,0.0,0.0,27.32,5.8,140.0,0.0
75%,60.0,0.0,0.0,29.58,6.2,159.0,0.0
max,80.0,1.0,1.0,95.69,9.0,300.0,1.0


In [8]:
# Missing Value Check


df.isnull().sum()


gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

In [9]:
# Preview Categorical Features

print(df['gender'].value_counts())
print(df['smoking_history'].value_counts())

gender
Female    58552
Male      41430
Other        18
Name: count, dtype: int64
smoking_history
No Info        35816
never          35095
former          9352
current         9286
not current     6447
ever            4004
Name: count, dtype: int64


In [10]:
# Clean 'gender' 


df['gender'] = df['gender'].str.strip().str.title()


df = df[df['gender'].isin(['Male','Female'])]

# Map to numeric
df['gender'] = df['gender'].map({'Male':1, 'Female':0})

print("After cleaning shape:", df.shape)
print("Gender values:", df['gender'].unique())



After cleaning shape: (99982, 9)
Gender values: [0 1]


In [11]:
# Features and Target 

X = df.drop('diabetes', axis=1)
y = df['diabetes']

In [12]:
print("Dataset shape:", df.shape)
print("Diabetes value counts:", df['diabetes'].value_counts())
print("Gender unique values:", df['gender'].unique())

Dataset shape: (99982, 9)
Diabetes value counts: diabetes
0    91482
1     8500
Name: count, dtype: int64
Gender unique values: [0 1]


In [13]:
# Split for Training/Testing

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [14]:
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)
print("Unique smoking_history values:", df['smoking_history'].unique())


Train shape: (79985, 8)
Test shape: (19997, 8)
Unique smoking_history values: ['never' 'No Info' 'current' 'former' 'ever' 'not current']


In [15]:
# Preprocessing

numeric_features = ['age','bmi','HbA1c_level','blood_glucose_level']
categorical_features = ['smoking_history']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough'  # keep hypertension, heart_disease, gender
)



In [16]:
print(df.head())


   gender   age  hypertension  heart_disease smoking_history    bmi  \
0       0  80.0             0              1           never  25.19   
1       0  54.0             0              0         No Info  27.32   
2       1  28.0             0              0           never  27.32   
3       0  36.0             0              0         current  23.45   
4       1  76.0             1              1         current  20.14   

   HbA1c_level  blood_glucose_level  diabetes  
0          6.6                  140         0  
1          6.6                   80         0  
2          5.7                  158         0  
3          5.0                  155         0  
4          4.8                  155         0  


In [34]:
# Models to Deploy


models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, class_weight='balanced'),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "SVM": SVC(random_state=42, class_weight='balanced')
}

results = []
best_model = None
best_score = 0




In [41]:
results = []
best_model = None
best_score = 0
best_name = None

print("Models defined:", list(models.keys()))

for name, model in models.items():
    print("Training:", name)
    clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
    
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    # Handle probability safely
    try:
        if hasattr(clf, "predict_proba"):
            y_proba = clf.predict_proba(X_test)[:,1]
            roc = roc_auc_score(y_test, y_proba)
        else:
            roc = None
    except Exception as e:
        print(f"ROC-AUC not available for {name}: {e}")
        roc = None
    
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    results.append((name, acc, f1, roc))
    
    # Track best model
    score = roc if roc is not None else f1
    if score > best_score:
        best_score = score
        best_model = clf
        best_name = name

print("\nModel Performance Comparison:")
print("{:<20} {:<10} {:<10} {:<10}".format("Model", "Accuracy", "F1", "ROC-AUC"))
for name, acc, f1, roc in results:
    roc_str = f"{roc:.3f}" if roc is not None else "N/A"
    print("{:<20} {:.3f}     {:.3f}     {}".format(name, acc, f1, roc_str))


Models defined: ['Logistic Regression', 'Random Forest', 'Gradient Boosting', 'SVM']
Training: Logistic Regression
Training: Random Forest
Training: Gradient Boosting
Training: SVM

Model Performance Comparison:
Model                Accuracy   F1         ROC-AUC   
Logistic Regression  0.888     0.574     0.963
Random Forest        0.970     0.797     0.964
Gradient Boosting    0.972     0.809     0.979
SVM                  0.898     0.604     N/A


The evaluation results indicate that ensemble methods outperform simpler classifiers on this dataset. Gradient Boosting achieved the highest overall performance, with an accuracy of 0.972, F1 score of 0.809, and ROC‑AUC of 0.979, closely followed by Random Forest (accuracy 0.970, F1 0.797, ROC‑AUC 0.964). Logistic Regression provided a strong baseline with high ROC‑AUC (0.963) but lower F1 (0.574), reflecting weaker handling of class imbalance. SVM delivered moderate accuracy (0.898) and F1 (0.604) but lacked probability calibration, preventing ROC‑AUC computation. Overall, Gradient Boosting offers the most balanced and robust predictive capability, while Random Forest provides a competitive alternative with similar performance.

In [43]:
# Dump Best Model to .pkl

joblib.dump(best_model, "best_diabetes_model.pkl")
print(f"\nBest model saved: {best_name}")




Best model saved: Gradient Boosting
