# **Predicting Diabetes Risk Using Machine Learning**
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

**Task:** Build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not. 

In [None]:
import pandas as pd
df = pd.read_csv("diabetes.csv")
df.head(5)

# Check for missing values
df.isna().sum()

# Summary statistics
df.describe()

## **Baseline Model**

In [12]:
X = df.drop("Outcome", axis=1)
y = df["Outcome"]

### Logistic Regression

In [25]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

log_reg_pipe = Pipeline([("scaler", StandardScaler()),
          ("model", LogisticRegression(max_iter=1000))])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
log_reg_pipe.fit(X_train, y_train)

y_pred = log_reg_pipe.predict(X_test)
print("Logistic Regression Report")
print(classification_report(y_test, y_pred))

Logistic Regression Report
              precision    recall  f1-score   support

           0       0.82      0.88      0.85       101
           1       0.74      0.64      0.69        53

    accuracy                           0.80       154
   macro avg       0.78      0.76      0.77       154
weighted avg       0.79      0.80      0.79       154



In [62]:
import joblib
joblib.dump(log_reg_pipe, "logistic_regression.pkl")

['logistic_regression.pkl']

### Random Forest Classification

In [21]:
rand_forest_pipe = Pipeline([("model", RandomForestClassifier(n_estimators=100, class_weight="balanced"))])
rand_forest_pipe.fit(X_train, y_train)
y_pred_rf = rand_forest_pipe.predict(X_test)

print("Random Forest Report")
print(classification_report(y_test, y_pred_rf))

Random Forest Report
              precision    recall  f1-score   support

           0       0.84      0.87      0.85       101
           1       0.73      0.68      0.71        53

    accuracy                           0.81       154
   macro avg       0.79      0.78      0.78       154
weighted avg       0.80      0.81      0.80       154



In [61]:
import joblib
joblib.dump(rand_forest_pipe, "random_forest.pkl")

['random_forest.pkl']

### XGBoost Classifier

In [27]:
import xgboost as xgb
from xgboost import XGBClassifier

xgb_pipeline = Pipeline([("scaler", StandardScaler()),
                         ("xgb", XGBClassifier(n_estimators=200,
                                               learning_rate=0.1,
                                               max_depth=5,
                                               subsample=0.8,
                                               colsample_bytree=0.8,
                                               random_state=23,
                                               use_label_encoder=False,
                                               eval_metric="logloss"))])
xgb_pipeline.fit(X_train, y_train)
y_pred_xgb = xgb_pipeline.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("\nXGBoost Classification Report:\n", classification_report(y_test, y_pred_xgb))

Accuracy: 0.8246753246753247

XGBoost Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87       101
           1       0.77      0.70      0.73        53

    accuracy                           0.82       154
   macro avg       0.81      0.79      0.80       154
weighted avg       0.82      0.82      0.82       154



In [None]:
import joblib
joblib.dump(xgb_pipeline, "xgboost.pkl")

['xgboost.pkl']

Putting all Classification Reports into a Dataframe

In [35]:
log_report = classification_report(y_test, y_pred, output_dict=True)
rand_forest_report = classification_report(y_test, y_pred_rf, output_dict=True)
xgb_report = classification_report(y_test, y_pred_xgb, output_dict=True)

log_report = pd.DataFrame(log_report).transpose()
rand_forest_report = pd.DataFrame(rand_forest_report).transpose()
xgb_report = pd.DataFrame(xgb_report).transpose()

log_report["model"] = "Logistic Regression"
rand_forest_report["model"] = "Random Forest"
xgb_report["model"] = "XGBoost"

comparison = pd.concat([log_report, rand_forest_report, xgb_report])
comparison = comparison.reset_index().rename(columns={"index":"class"})
comparison = comparison[["model", "class", "precision", "recall", "f1-score", "support"]]
comparison

Unnamed: 0,model,class,precision,recall,f1-score,support
0,Logistic Regression,0,0.824074,0.881188,0.851675,101.0
1,Logistic Regression,1,0.73913,0.641509,0.686869,53.0
2,Logistic Regression,accuracy,0.798701,0.798701,0.798701,0.798701
3,Logistic Regression,macro avg,0.781602,0.761349,0.769272,154.0
4,Logistic Regression,weighted avg,0.79484,0.798701,0.794956,154.0
5,Random Forest,0,0.838095,0.871287,0.854369,101.0
6,Random Forest,1,0.734694,0.679245,0.705882,53.0
7,Random Forest,accuracy,0.805195,0.805195,0.805195,0.805195
8,Random Forest,macro avg,0.786395,0.775266,0.780126,154.0
9,Random Forest,weighted avg,0.802509,0.805195,0.803266,154.0


## **Hyperparameter Tuning: Random Forest**

In [41]:
from sklearn.model_selection import GridSearchCV
parameter_grid = {"model__n_estimators": [100, 200, 300],
                  "model__max_depth": [None, 5, 10, 20],
                  "model__min_samples_split": [2, 5, 10],
                  "model__min_samples_leaf": [1, 2, 4],
                  "model__max_features": ["sqrt", "log2", None]}

grid_search = GridSearchCV(rand_forest_pipe,
             param_grid=parameter_grid,
             cv=5,
             scoring="recall",
             n_jobs=-1,
             verbose=2)

grid_search.fit(X_train, y_train)

grid_search.best_params_
best_rf_model = grid_search.best_estimator_
print(grid_search.best_estimator_)

Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Pipeline(steps=[('model',
                 RandomForestClassifier(class_weight='balanced', max_depth=5,
                                        min_samples_leaf=4,
                                        min_samples_split=5))])


In [43]:
y_pred_best_rf_model = best_rf_model.predict(X_test)
print(classification_report(y_test, y_pred_best_rf_model))

              precision    recall  f1-score   support

           0       0.89      0.80      0.84       101
           1       0.68      0.81      0.74        53

    accuracy                           0.81       154
   macro avg       0.79      0.81      0.79       154
weighted avg       0.82      0.81      0.81       154



*Evaluating Feature Importance*

In [59]:
rf = grid_search.best_estimator_.named_steps["model"]
importances = rf.feature_importances_
importances = pd.Series(importances, index=X_train.columns).sort_values(ascending=False)
importances

Glucose                     0.344420
BMI                         0.208367
Age                         0.142238
DiabetesPedigreeFunction    0.085850
SkinThickness               0.062757
Pregnancies                 0.053213
Insulin                     0.052241
BloodPressure               0.050916
dtype: float64

In [None]:
best_rf_model_report = classification_report(y_test, y_pred_best_rf_model, output_dict=True)
best_rf_model_report = pd.DataFrame(best_rf_model_report).transpose()
best_rf_model_report["model"] = "Tuned Random Forest"
random_forest_models = pd.concat([rand_forest_report, best_rf_model_report])
random_forest_models = random_forest_models.reset_index().rename(columns={"index": "class"})
random_forest_models = random_forest_models[["model", "class", "precision", "recall", "f1-score", "support"]]
random_forest_models

In [63]:
import joblib
joblib.dump(best_rf_model, "rf_tuned.pkl")

['rf_tuned.pkl']

**Model Deployment**