# **Project-3**

***Project Title:*** Predicting Diabetes

***Project Description:*** In this project, you will build a machine learning model to predict whether a person has diabetes or not based on their health metrics such as BMI, blood pressure, glucose levels, etc. The data set includes information on individuals' health metrics, including whether they have diabetes or not.

***Dataset Details:*** The data set contains over 750 records of female patients aged 21 years or older. The dataset has eight features (e.g., age, BMI, blood pressure, insulin level, etc.) and one target variable that indicates whether the person has diabetes or not.

Pregnancies: Number of pregnancies

Glucose: Glucose level in blood

BloodPressure: Blood pressure

SkinThickness: Thickness of the skin

Insulin: Insulin level in blood

BMI: Body Mass Index

DiabetesPedigreeFunction: Inheritance of diatbetes condition through generations

Age: Age

Outcome: 1 is Diabetic, 0 is non-Diabetic

***Datasets Location:*** Canvas -> Modules -> Week 9 -> Datasets -> **"patients.csv"**.

***Tasks:***

1) *Data Exploration and Preprocessing:* You will explore the data set, handle missing values, perform feature engineering, and preprocess the data to get it ready for model building.

2) *Model Building:* You will train and evaluate several machine learning models on the preprocessed data set.

3) *Model Evaluation:* You will evaluate the models' performance using several metrics such as accuracy, precision, recall, specificity, F1-score, and ROC curve analysis. You will also compare the models' performance and select the best-performing one.

4) *Deployment:* Once you have selected the best-performing model, you will deploy it and make predictions on new, unseen data.

This project will give you hands-on experience with supervised classification, data preprocessing, and model evaluation. It also has real-world applications in healthcare, where early detection of diabetes can help in the timely management of the disease.


Mount instance to google drive database

In [1]:
import numpy as np
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


import the required data

In [2]:
patients = pd.read_csv("drive/My Drive/patients.csv")

Create a stratified split to seperate the data

In [4]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(patients, patients["Outcome"]):
  strat_train_set = patients.iloc[train_index]
  strat_test_set = patients.iloc[test_index]

Then seperate the train and test data as well as their respective targets

In [6]:
train = strat_train_set.drop("Outcome", axis=1)
train_labels = strat_train_set["Outcome"].copy()

test = strat_test_set.drop("Outcome", axis=1)
test_labels = strat_test_set["Outcome"].copy()


import several models and initialize them based their respecivte options

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

In [19]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
          DecisionTreeClassifier(), n_estimators=500,
          bootstrap=True, random_state=42
)

In [20]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
          DecisionTreeClassifier(max_depth=1),
          n_estimators=200,
          learning_rate=0.5,
          random_state=42
)

In [9]:
rndf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
log_clf = LogisticRegression(random_state=42, solver='lbfgs', max_iter=1500)
svc_clf = SVC(gamma='scale', random_state=42)

voting_clf = VotingClassifier(
                estimators=[('rnd', rndf_clf), ('lr', log_clf), ('svr', svc_clf)]
)


__________________________________________***METRICS***______________

In [14]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve

Accuracy Scores for each model

In [32]:
for clf in (log_clf, rndf_clf, svc_clf, voting_clf, bag_clf, ada_clf):
  clf.fit(train, train_labels)
  pred = clf.predict(test)
  print(clf.__class__.__name__, accuracy_score(pred, test_labels))

LogisticRegression 0.7066666666666667
RandomForestClassifier 0.7266666666666667
SVC 0.7066666666666667
VotingClassifier 0.7133333333333334
BaggingClassifier 0.7133333333333334
AdaBoostClassifier 0.7133333333333334


Precision Scrores for each model

In [24]:
for clf in (log_clf, rndf_clf, svc_clf, voting_clf, bag_clf, ada_clf):
  clf.fit(train, train_labels)
  pred = clf.predict(test)
  print(clf.__class__.__name__, precision_score(pred, test_labels))

LogisticRegression 0.48
RandomForestClassifier 0.52
SVC 0.4
VotingClassifier 0.46
BaggingClassifier 0.46
AdaBoostClassifier 0.44


Recall Scrores for each model

In [27]:
for clf in (log_clf, rndf_clf, svc_clf, voting_clf, bag_clf, ada_clf):
  clf.fit(train, train_labels)
  pred = clf.predict(test)
  print(clf.__class__.__name__, recall_score(pred, test_labels))

LogisticRegression 0.5714285714285714
RandomForestClassifier 0.6046511627906976
SVC 0.5882352941176471
VotingClassifier 0.5897435897435898
BaggingClassifier 0.5897435897435898
AdaBoostClassifier 0.5945945945945946


F1 Scrores for each model

In [25]:
for clf in (log_clf, rndf_clf, svc_clf, voting_clf, bag_clf, ada_clf):
  clf.fit(train, train_labels)
  pred = clf.predict(test)
  print(clf.__class__.__name__, f1_score(pred, test_labels))

LogisticRegression 0.5217391304347826
RandomForestClassifier 0.5591397849462365
SVC 0.4761904761904762
VotingClassifier 0.5168539325842696
BaggingClassifier 0.5168539325842696
AdaBoostClassifier 0.5057471264367815


ROC Curve for each model

In [26]:
for clf in (log_clf, rndf_clf, svc_clf, voting_clf, bag_clf, ada_clf):
  clf.fit(train, train_labels)
  pred = clf.predict(test)
  print(clf.__class__.__name__, roc_curve(pred, test_labels))

LogisticRegression (array([0.        , 0.24074074, 1.        ]), array([0.        , 0.57142857, 1.        ]), array([2, 1, 0]))
RandomForestClassifier (array([0.        , 0.22429907, 1.        ]), array([0.        , 0.60465116, 1.        ]), array([2, 1, 0]))
SVC (array([0.        , 0.25862069, 1.        ]), array([0.        , 0.58823529, 1.        ]), array([2, 1, 0]))
VotingClassifier (array([0.        , 0.24324324, 1.        ]), array([0.        , 0.58974359, 1.        ]), array([2, 1, 0]))
BaggingClassifier (array([0.        , 0.24324324, 1.        ]), array([0.        , 0.58974359, 1.        ]), array([2, 1, 0]))
AdaBoostClassifier (array([0.        , 0.24778761, 1.        ]), array([0.        , 0.59459459, 1.        ]), array([2, 1, 0]))


GridSearchCV

In [30]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'RFR_model__n_estimators': [3, 10, 30], 'RFR_model__max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'RFR_model__bootstrap': [False], 'RFR_model__n_estimators': [3, 10], 'RFR_model__max_features': [2, 3, 4]},
  ]

grid_search = GridSearchCV(log_clf, param_grid, cv=5)
grid_search.fit(train, train_labels)

ValueError: ignored