# Diabetes Prediction Using Machine Learning and Deep Learning

This notebook trains and evaluates three models for diabetes prediction:
- **Logistic Regression** (Machine Learning baseline)
- **Random Forest Classifier** (Ensemble Machine Learning)
- **Artificial Neural Network (ANN)** using **TensorFlow/Keras** (Deep Learning)

**Dataset:** `diabetes.csv`  
**Target column:** `Outcome` (0 = Non-diabetic, 1 = Diabetic)


## 1. Import Libraries

In [4]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input


## 2. Load Dataset

In [5]:
# Load the dataset (make sure diabetes.csv is in the same folder as this notebook)
df = pd.read_csv('../Data Sets/diabetes.csv')

# Preview the first rows
df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## 3. Check Missing Values

In [6]:
df.isnull().sum()


Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

## 4. Split Features and Target

In [7]:
X = df.drop('Outcome', axis=1)   # Features
y = df['Outcome']                 # Target

X.shape, y.shape


((768, 8), (768,))

## 5. Train-Test Split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape


((614, 8), (154, 8))

## 6. Feature Scaling

In [9]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled.shape, X_test_scaled.shape


((614, 8), (154, 8))

## 7. Model 1 — Logistic Regression

In [10]:
lr_model = LogisticRegression()
lr_model.fit(X_train_scaled, y_train)

lr_predictions = lr_model.predict(X_test_scaled)

print("Logistic Regression Evaluation:")
print(confusion_matrix(y_test, lr_predictions))
print(classification_report(y_test, lr_predictions))


Logistic Regression Evaluation:
[[79 20]
 [18 37]]
              precision    recall  f1-score   support

           0       0.81      0.80      0.81        99
           1       0.65      0.67      0.66        55

    accuracy                           0.75       154
   macro avg       0.73      0.74      0.73       154
weighted avg       0.76      0.75      0.75       154



## 8. Model 2 — Random Forest Classifier

In [11]:
rf_model = RandomForestClassifier()
rf_model.fit(X_train_scaled, y_train)

rf_predictions = rf_model.predict(X_test_scaled)

print("Random Forest Evaluation:")
print(confusion_matrix(y_test, rf_predictions))
print(classification_report(y_test, rf_predictions))


Random Forest Evaluation:
[[81 18]
 [20 35]]
              precision    recall  f1-score   support

           0       0.80      0.82      0.81        99
           1       0.66      0.64      0.65        55

    accuracy                           0.75       154
   macro avg       0.73      0.73      0.73       154
weighted avg       0.75      0.75      0.75       154



## 9. Model 3 — Deep Learning (ANN)

In [12]:
dl_model = Sequential()

# Input layer
dl_model.add(Input(shape=(X_train_scaled.shape[1],)))

# Hidden layers
dl_model.add(Dense(32, activation='relu'))
dl_model.add(Dense(16, activation='relu'))

# Output layer (binary classification)
dl_model.add(Dense(1, activation='sigmoid'))

dl_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = dl_model.fit(
    X_train_scaled, y_train,
    epochs=100,
    batch_size=10,
    validation_split=0.2,
    verbose=1
)


Epoch 1/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.6058 - loss: 0.6607 - val_accuracy: 0.6829 - val_loss: 0.6038
Epoch 2/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8084 - loss: 0.4894 - val_accuracy: 0.7154 - val_loss: 0.5502
Epoch 3/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7908 - loss: 0.4988 - val_accuracy: 0.7398 - val_loss: 0.5282
Epoch 4/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7981 - loss: 0.4385 - val_accuracy: 0.7236 - val_loss: 0.5197
Epoch 5/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7987 - loss: 0.4337 - val_accuracy: 0.7236 - val_loss: 0.5180
Epoch 6/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7991 - loss: 0.4484 - val_accuracy: 0.7236 - val_loss: 0.5110
Epoch 7/100
[1m50/50[0m [32m━━━

## 10. Deep Learning Model Evaluation

In [13]:
dl_predictions = (dl_model.predict(X_test_scaled) > 0.5).astype("int32")

print("Deep Learning Model Evaluation:")
print(confusion_matrix(y_test, dl_predictions))
print(classification_report(y_test, dl_predictions))


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step
Deep Learning Model Evaluation:
[[73 26]
 [21 34]]
              precision    recall  f1-score   support

           0       0.78      0.74      0.76        99
           1       0.57      0.62      0.59        55

    accuracy                           0.69       154
   macro avg       0.67      0.68      0.67       154
weighted avg       0.70      0.69      0.70       154



## 11. Notes
- If you want reproducible results for Random Forest and Neural Network training, you can set random seeds.
- You can later add ROC-AUC, precision-recall curves, and hyperparameter tuning for improved performance.
