## Healthcare – Early Detection of Diabetes Using Machine Learning

## Background

A mobile health clinic aims to pre-screen patients for diabetes using basic health indicators like glucose level, BMI, insulin levels, and age. The goal is to:

1. Reduce hospital crowding,
2. Prioritize care for at-risk individuals,
3. Enable early intervention to prevent complications.

## Problem statement

Many people remain undiagnosed or are diagnosed too late with diabetes, especially Type 2 Diabetes. Traditional screening methods are resource-intensive. There is a need to:

* Predict the likelihood of diabetes before formal testing
* Use machine learning (ML) to classify whether a person is likely diabetic based on easily measurable indicators.

## Objective

To build a classification model that predicts whether a person has diabetes or not, based on features such as BMI, glucose level, insulin level, age, and others from the Pima Indians Diabetes dataset.

## Preparing the environment



In [None]:
## Import all necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier #Added for a more robust model option
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')



In [None]:
## importing the dataset
df = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
df.head()

In [None]:
# Extract columns as a data frame.

cols = df.columns
columns = pd.DataFrame(cols, columns=['Column Names'])
columns

## Description of the columns

| Column Name              | Description                                                      |
| ------------------------ | ---------------------------------------------------------------- |
| Pregnancies              | Number of times pregnant                                         |
| Glucose                  | Plasma glucose concentration                                     |
| BloodPressure            | Diastolic blood pressure (mm Hg)                                 |
| SkinThickness            | Triceps skinfold thickness (mm)                                  |
| Insulin                  | 2-Hour serum insulin (mu U/ml)                                   |
| BMI                      | Body mass index (weight in kg/m²)                                |
| DiabetesPedigreeFunction | Function that scores diabetes likelihood based on family history |
| Age                      | Age in years                                                     |
| Outcome                  | Class variable (0: Non-diabetic, 1: Diabetic)                    |

## Data cleaning and preprocessing.

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
## Check for null values
df.isnull().sum()

In [None]:
df.duplicated().sum()

The Data has no duplicates, neither does it have null values. 

In [None]:
# The dataset description mentions that 0 can be a placeholder for missing values in certain columns.
# Columns where 0 might indicate a missing value: 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI'
cols_to_check_for_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
print(f"Counts of zeros in columns {cols_to_check_for_zeros}:")
for col in cols_to_check_for_zeros:
    zero_count = (df[col] == 0).sum()
    print(f"- {col}: {zero_count} zeros ({(zero_count/len(df)*100):.2f}%)") # checking for the number of zeros and percentages

### Data Quality Insight: Implausible Zero Values

In the dataset, the following medical features contain zero values:

- `Glucose`
- `BloodPressure`
- `SkinThickness`
- `Insulin`
- `BMI`

These physiological measurements are **always expected to have non-zero values** in any living individual. 

Therefore, zero entries in these columns are likely due to **missing data or incorrect recording** rather than actual valid observations.

> **Note:** These zeros should be treated as missing values during data preprocessing.



In [None]:
zero_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'Insulin']
df[zero_cols] = df[zero_cols].replace(0, np.nan)
df = pd.DataFrame(SimpleImputer(strategy='median').fit_transform(df), columns=df.columns)

Zero values in key health columns are replaced with NaN to mark them as missing.  
Missing values are then filled using the median value of each column with SimpleImputer.

## Exloratory Data Analysis.

In [None]:
# Checking for correlation
df_corr = df.corr()

sns.heatmap(df_corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.savefig("Correlation Heatmap")
plt.show()

###  Key Correlations from the Heatmap

The heatmap highlights the strength of linear relationships between features and the target variable `Outcome`.

#### Most Important Correlations:

- **Glucose ↔ Outcome (0.49)**  
  Strongest positive correlation. Higher glucose levels are clearly associated with a higher likelihood of the condition.

- **BMI ↔ Outcome (0.31)**  
  Moderate correlation. Indicates that higher body mass index contributes to the risk.

- **Age ↔ Outcome (0.24)**  
  Older individuals show a higher tendency toward the outcome.

- **Pregnancies ↔ Outcome (0.22)**  
  Suggests that more pregnancies are modestly linked with the condition, possibly due to long-term health effects.

#### ⚠️ Note:
Other features like `Insulin`, `BloodPressure`, and `SkinThickness` have weak correlations with the outcome.


In [None]:
# Checking to see if there is a relationship between number of pregnancy and likelihood of being diagnosed with diabetes
df.groupby("Pregnancies")["Outcome"].value_counts().unstack().plot(kind="bar", stacked=False)
plt.title("Diabetes Outcome by Number of Pregnancies")
plt.xlabel("Number of Pregnancies")
plt.ylabel("Number of Outcomes");
plt.savefig("Bargraph")

### Diabetes Outcome by Number of Pregnancies

- Fewer pregnancies (0–2) are associated with more non-diabetic cases.
- As the number of pregnancies increases, the proportion of diabetic cases (orange bars) also increases.
- Diabetic outcomes become more frequent relative to non-diabetic outcomes in women with 6 or more pregnancies.

> **Conclusion:** Higher pregnancy count is associated with a higher likelihood of diabetes.


In [None]:
# Group the data
grouped = df.groupby("BloodPressure")["Outcome"].value_counts().unstack()

# Plot line chart
plt.figure(figsize=(12, 6))
for outcome in grouped.columns:
    plt.plot(grouped.index, grouped[outcome], marker='o', label=f'Outcome {outcome}')

# Add labels and title
plt.title("Diabetes Outcome Trends by Blood Pressure")
plt.xlabel("Blood Pressure")
plt.ylabel("Count")
plt.legend(title='Outcome')
plt.grid(True)
plt.savefig("LineGraph")
plt.show()


## Blood Pressure & Diabetes
- **Non-diabetics (Outcome 0):** Most have diastolic blood pressure between 60–80 mmHg, with a sharp, high peak—indicating normal, stable blood pressure.
- **Diabetics (Outcome 1):** Show a wider, flatter distribution; more diabetics have higher blood pressure (>80 mmHg).
- **Clinical Relevance:** Diabetes is linked to greater blood pressure variability and more cases of elevated blood pressure, increasing risk for cardiovascular complications.
- **Key Takeaway:** Blood pressure monitoring and control are crucial in diabetes management to reduce health risks.

In [None]:
# Check for the age distribution
plt.figure(figsize=(10, 6))
sns.displot(df['Age'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.savefig("age distribution")
plt.show();

## Age Distribution Explanation

- **Shape:** The histogram shows a right-skewed (positively skewed) distribution.
- **Peak:** The largest group is in the early 20s, with the highest frequency around age 21–23.
- **Trend:** As age increases, the number of individuals decreases steadily.
- **Older Age Groups:** There are very few individuals above age 60, and almost none above 75.

### Summary

Most individuals in this data set are young adults, with frequency dropping sharply as age increases. The population is concentrated in the 20–40 year age range, and older adults are underrepresented.

In [None]:
# Checking if the data is balanced
df['Outcome'].value_counts()

- We have an imbalance dataset and Machine learning models trained on imbalanced data may become biased toward the majority class. This needs to be tackled!

In [None]:
# Visualize the imbalance
sns.countplot(x='Outcome', data=df)
plt.title('Class Distribution (Outcome)')
plt.xlabel('Outcome (0 = No Diabetes, 1 = Diabetes)')
plt.ylabel('Count')
plt.show()

In [None]:
# Function to resample(Oversample the minority class)

def balance_dataset(df, target_column):
    """
    Balance the dataset by oversampling the minority class

    """
    # Separate majority and minority classes
    majority = df[df[target_column] == 0]
    minority = df[df[target_column] == 1]

    print(f"Before Oversampling:\n{df[target_column].value_counts()}\n")

    # Upsample minority class
    minority_upsampled = resample(
        minority,
        replace=True,  # sample with replacement
        n_samples=len(majority),  # match majority count
        random_state=42
    )

    # Combine majority and upsampled minority
    df_oversampled = pd.concat([majority, minority_upsampled])

    print(f"After Oversampling:\n{df_oversampled[target_column].value_counts()}\n")

    # Plot class distribution before and after
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    df[target_column].value_counts().plot(kind='bar', ax=axes[0], color=['skyblue', 'orange'])
    axes[0].set_title('Before Oversampling')
    axes[0].set_xlabel('Class')
    axes[0].set_ylabel('Count')

    df_oversampled[target_column].value_counts().plot(kind='bar', ax=axes[1], color=['skyblue', 'orange'])
    axes[1].set_title('After Oversampling')
    axes[1].set_xlabel('Class')
    axes[1].set_ylabel('Count')

    plt.tight_layout()
    plt.savefig("Resampling")
    plt.show()

    return df_oversampled

In [None]:
Unbalanced_data = df.copy()
balanced_data = balance_dataset(Unbalanced_data, 'Outcome')

## Feature Selection and feature engineering

From the heatmap above we can see that the Outcome (whether a patient has diabetes or not) is not reaaly affected by the skinThickness and Blood pressure. This can further be confirmed by doing a simple Feature Contribution Analysis. To proceed we need to ensure that all the zeros in the columns of Insulin, BloodPressure, Glucose, SkinThickness, BMI have been replaced with the median of their respective columns.

In [None]:
Unbalanced_data.describe()

In [None]:
balanced_data.describe()

In [None]:
## Create a copy of the balanced DataSet
copy_balanced = balanced_data.copy()

In [None]:
copy_balanced.describe()

In [None]:
x = balanced_data.drop('Outcome', axis=1)
y = balanced_data['Outcome']
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

# Convert back to DataFrame
x_scaled_df = pd.DataFrame(x_scaled, columns=x.columns)

# Step 5: Reattach the Outcome column
standardized_data = pd.concat([x_scaled_df, y.reset_index(drop=True)], axis=1)
standardized_data

The affected columns have now been filled with the medians of their respective columns.

## Feature Contribution Analysis

To carry out a feature contrinution analysis, it is necessary to extract the outcome from the other features and then fit it using Random Forest Classifier, After that features can then be arranged based on their importance.


In [None]:
X_FeatureAnalysis = copy_balanced.drop(['Outcome'], axis=1)
y_FeatureAnalysis = copy_balanced['Outcome']
X_FeatureAnalysis.shape, y_FeatureAnalysis.shape

In [None]:
## Model for feature analysis

model_FeatureAnalysis = RandomForestClassifier(n_estimators = 100, random_state=42)
model_FeatureAnalysis.fit(X_FeatureAnalysis, y_FeatureAnalysis)

In [None]:
Importances = model_FeatureAnalysis.feature_importances_   # This gets feature importance
feature_importances = pd.DataFrame({'features': X_FeatureAnalysis.columns, 'importance':Importances}) # This creates a Dataframe of features and their importances
feature_importances = feature_importances.sort_values(by='importance', ascending = False)
feature_importances

In [None]:
copy_balanced.head()

## Model Training.

- Three alogorithims will be used (Logistic regression, randomForest classifier and XGBclassifier) then the best performing model selected.

**Before then data will be devided to a train and test set.**

In [None]:
# 1. Select X (features) and y (target) from the DataFrame 'copy_balanced'

x = copy_balanced.drop('Outcome', axis=1)  # X contains all columns except the target
y = copy_balanced['Outcome']

# Here, test_size=0.2 means 20% test data, 80% train data. Set random_state for reproducibility.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [None]:
x_train.shape, x_test.shape

In [None]:
y_train.shape, y_test.shape

## 1. Logistic Regression Model

In [None]:
# Initialize the logistic regression model
log_model = LogisticRegression(max_iter =1000, random_state=42)

# Train the model on the training data
log_model.fit(x_train, y_train)

In [None]:
# Predict labels for the test set
Log_ypredict = log_model.predict(x_test)
Log_ypredict

In [None]:
# Evaluate the model
# Get accuracy
accuracy = accuracy_score(y_test, Log_ypredict)

# Get classification report 
report = classification_report(y_test, Log_ypredict, output_dict=True)

# Convert classification report to DataFrame
C_Report  = pd.DataFrame(report).transpose()

# Add accuracy as a new row
C_Report.loc['accuracy'] = [accuracy, None, None, None]
## Print Report
C_Report

The regression model predicts diabetes with **78.5% accuracy**. It performs similarly for both classes:

* **Non-diabetic (0)**: F1-score = **77.3%**
* **Diabetic (1)**: F1-score = **79.6%**

Precision and recall are well balanced, indicating the model reliably distinguishes between diabetic and non-diabetic cases without favoring either class.


In [None]:
# Logistic Regression confusion matrix
log_cm = confusion_matrix(y_test, Log_ypredict)
plt.figure(figsize=(5,4))
sns.heatmap(log_cm, annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig("Logistic Regression Matrix")
plt.show()

## Explanation

**True Negatives (75):** The model correctly predicted 75 non-diabetic individuals.

**False Positives (22):** The model incorrectly labeled 22 non-diabetic individuals as diabetic.

**False Negatives (21):** The model missed 21 diabetic individuals.

**True Positives (82):** The model correctly identified 82 diabetic individuals.

## 2. Random Forest Tree Model

In [None]:
# Initialize and train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(x_train, y_train)

# Predict on the test set
rf_ypredict = rf_model.predict(x_test)

# Calculate accuracy
rf_accuracy = accuracy_score(y_test, rf_ypredict)

# Get classification report
rf_report = classification_report(y_test, rf_ypredict, output_dict=True)
RF_report = pd.DataFrame(rf_report).transpose()

# Add accuracy to the table
RF_report.loc['accuracy'] = [rf_accuracy, None, None, None]

# Display the evaluation table
RF_report

The **Random Forest model** demonstrates strong performance in predicting diabetes, achieving an overall **accuracy of 90.5%**. It shows a well-balanced ability to classify both diabetic and non-diabetic individuals.

For non-diabetic cases (class 0), the model has a **precision of 91.3%**, meaning it correctly identifies non-diabetics in most predictions, and a **recall of 88.5%**, indicating that it successfully detects a majority of actual non-diabetic individuals. The resulting **F1-score is 89.9%**, reflecting a good balance between precision and recall for this class.

For diabetic cases (class 1), the model yields a **precision of 89.7%** and a higher **recall of 92.3%**, suggesting that it is particularly effective at capturing actual diabetes cases. The **F1-score for this class is 90.9%**, indicating strong overall performance.

The **macro average** scores—precision (90.5%), recall (90.4%), and F1-score (90.4%)—show consistent performance across both classes without favoring one. Similarly, the **weighted averages**, which account for class distribution, mirror the overall performance closely.

In summary, the Random Forest model is accurate, balanced, and particularly effective at identifying diabetic individuals, making it a reliable tool for diabetes prediction.


In [None]:
# Random Forest confusion matrix
rf_cm = confusion_matrix(y_test, rf_ypredict)
plt.figure(figsize=(5,4))
sns.heatmap(rf_cm, annot=True, fmt='d', cmap='Greens')
plt.title('Random Forest Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig("RF Model")
plt.show()

The Random Forest model demonstrates strong classification performance, as shown in the confusion matrix:

* It correctly identified **85 out of 96 non-diabetic cases** (True Negatives) and misclassified **11** as diabetic (False Positives).
* It correctly identified **96 out of 104 diabetic cases** (True Positives) and misclassified **8** as non-diabetic (False Negatives).


## 3. XGBClassification Model

In [None]:
#Initialize and train the XGBoost model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(x_train, y_train)

#Predict on the test set
xgb_ypredict = xgb_model.predict(x_test)

#Calculate accuracy
xgb_accuracy = accuracy_score(y_test, xgb_ypredict)

# Get classification report
report = classification_report(y_test, xgb_ypredict, output_dict=True)
xgb_report = pd.DataFrame(report).transpose()

# Add accuracy to the table
xgb_report.loc['accuracy'] = [xgb_accuracy, None, None, None]

# Display the evaluation table
xgb_report

The **XGBoost (XGBClassifier) model** achieved an overall **accuracy of 88.0%** in predicting diabetes, indicating strong and consistent performance. For the **non-diabetic class (0)**, it recorded a **precision of 90%**, **recall of 84.3%**, and an **F1-score of 87.09%**, showing it is effective at correctly identifying non-diabetic individuals but slightly less so at capturing all actual cases. For the **diabetic class (1)**, the model achieved a **precision of 86.3%**, a higher **recall of 91.3%**, and an **F1-score of 88.7%**, suggesting it performs particularly well in identifying most individuals with diabetes.

The **macro average scores**—**precision: 88.18%**, **recall: 87.8%**, and **F1-score: 87.9%**—show that the model maintains a balanced performance across both classes. The **weighted averages**—**precision: 88.1%**, **recall: 88.0%**, and **F1-score: 87.9%**—reflect the model’s performance accounting for the class distribution, further confirming its overall reliability. With strong precision, high recall for diabetic cases, and balanced F1-scores, the XGBoost model demonstrates solid potential for supporting diabetes risk identification in healthcare settings.


In [None]:
# Begion ny computing the confusion matrix
xgb_cm = confusion_matrix(y_test, xgb_ypredict)

# Plot the confusion matrix
plt.figure(figsize=(5,4))
sns.heatmap(xgb_cm, annot=True, fmt='d', cmap='Oranges')
plt.title('XGBoost Classifier Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig("XGBC matrix")
plt.show()

The XGBoost Classifier demonstrates high predictive performance in identifying diabetes cases, as shown in the confusion matrix:

- It correctly classified **81 out of 96 non-diabetic cases (True Negatives)**, while 15 were incorrectly labeled as diabetic (False Positives).

- It correctly identified **95 out of 104 diabetic cases (True Positives)**, with only 9 missed as non-diabetic (False Negatives).

## Conclusion

* **Random Forest outperformed all models**, achieving the highest **accuracy (90.5%)** and strong, balanced **F1-scores** for both classes (non-diabetic: 89.9%, diabetic: 90.9%).

* **XGBoost followed closely** with **88.0% accuracy**, showing high recall for diabetic cases (91.3%) and solid overall performance, making it a reliable alternative.

* **Logistic Regression lagged**, with lower **accuracy (78.5%)** and moderate F1-scores, though it maintained a balanced classification between classes.

* **Conclusion**: **Random Forest** is the most suitable model for predicting diabetes due to its superior and consistent performance across all evaluation metrics.
