# Diabetes Prediction Using Machine Learning

## üìå Project Overview

This project focuses on predicting the likelihood of diabetes in patients using supervised machine learning techniques. The goal is to build, evaluate, and compare classification models with a strong emphasis on **medical recall**, ensuring that diabetic cases are correctly identified.

The analysis follows an end-to-end data science workflow including data cleaning, feature engineering, model training, evaluation, and model persistence for deployment readiness.

---

## üéØ Problem Statement

Early detection of diabetes is critical in reducing long-term health complications. Traditional diagnostic approaches can miss early-stage cases. This project aims to leverage machine learning models to improve diagnostic accuracy, particularly by minimizing false negatives.

---

## üìÇ Dataset Description

* Dataset: Pima Indians Diabetes Dataset
* Target Variable: `Outcome` (0 = Non-diabetic, 1 = Diabetic)
* Features include:

  * Glucose
  * Blood Pressure
  * Skin Thickness
  * Insulin
  * BMI
  * Diabetes Pedigree Function
  * Age

### Data Preprocessing Steps

* Medically invalid zero values were replaced with missing values
* Median imputation was applied to affected features
* Feature scaling applied where necessary
* Dataset split into training and testing sets

---

## üß† Models Implemented

1. Logistic Regression (Baseline Model)
2. Random Forest Classifier (Optimized Model)

Both models were evaluated using multiple metrics to ensure balanced and clinically meaningful performance.

---

## üìä Evaluation Metrics

* Accuracy
* Precision
* Recall (Primary medical metric)
* F1-Score
* ROC-AUC

---

## üìà Final Model Performance

| Model               | Accuracy | Recall | ROC-AUC |
| ------------------- | -------- | ------ | ------- |
| Logistic Regression | XX.XX    | XX.XX  | XX.XX   |
| Random Forest       | XX.XX    | XX.XX  | XX.XX   |

> **Note:** Recall was prioritized to minimize false negatives in medical diagnosis.

---

## üîç Key Insights

* Random Forest outperformed Logistic Regression across most metrics
* Logistic Regression provided interpretability and a strong baseline
* Optimized Random Forest achieved higher recall, making it more suitable for medical screening
* ROC-AUC analysis confirmed better class separation by Random Forest

---

## üíæ Model Persistence

The final Random Forest model was serialized using `joblib` for future deployment:

```python
joblib.dump(best_rf, "models/diabetes_random_forest_model.pkl")
```

---

## üöÄ How to Run the Project

1. Clone the repository
2. Install dependencies:

   ```bash
   pip install -r requirements.txt
   ```
3. Run the Jupyter notebook to train and evaluate models
4. Load the saved model for predictions

---

## üîÆ Future Improvements

* Hyperparameter tuning with GridSearchCV
* Recall optimization using class weighting or SMOTE
* Feature importance and SHAP analysis
* Model deployment using Streamlit or FastAPI

---

## üë§ Author

**Brian Sagini**
Business Intelligence & Data Analyst

---

## üìé License

This project is for educational and portfolio purposes.
