# 📊 **Diabetes Insights: Predictive Modeling with BRFSS Health Indicators**

---

## 🚀 **Project Overview**
This project leverages data from the **Behavioral Risk Factor Surveillance System (BRFSS)** to explore how health indicators can predict diabetes diagnoses. By combining epidemiological insights with machine learning techniques, we aim to identify the most influential features for diabetes prediction, test model performances, and assess their utility on both balanced and imbalanced datasets.

The goal is not just prediction but also to improve understanding of how certain health indicators interplay with diabetes risk, helping public health officials and clinicians make data-driven decisions.

---

## 💡 **Hypothesis**
As an epidemiologist, I hypothesize that the combination of specific health indicators within the BRFSS dataset elicits greater predictive power for identifying diabetes cases in a population. By combining variables related to general health, mobility, and metabolic conditions, predictive accuracy can be significantly improved.

---

## 📂 **Dataset Overview**
The BRFSS dataset is one of the largest health surveys globally, containing data on:
- 🏥 **Chronic Health Conditions**
- 🧍 **Behavioral Risk Factors**
- ⚕️ **Preventive Health Practices**

### **Key Features Extracted for Analysis:**
| **Feature**     | **Description**                             | **Impact**                                 |
|------------------|---------------------------------------------|--------------------------------------------|
| **BMI**         | Body Mass Index, a measure of body fat      | Higher BMI increases diabetes risk         |
| **GenHlth**     | Self-reported general health status         | Poor health correlates with diabetes       |
| **HighBP**      | Indicator of hypertension                  | Strong link to diabetes cases              |
| **HighChol**    | High cholesterol levels                    | Related to diabetes development            |
| **DiffWalk**    | Difficulty walking or mobility issues       | Common symptom of diabetes                 |

This dataset underwent extensive cleaning, handling missing values, and feature engineering to ensure its readiness for training machine learning models.

---

## 🔧 **Methods & Tools**

### **Data Cleaning & Preprocessing:**
- Handled missing values with mean imputation.
- Standardized numeric features to improve model training.
- Created derived features like `Health_Score` and `BMI Categories` to enhance prediction power.

### **Models Trained:**
- **Logistic Regression**
- **Random Forest Classifier**
- **XGBoost**
- **LightGBM**
- **Stacking Classifier**
- **Voting Classifier**

### **Synthetic Data (SMOTE):**
- Used **SMOTE** (Synthetic Minority Oversampling Technique) to balance the dataset and test model performances on balanced classes.

---

## 📈 **Key Results**

| **Model**            | **Precision** | **Recall** | **F1-Score** | **Accuracy** | **MSE**   | **RMSE**  | **MAE**   | **R²**     |
|-----------------------|---------------|------------|--------------|--------------|-----------|-----------|-----------|------------|
| **Random Forest**     | 0.87          | 0.98       | 0.93         | 0.86         | 0.2674    | 0.5171    | 0.2674    | -1.2495    |
| **XGBoost**           | 0.96          | 0.63       | 0.76         | 0.66         | 0.3429    | 0.5855    | 0.3429    | -1.8838    |
| **LightGBM**          | 0.91          | 0.85       | 0.88         | 0.80         | 0.2022    | 0.4497    | 0.2022    | -0.7009    |
| **Stacking Classifier**| 0.93          | 0.81       | 0.87         | 0.78         | 0.1349    | 0.3674    | 0.1349    | -0.1354    |
| **Voting Classifier** | 0.92          | 0.85       | 0.88         | 0.81         | 0.2372    | 0.4870    | 0.2372    | 0.0512     |

### **Key Findings:**
- The **Voting Classifier** emerged as the best model for the balanced dataset, achieving the highest accuracy, lowest error rates, and the best R² score.
- For imbalanced data, the **Stacking Classifier** performed best by leveraging the diversity of its base models to handle class imbalance effectively.

---

## 🌟 **Conclusion**

This project highlights the power of machine learning combined with epidemiological data to predict diabetes:
- **On imbalanced data**: The **Stacking Classifier** excels in balancing precision and recall for underrepresented cases.
- **On balanced data**: The **Voting Classifier** consistently produces better predictions due to its ensemble design.

### **Future Directions:**
- Explore additional health indicators or external datasets.
- Perform more hyperparameter tuning for further model optimization.
- Deploy models in real-world settings to support public health initiatives.

---

## 🛠️ **Technical Stack**
- **Programming Language**: Python (`pandas`, `scikit-learn`, `LightGBM`, `XGBoost`, `imbalanced-learn`)
- **Visualization**: `matplotlib`, `seaborn`
- **Models**: Ensemble classifiers, boosting methods, logistic regression

---

## 🎯 **How to Use This Project**
1. **Setup**: Clone this repository and install the dependencies listed in `requirements.txt`.  
2. **Run the Notebook**: Explore the step-by-step analysis in `Diabetes_Prediction.ipynb`.  
3. **Visualize Results**: Use the charts and metrics in the notebook to understand model performance.

---

## 🎉 **Acknowledgements**
- **BRFSS Dataset**: Provided by the CDC, offering invaluable insights into health indicators.  
- **IDF Insights**: Guidance on the global diabetes burden.

---
