# Hyperparameter Optimization Project

## Overview
This project focuses on hyperparameter optimization to improve the performance of machine learning models. Using a diabetes prediction dataset, we explore Random Forest and XGBoost classifiers to systematically determine the best hyperparameters for optimal model performance. Additionally, we leverage Explainable AI (XAI) tools such as LIME and SHAP to interpret model predictions, ensuring transparency and trust in the results.

---

## Dataset Description

### Source
The dataset originates from the National Institute of Diabetes and Digestive and Kidney Diseases. It aims to predict whether a patient has diabetes based on several diagnostic metrics.

### Characteristics
- **Number of Instances**: 768
- **Features**: 8 independent variables and 1 target variable
- **Target Variable**: `Outcome` (1 for diabetic, 0 for non-diabetic)
- **Key Features**:
  - Glucose
  - Blood Pressure
  - BMI (Body Mass Index)
  - Age
  - Diabetes Pedigree Function
- **Data Quality**: The dataset is clean with no missing values.

---

## Workflow

### 1. Data Preprocessing
- The data is split into training and testing sets using an 80-20 split.
- Independent variables (`X`) and the target variable (`y`) are separated for analysis.

### 2. Hyperparameter Optimization

#### Random Forest
- **Parameters Tuned**:
  - Number of estimators (`n_estimators`): [100, 200, 300]
  - Maximum depth of trees (`max_depth`): [None, 10, 20]
  - Minimum samples required to split a node (`min_samples_split`): [2, 5, 10]
- **Method**: Grid Search with Cross-Validation (CV).
- **Optimal Parameters**:
  - `n_estimators`: 200
  - `max_depth`: None
  - `min_samples_split`: 10

#### XGBoost
- **Parameters Tuned**:
  - Number of estimators (`n_estimators`): [100, 200, 300]
  - Maximum depth of trees (`max_depth`): [3, 6, 9]
  - Learning rate (`learning_rate`): [0.01, 0.1, 0.2]
- **Method**: Grid Search with Cross-Validation (CV).
- **Optimal Parameters**:
  - `n_estimators`: 200
  - `max_depth`: 3
  - `learning_rate`: 0.01

### 3. Model Evaluation
Performance metrics were calculated for both models:

| Metric          | Random Forest | XGBoost   |
|------------------|---------------|-----------|
| Accuracy         | 0.7489        | 0.7576    |
| Precision        | 0.6310        | 0.6667    |
| Recall           | 0.6625        | 0.6000    |
| F1 Score         | 0.6463        | 0.6316    |
| ROC-AUC          | 0.8148        | 0.8063    |

**Observations**:
- **Random Forest**: Higher recall and ROC-AUC, making it suitable for identifying positive cases.
- **XGBoost**: Better accuracy and precision, reducing false positives.

### 4. Explainable AI
#### LIME (Local Interpretable Model-Agnostic Explanations)
- **Findings**:
  - LIME was used to explain individual predictions by analyzing the contribution of each feature to the model's output.
  - For a sample prediction, features like `Glucose`, `BMI`, and `Age` had the highest impact.
    - **Example**: A prediction with high glucose levels (98 mg/dL) and elevated BMI (34 kg/m²) was classified as diabetic due to their strong influence on the probability.
    - LIME highlighted that features like `SkinThickness` and `BloodPressure` had less influence on certain predictions.
- **Explanation**:
  - LIME provides an intuitive breakdown of how features individually contribute to the outcome, making it easier to trust and validate the model in sensitive applications like healthcare.

#### SHAP (SHapley Additive exPlanations)
- **Findings**:
  - SHAP values revealed global feature importance, showing that `Glucose` was consistently the most significant predictor for diabetes.
  - The summary plot indicated:
    - High glucose levels increased the likelihood of being classified as diabetic.
    - Features like `BMI` and `Age` had moderate but consistent contributions across predictions.
    - Features like `DiabetesPedigreeFunction` provided subtle yet valuable signals for certain predictions.
  - SHAP dependence plots illustrated the non-linear relationships between features and the target variable.
- **Explanation**:
  - SHAP enhances global interpretability by showing how much each feature contributes to the model's predictions on average.
  - It provides actionable insights for stakeholders, such as identifying high-risk individuals based on glucose and BMI thresholds.

---

## Tools and Libraries

### Python Libraries
- **Data Handling**: `pandas`, `numpy`
- **Modeling**: `scikit-learn`, `xgboost`
- **Visualization**: `matplotlib`
- **Explainability**: `lime`, `shap`

### Environment
- Development was conducted using Jupyter Notebooks and Google Colab.
- Python 3.9 was used for compatibility.

---

## How to Use

1. **Install Dependencies**:
   Run the following command to install required libraries:
   ```bash
   pip install scikit-learn xgboost shap lime pandas numpy matplotlib
   ```

2. **Prepare the Dataset**:
   - Place the dataset (`diabetes.csv`) in the project directory.

3. **Run the Script**:
   - Execute the notebook or Python script to:
     - Perform hyperparameter tuning.
     - Train and evaluate models.
     - Generate explainability insights.

---

## Results and Insights
- **Best Model**: XGBoost achieved slightly higher accuracy.
- **Feature Importance**: Glucose, BMI, and Age are the most influential predictors.
- **Use Cases**:
  - **Random Forest**: Suitable for scenarios where recall (minimizing false negatives) is critical, such as medical diagnosis.
  - **XGBoost**: Ideal for cases prioritizing precision to avoid false positives.

---

## Future Enhancements
- Integrate additional models like Support Vector Machines (SVM) or Neural Networks.
- Automate the hyperparameter tuning process using Bayesian Optimization.
- Build an interactive dashboard to visualize model performance and predictions.

---

## Contact
For further inquiries, please reach out to the project contributor.

---
