# Diabetes Prediction Using Decision Tree

## Objective:
Create a decision tree to predict whether a patient has diabetes based on clinical variables.

## Dataset Information:
The dataset contains the following variables:
1. **Pregnancies**: Number of times pregnant (integer)
2. **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test (integer)
3. **BloodPressure**: Diastolic blood pressure (mm Hg) (integer)
4. **SkinThickness**: Triceps skin fold thickness (mm) (integer)
5. **Insulin**: 2-Hour serum insulin (mu U/ml) (integer)
6. **BMI**: Body mass index (weight in kg/(height in m)^2) (float)
7. **DiabetesPedigreeFunction**: A function scoring the likelihood of diabetes based on family history (float)
8. **Age**: Age in years (integer)
9. **Outcome**: Class variable (0 if non-diabetic, 1 if diabetic) (integer)

## Steps to Follow:
### Q1. Import the Dataset and Examine the Variables
- Import the dataset using pandas.
- Use descriptive statistics to understand the distribution of each variable.
- Create visualizations to explore the relationships between the variables.

### Q2. Preprocess the Data
- **Handle Missing Values**:
  - Replace zeros in `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, and `BMI` with `NaN`.
  - Fill missing values with appropriate statistics (e.g., median).
- **Remove Outliers**:
  - Use the IQR rule to identify and remove outliers in critical numerical columns.
- **Transform Variables**:
  - Convert categorical variables into dummy variables if required.

### Q3. Split the Dataset
- Split the dataset into training and test sets using an 80-20 split.
- Use a random seed to ensure reproducibility.

### Q4. Train the Decision Tree
- Use a decision tree algorithm (e.g., ID3 or C4.5) to train the model on the training set.
- Perform cross-validation to optimize hyperparameters and prevent overfitting.

### Q5. Evaluate the Model
- Use metrics such as accuracy, precision, recall, and F1 score to evaluate the model on the test set.
- Visualize the results using confusion matrices and ROC curves.

### Q6. Interpret the Decision Tree
- Analyze the tree structure, including splits, branches, and leaves.
- Identify the most important variables and thresholds.
- Use domain knowledge to explain patterns and trends.

### Q7. Validate the Model
- Apply the decision tree model to new data to validate its performance.
- Test the robustness of the model to changes in the dataset or environment using sensitivity analysis and scenario testing.

## Dataset Link:
[Download diabetes.csv dataset](https://drive.google.com/file/d/1Q4J8KS1wm4-_YTuc389enPh6O-eTNcx2/view?usp=sharing)

By following these steps, you will develop a comprehensive understanding of decision tree modeling and its applications in healthcare.

---

## Q1. Importing and Examining the Dataset

### Code:
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
url = "https://drive.google.com/uc?id=1Q4J8KS1wm4-_YTuc389enPh6O-eTNcx2"
data = pd.read_csv(url)

# Overview of the dataset
print(data.head())
print(data.info())
print(data.describe())

# Visualizing distributions
data.hist(figsize=(12, 10), bins=20)
plt.tight_layout()
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
```
---

## Q2. Preprocessing the Data

### Steps:
1. **Check for Missing Values**:
   - Replace zeros in columns such as `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, and `BMI` with `NaN`.
   - Handle missing data by filling the missing values with the median of the respective columns.

2. **Remove Outliers**:
   - Use the Interquartile Range (IQR) rule to identify and remove extreme values from the dataset.

3. **Scale Features**:
   - Normalize or standardize the numerical features for consistent scale and better performance of machine learning models.

### Code:
```python
# Replace zeros with NaN for specific columns
columns_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
data[columns_with_zeros] = data[columns_with_zeros].replace(0, np.nan)

# Fill missing values with median
for col in columns_with_zeros:
    data[col].fillna(data[col].median(), inplace=True)

# Remove outliers using the IQR method
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Apply outlier removal to selected columns
for col in ['Glucose', 'BloodPressure', 'BMI', 'Age']:
    data = remove_outliers(data, col)

# Normalize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']] = scaler.fit_transform(
    data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
)
```
---

## Q3. Splitting the Dataset

### Code:
```python
from sklearn.model_selection import train_test_split

# Splitting data into features and target
X = data.drop(columns=['Outcome'])
y = data['Outcome']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
---

## Q4. Training a Decision Tree Model

### Code:
```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Initialize the model
dt = DecisionTreeClassifier(random_state=42)

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)
```
---

## Q5. Evaluating the Model

### Code:
```python
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc

# Predictions
y_pred = best_model.predict(X_test)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.show()

# Metrics
print(classification_report(y_test, y_pred))

# ROC Curve
y_pred_prob = best_model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.show()
```
---

## Q6. Interpreting the Decision Tree

### Code:
```python
from sklearn.tree import plot_tree

plt.figure(figsize=(20, 10))
plot_tree(best_model, feature_names=X.columns, class_names=["Non-Diabetic", "Diabetic"], filled=True)
plt.show()
```

**Interpretation:**
- Analyze the tree splits, branches, and leaf nodes.
- Identify key features like Glucose, BMI, and Age.
- Discuss threshold values and their relevance to diabetes prediction.
---

## Q7. Validating the Model

### Steps:
1. Test with new unseen data.
2. Use sensitivity analysis by altering features slightly and observing predictions.

### Code:
```python
# Sensitivity Analysis
new_data = X_test.copy()
new_data['Glucose'] += 0.1  # Simulate a small increase in glucose
new_predictions = best_model.predict(new_data)

# Compare results
print("Original Predictions:", y_pred[:10])
print("New Predictions after Glucose adjustment:", new_predictions[:10])
```
---