# Decision Tree 2: Predicting Diabetes with Clinical Variables
This notebook demonstrates how to use a decision tree to identify patients with diabetes based on clinical variables, following a real-world healthcare scenario.

## Q1. Import the dataset and examine the variables. Use descriptive statistics and visualizations to understand the distribution and relationships between the variables.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset (update the path as needed)
diabetes = pd.read_csv('diabetes.csv')
diabetes.head()

In [None]:
# Descriptive statistics
print(diabetes.describe())

In [None]:
# Visualizations
sns.pairplot(diabetes, hue='Outcome')
plt.show()

plt.figure(figsize=(10,6))
sns.heatmap(diabetes.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## Q2. Preprocess the data by cleaning missing values, removing outliers, and transforming categorical variables into dummy variables if necessary.

In [None]:
# Check for missing values
print(diabetes.isnull().sum())

# Replace zero values in certain columns with NaN (as zeros are not physiologically plausible)
cols_with_zero = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
diabetes[cols_with_zero] = diabetes[cols_with_zero].replace(0, pd.NA)

# Impute missing values with median
for col in cols_with_zero:
    diabetes[col].fillna(diabetes[col].median(), inplace=True)

# Remove outliers using IQR method (example for BMI)
Q1 = diabetes['BMI'].quantile(0.25)
Q3 = diabetes['BMI'].quantile(0.75)
IQR = Q3 - Q1
mask = (diabetes['BMI'] >= Q1 - 1.5 * IQR) & (diabetes['BMI'] <= Q3 + 1.5 * IQR)
diabetes = diabetes[mask]

## Q3. Split the dataset into a training set and a test set. Use a random seed to ensure reproducibility.

In [None]:
from sklearn.model_selection import train_test_split

X = diabetes.drop('Outcome', axis=1)
y = diabetes['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Q4. Use a decision tree algorithm, such as ID3 or C4.5, to train a decision tree model on the training set. Use cross-validation to optimize the hyperparameters and avoid overfitting.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 9, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X_train, y_train)
print('Best parameters:', grid.best_params_)
model = grid.best_estimator_

## Q5. Evaluate the performance of the decision tree model on the test set using metrics such as accuracy, precision, recall, and F1 score. Use confusion matrices and ROC curves to visualize the results.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, roc_auc_score

# Predictions
y_pred = model.predict(X_test)

# Metrics
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# ROC curve
probs = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, probs)
auc = roc_auc_score(y_test, probs)
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

## Q6. Interpret the decision tree by examining the splits, branches, and leaves. Identify the most important variables and their thresholds. Use domain knowledge and common sense to explain the patterns and trends.

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize=(20,10))
plot_tree(model, feature_names=X.columns, class_names=['Non-Diabetic', 'Diabetic'], filled=True, max_depth=3)
plt.show()

# Feature importance
importances = pd.Series(model.feature_importances_, index=X.columns)
print('Feature importances:\n', importances.sort_values(ascending=False))

## Q7. Validate the decision tree model by applying it to new data or testing its robustness to changes in the dataset or the environment. Use sensitivity analysis and scenario testing to explore the uncertainty and risks.

In [None]:
# Sensitivity analysis: Varying glucose value for a sample patient
import numpy as np
sample = X_test.iloc[0].copy()
glucose_range = np.arange(80, 200, 5)
preds = []
for g in glucose_range:
    sample['Glucose'] = g
    preds.append(model.predict([sample])[0])
plt.plot(glucose_range, preds, marker='o')
plt.xlabel('Glucose')
plt.ylabel('Predicted Outcome')
plt.title('Sensitivity Analysis: Glucose Effect')
plt.show()

This notebook demonstrates the end-to-end process of building, evaluating, and interpreting a decision tree for diabetes prediction in a healthcare setting.