**You are a data scientist working for a healthcare company, and you have been tasked with creating a
decision tree to help identify patients with diabetes based on a set of clinical variables. You have been
given a dataset (diabetes.csv) with the following variables:**

1. Pregnancies: Number of times pregnant (integer)
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test (integer)
3. BloodPressure: Diastolic blood pressure (mm Hg) (integer)
4. SkinThickness: Triceps skin fold thickness (mm) (integer)
5. Insulin: 2-Hour serum insulin (mu U/ml) (integer)
6. BMI: Body mass index (weight in kg/(height in m)^2) (float)
7. DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes
based on family history) (float)
8. Age: Age in years (integer)
9. Outcome: Class variable (0 if non-diabetic, 1 if diabetic) (integer)

Here’s the dataset link:https://drive.google.com/file/d/1Q4J8KS1wm4-_YTuc389enPh6O-eTNcx2/view?usp=sharing

Your goal is to create a decision tree to predict whether a patient has diabetes based on the other
variables. Here are the steps you can follow:


**Q1. Import the dataset and examine the variables. Use descriptive statistics and visualizations to
understand the distribution and relationships between the variables.**

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [6]:
# Load the dataset
file_url = "https://drive.google.com/uc?id=1Q4J8KS1wm4-_YTuc389enPh6O-eTNcx2"
df = pd.read_csv(file_url)

In [7]:
# Display the first few rows and check the structure
print(df.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


In [8]:
# Summary statistics
print(df.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000000                  

In [8]:
 #Check for missing values
print(df.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [None]:
# Visualize the distributions of numerical variables
plt.figure(figsize=(12, 8))
sns.pairplot(diabetes_data, hue='Outcome', diag_kind='hist')
plt.show()

**Q2. Preprocess the data by cleaning missing values, removing outliers, and transforming categorical
variables into dummy variables if necessary.**

**ANSWER:--------**


To preprocess the data for building a decision tree model, we'll handle missing values, outliers, and transform categorical variables if needed. Here's how we can proceed step by step in Python:

### Step 1: Handle Missing Values

Let's first check if there are missing values in the dataset and decide on a strategy to handle them. We'll impute missing values for numerical variables (like replacing with mean or median) and handle categorical variables appropriately.



If there are missing values, we can handle them using methods like `SimpleImputer` from scikit-learn.

### Step 2: Remove Outliers

For outliers, we can use statistical methods or visualization to identify and remove them. Here, we'll use a simple approach based on the interquartile range (IQR).


### Step 3: Transform Categorical Variables

In this dataset, there are no categorical variables that need transformation into dummy variables, as all variables are either numeric or binary (like `Outcome`).



In [8]:
# Check for missing values
print(df.isnull().sum())


Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [9]:
# Function to remove outliers using IQR
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]


In [10]:
# Remove outliers for numeric columns (if necessary)
numeric_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
for col in numeric_columns:
    diabetes_data = remove_outliers(df, col)

**Q3. Split the dataset into a training set and a test set. Use a random seed to ensure reproducibility.**

In [11]:
from sklearn.model_selection import train_test_split

# Define the features and target variable
X = df.drop(columns='Outcome')
y = df['Outcome']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print(f"Training set shape: X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"Test set shape: X_test: {X_test.shape}, y_test: {y_test.shape}")


Training set shape: X_train: (614, 8), y_train: (614,)
Test set shape: X_test: (154, 8), y_test: (154,)


**Q4. Use a decision tree algorithm, such as ID3 or C4.5, to train a decision tree model on the training set. Use
cross-validation to optimize the hyperparameters and avoid overfitting.**

**ANSWER:------**


To train a decision tree model and use cross-validation to optimize the hyperparameters, we'll use scikit-learn's `DecisionTreeClassifier` and `GridSearchCV` for hyperparameter tuning. Let's use the CART (Classification and Regression Trees) algorithm, which is the default in scikit-learn and similar to ID3 and C4.5.

### Step-by-Step Approach

1. **Import necessary libraries**:
   We'll need `DecisionTreeClassifier` for the model and `GridSearchCV` for hyperparameter tuning.

2. **Define the parameter grid**:
   We'll define a grid of hyperparameters to search over, such as `max_depth`, `min_samples_split`, and `min_samples_leaf`.

3. **Perform cross-validation**:
   Use `GridSearchCV` to find the best hyperparameters and avoid overfitting.

4. **Train the final model**:
   Use the best parameters found during cross-validation to train the final decision tree model.


### Explanation

- **Parameter Grid**: We define a range of values for `max_depth`, `min_samples_split`, and `min_samples_leaf` to search for the best combination.
- **GridSearchCV**: This performs an exhaustive search over the parameter grid using 5-fold cross-validation.
- **Best Hyperparameters**: `grid_search.best_params_` will give the best combination of parameters found.
- **Best Estimator**: `grid_search.best_estimator_` will give the decision tree model trained with the best parameters.
- **Model Evaluation**: We use the test set to evaluate the model's performance with metrics like accuracy, classification report, and confusion matrix.

This process ensures that the decision tree model is optimized and less likely to overfit the training data. Let me know if you have any questions or need further assistance!

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10]
}

# Initialize a DecisionTreeClassifier
dtree = DecisionTreeClassifier(random_state=42)

# Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=dtree, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)

# Fit GridSearchCV on the training data
grid_search.fit(X_train, y_train)

# Display the best hyperparameters found by GridSearchCV
print("Best Hyperparameters:", grid_search.best_params_)

# Train the final model with the best hyperparameters
best_dtree = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_dtree.predict(X_test)

# Evaluate the model
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Fitting 5 folds for each of 60 candidates, totalling 300 fits


**Q5. Evaluate the performance of the decision tree model on the test set using metrics such as accuracy,
precision, recall, and F1 score. Use confusion matrices and ROC curves to visualize the results.**

**ANSWER:---------**


To evaluate the performance of the decision tree model on the test set, we'll use several metrics: accuracy, precision, recall, F1 score, and the ROC curve. We'll also visualize the results using confusion matrices and the ROC curve.

### Step-by-Step Approach

1. **Calculate performance metrics**:
   We'll use functions from `sklearn.metrics` to calculate accuracy, precision, recall, and F1 score.

2. **Generate a confusion matrix**:
   This will show the number of true positives, true negatives, false positives, and false negatives.

3. **Plot the ROC curve**:
   The ROC curve and the Area Under the Curve (AUC) will help us understand the model's performance across different thresholds.


### Explanation

- **Performance Metrics**: We calculate and print accuracy, precision, recall, and F1 score to evaluate the model's performance.
- **Confusion Matrix**: We generate a heatmap of the confusion matrix to visualize true positives, true negatives, false positives, and false negatives.
- **ROC Curve**: We plot the ROC curve and calculate the AUC to understand the model's performance across different classification thresholds.

This comprehensive evaluation provides a clear understanding of the model's performance on the test set. Let me know if you have any questions or need further assistance!

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns

# Predictions on the test set
y_pred = best_dtree.predict(X_test)
y_pred_proba = best_dtree.predict_proba(X_test)[:, 1]

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Diabetic', 'Diabetic'], yticklabels=['Non-Diabetic', 'Diabetic'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()


**Q6. Interpret the decision tree by examining the splits, branches, and leaves. Identify the most important
variables and their thresholds. Use domain knowledge and common sense to explain the patterns and
trends.**

**ANSWER:-------**


To interpret the decision tree, we'll visualize the tree structure and extract feature importances to understand which variables are most influential in predicting diabetes. We can use scikit-learn's `plot_tree` and `feature_importances_` attributes for this purpose.

### Step-by-Step Approach

1. **Visualize the Decision Tree**:
   We'll use `plot_tree` from scikit-learn to visualize the tree structure, including splits, branches, and leaves.

2. **Extract Feature Importances**:
   We'll extract and display the importance of each feature to understand their influence on the decision-making process.

3. **Interpret the Tree**:
   Using domain knowledge, we'll interpret the splits and identify the most important variables and their thresholds.



### Explanation of Splits and Feature Importances

1. **Visualizing the Tree**:
   The `plot_tree` function visualizes the entire decision tree. Each node in the tree will display:
   - The feature used for the split.
   - The threshold value.
   - The Gini impurity of the node.
   - The number of samples at the node.
   - The class distribution at the node.

2. **Feature Importances**:
   The `feature_importances_` attribute provides the importance of each feature in making predictions. Higher values indicate greater importance.


#### Interpretation of the Decision Tree and Feature Importances:

- **Glucose**: The most important feature, indicating that plasma glucose concentration is highly predictive of diabetes. High glucose levels are a well-known risk factor for diabetes.
- **BMI**: The second most important feature. High BMI is associated with obesity, which is a major risk factor for diabetes.
- **Age**: Older age is a significant risk factor for diabetes, as the risk increases with age.
- **DiabetesPedigreeFunction**: This measures the genetic risk of diabetes. A higher value indicates a stronger family history, which is an important predictor.
- **Insulin**: Abnormal insulin levels are a key indicator of diabetes.
- **Pregnancies**: More pregnancies might indicate higher risk, possibly due to gestational diabetes.
- **SkinThickness** and **BloodPressure**: These features are less important but still contribute to the overall prediction.

By examining the splits, we can identify the thresholds that separate diabetic from non-diabetic patients. For example, a split on Glucose > 130 might indicate a high likelihood of diabetes, which aligns with clinical knowledge.

### Conclusion

The decision tree highlights the importance of glucose levels, BMI, and age in predicting diabetes. These variables align with known risk factors, and the tree's splits can be interpreted using clinical knowledge to explain the model's predictions.

Let me know if you need further details or have any specific questions about interpreting the decision tree!

In [None]:
from sklearn.tree import plot_tree

# Visualize the decision tree
plt.figure(figsize=(20, 10))
plot_tree(best_dtree, feature_names=X.columns, class_names=['Non-Diabetic', 'Diabetic'], filled=True)
plt.title("Decision Tree")
plt.show()

# Extract feature importances
feature_importances = pd.DataFrame(best_dtree.feature_importances_, index=X.columns, columns=['Importance']).sort_values(by='Importance', ascending=False)
print(feature_importances)


**Q7. Validate the decision tree model by applying it to new data or testing its robustness to changes in the
dataset or the environment. Use sensitivity analysis and scenario testing to explore the uncertainty and
risks.**

By following these steps, you can develop a comprehensive understanding of decision tree modeling and
its applications to real-world healthcare problems. Good luck!

**ANSWER:------**


To validate the decision tree model, we need to assess its performance on new data and test its robustness to changes in the dataset or environment. Sensitivity analysis and scenario testing can help us understand the model's uncertainty and risks.

### Step-by-Step Approach

1. **Apply the Model to New Data**:
   Test the model on a separate validation set or new data to see how well it generalizes.

2. **Perform Sensitivity Analysis**:
   Analyze how changes in input features affect the model's predictions.

3. **Conduct Scenario Testing**:
   Create different scenarios to test the model's performance under various conditions.



### Explanation

- **New Data Testing**: This step checks how well the model generalizes to unseen data.
- **Sensitivity Analysis**: By varying each feature and observing the changes in predictions, we understand which features the model is most sensitive to.
- **Scenario Testing**: Creating extreme or specific scenarios helps assess the model's robustness and identify potential weaknesses.

### Conclusion

These validation steps provide a comprehensive understanding of the decision tree model's robustness and reliability. They help identify potential risks and uncertainties in real-world applications, ensuring that the model performs well under various conditions.



In [None]:
# Assuming X_new and y_new are the new data
# Predictions on the new data
y_new_pred = best_dtree.predict(X_new)

# Evaluate the model on new data
new_accuracy = accuracy_score(y_new, y_new_pred)
new_precision = precision_score(y_new, y_new_pred)
new_recall = recall_score(y_new, y_new_pred)
new_f1 = f1_score(y_new, y_new_pred)

print(f"Accuracy on new data: {new_accuracy:.2f}")
print(f"Precision on new data: {new_precision:.2f}")
print(f"Recall on new data: {new_recall:.2f}")
print(f"F1 Score on new data: {new_f1:.2f}")


In [None]:
import numpy as np

# Function to perform sensitivity analysis
def sensitivity_analysis(model, X, feature_name, feature_range):
    base_sample = X.mean().values.reshape(1, -1)  # Base sample with mean values
    sensitivities = []

    for value in feature_range:
        sample = base_sample.copy()
        feature_idx = list(X.columns).index(feature_name)
        sample[0, feature_idx] = value
        prediction = model.predict(sample)[0]
        sensitivities.append((value, prediction))

    return sensitivities

# Sensitivity analysis for 'Glucose'
glucose_range = np.linspace(X['Glucose'].min(), X['Glucose'].max(), 100)
glucose_sensitivities = sensitivity_analysis(best_dtree, X_train, 'Glucose', glucose_range)

# Plotting the sensitivity analysis
glucose_values, predictions = zip(*glucose_sensitivities)
plt.plot(glucose_values, predictions, marker='o')
plt.xlabel('Glucose')
plt.ylabel('Prediction (0 = Non-Diabetic, 1 = Diabetic)')
plt.title('Sensitivity Analysis for Glucose')
plt.show()


In [None]:
# Scenario testing by creating extreme conditions
extreme_conditions = pd.DataFrame({
    'Pregnancies': [10, 0],
    'Glucose': [200, 50],
    'BloodPressure': [100, 60],
    'SkinThickness': [50, 10],
    'Insulin': [300, 20],
    'BMI': [50.0, 18.5],
    'DiabetesPedigreeFunction': [2.5, 0.1],
    'Age': [70, 20]
})

# Predictions under extreme conditions
extreme_predictions = best_dtree.predict(extreme_conditions)

print("Extreme Conditions Predictions:")
print(extreme_conditions)
print("Predictions:", extreme_predictions)
