<a href="https://colab.research.google.com/github/Advanced-Data-Science-TU-Berlin/Data-Science-Training-Python-Part-2/blob/main/interactive_notebooks/1_2_decision_tree_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification and Regression Trees (CART)
Classification and Regression Trees (CART) can be translated into a graph or set of rules for predictive classification. They help when logistic regression models cannot provide sufficient decision boundaries to predict the label. In addition, decision tree models are more interpretable as they simulate the human decision-making process. In addition, decision tree regression can capture non-linear relationships, thus allowing for more complex models. CART tries to split data into subsets so that each subset is as pure or homogeneous as possible.
> A pure node is one that results in perfect prediction.

In this exercise we are using the **Pima Indians Diabetes Dataset** which is applicable in the field of medical sciences. The objective of the dataset is to diagnostically predict `whether or not a patient has diabetes`, based on certain diagnostic measurements included in the dataset.

> To read more about this data check [here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

<img src="https://www.researchgate.net/publication/328766758/figure/fig2/AS:689950496403459@1541508425307/Decision-tree-structure-by-using-all-features-and-Pima-Indians-dataset-From-this.png">

Let's try to read the data and check out what the first few rows of this dataset look like:

In [None]:
!pip install opendatasets

In [None]:
import opendatasets as od
# Download the kaggle dataset from https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
# Note: use od.download and pass the url
<your-code-here>

In [None]:
# Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
% matplotlib inline

UsageError: Line magic function `%` not found.


In [None]:
# Read the diabetes data from /content/pima-indians-diabetes-database/diabetes.csv
# Note: use pd.read_csv
df = <your-code-here>
# Visualize few lines of the data
<your-code-here>

After loading the data, we understand the structure & variables. Let's determine the target & feature variables (dependent & independent variables respectively)



In [None]:
# Split the dataset into features and target variable

# Select features (X) - Include all columns except the target variable "Outcome"
X = <your-code-here>

# Select the target variable or label (y)
y = <your-code-here>


As part of any Machine Learning process we need to make sure that the data is clean before applying any model.
> Important Note: Decision Tree (DT) can handle both continuous and numeric variables. But since we are using Scikit Learn, we need to convert categorical values into numerical.

Luckily here we don't have any categorical value.

> Another Important Note: DT can automatically handle missing values and they are robust to outliers so we don't need to do anything regarding that.

Let's just look closer at the data:


In [None]:
# Display summary statistics of the dataset
# Note: use describe() on df
df.describe()

##  Distribution of the target value

In [None]:
# Show the distribution of the target value "Outcome"
# Note: use value_counts() on Outcome column
target_frequency = <your-code-here>

print("#Outcomes:")
display(target_frequency)

# Plot a pie chart showing the distribution
# Note use plot.pie on target_frequency also use following parameters to look better
# autopct='%1.1f%%', startangle=90, colors=['lightgreen', 'lightcoral']
<your-code-here>


## Pair-wise Scatter Plots
Pair-wise scatter plots are a visualization tool used to explore relationships between pairs of variables in a dataset. Each point on the plot represents a data point, and the position of the point is determined by the values of two variables.

Here, the pair-wise scatter plots are created for different pairs of features, with points colored by the target variable "Outcome" (indicating whether a patient has diabetes or not). This type of visualization allows for the examination of potential patterns, trends, or separations between different classes of the target variable.

In [None]:
# Pair-wise Scatter Plots
# Put the 'Outcome' which is our target value into the hue parameter
pp = sns.pairplot(df, hue=<target-name>, height=1.8,
                  aspect=1.8, plot_kws=dict(edgecolor="k",
                  linewidth=0.5), diag_kind="kde")

fig = pp.fig
fig.subplots_adjust(top=0.93, wspace=0.3)
t = fig.suptitle('Pairwise Plots', fontsize=14)

As we can see out dataset is slightly imbalanced. Usually it can impact the performance of the DT (similar to most of other classifiers). They are techniques to deal with imbalanced classification. You can read more [here](https://machinelearningmastery.com/cost-sensitive-decision-trees-for-imbalanced-classification/)

From the pairwise plots we can see that the data is not linearly separable.

## Dataset Split
Now let’s divide the data into training & testing sets in the ratio of 70:30.

In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training set and test set (70% training and 30% test)
# Note: use train_test_split and pass X, y, test_size=0.3, random_state=1
X_train, X_test, y_train, y_test = <your-code-here>

Let's perform the decision tree analysis using scikit learn:

In [None]:
# Import Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree classifier object
# Note: call DecisionTreeClassifier and use criterion='entropy' as input
clf = <your-code-here>

# Train the Decision Tree Classifier
# Note: call fit function on clf and pass X_train and y_train
clf = <your-code-here>

# Predict the response for the test dataset
# Note: call predict on clf and pass X_test
y_pred = <your-code-here>


Let's quickly lookt at the accuracy of the trained model on test data:

In [None]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Print the accuracy of the model on test data
# Note: use metrics.accuracy_score and pass y_test and y_pred
print("Accuracy:", <your-code-here>)

Looks like our decision tree algorithm has an accuracy more than 70%. A value this high is usually considered good.

## Visualizing Decision Tree
Now that we have created a decision tree, let’s see what it looks like when we visualise it. There are multiple ways to see the result of a DT.
1. Print Text Representation: Exporting Decision Tree to the text representation can be useful when working on applications whitout user interface or when we want to log information about the model into the text file. You can check details about export_text in the sklearn docs. But it maybe hard to read specially when the tree is big. Let's take a look:

In [None]:
# Import tree for decision tree models
from sklearn import tree

# Export text representation of the tree
# Note: use export_text function on tree and pass clf
text_representation = <your-code-here>

# Print Text Representation
print(text_representation)

2. Plot Tree with `plot_tree`: The plot_tree method was added to sklearn in version 0.21. It requires matplotlib to be installed. It allows us to easily produce figure of the tree (without intermediate exporting to graphviz) The more information about plot_tree arguments are in the docs. Let's see how it looks like:


In [None]:
# Create a figure
fig = plt.figure(figsize=(25,20))

# Plot the tree
# Note: Pass the clf classifier as the first argument of the plot_tree
_ = tree.plot_tree(<classifier>,
                   feature_names=X.columns,
                   class_names=y.name,
                   filled=True)

We can save the figure into a .png file.

In [None]:
# Save the image
# Note: use fig.savefig and pass "decistion_tree.png" as the input
<your-code-here>

3. Visualize Decision Tree with `graphviz`: To plot the tree first we need to export it to DOT format with export_graphviz method. Then we can plot it in the notebook or save to the file. This will provide us with a better view to look at bigger trees.
> The DOT format is a plain text graph description language that is commonly used for describing graphs and graph structures

Let's take a look:


In [None]:
# Import graphviz
import graphviz

# Export graphviz from the tree
# Note: Pass the clf classifier as the first argument of the export_graphviz
dot_data = tree.export_graphviz(<classifier>,
                                out_file=None,
                                feature_names=X.columns,
                                class_names=['no', 'yes'],
                                filled=True # paint nodes to indicate majority class for classification
                                )

# Draw graph
graph = graphviz.Source(dot_data, format="png")
graph

Inside each node:
- The question that the decision tree asks to split based on the selected feature
- Criterion (Ex. gini or entropy): The function to measure the quality of a split
- Samples: The number of samples ended up in each node
- Value [X,Y]: The list tells you how many samples at the given node fall into each category (here 0 and 1)
- Class: shows the prediction a given node will make and it can be determined from the value list

In [None]:
# Save the graphivz figure
# Note: use graph.render and pass "decision_tree_graphivz" as input
<your-code-here>

> To read more about DT visualization you can check [this](https://mljar.com/blog/visualize-decision-tree/) link

## DT Evaluation
As you notice, in this extensive decision tree chart, each internal node has a decision rule that splits the data. But are all of these useful/pure?

Gini referred to as Gini ratio measures the impurity of the node in a decision tree. One can assume that a node is pure when all of its records belong to the same class. Such nodes are known as the leaf nodes.

In our outcome above, the complete decision tree is difficult to interpret due to the complexity of the outcome. Pruning/shortening a tree is essential to ease our understanding of the outcome and optimise it. This optimisation can be done in one of three ways:
- **criterion**: optional (default=”gini”) or Choose attribute selection measure.

  This parameter allows us to use the attribute selection measure.
- **splitter**: string, optional (default=”best”) or Split Strategy

  Allows the user to split strategy. You may choose “best” to choose the best - - split or “random” to choose the best random split.
- **max_depth**: int or None, optional (default=None) or Maximum Depth of a Tree

  This parameter determines the maximum depth of the tree. A higher value of this variable causes overfitting and a lower value causes underfitting.

>According to the paper “Theoretical comparison between the Gini Index and Information Gain criteria”, the frequency of agreement/disagreement of the Gini Index and the Information Gain was only 2% of all cases, so for all intents and purposes you can pretty much use either, but the only difference is entropy might be a little slower to compute because it requires you to compute a logarithmic function.

> More information regarding DT parameters is [here](https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680)



## Explanation on Pruning in Decision Trees:
Pruning is the process of shortening a decision tree to improve its interpretability and optimize its performance.
We will vary the maximum depth of the tree as a control variable for pre-pruning.

Let’s try max_depth=3.

In [None]:
# Create a Decision Tree classifier object with pre-pruning parameters
# Note: use We use DecisionTreeClassifier with "entropy" as the criterion and set max_depth=3 for pre-pruning
# This helps in creating a simpler tree for better interpretability
clf = <your-code-here>

# Train the Decision Tree Classifier
clf = <your-code-here>

# Predict the response for the test dataset
y_pred = <your-code-here>

# Model Accuracy: Evaluate the performance of the classifier
accuracy = <your-code-here>

# Print the accuracy of the model
print("Accuracy:", accuracy)

On Pre-pruning, the accuracy of the decision tree algorithm increased to 77.05%, which is clearly better than the previous model.

Let's look at the DT:

In [None]:
# DOT data
dot_data = tree.export_graphviz(clf,
                                out_file=None,
                                feature_names=X.columns,
                                class_names=['No', 'Yes'],
                                filled=True)

# Draw graph
graph = graphviz.Source(dot_data, format="png")
graph

## Hyperparameter Tuning in DT
Hyperparameters are model parameters whose values are set before training. Why should we tune the hyperparameters of a model? Because we don’t really know their optimal values in advance. A model with different hyperparameters is, actually, a different model so it may have a lower performance.If the model has several hyperparameters, we need to find the best combination of values of the hyperparameters searching in a multi-dimensional space. That’s why hyperparameter tuning, which is the process of finding the right values of the hyperparameters, is a very complex and time-expensive task.
### Grid Search
Grid search is the simplest algorithm for hyperparameter tuning. Basically, we divide the domain of the hyperparameters into a discrete grid. Then, we try every combination of values of this grid, calculating some performance metrics using cross-validation. The point of the grid that maximizes the average value in cross-validation, is the optimal combination of values for the hyperparameters.
> To read more check [here](https://www.yourdatateacher.com/2021/05/19/hyperparameter-tuning-grid-search-and-random-search/)


In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid based on the results of random search
# Note: use [2, 3, 5, 10, 20] for max_depth and ["gini", "entropy"] for criterion
params = {
    'max_depth': <your-code-here>,
    'criterion': <your-code-here>
}

# Create a Decision Tree classifier with a fixed random_state=1
# Note: use DecisionTreeClassifier and pass random_state=1 as input
dt = <your-code-here>

# Instantiate the grid search model
# Note: pass corresponding parameters to GridSearchCV
# use cv=4 and scoring="accuracy"
grid_search = GridSearchCV(
    estimator=<dt-classifier>,
    param_grid=<params>,
    cv=<number-of-cross-validations>,
    n_jobs=-1,
    verbose=1,
    scoring=<scoring-metric>)

# Fit the grid search model on the training data
# Note: use grid_search.fit and pass X_train, y_train
grid_search.fit(X_train, y_train)

## Explanation on Analyzing Grid Search Results

Let's see the results:

In [None]:
# Create a DataFrame with the grid search results
# Note: pass grid_search.cv_results_ to pd.DataFrame
score_df = <your-code-here>

# Display the first few rows of the scores DataFrame
print("Scores DataFrame: -----------")
display(score_df.head())

# Display the top 5 results based on the mean_test_score
print("Top-5 Scores: -----------")
# Note: use score_df.nlargest and pass 5 and "mean_test_score" as inputs
display(<your-code-here>)

# Extract and display the best estimator (model with the best hyperparameters)
# Note: use grid_search.best_estimator_
dt_best = <your-code-here>
print("Best Estimator: -----------")
display(dt_best)

Now that we have the best parameter let's see its tree and performance but first let's write some helper functions:

In [None]:
def evaluate_model(dt_classifier):
    # Evaluate the model on the training set
    # Note: use dt_classifier.predict and pass X_train
    y_pred_train = <your-code-here>
    # Note: use metrics.accuracy_score and pass y_train and y_pred_train
    train_accuracy = <your-code-here>

    # Display the confusion matrix for the training set
    fig, ax = plt.subplots(1, 2, figsize=(20,5))
    ax[0].set_title('Train Confusion Matrix')
    metrics.ConfusionMatrixDisplay.from_estimator(dt_classifier, X_train, y_train, cmap='ocean', ax=ax[0])

    print("Train Accuracy :", train_accuracy)
    print("-" * 50)

    # Evaluate the model on the test set
    # Note: use dt_classifier.predict and pass X_test
    y_pred_test = <your-code-here>
    # Note: use metrics.accuracy_score and pass y_train and y_pred_test
    test_accuracy = <your-code-here>

    # Display the confusion matrix for the test set
    ax[1].set_title("Test Confusion Matrix")
    metrics.ConfusionMatrixDisplay.from_estimator(dt_classifier, X_test, y_test, cmap='ocean', ax=ax[1])

    print("Test Accuracy :", test_accuracy)

def plot_dt_graph(clf):
    # Create DOT data for decision tree visualization
    dot_data = tree.export_graphviz(clf,
                                    out_file=None,
                                    feature_names=X.columns,
                                    class_names=y.name,
                                    filled=True)

    # Draw the decision tree graph
    graph = graphviz.Source(dot_data, format="png")

    return graph

In [None]:
evaluate_model(dt_best)
plot_dt_graph(dt_best)

Useful links:
- https://www.kaggle.com/code/gauravduttakiit/hyperparameter-tuning-in-decision-trees/notebook
- https://www.kaggle.com/code/mamun18/decision-tree-practice-with-car-evaluation-dataset
- https://www.springboard.com/blog/data-science/decision-tree-implementation-in-python/
- https://towardsdatascience.com/id3-decision-tree-classifier-from-scratch-in-python-b38ef145fd90