# Decision Tree


Decision trees can be used for both classification and regression problems. In classification problems, the class label is a categorical variable, and in regression problems, the target variable is a continuous variable.

### Key Concepts:

1. Entropy: It measures the impurity or disorder of a set of samples. A low entropy indicates high purity, and a high entropy indicates high impurity.

2. Information Gain: It quantifies the reduction in entropy achieved after splitting a dataset based on an attribute. It helps in deciding the attribute to split on, with higher information gain being preferred.

3. Gini Index: It measures the impurity of a set of samples, similar to entropy. Gini index values range from 0 to 1, with 0 indicating pure nodes and 1 indicating impure nodes.

4. Pruning: It is a technique used to reduce the complexity of a decision tree by removing unnecessary branches or nodes. Pruning helps prevent overfitting and improves the generalization ability of the model.

### Formula:

-   **Entropy**: a measure of impurity in a set of examples.
$$Entropy = -\sum_{i=1}^n p_i log_2(p_i)$$


-   **Information gain**: the expected reduction in entropy by splitting a set of examples based on an attribute.

$$\text{Information Gain} = H(D) - \sum_{i=1}^{n}\left(\frac{|D_i|}{|D|} \cdot H(D_i)\right)$$
- **Gini index**: a measure of impurity in a set of examples, similar to entropy.

$$ \text{Gini Index} = \sum_{i=1}^n p_i (1-p_i)$$

### Selection of attribute

For selecting the best attribute to split in a decision tree typically involves evaluating different attributes based on certain criteria, such as information gain or Gini index. The general steps for attribute selection are as follows:

1. Calculate the impurity or uncertainty measure of the current node. This could be entropy or Gini index.

2. Iterate over each attribute and calculate the impurity-based criterion for splitting, such as information gain or Gini gain.

3. Select the attribute with the highest information gain or lowest impurity measure. This attribute will have the most significant impact on reducing the overall impurity in the subsequent splits.

4. Split the dataset based on the selected attribute and create child nodes.

Recursively repeat the above steps for each child node until a stopping criterion is met, such as reaching a maximum depth, minimum number of samples, or impurity below a certain threshold.

In [2]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
import graphviz

# Load the dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree classifier using information gain criterion
clf = DecisionTreeClassifier(criterion='entropy')

# Train the classifier on the training set
clf.fit(X_train, y_train)

# Visualize the decision tree
dot_data = export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph.render("decision_tree")

# Display the decision tree
graph.view()

'decision_tree.pdf'