**AUTHOR: RAIHAN SALMAN BAEHAQI (1103220180)**

**PART I** 

**The Fundamentals of Machine Learning** 

---

**CHAPTER 6 - Decision Trees** 

---

Chapter 6 explores Decision Trees, versatile Machine Learning algorithms capable of performing classification, regression, and multioutput tasks. They are powerful algorithms capable of fitting complex datasets and serve as fundamental components of Random Forests, which are among the most powerful ML algorithms available today.

---

**Training and Visualizing a Decision Tree**   
To understand Decision Trees, let's build one using the iris dataset. The following code trains a DecisionTreeClassifier on the iris dataset using petal length and width features.

Train a Decision Tree classifier:

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:]  # petal length and width
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

**Visualizing with Graphviz**   
You can visualize the trained Decision Tree using the export_graphviz() method:

In [None]:
from sklearn.tree import export_graphviz

export_graphviz(
    tree_clf,
    out_file="iris_tree.dot",
    feature_names=iris.feature_names[2:],
    class_names=iris.target_names,
    rounded=True,
    filled=True
)

Convert the DOT file to PNG using Graphviz command-line tool:

In [None]:
# In terminal/command line:
# $ dot -Tpng iris_tree.dot -o iris_tree.png

**Figure 6-1. Iris Decision Tree**   
![Figure6-1.jpg](./06.Chapter-06/Figure6-1.jpg)

The tree shows nodes with gini impurity, samples count, value arrays, and class predictions.

**Making Predictions**   
Decision Trees make predictions by traversing from the root node to leaf nodes based on feature thresholds. Starting at the root node (depth 0), the tree checks if petal length < 2.45 cm. If true, it predicts Iris setosa. Otherwise, it checks if petal width < 1.75 cm to predict either Iris versicolor or Iris virginica.

**Data Preparation Requirements**   
One quality of Decision Trees is that they require very little data preparation. They don't require feature scaling or centering at all.

**Node Attributes:**  
- **samples**: counts how many training instances apply to the node
- **value**: number of training instances of each class at the node
- **gini**: measures node impurity (gini=0 means pure node)

**Equation 6-1. Gini impurity**   
![Eq6-1.jpg](./06.Chapter-06/Eq6-1.jpg)

Where p_i,k is the ratio of class k instances among training instances in the ith node.

**Binary Trees in Scikit-Learn**   
Scikit-Learn uses the CART algorithm, which produces only binary trees: nonleaf nodes always have two children.

**Figure 6-2. Decision Tree decision boundaries**     
![Figure6-2.jpg](./06.Chapter-06/Figure6-2.jpg)

The thick vertical line represents the root node decision boundary (petal length = 2.45 cm). The dashed line shows the depth-1 right node split (petal width = 1.75 cm).

---

**Model Interpretation: White Box Versus Black Box**   
Decision Trees are intuitive and their decisions are easy to interpret. Such models are called **white box models**. In contrast, Random Forests or neural networks are **black box models**: they make great predictions but it's hard to explain in simple terms why the predictions were made. Decision Trees provide simple classification rules that can even be applied manually.

---

**Estimating Class Probabilities**   
A Decision Tree can estimate the probability that an instance belongs to a particular class k. It traverses the tree to find the leaf node for the instance, then returns the ratio of training instances of class k in that node.

Example prediction:

In [None]:
>>> tree_clf.predict_proba([[5, 1.5]])
array([[0.        , 0.90740741, 0.09259259]])
>>> tree_clf.predict([[5, 1.5]])
array([1])

For a flower with petals 5 cm long and 1.5 cm wide, the tree predicts 0% for Iris setosa, 90.7% for Iris versicolor, and 9.3% for Iris virginica. The predicted class is Iris versicolor (class 1).

---

**The CART Training Algorithm**     
Scikit-Learn uses the **Classification and Regression Tree (CART)** algorithm to train Decision Trees. The algorithm splits the training set into two subsets using a single feature k and threshold t_k that produces the purest subsets (weighted by their size).

**Equation 6-2. CART cost function for classification**   
![Eq6-2.jpg](./06.Chapter-06/Eq6-2.jpg)

Where G<sub>left/right</sub> measures the impurity of the left/right subset, and m<sub>left/right</sub> is the number of instances in each subset.

The algorithm recursively splits subsets until reaching maximum depth (max_depth) or when no split reduces impurity. Other stopping conditions include: min_samples_split, min_samples_leaf, min_weight_fraction_leaf, and max_leaf_nodes.

**Greedy Algorithm**   
CART is a **greedy algorithm**: it searches for an optimum split at the top level, then repeats at each subsequent level. It doesn't guarantee a globally optimal tree. Finding the optimal tree is an **NP-Complete problem** requiring O(exp(m)) time.

**Computational Complexity:**  
- **Prediction**: O(log₂(m)) - requires traversing roughly log₂(m) nodes
- **Training**: O(n × m log₂(m)) - compares all features on all samples at each node

For small training sets (<few thousand instances), setting presort=True speeds up training, but slows it down for larger sets.

**Gini Impurity or Entropy?**   
By default, Gini impurity is used, but you can select entropy by setting criterion="entropy". Entropy originated in thermodynamics and information theory, measuring average information content.

**Equation 6-3. Entropy**   
![Eq6-3.jpg](./06.Chapter-06/Eq6-3.jpg)

Most of the time, Gini impurity and entropy lead to similar trees. Gini impurity is slightly faster to compute (good default), while entropy tends to produce slightly more balanced trees.

---

**Regularization Hyperparameters**   
Decision Trees are **nonparametric models** - the number of parameters is not determined prior to training, allowing the model structure to stick closely to the data. If left unconstrained, the tree will likely overfit.

**Main Regularization Parameters:**  
- **max_depth**: maximum tree depth (default is None/unlimited)
- **min_samples_split**: minimum samples needed before splitting a node
- **min_samples_leaf**: minimum samples required in leaf nodes
- **min_weight_fraction_leaf**: same as min_samples_leaf but as a fraction
- **max_leaf_nodes**: maximum number of leaf nodes
- **max_features**: maximum features evaluated for splitting at each node

Increasing min_* hyperparameters or reducing max_* hyperparameters will regularize the model.

**Pruning**   
Alternative algorithms first train the tree without restrictions, then prune (delete) unnecessary nodes. A node is considered unnecessary if the purity improvement it provides is not statistically significant. Standard tests like the **chi-squared test** estimate if improvement is due to chance. If the p-value is higher than a threshold (typically 5%), the node is deleted.

**Figure 6-3. Regularization using min_samples_leaf**   
![Figure6-3.jpg](./06.Chapter-06/Figure6-3.jpg)

Left: Decision Tree with default hyperparameters (overfitting). Right: trained with min_samples_leaf=4 (better generalization).

---

**Regression**   
Decision Trees can perform regression tasks using DecisionTreeRegressor class.

Train a regression tree:

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)

**Figure 6-4. A Decision Tree for regression**   
![Figure6-4.jpg](./06.Chapter-06/Figure6-4.jpg)

Instead of predicting a class, the regression tree predicts a value. For instance with x1 = 0.6, it predicts value=0.111, which is the average target value of the 110 training instances in that leaf node.

**Figure 6-5. Predictions of two Decision Tree regression models**   
![Figure6-5.jpg](./06.Chapter-06/Figure6-5.jpg)

Left: max_depth=2. Right: max_depth=3. The predicted value for each region is always the average target value of instances in that region.

**CART Cost Function for Regression**  

**Equation 6-4. CART cost function for regression**   
![Eq6-4.jpg](./06.Chapter-06/Eq6-4.jpg)

The CART algorithm tries to split the training set to minimize the MSE instead of minimizing impurity.

**Regularizing Regression Trees**   
Decision Trees are prone to overfitting when dealing with regression tasks. Without regularization, predictions overfit badly. Setting min_samples_leaf=10 results in a much more reasonable model.

**Figure 6-6. Regularizing a Decision Tree regressor**   
![Figure6-6.jpg](./06.Chapter-06/Figure6-6.jpg)

Left: overfitting without regularization. Right: reasonable predictions with min_samples_leaf=10.

---

**Instability**   
Decision Trees are simple to understand, easy to use, versatile, and powerful. However, they do have limitations.

**Sensitivity to Training Set Rotation**   
Decision Trees love **orthogonal decision boundaries** (all splits perpendicular to an axis), making them sensitive to training set rotation.

**Figure 6-7. Sensitivity to training set rotation**   
![Figure6-7.jpg](./06.Chapter-06/Figure6-7.jpg)

Left: Decision Tree splits linearly separable dataset easily. Right: after 45° rotation, the decision boundary looks unnecessarily convoluted. One solution is to use Principal Component Analysis (Chapter 8) for better orientation of training data.

**Sensitivity to Small Variations**   
The main issue with Decision Trees is high sensitivity to small variations in training data. Removing just one training instance can produce a very different model.

**Figure 6-8. Sensitivity to training set details**   
![Figure6-8.jpg](./06.Chapter-06/Figure6-8.jpg)

After removing one training instance, the Decision Tree looks very different. Since Scikit-Learn's training algorithm is **stochastic** (randomly selects features to evaluate at each node), you may get very different models even on the same training data (unless you set the random_state hyperparameter).