<center><img src="https://treenewal.com/wp-content/uploads/2020/11/oak-tree-care.png" alt="decision tree picture"></center>

# <center> 🌳Decision trees - classification and regression trees (CART)🌳 </center> 

**What you can expect from this notebook:** This is a follow-up to my notebook [ml from scratch: neural network and GD-optimizers](https://www.kaggle.com/code/vincentbrunner/ml-from-scratch-neural-network-and-gd-optimizers/notebook). In this notebook classification and regression tree algorithms are **explained and implemented from scratch** and used on the titanic and avocado-prices datasets.<br>

If you're a beginner and intimidated by all the terms like entropy and information gain, I can assure you it looks way harder than it actually is. You'll manage it easily💪 <br>

Special thanks to [Suraj Jha](https://www.kaggle.com/surajjha101) and [Ashwin Shetgaonkar](https://www.kaggle.com/ashwinshetgaonkar) for providing helpful feedback on my last notebook, as well as to [Marília Prata](https://www.kaggle.com/mpwolke) for suggesting using competition data.

<div class="alert alert-block alert-info">👉If you're just interested in the complete, with comments documented implementation of a CART algorithm + post pruning based on nested objects using just numpy and the copy module, feel free to click on show hidden code: </div>

In [None]:
#  libraries used:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

class DecisionTreeClassifier():
    
    def __init__(self, method="gini", max_depth=50, min_samples_split=20, max_features=None):
        self.method = method
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features

        self.root_node = None
        self.depth = 0
        
    def calc_entropy(self, X):
        weighted_average_of_props = 0
        for categ in np.unique(X):
            prop = np.where(X == categ, 1, 0).sum() / len(X)
            weighted_average_of_props += prop * np.log10(prop)
        return -1 * weighted_average_of_props
    
    def calc_gini(self, X):
        square_props = 0
        for categ in np.unique(X):
            prop = np.where(X == categ, 1, 0).sum() / len(X)
            square_props += prop ** 2
        return 1 - square_props
    
    def information_gain(self, S, S_l, S_r):
        #  if entropy is used as impurity measurement
        if self.method == "entropy":
            S_impurity = self.calc_entropy(S)
            S_l_impurity = self.calc_entropy(S_l)
            S_r_impurity = self.calc_entropy(S_r)
            
        #  if gini index is used as impurity measurement (default due to being less computing power intensiv)
        elif self.method == "gini":
            S_impurity = self.calc_gini(S)
            S_l_impurity = self.calc_gini(S_l)
            S_r_impurity = self.calc_gini(S_r)
            
        else:
            raise ValueError("expected method to be 'gini' or 'entropy'")
        
        return S_impurity - (S_l_impurity * (len(S_l) / len(S)) 
                             + S_r_impurity * (len(S_r) / len(S)))
    
    def get_best_split(self, X, y):
        best_var, best_value, best_score = None, None, 0
        
        if self.max_features == None:
            variables = range(X.shape[1])
        else:
            variables = np.random.choice([i for i in range(X.shape[1])], size=self.max_features, replace=False)
        
        for var_i in variables:
            for value in np.unique(X[:-1, var_i]):
                score = self.information_gain(y,                        # targe before split
                                              y[X[:, var_i] <= value],  # targe after split left child
                                              y[X[:, var_i] > value])   # targe after split right child
                if score > best_score:
                    best_var, best_value, best_score = var_i, value, score
                    
        return best_var, best_value
    
    def build_tree(self, X, y):
        best_var, best_value = self.get_best_split(X, y)
        
        #  check wether to split or create leaf node
        if (best_var == None)|(len(X) < self.min_samples_split)|(self.depth >= self.max_depth):
            return Leaf(np.unique(y)[np.argmax(np.unique(y, return_counts=True)[1])])
        else:
            self.depth += 1
            return Node(self.build_tree(X[X[:, best_var] <= best_value], y[X[:, best_var] <= best_value]), # left child
                        self.build_tree(X[X[:, best_var] > best_value], y[X[:, best_var] > best_value]), # right child
                        (best_var, best_value)) # save split
    
    def get_prediction(self, node, row):
        if node.type == "Node":
            if row[node.split[0]] > node.split[1]:
                return self.get_prediction(node.right_child, row)
            else:
                return self.get_prediction(node.left_child, row)
        else:
            return node.value
        
    def fit(self, X, y):
        self.depth = 0
        self.root_node = self.build_tree(X, y)
                    
    def predict(self, X):
        predictions = [self.get_prediction(self.root_node, row) for row in X]
        return np.array(predictions)
    
    #  functions added for "visualising" structure:
    def get_tree_struckture(self, node, prev_layer):
        if node.type == "Node":
            print(f"walked node on depth lvl. {prev_layer+1}")
            return self.get_tree_struckture(node.left_child, prev_layer+1), self.get_tree_struckture(node.right_child, prev_layer+1)
        else:
            print(f"walked leaf on depth lvl. {prev_layer+1} with the value {node.value}")
                
    def struckture(self):
        self.get_tree_struckture(self.root_node, 0)

# <center>Table of Contents 📄</center>
This notebook goes through all the principles neccessary to understand and code a classification or regression tree. The resulting models will be tested on actual datasets (no pseudo datasets this time^^).

<p style="font-family: Verdana; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#1." style="color:#318a11">1. basic intuition behind decision trees</a></p>

<p style="font-family: Verdana; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#2." style="color:#318a11">2. evaluating a split:</a></p>

<p style="text-indent:10px; font-family: Verdana; font-size: 14px; letter-spacing: 2px; line-height:1.3"><a href="#2.1." style="color:#318a11">2.1. entropy:</a></p>

<p style="text-indent:10px; font-family: Verdana; font-size: 14px; letter-spacing: 2px; line-height:1.3"><a href="#2.2." style="color:#318a11">2.2. gini index:</a></p>

<p style="text-indent:10px; font-family: Verdana; font-size: 14px; letter-spacing: 2px; line-height:1.3"><a href="#2.3." style="color:#318a11">2.3. information gain:</a></p>

<p style="font-family: Verdana; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#3." style="color:#318a11">3. greedy splitting</a></p>

<p style="font-family: Verdana; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#4." style="color:#318a11">4. building a classification tree</a></p>

<p style="font-family: Verdana; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#5." style="color:#318a11">5. how to prevent overfitting</a></p>

<p style="font-family: Verdana; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#6." style="color:#318a11">6. putting it all together: <strong>classification on titanic dataset</strong></a></p>

<p style="font-family: Verdana; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#7." style="color:#318a11">7. how to use decision trees for regression</a></p>

<p style="font-family: Verdana; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#8." style="color:#318a11">8. putting it all together: <strong>regression on house-price dataset</strong></a></p>

<br>

##### <div class="alert alert-block alert-info">⚠️ <strong>Important:</strong> since making own visualisations would be too time-consuming for a first notebook I mainly embedded images from <strong>google image search</strong>. If you should find you're image here and <strong>want it to be removed</strong> please leave a comment or <strong>contact me</strong>. </div>

****
<p id="1."></p>

# <center>1. basic intuition behind decision trees</center>
Scince decision trees might just be the most intuitive ml model there is (after linear regression mabey), most of what it does can be understood by just looking at it's struckture:
<center><img src="https://eloquentarduino.github.io/wp-content/uploads/2020/08/DecisionTree.png" style="width:auto; height:500px" alt="eloquentarduino.github.io"></center>
<center>image source: eloquentarduino.github.io</center>
<br>
Starting from the top, questions of binary nature are asked and a datapoint follows the path to the next question based on the answer to the previous question. In the example tree the first question is "Age over 30?", so if an input sample has the entry 45 in the age variable, it travels to the right.
<details> 
  <summary>There another question is asked: "No. of children < 2?". Let's assume the input sample has entry 1 in the num. children variable, <strong>which path would it travel on next?</strong> (click to see answer) </summary>
   <center><strong>→ it travels along the left parts and results into the block "Get Loan"</strong></center>
</details>
<br>

This is the output of the tree/the prediction it makes. -> aaannd that's a decision tree, see totally easy👌<br>

**important points about the structure to take away:**
* a decision tree is a ***chain of questions*** that separate the tree into ***subtrees***
* separating data into two groups like that is called a ***split***
* these splits are performed at ***nodes*** (blue blocks that ask questions)
* at the end of such a chain of splits ***leaves or terminal nodes*** (blue blocks that output a statement) output the predicted value
* --> a decision tree makes predictions by splitting data points into subgroups based on variable values

**how these trees are built in order to output fitting predictions:**
* to fit a tree/build it to predict a target variable, a whole dataset is used, not just a single sample
* tree construction starts at the top node, also called the ***root node***
* there the question is asked/the ***dataset gets split***, that separates the data best based on the target variable
    * consider the datapoints with target [+, +, -, +, -, -, °, °]
    * a good split would be [+, +, +] [-, -, -, °, ° ]
* now that the data is split into 2 groups (the left one is called ***left child*** and the right one ***right child***), each group again can be split, this is repeated till every subgroup consists of data points that have the same target. 
* There a ***leaf/terminal node is created*** with that target value
* if we now pass a sample through the tree, it gets "sorted" into one of these "subgroups" and given that the data is not random, it (probably) has the same target as that group<br>

****
<p id="2."></p>

# <center>2. evaluating a split 🧐</center>
The crucial part of building a decision tree is deciding at which variable and which value to split/which questions to ask when. To do that we need some way to evaluate a split/tell which split is better than another one. This subchapter will cover exactly that. So to start out, let's take a closer look at how splitting works. Below you see a dataset with 2 feature variables and one target/label variable. You can use the interactive elements to get an idea of how splitting the data according to criteria affects the distribution of the target variable:

In [None]:
data = {"var1":np.array([1, 2, 5, 3, 2, 1, 4]),     
        "var2":np.array([5, 34, 23, 3, 42, 13, 54]), 
        "target":np.array(["red", "blue", "red", "red", "blue", "blue", "red"])}
pd.DataFrame(data).head()

In [None]:
from ipywidgets import widgets

split = []
#  function to visualise splits
def visualise_split(split, ax):
    x = [i for i in range(len(split[0]))]+[i+len(split[0])+1 for i in range(len(split[1]))]    
    split_total = np.concatenate([split[0],split[1]])
    colors_split = np.where(split_total=="blue", ["#1c46ff" for i in range(len(split_total))], ["#ff3333" for i in range(len(split_total))])
    ax.scatter(x, [0 for i in range(len(x))], 2000, color=colors_split)
    ax.axvline(len(split[0]), color="black")
    ax.axes.xaxis.set_visible(False)
    ax.axes.yaxis.set_visible(False)
    ax.set_ylim(-0.01, 0.01)
    ax.set_xlim(-1, len(split_total)+1)

def make_split(var, value):
    split = [data["target"][data[var] <= value], data["target"][data[var] > value]]

    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 4))

    befor_split = data["target"]
    colors_split = np.where(befor_split=="blue", ["#1c46ff" for i in range(len(befor_split))], ["#ff3333" for i in range(len(befor_split))])
    ax1 = plt.subplot(2, 3, 2)
    ax1.scatter(range(len(befor_split)), [0 for i in range(len(befor_split))], 2000, color=colors_split)
    ax1.axes.xaxis.set_visible(False)
    ax1.axes.yaxis.set_visible(False)
    ax1.set_ylim(-0.01, 0.01)
    ax1.set_xlim(-1, len(befor_split))
    ax1.set_title("target before split", fontdict={'fontsize': 20})
    
    visualise_split(split, ax2)
    ax2.set_title("target after split", fontdict={'fontsize': 20})
    
    fig.tight_layout() 
    plt.show()

# found a tutorial on getting the ipywidgets to work on kaggle here: https://www.kaggle.com/code/atorabi/intro-to-ipywidgets/notebook
variable_selector = widgets.Select(description="", options=["var1", "var2"], layout = widgets.Layout(width="50px", height="50px"))
value_slider = widgets.SelectionSlider(options=np.unique(np.concatenate([data["var1"], data["var2"]])))

w = widgets.interactive_output(make_split, {"var": variable_selector, "value": value_slider})
widgets.HBox(children = [variable_selector, widgets.VBox([value_slider, w])])

##### <div class="alert alert-block alert-info">⚠️ <strong>Important:</strong> After making this interactive visualisation I discovered that it <strong>doesn't work in viewer mode</strong>. Still when <strong>forking and going into edit mode</strong> the visualisation get's updated.</div>

Now that the principle of splitting is hopfully cristal clear, three specific splits will be considered to see what makes a good and what makes a bad split.

In [None]:
#  and prepare 3 sample splits
split1 = [data["target"][data["var2"] >= 12], data["target"][data["var2"] < 12]] # variable: var2, value: 12
split2 = [data["target"][data["var1"] >= 3], data["target"][data["var1"] < 3]] # variable: var1, value: 3
split3 = [data["target"][data["var2"] >= 20], data["target"][data["var2"] < 20]] # variable: var1, value: 4

print("sample splits:\n")
#  visualising splits
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(30, 1))

visualise_split(split1, ax1)
ax1.set_title("split 1: variable = var2, value = 12", fontdict={'fontsize': 20})
visualise_split(split2, ax2)
ax2.set_title("split 2: variable = var1, value = 3", fontdict={'fontsize': 20})
visualise_split(split3, ax3)
ax3.set_title("split 3: variable = var1, value = 4", fontdict={'fontsize': 20})

plt.show()

<details> 
  <summary>Consider the splits above, from what was covered till now, <strong>which split would be the best?</strong> (result in the optimal children, regarding the target) -> click to see answer</summary>
   <center><strong>→ split 2 results in groups that each contain nearly perfect homogenous target categories, this is refered to as <u>purity</u></strong></center>
</details>
<br>

**How impure a group is, is determined by the homogeneity of its target/label:**
* split 1: medium impurity
* split 2: low impurity
* split 3: high impurity

Since the original goal of building a decision tree, is to split the data till it's divided into subgroups that share the same target, ***the lower the impurity, the better***.<br>

You can also think of it as **uncertainty**: high impurity (split 3) equals high uncertainty (take a random sample from the group, how certain are you about the target value of that sample?). Low impurity (split 2) equals low uncertainty (consider the groups split 2 results in: taking a random sample from one of the groups and knowing from which group it came, you can almost certainly know its target value)

**so the goal of splitting is to reduce impurity and with that uncertainty.** Having said all that, it becomes clear what tools are needed to determine how good a split is:
* an impurity/uncertainty measure
* a formula to determine how much the whole split reduced impurity/uncertainty

the 2 most common impurity/uncertainty measures together with that formula will be explained and implemented in the following 3 subchapters.

**impurity measure preface:**<br>
since we have looked at impurity on a quite abstract level till now, let's formulate it in more mathematical terms:
* impurity/uncertainty is high if the categories are evenly spread -> ***high impurity ≙ equal probabilities***
* impurity/uncertainty is low if a category is "dominating" -> ***low impurity ≙ high probability of one category (low probabilities of all others)***

So let's look at a binary example, and more specific the probabilitie of +: 
* [+++++,-] ≙ low impurity -> high probability of +
* [+++,---] ≙ high impurity -> equal probability of + (and -)
* [+,-----] ≙ low impurity -> low probability of + (but due to that a high probability of -)

So ***an impurity measure must put out high values for equal probabilities and low values for unequal probabilities***

##### <div class="alert alert-block alert-info">⚠️ <strong>Attention:</strong> The next section will provide formulas and intuitions for them but to explain them <strong>mathematicaly precise</strong>, I'd have to fill the whole notebook with just one concept. So I suggest looking into other recources and skipping the next 3 subchapters <strong>if you wan't everything to be perfectly detailed and formaly/officialy right.</strong></div>

****

<p id="2.1."></p>

## <center>2.1. entropy</center>

**Entropy** is one of the more important concepts in data science, so even tho the measure described in the next subchapter (Gini index) is more computationally efficient and therefore more often used for decision trees, entropy will be covered first in this notebook. Its formula is: <br>

$\large H=-\sum p(x)*log(p(x))\>\>\>\>\>$   or less often writen as:  $\large \>\>\>\>\> H=\sum p(x)*log(p(\frac{1}{x}))$<br>

what this formula does, part for part will be discussed in this section together with 3 different perspectives to look at it.<br>

#### **entropy as impurity measure:**<br>
The formula might look confusing at first but let's analyse it part for part:
1. the whole term gets multiplied by -1 (why is covered later)
2. a sum is taken over each ***category*** of the target variable
3. the term that's calculated for each category can be further split up into 2 parts
    * $p(x)$ this is the probability of the category occurring (the proportion as probability estimate). 
    * $log(p(x))$ 
    * -> All probabilities of all classes sum up to 1, so you can imagine this part as ***weighted average*** of $log(p(x))$ over all the categories 

Let's take a closer look at the $log(p(x))$ term:
<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/Binary_logarithm_plot_with_ticks.svg/1200px-Binary_logarithm_plot_with_ticks.svg.png" style="width:auto; height:300px" alt="eloquentarduino.github.io"></center>

The closer the probability of x is to 1, the closer $log(p(x))$ is to 0, the closer the probability of x is to 0, the closer $log(p(x))$ is to -infinity. So when taking the weighted average (weighing by the same probabilities), the following happens: low probabilities result in really low values for $log(p(x))$ but don't contribute much to the average due to being weighted by their low probability. High probabilities result in a value close to 0 for $log(p(x))$ and contribute much to the average.

* So to achieve the lowest possible value with the $\sum p(x)*log(p(x))$ term, all probabilities have to be equal.
* And to achieve the value closest to 0, the probabilities have to be as unequal as possible (one high probability, small probabilities for all other categories)

***sounds familiar?*** That's more or less what we wanted to achieve with an impurity measure in the first place, just that the value gets lower for impure and higher for pure values - a purity measure so to speak. But that's easy to correct **multiplying $\sum p(x)*log(p(x))$ by -1 results in high values(close to 1) for impure data and low values(close to 0) for pure data**.

Let's recall the **binary example** from earlier:<br>

the entropy formula can now be written as: $\large H= -(p(x_1)*log(p(x_1)))-((1-p(x_1))*log(1-p(x_1)))$ where x_1 is one of the 2 categories. Plotting this as a graph would look like this:<br>

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Binary_entropy_plot.svg/200px-Binary_entropy_plot.svg.png"></center>

so for the examples the entropy is:
* [+++++,-] ≙ rather low impurity -> entropy somewhere around 0.5 
* [+++,---] ≙ high impurity -> entropy is 1
* [+,-----] ≙ rather low impurity -> entropy  somewhere around 0.5 


#### **entropy as uncertainty measure:**

As already mentioned in the last subchapter, you can view entropy as a measure of uncertainty. This is best demonstrated by looking at an example:
* [+++++,-] ≙ entropy somewhere around 0.5 -> you can be pretty certain that a random sample would be +, so uncertainty is rather low
* [+++,---] ≙ entropy is 1 -> you are completely uncertain if a random sample would be + or -, due to them having the same probability of occurring
* [+,-----] ≙ entropy somewhere around 0.5 -> you can be pretty certain that a random sample would be -, so uncertainty is rather low 

#### **entropy as information measure:**

This is the original viewpoint/approach to entropy and the one it was invented for by the godfather of information theory: Claude Shannon. The original intuition was that entropy calculates the amount of information in bits, that is conveyed on average when identifying the value of a random sample from a set of possible outcomes + probabilities.

The information content or surprisal of an event E is expressed by the term: $\large I(E) = log_2(p(\frac{1}{E}))$. If the probability of an event occuring is low, the amount of information conveyed when spotting its occurrence is high, if the probability of an event occurring is high, the amount of information conveyed when spotting its occurrence is low. It can be imagined as a pseudo-inverse of the probability<br>

Looking at entropy when written as $\large H=\sum p(x)*log(p(\frac{1}{x}))$ and taking into account that the expected value of a random variable is given by $\large E=\sum p(x_i)*x_i$, it becomes clear, that entropy is the **expected value of information content/suprisal.**

<br>

implementation:

In [None]:
def calc_entropy(X):
    weighted_average_of_props = 0
    for categ in np.unique(X):
        prop = np.where(X == categ, 1, 0).sum() / len(X)
        weighted_average_of_props += prop * np.log2(prop) 
    return -1 * weighted_average_of_props

# applied on our binary example:
ex1 = np.array(["+", "+", "+", "+", "+", "-"])
ex2 = np.array(["+", "+", "+", "-", "-", "-"])
ex3 = np.array(["+" ,"-", "-", "-", "-", "-"]) 

entropy_ex1 = calc_entropy(ex1)
entropy_ex2 = calc_entropy(ex2)
entropy_ex3 = calc_entropy(ex3)

print(f"binary example:\n\n{ex1}: entropy = {entropy_ex1}\n{ex2}: entropy = {entropy_ex2}\n{ex3}: entropy = {entropy_ex3}\n")

****

<p id="2.2."></p>

## <center>2.2. gini index</center>

The **Gini index** is given by $\large gini=1-\sum p(x)^2$ and has practically the same attributes as entropy just on a slightly different scale. That's why I won't go much further into details here.<br>

Still let's at least take a look at the graph resulting from the binary example of the past, in comparison to entropy:

<center><img src="https://cdn-images-1.medium.com/max/674/1*ovBMTgXvj3hmDHvqqxm87A.png" style="width:auto; height:300px" alt="medium.com"></center>

**main differences between gini and entropy:**
* max value: gini 0.5 and entropy 1
* computational efficience: entropy requires calculating a logarithm, therefore ***the gini index is more efficient***


<br>


implementation:

In [None]:
def calc_gini(X):
    square_props = 0
    for categ in np.unique(X):
        prop = np.where(X == categ, 1, 0).sum() / len(X)
        square_props += prop ** 2
    return 1 - square_props

# applied on our binary example:
ex1 = np.array(["+", "+", "+", "+", "+", "-"])
ex2 = np.array(["+", "+", "+", "-", "-", "-"])
ex3 = np.array(["+" ,"-", "-", "-", "-", "-"]) 

gini_ex1 = calc_gini(ex1)
gini_ex2 = calc_gini(ex2)
gini_ex3 = calc_gini(ex3)

print(f"binary example:\n\n{ex1}: gini index = {gini_ex1}\n{ex2}: gini index = {gini_ex2}\n{ex3}: gini index = {gini_ex3}\n")

****

<p id="2.3."></p>

## <center>2.3. information gain</center>

Now that the 2 most important uncertainty statistics were covered this subchapter will move back to the original question **how to evaluate a split using these statistics?**. <br>

To do that **two major problems** have to be solved:
1. a split results in 2 children: somehow the impurity of both has to be considered 
2. To know how good a split is (in comparison to others) there has to be a measure of how much a split ***reduces impurity/uncertainty***

<details> 
  <summary>Scince both problems can be solved quite easily, I'd suggest <strong>taking the time to think about solutions yourself</strong> before <strong>clicking here</strong> to see how information gain solves both</summary>
   <ul>
   <li>1. taking a <strong>weighted average</strong> of the seperate impurities of both children (weighted by the proportion of samples in the child(relative to data before splitting))</li>
   <li>2. subtracting the weighted average of impurities after splitting from the impurity before splitting: <strong>by how much was impurity reduced by making that split?</strong>. This results in a value comparable over all splits: the split that reduces impurity/uncertainty the most in the best.</li>    
   </ul>
</details>
<br>

Put into a mathematic formula this results in: <br>

$\large Gain(S,A)=Entropy(S)-\sum_{v\>\epsilon \>val\>A}\frac{\lvert S_v\rvert}{\lvert S\rvert}Entropy(S_v)$

looks confusing at first but that's why the intuition was covered before: it does exactly what is mentioned above:
* S always stands for a subset: $S$ is the original subset before splitting, $S_v$ is the subset v after splitting
* the sum is taken over a term calculated on each S_v created by the split
* the vertical lines around these subsets in the term $\frac{\lvert S_v\rvert}{\lvert S\rvert}$ mean the *total number of samples in the subset*. 
* So the term $\frac{\lvert S_v\rvert}{\lvert S\rvert}$ just calculates the proportion of S_v(the subset of a child) relative to the original subset S  -> this results in the previously discussed **weighted average**.
* **note:** obviously the term weighted average is not the theoretical correct name, but in practice, it does basically that. 
* and that weighted average is taken over the impurity of the subset v: $Entropy(S_v)$ note: **you can plug in Gini instead of entropy in this formula**

**why information gain?**<br>
Information gain is nothing but a synonym for the ***Kullback–Leibler divergence*** of the distributions before and after splitting:<br>

$IG(Y,X)=H(Y)-H(Y|X)$<br>

It is also called information gain due to it representing the amount of information "gained" when knowing the distribution of X in addition to the distribution of Y and looking at entropy from the original information theory perspective.

In [None]:
#  S : original subset, S_l : subset of left child, S_r : subset of right child
def information_gain(S, S_l, S_r, method):
    #  if entropy is used as impurity measurement
    if method == "entropy":
        S_impurity = calc_entropy(S)
        S_l_impurity = calc_entropy(S_l)
        S_r_impurity = calc_entropy(S_r)

    #  if gini index is used as impurity measurement (default due to being less computing power intensiv)
    elif method == "gini":
        S_impurity = calc_gini(S)
        S_l_impurity = calc_gini(S_l)
        S_r_impurity = calc_gini(S_r)

    else:
        raise ValueError("expected method to be 'gini' or 'entropy'")
    
    return S_impurity - (S_l_impurity * (len(S_l) / len(S)) 
                        + S_r_impurity * (len(S_r) / len(S)))

With that concept implemented we are now able to **evaluate the splits** you saw earlier. This is done by calculating the information gain for each split:

In [None]:
original_subset = data["target"]

inf_gain1 = information_gain(original_subset, split1[0], split1[1], method="entropy")
inf_gain2 = information_gain(original_subset, split2[0], split2[1], method="entropy")
inf_gain3 = information_gain(original_subset, split3[0], split3[1], method="entropy")

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(30, 1))

visualise_split(split1, ax1)
ax1.set_title("split 1: variable = var2, value = 12, inf. gain = {:2f}".format(inf_gain1), fontdict={'fontsize': 20})
visualise_split(split2, ax2)
ax2.set_title("split 2: variable = var1, value = 3, inf. gain = {:2f}".format(inf_gain2), fontdict={'fontsize': 20})
visualise_split(split3, ax3)
ax3.set_title("split 3: variable = var1, value = 4, inf. gain = {:2f}".format(inf_gain3), fontdict={'fontsize': 20})

plt.show()

So the splits can be ranked in the following order:<br>
* split 2
* split 1
* split 3

****

<p id="3."></p>

# <center>3. greedy splitting</center>

The way the best split is found is often referred to as greedy splitting. This means considering every possible split and choosing the best one aka. the one that results in the highest information gain

In [None]:
class Spliter():
    
    def __init__(self, method="gini"):
        self.method = method
        
    def calc_entropy(self, X, sample_weights=None):
        weighted_average_of_props = 0
        for categ in np.unique(X):
            prop = np.where(X == categ, 1, 0).sum() / len(X)
            weighted_average_of_props += prop * np.log10(prop)
        return -1 * weighted_average_of_props
    
    def calc_gini(self, X, sample_weights=None):
        square_props = 0
        for categ in np.unique(X):
            prop = np.where(X == categ, 1, 0).sum() / len(X)
            square_props += prop ** 2
        return 1 - square_props
    
    def information_gain(self, S, S_l, S_r):
        #  if entropy is used as impurity measurement
        if self.method == "entropy":
            S_impurity = self.calc_entropy(S)
            S_l_impurity = self.calc_entropy(S_l)
            S_r_impurity = self.calc_entropy(S_r)
            
        #  if gini index is used as impurity measurement (default due to being less computing power intensiv)
        elif self.method == "gini":
            S_impurity = self.calc_gini(S)
            S_l_impurity = self.calc_gini(S_l)
            S_r_impurity = self.calc_gini(S_r)
            
        else:
            raise ValueError("expected method to be 'gini' or 'entropy'")
        
        return S_impurity - (S_l_impurity * (len(S_l) / len(S)) 
                             + S_r_impurity * (len(S_r) / len(S)))
    
    def get_best_split(self, X, y):
        best_var, best_value, best_score = None, None, 0
        
        variables = range(X.shape[1])
        
        #  iterate over each possible variable
        for var_i in variables: 
            #  iterate over each possible value
            for value in np.unique(X[:-1, var_i]):
                score = self.information_gain(y,                        # targe before split
                                              y[X[:, var_i] <= value],  # targe after split left child
                                              y[X[:, var_i] > value])   # targe after split right child
                
                #  if the resulting split is better than the one before save it
                if score > best_score:
                    best_var, best_value, best_score = var_i, value, score
                    
        return best_var, best_value, best_score

Now we can not just evaluate the 3 given splits of the sample dataset, but find the best one:

In [None]:
splitter = Spliter()

X = np.column_stack([data["var1"], data["var2"]])
y = data["target"]

best_var, best_value, best_score = splitter.get_best_split(X, y)
split = (y[X[:, best_var] > best_value], y[X[:, best_var] <= best_value])

fig, ax1 = plt.subplots(1, 1, figsize=(20, 2))
visualise_split(split, ax1)
plt.show()

****

<p id="4."></p>

# <center>4. building a classification tree 🏗️</center>

In this Section, the actual building/fitting process of a decision tree will be explained.

The main bullet point here is ***recursive splitting***.<br>
For anyone already having looked into recursive programming just a bit, this should be relatively straightforward. For everyone who hasn't, here is a short explanation of recursion:<br>
* in contrast to iterative programming, no loops are used to repeatedly compute a pattern
* instead, methods are used that recursively call themselves till some kind of exit/termination condition is fulfilled 

Let's recall how a decision tree is constructed:
* the data repeatedly gets split into 2 subsets at the split resulting in the highest information gain
* when no split is able to improve the uncertainty/results in an information gain higher than 0, a terminal node or leaf is created

This creates a pattern of splitting or creating leaves, that is repeatedly called on the subsets created by the last split. Before taking a look at the recursive implementation of this pattern, we have to determine some way of creating the tree-**struckture**. 

This could be done by nesting lists or dictionaries, but also by nesting objects that represent the building blocks of a decision tree. I think that while the first one might be slightly more efficient/easier to save, the second one is more intuitive and easier to understand. So for this implementation, the following classes are going to be used:

In [None]:
#  these classes could be combined to one node class but I think it is easier to imagine when splitting into seperate node and leaf "building-blocks"

class Node():
    def __init__(self, left_child, right_child, split, layer=None):
        self.right_child = right_child
        self.left_child = left_child
        self.split = split
        self.type = "Node"

class Leaf():
    def __init__(self, value):
        self.value = value
        self.type = "Leaf"
        

They can be nested into each other by setting the left/right child of a node to another node which's left and right childs again contain nodes etc. till the leafs are reached. Combining this with the recursion idea, it results in something that would look like this:<br>

In [None]:
def build_tree(X, y):
                best_var, best_value = get_best_split(X, y)

                #  check wether to split or create leaf node
                if (best_var == None):  # best var is None when information gain == 0 -> see subchapter information gain
                    return Leaf(np.unique(y)[np.argmax(np.unique(y, return_counts=True)[1])])
                else:
                    return Node(build_tree(X[X[:, best_var] <= best_value], y[X[:, best_var] <= best_value]), # left child
                                build_tree(X[X[:, best_var] > best_value], y[X[:, best_var] > best_value]), # right child
                                (best_var, best_value)) # save split

Note that the children of the returned node get set to the same function of the corresponding subset the split results in. With that recursive element every node's children get set to Nodes calculated on the subsets resulting from the split etc. 

So when calling build_tree() and saving it some kind of root node variable, the whole structure gets built and nested into this variable.

****

<p id="5."></p>

# <center>5. how to prevent overfitting</center>

At this point **everything needed to implement a decision tree was covered** but there is one small detail that would ruin the success of the model: overfitting. Till now the algorithm continues to split the data **till an additional split wouldn't result in a decrease of uncertainty/impurity**. This creates a model with **large variance** and small bias, meaning it **works often nearly perfect on the training data** but ***overfits*** on the test data. 

> The model hasn't learned to solve the **general problem** but to specifically classify the data points from the training data by increasing its variance till the chain of questions separates the training data nearly perfect but doesn't work on the real problem anymore

To prevent overfitting from happening, the variance of the model has to be reduced while keeping the bias as low as possible. The most important method to do this with decision trees is called ***pruning***. Pruning seeks to reduce the complexity/size of a decision tree by reducing the number of nodes and leaves in the tree. The idea is to cut or prevent the creation of nodes/leaves non-critical to the actual problem.

Pruning can be further split(😉🥴) into 2 categories: 
* **pre-pruning** (pruning while/before fitting the tree)
* **post-pruning** (pruning after fitting the tree)

#### **pre-pruning:**
* prevents the creation of unnecessary leafs and nodes by introducing additional terminal/stop conditions to the recursive splitting algorithm. Some of the possible criteria/conditions are:
    * maximal depth (the maximal number of nodes that are allowed to be created)
    * minimal amount of samples in split (the minimal amount of samples to split at, if size of subset smaller/equal a leaf is created no matter the impurity)
    * minimal amount of samples in leaf (self-explanatory)
    * minimal gain to split (if maximal achievable information gain is below that threshold, a leaf is created)
* is also referred to as the early stopping rule

example implementation of a few criteria:

In [None]:
max_depth = 20
min_samples_split = 20

depth = 0
def build_tree(X, y):
    best_var, best_value = self.get_best_split(X, y)

    #  check wether to split or create leaf node
    if (best_var == None)|(len(X) < min_samples_split)|(depth >= max_depth):
        return Leaf(np.unique(y)[np.argmax(np.unique(y, return_counts=True)[1])])
    else:
        depth += 1
        return Node(self.build_tree(X[X[:, best_var] <= best_value], y[X[:, best_var] <= best_value]), # left child
                    self.build_tree(X[X[:, best_var] > best_value], y[X[:, best_var] > best_value]), # right child
                    (best_var, best_value)) # save split

#### **post-pruning**
* removes unnecessary nodes + leaves from the already constructed tree by utilising some algorithm
* the probably most widely used algorithm for post-pruning is called **cost complexity pruning** where the best amount of cut nodes+leafs is determined by a cost function that weights the training error against the complexity

**cost complexity pruning** is more elaborate to implement especially when using the system of nested objects, which this notebook goes for. So it can be covered in its own notebook in the future if there is any demand for that.

****

<p id="6."></p>

# <center>6. putting it all together: <strong>classification on titanic dataset</strong></center>

**Finally enough was covered to make the final decision tree implementation 🎉.**
The implementations are going to be coded in a way that they can be used in similar fashion to the sklearn equivalents, so a fluid transition is as easy as possible. 

#### preparing the data
Let's start out by loading the data, encoding the categorical features (cause the algorithm can just deal with numerical data), and splitting it into training and test set:

In [None]:
titanic_data = pd.read_csv("../input/titanic/train.csv", usecols=["Survived", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]).dropna()
titanic_data_encoded = pd.get_dummies(titanic_data, columns=["Sex", "Embarked"])

titanic_data_train = titanic_data_encoded.sample(frac=0.7)
titanic_data_test = titanic_data_encoded[~titanic_data_encoded.index.isin(titanic_data_train.index)]
titanic_data_train.head()

#### creating the DecisionTreeClassifier class
To do this, all thats left to do, is to collect all the single concepts implemented so far and put them into a class, which results in:

In [None]:
class DecisionTreeClassifier():
    
    def __init__(self, method="gini", max_depth=50, min_samples_split=20, max_features=None):
        self.method = method
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features

        self.root_node = None
        self.depth = 0
        
    def calc_entropy(self, X):
        weighted_average_of_props = 0
        for categ in np.unique(X):
            prop = np.where(X == categ, 1, 0).sum() / len(X)
            weighted_average_of_props += prop * np.log10(prop)
        return -1 * weighted_average_of_props
    
    def calc_gini(self, X):
        square_props = 0
        for categ in np.unique(X):
            prop = np.where(X == categ, 1, 0).sum() / len(X)
            square_props += prop ** 2
        return 1 - square_props
    
    def information_gain(self, S, S_l, S_r):
        #  if entropy is used as impurity measurement
        if self.method == "entropy":
            S_impurity = self.calc_entropy(S)
            S_l_impurity = self.calc_entropy(S_l)
            S_r_impurity = self.calc_entropy(S_r)
            
        #  if gini index is used as impurity measurement (default due to being less computing power intensiv)
        elif self.method == "gini":
            S_impurity = self.calc_gini(S)
            S_l_impurity = self.calc_gini(S_l)
            S_r_impurity = self.calc_gini(S_r)
            
        else:
            raise ValueError("expected method to be 'gini' or 'entropy'")
        
        return S_impurity - (S_l_impurity * (len(S_l) / len(S)) 
                             + S_r_impurity * (len(S_r) / len(S)))
    
    def get_best_split(self, X, y):
        best_var, best_value, best_score = None, None, 0
        
        if self.max_features == None:
            variables = range(X.shape[1])
        else:
            variables = np.random.choice([i for i in range(X.shape[1])], size=self.max_features, replace=False)
        
        for var_i in variables:
            for value in np.unique(X[:-1, var_i]):
                score = self.information_gain(y,                        # targe before split
                                              y[X[:, var_i] <= value],  # targe after split left child
                                              y[X[:, var_i] > value])   # targe after split right child
                if score > best_score:
                    best_var, best_value, best_score = var_i, value, score
                    
        return best_var, best_value
    
    def build_tree(self, X, y):
        best_var, best_value = self.get_best_split(X, y)
        
        #  check wether to split or create leaf node
        if (best_var == None)|(len(X) < self.min_samples_split)|(self.depth >= self.max_depth):
            return Leaf(np.unique(y)[np.argmax(np.unique(y, return_counts=True)[1])])
        else:
            self.depth += 1
            return Node(self.build_tree(X[X[:, best_var] <= best_value], y[X[:, best_var] <= best_value]), # left child
                        self.build_tree(X[X[:, best_var] > best_value], y[X[:, best_var] > best_value]), # right child
                        (best_var, best_value)) # save split
    
    def get_prediction(self, node, row):
        if node.type == "Node":
            if row[node.split[0]] > node.split[1]:
                return self.get_prediction(node.right_child, row)
            else:
                return self.get_prediction(node.left_child, row)
        else:
            return node.value
        
    def fit(self, X, y):
        self.depth = 0
        self.root_node = self.build_tree(X, y)
                    
    def predict(self, X):
        predictions = [self.get_prediction(self.root_node, row) for row in X]
        return np.array(predictions)
    
    #  functions added for "visualising" structure:
    def get_tree_struckture(self, node, prev_layer):
        if node.type == "Node":
            print(f"walked node on depth lvl. {prev_layer+1}")
            return self.get_tree_struckture(node.left_child, prev_layer+1), self.get_tree_struckture(node.right_child, prev_layer+1)
        else:
            print(f"walked leaf on depth lvl. {prev_layer+1} with the value {node.value}")
                
    def struckture(self):
        self.get_tree_struckture(self.root_node, 0)

#### fitting the classifier
Now all that's left is to initialise a classifier object and fit it on the training data:

In [None]:
X_train = titanic_data_train.loc[:, titanic_data_train.columns!="Survived"].to_numpy() # select everything but the target
y_train = titanic_data_train.loc[:, "Survived"].to_numpy() # select the target

classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)

I added another recursive functions to walk the structure of the created tree. It's not that easy to read but it's at least something. The structure of the resulting tree looks like this:

In [None]:
classifier.struckture()

#### evaluating the classifier
Now the classifier can be evaluated on the test data:

In [None]:
X_test = titanic_data_test.loc[:, titanic_data_test.columns!="Survived"].to_numpy() # select everything but the target
y_test = titanic_data_test.loc[:, "Survived"].to_numpy() # select the target

#  make predictions:
predictions = classifier.predict(X_test)

#  confusion matrix:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, predictions, labels=[0, 1])
cm_displ = ConfusionMatrixDisplay(cm)
cm_displ.plot()
plt.show()

#  calculate accuracy:
accuracy = np.mean(predictions==y_test)

#  calculate recall:
recall = cm[1, 1]/cm[1, :].sum() # of the total actual positives, how much were classified correctly

#  calculate precision:
precision = cm[1, 1]/cm[:, 1].sum() # of all predicted positives, how much were True positives

#  not that neccessary for this problem, but for the completeness:
f1 = 2 * ((recall * precision)/(recall + precision)) 

print(f"accuracy = {accuracy},\nrecall = {recall},\nprecision = {precision},\nf1-score = {f1}")

#### and there you go!
**If you're a beginner I hope you have understood the concept of decision trees for classification by now. If yes I'd invite you to continue with the next subchapter where I'll try explaining how to use this concept for regression tasks. If no or that was enough till here I'd like to thank you for reading and wish you a great rest of your day!**

****

<p id="7."></p>

# <center>7. decision trees for regression </center>

Actually **not much changes**, when using decision trees for regression, cause all you do is predict values instead of categories. So you basically use the same classification structure but with many leaves, covering most of the value range of the target variable. Let's take a look at a regression problem:

<center><img src="https://www.jcchouinard.com/wp-content/uploads/2021/09/image-70.png"></center>
<center> image source: jcchouinard.com</center>

Note that instead of predicting values using a continuous line like in linear regression, the values are aggregated in main categories and then get predicted by the decision tree.

To modify the decision tree algorithm covered till here, in order to handle regression a few things have to be modified:

* the function that is reduced: impurity for classification, ***sum of squared residuals*** for regression
* the value of the leaf: mode for classification, ***mean*** for regression

The rest can basically stay the same, due to the regression problem being practically still treated as a large multiclass classification. Even the implemented **information gain** can be kept due to it, after replacing Gini with the sum of squared residuals, calculating how much a split reduces this error.

In [None]:
#  main changes:

#  sum or squared residuals instead of gini or entropy
def calc_ssr(self, X, value):
    sum_of_squared_residuals = np.square(X - value).sum()
    return sum_of_squared_residuals

#  assigning mean instead of mode to leaf
def placeholder():
    return Leaf(y.mean())

****

<p id="8."></p>

# <center>8. putting it all together: <strong>regression on house-price dataset</strong></center>

So with that we can make the few neccessary changes to the DecionTreeClassifier class and turn it into a DecionTreeRegressor. But before that let's quickly prepare a regession dataset:

#### preparing the data

In [None]:
house_data = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
house_data_encoded = pd.get_dummies(house_data).dropna().reset_index(drop=True)

house_data_train = house_data_encoded.sample(frac=0.7)
house_data_test = house_data_encoded[~house_data_encoded.index.isin(house_data_train.index)]
house_data_train.head()

#### creating the DecisionTreeRegressor class
The DecisionTreeClassifier with the previously discussed modifications:

In [None]:
class DecisionTreeRegressor():
    
    def __init__(self, max_depth=500, min_samples_split=2, max_features=None):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features

        self.root_node = None
        self.depth = 0
        
    def calc_ssr(self, X, value):
        sum_of_squared_residuals = np.square(X - value).sum()
        return sum_of_squared_residuals
    
    def loss_reduction(self, S, S_l, S_r):  # not information gain anymore but basicaly the same
        S_loss = self.calc_ssr(S, S.mean())
        S_l_loss = self.calc_ssr(S_l, S_l.mean())
        S_r_loss = self.calc_ssr(S_r, S_r.mean())
        
        return S_loss - (S_l_loss * (len(S_l) / len(S)) 
                         + S_r_loss * (len(S_r) / len(S)))
    
    def get_best_split(self, X, y):
        best_var, best_value, best_score = None, None, 0
        
        if self.max_features == None:
            variables = range(X.shape[1])
        else:
            variables = np.random.choice([i for i in range(X.shape[1])], size=self.max_features, replace=False)
        
        for var_i in variables:
            for value in np.unique(X[:, var_i])[:-1]:
                score = self.loss_reduction(y,                        # targe before split
                                            y[X[:, var_i] <= value],  # targe after split left child
                                            y[X[:, var_i] > value])   # targe after split right child
                
                if score > best_score:
                    best_var, best_value, best_score = var_i, value, score

        return best_var, best_value
    
    def build_tree(self, X, y):
        best_var, best_value = self.get_best_split(X, y)
        
        #  check wether to split or create leaf node
        if (best_var == None)|(len(X) < self.min_samples_split)|(self.depth >= self.max_depth):
            return Leaf(y.mean())
        else:
            self.depth += 1
            return Node(self.build_tree(X[X[:, best_var] <= best_value], y[X[:, best_var] <= best_value]), # left child
                        self.build_tree(X[X[:, best_var] > best_value], y[X[:, best_var] > best_value]), # right child
                        (best_var, best_value)) # save split
    
    def get_prediction(self, node, row):
        try:
            if row[node.split[0]] > node.split[1]:
                return self.get_prediction(node.right_child, row)
            else:
                return self.get_prediction(node.left_child, row)
        except AttributeError:
            return node.value
        
    def fit(self, X, y):
        self.depth = 0
        self.root_node = self.build_tree(X, y)
        
    def predict(self, X):
        predictions = [self.get_prediction(self.root_node, row) for row in X]
        return np.array(predictions)
    
    #  functions added for "visualising" structure:
    def get_tree_struckture(self, node, prev_layer):
        if node.type == "Node":
            print(f"walked node on depth lvl. {prev_layer+1}")
            return self.get_tree_struckture(node.left_child, prev_layer+1), self.get_tree_struckture(node.right_child, prev_layer+1)
        else:
            print(f"walked leaf on depth lvl. {prev_layer+1} with the value {node.value}")
                
    def struckture(self):
        self.get_tree_struckture(self.root_node, 0)

#### fitting the regressor

In [None]:
X_train = house_data_train.loc[:, house_data_train.columns!="SalePrice"].to_numpy() # select everything but the target
y_train = house_data_train.loc[:, "SalePrice"].to_numpy() # select the target

regressor = DecisionTreeRegressor(max_depth=1000, min_samples_split=2)
regressor.fit(X_train, y_train) # will take a bit longer due to larger tree

#### evaluating the regressor
Regression trees realy tend to easily overfit, which is visible when comparing train to test error:

In [None]:
X_test = house_data_test.loc[:, house_data_test.columns!="SalePrice"].to_numpy() # select everything but the target
y_test = house_data_test.loc[:, "SalePrice"].to_numpy() # select the target

#  make predictions:
predictions = regressor.predict(X_test)

#  calculate rmse:
rmse = np.sqrt(np.square(y_test - predictions)).mean()

#  calculate R2:
r2 = 1 - np.square(y_test - predictions).sum()/np.square(y_test - y_test.mean()).sum()

print(f"testing data: root mean squared error = {rmse}\nR-squared = {r2}")

#  make predictions:
predictions_train = regressor.predict(X_train)

#  calculate rmse:
rmse_train = np.sqrt(np.square(y_train - predictions_train)).mean()

#  calculate R2:
r2_train = 1 - np.square(y_train - predictions_train).sum()/np.square(y_train - y_train.mean()).sum()

print(f"training data: root mean squared error = {rmse_train}\nR-squared = {r2_train}")

So lets use the pre pruning parameter implemented earlyer to reduce overfitting:

In [None]:
regressor_2 = DecisionTreeRegressor(max_depth=150, min_samples_split=30)
regressor_2.fit(X_train, y_train) # will take a bit longer due to larger tree

In [None]:
#  make predictions:
predictions = regressor_2.predict(X_test)

#  calculate rmse:
rmse = np.sqrt(np.square(y_test - predictions)).mean()

#  calculate R2:
r2 = 1 - np.square(y_test - predictions).sum()/np.square(y_test - y_test.mean()).sum()

print(f"testing data: root mean squared error = {rmse}\nR-squared = {r2}")

#  make predictions:
predictions_train = regressor_2.predict(X_train)

#  calculate rmse:
rmse_train = np.sqrt(np.square(y_train - predictions_train)).mean()

#  calculate R2:
r2_train = 1 - np.square(y_train - predictions_train).sum()/np.square(y_train - y_train.mean()).sum()

print(f"training data: root mean squared error = {rmse_train}\nR-squared = {r2_train}")

As visible above the **training error increases while the test error decreases**. Still, the model is far from perfect. Here the already mentioned **post pruning** would probably help, but like already said, that's too much to cover in this notebook, would need its own one to explain in detail and implement.

##### <div class="alert alert-block alert-info">⚠️ <strong>Attention:</strong> Decision Trees are considered <strong>weak estimators</strong> and nearly never used on theire own. Still they create the base for most <strong>ensembling techniques</strong> that combine multiple weak estimators to achieve lower bias or variance depending on the method (p.e. Boosting as a technique to decrease bias). I plan on maybe making a notebook to ensembling techniques like random forrest, adaboost, gradient boosting + variants in the future so if there's any demand for that let me know.</div>

#### **and that's it for this nootebook, hope it helped someone, have a great day and happy learning! 👋**