# Decision Tree Regression

A naive `CART` implementation.

Before reading this chapter, take a look at [Decision Tree Classification](), which introduces the concept of decision trees better.

Structurely, a regression tree are constructed in the same way as a classification tree. There are however, two key differences:

1. Because we can now attribute error to distance, the **quality** of splitting is now measured by variance of two children. Variance is defined as $Var(X) = E[(X-\mu)^2]$. Therefore, an optimial split is defined having the lowest possible variance between the other combinations of splits.
    1. Note: the difference between MSE and variance is that variance measures the dispertion of values, while MSE measures the quality of an estimator, or in other words, how different the values of the estimator and actual values are.
2. Terminal nodes (or leaf nodes) that reveal the prediction is no longer a majority vote, but the average value instead.


## Example

Suppose we had the following information about housing data:

In [1]:
import pandas as pd


df = pd.DataFrame({'type': ['semi', 'detached', 'detached', 'semi', 'semi'],
                   'n_bedrooms': [3, 2, 3, 2, 4],
                   'price': [600, 700, 800, 400, 700]})
df

Unnamed: 0,type,n_bedrooms,price
0,semi,3,600
1,detached,2,700
2,detached,3,800
3,semi,2,400
4,semi,4,700


Defining the split functionality

In [2]:
from functools import reduce
import numpy as np


def _variance(targets):
    if targets.size == 0:
        return 0
    return np.var(targets)


def _weighted_variance(groups):
    """
    obj: measures the weighted variance (impurity) after a split. this is effective 
    the sum of weighted variances, where the weight is defined quality of elements
    within each group
    :param groups: List[List[]] - [i] - child, [i][j] - target value at child i
    :return: - float
    """
    total = sum(len(group) for group in groups)

    def single_wv(group):
        weight = len(group) / float(total)
        return weight * _variance(group)
    
    return reduce(lambda g1, g2: single_wv(g1) + single_wv(g2), groups)


In [3]:
_variance(np.array([1, 2, 3]))

0.6666666666666666

In [4]:
_weighted_variance([np.array([1, 2, 3]), np.array([1, 2])])

0.5

We define our choice of split by selecting the minimum weighted variance between features. In other words, compare the weighted variance between every combination of features and select the split the returns the smallest value. This is the same as selecting a split that returns the overall smallest spread between values - or the group of features that are overall, the closest together.

This has the effect of grouping two together features, or making a decision on the basis that the information gained from the split is defined by being more closer together, forming more tightly knited groups as a consequence. The leaf nodes that determine the final predictions of the tree will as a result carry minimum variance and so the target mean will be maximimally explained.

In our case, the comparisons we make is based on the combinations between the number of bedrooms and type of home.

In [5]:
def split_variance(df, feature, value, target):
    is_value = df[feature] == value
    is_not_value = df[feature] != value
    child_split = [np.array(df[is_value][target].tolist()),
                   np.array(df[is_not_value][target].tolist())]
    return _weighted_variance(child_split)


split_type = df['type'].unique().tolist()
split_bedrooms = df['n_bedrooms'].unique().tolist()
split_type = [('type', elm) for elm in split_type]
split_bedrooms = [('n_bedrooms', elm) for elm in split_bedrooms]

for feature, value in split_type:
    print(f'feature: {feature}')
    print(f'value: {value}')
    sv = split_variance(df, feature, value, 'price')
    print(f'split_variance: {sv}', end='\n\n')


for feature, value in split_bedrooms:
    print(f'feature: {feature}')
    print(f'value: {value}')
    sv = split_variance(df, feature, value, 'price')
    print(f'split_variance: {sv}', end='\n\n')


feature: type
value: semi
split_variance: 10333.333333333334

feature: type
value: detached
split_variance: 10333.333333333334

feature: n_bedrooms
value: 3
split_variance: 16000.0

feature: n_bedrooms
value: 2
split_variance: 13000.0

feature: n_bedrooms
value: 4
split_variance: 17500.0



We find that splitting on 'semi' first proves to be optimal. Then following this same strategy again, splitting next on bedroom (3) proves me to be the next optimal split.

Finally, we average out the remaining values within the dataframe to indicate the final predictive value.

![](../../../assets/tree_based_algorithms/sample_decision_tree2.PNG)


## Putting it All Together

In [6]:
import pandas as pd
import sys
from functools import reduce
import numpy as np
from tabulate import tabulate


class Impurity:
    @staticmethod
    def weighted_variance(groups):
        """
        obj: measures the weighted variance (impurity) after a split. this is effectively
        the sum of weighted variances, where the weight is defined quality of elements
        within each group
        :param groups: List[List[]] - [i] - child, [i][j] - target value at child i
        :return: - float
        """
        total = sum(len(group) for group in groups)

        def single_wv(group):
            weight = len(group) / float(total)
            return weight * Impurity._variance(group)

        return reduce(lambda g1, g2: single_wv(g1) + single_wv(g2), groups)

    @staticmethod
    def _variance(targets):
        if targets.size == 0:
            return 0
        return np.var(targets)


class Node:
    def __init__(self, data, feature, impurity,
                 left=None, right=None, leaf=None):
        """
        :param data: pd.DataFrame - subset data by a particular feature group
        :param feature: str - feature `data` was grouped by
        :param impurity: float - metric value that was made for the optimal decision of the split
        :param left: Node - pointer to left child
        :param right: Node - pointer to right child
        :param leaf: int - average target value represented by the leaf node
        """
        self.data = data
        self.feature = feature
        self.impurity = impurity
        self.left, self.right = left, right
        self.leaf = leaf

    def is_leaf(self):
        return self.leaf is not None and self.left is None and self.right is None

    def print(self, delimit):
        md_table = None
        if self.data is not None:
            md_table = tabulate(self.data.head(10), headers='keys', tablefmt='pipe')
            md_table = '\n'.join([delimit + row for row in md_table.split('\n')])
        feature = delimit + 'feature: ' + self.feature if self.feature else ''
        impurity = delimit + 'impurity: ' + str(self.impurity) if self.impurity else ''
        leaf = delimit + 'leaf: ' + str(self.leaf) if self.leaf else ''
        tmp_iter = [md_table, feature, impurity, leaf]
        print('\n'.join([elm for elm in tmp_iter if elm]))


class DecisionTree:
    @staticmethod
    def train(data, max_depth, min_size, y_label='y'):
        """
        obj: build or "train" the decision tree
        :param data: pd.DataFrame - training data from its entry point
        :param max_depth: int - constraint for the maximum depth of tree
        :param min_size: int - constrain for the minimum number of rows per row
        :param y_label: str - target label into `data`
        :return: Node - root of the decision tree
        """
        def dfs(data, depth):
            # checking terminating conditions (sufficient data, and depth)
            if len(data) == 0 or len(data) <= min_size or depth >= max_depth:
                return Node(data, None, None, leaf=DecisionTree._predict(data, y_label))
            # otherwise we are safe to split again
            else:
                # build the new node in the stack and recurse to the next level
                # form the binary connections on return
                best_split_info = DecisionTree._get_best_split(data, y_label)
                cur_node = Node(data, best_split_info['col_name'], best_split_info['impurity'])
                cur_node.left = dfs(best_split_info['left'], depth + 1)
                cur_node.right = dfs(best_split_info['right'], depth + 1)
                return cur_node
        return dfs(data, 1)

    @staticmethod
    def visualize_tree(root):
        def dfs(node, tab_count):
            delimit = '\t' * tab_count
            node.print(delimit)
            if not node.is_leaf():
                dfs(node.left, tab_count + 1)
                dfs(node.right, tab_count + 1)
        dfs(root, 0)

    @staticmethod
    def _get_best_split(data, y_label):
        """
        obj: identify the best attribute to split on ``data``
        :param data: pd.DataFrame - subset data
        :param y_label: str - label for output in data
        :return: dict - the score, left, right, and column for optimal split
        """
        def get_best_score(one_rests):
            # note: top score is the minimum score
            top_score, top_i = sys.maxsize, None
            for i, one_rest in enumerate(one_rests):
                impurity = Impurity.weighted_variance([one_rest[0][y_label],
                                                       one_rest[1][y_label]])
                if impurity < top_score:
                    top_score, top_i = impurity, i
            return top_score, top_i

        features = DecisionTree.get_features(data, y_label)
        g_top_score, g_top_one_rest, g_top_col = sys.maxsize, None, None

        # get all the possible binary splits for a particular attribute
        for col_name in features:
            one_rests = DecisionTree._get_all_splits(data, col_name)
            top_score, top_i = get_best_score(one_rests)
            if g_top_score > top_score:
                g_top_score, g_top_one_rest, g_top_col = top_score, one_rests[top_i], col_name

        return {'impurity': g_top_score, 'left': g_top_one_rest[0],
                'right': g_top_one_rest[1], 'col_name': g_top_col}

    @staticmethod
    def get_features(df, y_label):
        """
        obj: return the features of the dataframe
        """
        c_names = np.array(list(df))
        features = c_names[c_names != y_label]
        return features

    @staticmethod
    def _predict(data, y_label):
        """
        obj: determine leaf node value of tree
        """
        return np.mean(data[y_label])

    @staticmethod
    def _get_all_splits(data, by):
        """
        objective: split data by all unique attributes within a feature
        :param data: pd.DataFrame - subset data
        :param by: str - column to group by
        :return: [(pd.DataFrame, pd.DataFrame)] - (one, rest) data frames. this can
        also be interpreted as (left, right) where left = unique, right = remainder
        """
        groups = data[by].unique()
        one_rest = []
        for elm in groups:
            one, rest = DecisionTree._partition(data, by, elm)
            one_rest.append((one, rest))
        return one_rest

    @staticmethod
    def _partition(data, feature, value):
        """
        obj: partition dataframe into two, using one vs rest approach
        :param data: pd.Dataframe - data
        :param feature: str - column name
        :param value: int - element into column to split dataframe by
        :return: [d1, df2] - (one, rest) datafrmaes
        """
        mask = data[feature] == value
        return data[mask], data[~mask]

In [7]:
df = pd.DataFrame({'type': ['semi', 'detached', 'detached', 'semi', 'semi'],
                   'n_bedrooms': [3, 2, 3, 2, 4],
                   'price': [600, 700, 800, 400, 700]})
tree = DecisionTree.train(df, 3, 2, 'price')
DecisionTree.visualize_tree(tree)

|    | type     |   n_bedrooms |   price |
|---:|:---------|-------------:|--------:|
|  0 | semi     |            3 |     600 |
|  1 | detached |            2 |     700 |
|  2 | detached |            3 |     800 |
|  3 | semi     |            2 |     400 |
|  4 | semi     |            4 |     700 |
feature: type
impurity: 10333.333333333334
	|    | type   |   n_bedrooms |   price |
	|---:|:-------|-------------:|--------:|
	|  0 | semi   |            3 |     600 |
	|  3 | semi   |            2 |     400 |
	|  4 | semi   |            4 |     700 |
	feature: n_bedrooms
	impurity: 1666.6666666666665
		|    | type   |   n_bedrooms |   price |
		|---:|:-------|-------------:|--------:|
		|  3 | semi   |            2 |     400 |
		leaf: 400.0
		|    | type   |   n_bedrooms |   price |
		|---:|:-------|-------------:|--------:|
		|  0 | semi   |            3 |     600 |
		|  4 | semi   |            4 |     700 |
		leaf: 650.0
	|    | type     |   n_bedrooms |   price |
	|---:|:---------|----

Notice how this matches the same tree structure as in the diagram above.

For the purposes of demonstration, we are for example ok with working with a single row of data (which we shouldn't be in the real world), we can change the parameters as follows:

In [8]:
tree = DecisionTree.train(df, 3, 1, 'price')
DecisionTree.visualize_tree(tree)

|    | type     |   n_bedrooms |   price |
|---:|:---------|-------------:|--------:|
|  0 | semi     |            3 |     600 |
|  1 | detached |            2 |     700 |
|  2 | detached |            3 |     800 |
|  3 | semi     |            2 |     400 |
|  4 | semi     |            4 |     700 |
feature: type
impurity: 10333.333333333334
	|    | type   |   n_bedrooms |   price |
	|---:|:-------|-------------:|--------:|
	|  0 | semi   |            3 |     600 |
	|  3 | semi   |            2 |     400 |
	|  4 | semi   |            4 |     700 |
	feature: n_bedrooms
	impurity: 1666.6666666666665
		|    | type   |   n_bedrooms |   price |
		|---:|:-------|-------------:|--------:|
		|  3 | semi   |            2 |     400 |
		leaf: 400.0
		|    | type   |   n_bedrooms |   price |
		|---:|:-------|-------------:|--------:|
		|  0 | semi   |            3 |     600 |
		|  4 | semi   |            4 |     700 |
		leaf: 650.0
	|    | type     |   n_bedrooms |   price |
	|---:|:---------|----

# Incorportating Random Forests

A Random Forest incorporates the combination two critical things:
1. Many decision trees.
2. Each decision tree randomly subsamples features within each node in a tree.

A _regression forest_ which is a kind of random forest, predicts the final value by incorporating all the predictions from its seperate decision trees using a metric just as average, or weighted average. 


[UPDATE!????] In a regression forest, all fields must be numerical. (then why decision used string type?)

In [9]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split


boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [10]:
from sklearn.ensemble import RandomForestRegressor


regressor = RandomForestRegressor(n_estimators=100, max_depth=10, min_samples_split=3)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)

We can identify the following properties used within our random forest.

In [11]:
print(f'Number of trees in forest: {regressor.n_estimators}')
print(f'The maximum depth of the tree is: {regressor.max_depth}. This means that no more than {regressor.max_depth}' 
      f'splits in the data were used.')
print(f'Every node in the tree contained at least {regressor.min_samples_split} samples within its own data subset.')
print(f'The minimum weighted fraction for a leaf node is {regressor.min_weight_fraction_leaf}. This is the percentage of' 
      f'samples required (overall the samples) to be deemed a leaf node.')
print(f'The criterion used measure quality of each split was {regressor.criterion}. Like demonstrated above, the' 
      f'mse criterion is equivalent to greedily reducting the variance.'
      f'The split that minimizes the variance ensures that the target data are most closely aligned together.')
print(f'When identifying the best split, the maximum features that are randomly considered at {regressor.max_features}. ')
print(f'The additional constraint for the maximum number of leaf nodes is {regressor.max_leaf_nodes}. This is means that'
      f'there is no constraint (unlimited number of leaf nodes).')
print(f'A split will be enforced if the difference between the impurity of the previous split and the current split is less'
      f'than {regressor.min_impurity_decrease}.')
print(f'To both fit and predict the model, {regressor.n_jobs} were run in parallel.')

Number of trees in forest: 100
The maximum depth of the tree is: 10. This means that no more than 10splits in the data were used.
Every node in the tree contained at least 3 samples within its own data subset.
The minimum weighted fraction for a leaf node is 0.0. This is the percentage ofsamples required (overall the samples) to be deemed a leaf node.
The criterion used measure quality of each split was mse. Like demonstrated above, themse criterion is equivalent to greedily reducting the variance.The split that minimizes the variance ensures that the target data are most closely aligned together.
When identifying the best split, the maximum features that are randomly considered at auto. 
The additional constraint for the maximum number of leaf nodes is None. This is means thatthere is no constraint (unlimited number of leaf nodes).
A split will be enforced if the difference between the impurity of the previous split and the current split is lessthan 0.0.
To both fit and predict the mo