# Regression Tree Notebook
### Author: Krzysztof Chmielewski

This notebook implements a regression tree from scratch and evaluates it on the Boston housing dataset. It demonstrates:
- Data loading and exploratory analysis
- Regression tree algorithm
- Model implementation
- Training and evaluation

## Model Purpose and Applications

**Regression Tree** extends decision trees to **continuous value prediction** (regression problems). It recursively splits data to minimize Mean Squared Error (MSE), creating a tree structure where leaf nodes predict average target values for their regions.

### Key Use Cases:
- **Real estate valuation**: Predicting house prices from property features
- **Medical outcome prediction**: Estimating patient recovery time or treatment effectiveness
- **Resource allocation**: Forecasting resource needs based on multiple factors
- **Climate modeling**: Predicting temperature or precipitation patterns
- **Manufacturing process control**: Predicting product quality scores
- **Feature importance analysis**: Identifying which variables matter most for predictions
- **Interpretable predictions**: Explaining "why" a specific prediction was made

### Strengths:
- **Highly interpretable** — clear decision rules understandable to non-technical stakeholders
- Handles non-linear relationships naturally
- No feature scaling required
- Can capture complex interactions between features
- Efficient prediction time
- Provides feature importance rankings

### Limitations:
- Prone to **overfitting** without proper depth and sample constraints
- Unstable — small data changes cause large tree changes
- Greedy splitting strategy (not globally optimal)
- Can create imbalanced trees with skewed data
- May require pruning for generalization

In [None]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from utils.evaluation import eval_regresssion

In [None]:
def MSE(y):
    if len(y) == 0: return 0
    return np.mean((y - np.mean(y)) ** 2)

def split_dataset(X, y, feature, threshold):
    left_mask = X[:, feature] <= threshold
    right_mask = X[:, feature] > threshold
    return X[left_mask], y[left_mask], X[right_mask], y[right_mask]

## Regression Tree: Core Concepts

A regression tree recursively splits data using the **Mean Squared Error (MSE)** criterion to build a tree structure for continuous value prediction.

### Mean Squared Error (MSE)

MSE measures the variance of target values in a node:

$$\text{MSE}(y) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y})^2$$

where $\hat{y}$ is the mean target value.

### Best Split Criterion

For each potential split, the algorithm calculates the weighted MSE of child nodes:

$$\text{Weighted MSE} = \frac{n_{left}}{n} \text{MSE}(y_{left}) + \frac{n_{right}}{n} \text{MSE}(y_{right})$$

The algorithm greedily selects splits that **minimize** weighted MSE, stopping at:
- Maximum depth (`max_depth`)
- Minimum samples required to split (`min_samples_split`)
- No improvement in MSE

Leaf nodes predict the **mean target value** of all samples in that region.

In [None]:
class RegressionTree():

    def __init__(self, max_depth = 5, min_samples_split = 2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None

    class Node:
        def __init__(self, feature=None, threshold=None, left=None, right=None, *, value=None):
            self.feature = feature
            self.threshold = threshold
            self.left = left
            self.right = right
            self.value = value 

    def __best_split(self, X, y):
        best_feature = None
        best_threshold = None
        best_mse = np.inf

        _, n_features = X.shape

        for feature in range(n_features):
            thresholds = np.unique(X[:,feature])

            for threshold in thresholds:
                _, y_left, _, y_right = split_dataset(X,y,feature,threshold)

                if len(y_left) == 0 or len(y_right) == 0: continue

                mse_left = MSE(y_left)
                mse_right = MSE(y_right)

                weighted_mse = (len(y_left)/len(y) * mse_left + len(y_right)/len(y) * mse_right)

                if weighted_mse < best_mse:
                    best_mse = weighted_mse
                    best_feature = feature
                    best_threshold = threshold
        
        return best_feature, best_threshold, best_mse
    
    def __build_tree(self, X, y, depth=0):
        if depth >= self.max_depth or len(y) < self.min_samples_split:
            return self.Node(value=np.mean(y))
        
        feature, threshold, mse = self.__best_split(X,y)

        if feature is None or mse >= MSE(y):
            return self.Node(value=np.mean(y))
        
        X_left, y_left, X_right, y_right = split_dataset(X, y, feature, threshold)

        left_child = self.__build_tree(X_left, y_left, depth+1)
        right_child = self.__build_tree(X_right, y_right, depth+1)

        return self.Node(feature, threshold, left_child, right_child)
    
    def __predict_sample(self, x, node: Node):
        if node.left is None and node.right is None:
            return node.value
        
        if x[node.feature] <= node.threshold:
            return self.__predict_sample(x, node.left)
        else:
            return self.__predict_sample(x, node.right)
        
    def fit(self, X, y):
        self.root = self.__build_tree(X, y)

    def predict(self, X):
        return np.array([self.__predict_sample(x, self.root) for x in X])


## Boston Housing Dataset

The Boston housing dataset contains information about housing in the Boston area. The dataset has 506 samples and 13 features plus the target variable `MEDV` (median home value in $1000s).

Data conversion: The raw `boston.txt` file contains data split across two lines per record. We parse and convert it to a CSV format with proper column headers.

In [None]:
with open("data/boston/boston.txt", "r") as infile, open("data/boston/boston.csv", "w") as outfile:
    lines = infile.readlines()
    n = len(lines)
    headers = "CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B-1000,LSTAT,MEDV"
    outfile.write(headers + '\n')
    for i in range(0,n,2):
        line1 = lines[i]
        line2 = lines[i+1]
        processed_line = ','.join(line1.split() + line2.split())
        outfile.write(processed_line + '\n')

There are 14 attributes in each case of the dataset. They are:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per $10,000
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's

Target is MEDV

In [None]:
data = pd.read_csv('data/boston/boston.csv')
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B-1000,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


## Training

Split the data into training and test sets and build the regression tree using MSE-based splitting.

In [None]:
X = data.drop("MEDV", axis=1).values
Y = data["MEDV"].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [None]:
RT = RegressionTree()
RT.fit(X_train, y_train)

## Evaluation

After training, we predict continuous values on the test set and evaluate using regression metrics:

- **RMSE (Root Mean Squared Error)**: $\sqrt{\frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2}$ — penalizes large errors
- **MAE (Mean Absolute Error)**: $\frac{1}{n} \sum_{i=1}^{n} |\hat{y}_i - y_i|$ — robust to outliers
- **R² Score**: $1 - \frac{\sum (\hat{y}_i - y_i)^2}{\sum (y_i - \bar{y})^2}$ — proportion of variance explained (0 to 1)

We use the local helper `eval_regression(y_true, y_pred)` to compute and display these metrics.

In [None]:
y_pred = RT.predict(X_test)
eval_regression(y_test, y_pred);

MSE: 15.632
RMSE: 3.954
MAE: 2.733
R2: 0.790
