# **Regression Trees - rice_ml**
This notebook demonstrates how to use the DecisionTreeRegressor class within the rice_ml package. It demonstrates it in an informative way that also analyzes the results, mirroring a standard use case of the classes.

Note that when using this in robust model selection, k-fold cross-validation and deeper hyperparameter tuning is recommended. In this example, since it's main goal is demonstrating the classes, we will not do as deep of hyperparameter tuning, and will compare every test using the same random state (42).

This notebook shows how to:
- Use 'DecisionTreeRegressor' from 'rice_ml'
- Prepare and normalize data using 'rice_ml'
- Evaluate decision trees on a regression task

## Table of Contents
- [Algorithm](#algorithm)
- [Data Preparation](#data-preparation)
- [Decision Tree Regression](#decision-tree-regression)
  - [Model Training](#model-training)
  - [Results](#results)

## Algorithm
A regression tree is a supervised learning algorithm that models a continuous value by recursively dividing the feature space into regions and fitting a constant prediction within each region. Unlike classification trees, which predict labels, regression trees predict numeric values.

The algorithm repeatedly selects a feature and a value to split it at that separates the data the best. It uses mean squared error (MSE) to decide which split is best. At each node in the tree, it chooses the split that results in the lowest MSE, continuing down the tree until a leaf node is reached. Each node represents a decision rule, and every leaf represents a prediction value.

To make a prediction, a data point is passed through the tree, moving down towards as leaf as dicated by the decision rules it reaches. Once it reaches a leaf, that predicted value is returned as the data point's prediction.

Regression trees are great at modeling nonlinear relationships without needing any feature transformations, but they rely on the quality and structure of the splits chosen during training. They can easily be overfit if the wrong parameters are used.

A great way to counter overfitting is using an ensemble method such as a random forest. A regression tree is a great baseline model for a random forest.

![Decision Tree Example](../images/regressor_tree.png)
Source: [Medium](https://medium.com/@sametgirgin/decision-tree-regression-in-6-steps-with-python-1a1c5aa2ee16)


### Pros vs Cons
#### Pros
- Interpretable and easy to visualize
- Handles nonlinear relationships very well
- Works with mixed feature types (numerical and categorical can be handled directly)
- Fast predictions
#### Cons
- Can easily overfit
- High variance
- Constant predictions within each leaf (can cause sharp and unrealistic jumps in output if there are not enough leafs)

## Data Preparation
We will be using the California Housing data from 'sklearn'. It contains census information about median housing prices in California districts. Given demographic and geographic features, we can use it to predict the median house value for a district.

In [9]:
from sklearn.datasets import fetch_california_housing
import numpy as np

data = fetch_california_housing()

X = data.data
y = data.target
feature_names = data.feature_names

print("X shape:", X.shape)
print("y shape:", y.shape)
print("Features:", feature_names)


X shape: (20640, 8)
y shape: (20640,)
Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


## Decision Tree Regression

### Model Training

In [10]:
from rice_ml.supervised_learning.regression_trees import DecisionTreeRegressor
from rice_ml.utilities.preprocess import train_test_split, normalize

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

model = DecisionTreeRegressor(
    max_depth=4,
    random_state=42,
)

model.fit(X_train, y_train)

r2 = model.score(X_test, y_test)
print(f"R2 score: {r2:.3f}")

y_pred = model.predict(X_test)

mse = ((y_test - y_pred) ** 2).mean()
print(f"RMSE: {np.sqrt(mse):.3f}")

R2 score: 0.570
RMSE: 0.761


The base model has a decent R^2 score, but could definitely be much better. It is explaining around 57% of the variance. The RMSE is pretty good. It is in terms of $100k, so it is showing the average error at around $76,000.
Next we will look at some parameter tuning and normalization.

#### Effect of max_depth on Model Performance

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

nn = [1, 2, 3, 4, 5, 8, 10, None]

for n in nn:
    model = DecisionTreeRegressor(
        max_depth=n,
        random_state=42,
    )

    results = model.fit(X_train, y_train)
    score = model.score(X_test, y_test)

    y_pred = model.predict(X_test)

    mse = ((y_test - y_pred) ** 2).mean()

    print(f"R^2 (max_depth={n}): {score:.3f}")
    print(f"RMSE (max_depth={n}): {np.sqrt(mse):.3f}")


R^2 (max_depth=1): 0.310
RMSE (max_depth=1): 0.964
R^2 (max_depth=2): 0.447
RMSE (max_depth=2): 0.863
R^2 (max_depth=3): 0.524
RMSE (max_depth=3): 0.800
R^2 (max_depth=4): 0.570
RMSE (max_depth=4): 0.761
R^2 (max_depth=5): 0.619
RMSE (max_depth=5): 0.716
R^2 (max_depth=8): 0.696
RMSE (max_depth=8): 0.639
R^2 (max_depth=10): 0.687
RMSE (max_depth=10): 0.649
R^2 (max_depth=None): 0.615
RMSE (max_depth=None): 0.720


The max depth in a decision tree is crucial. Tuning it based on your data is very important and can change a horrible model to a great model, and vise versa. Tuning min_samples_split, min_samples_leaf, and max_features can also have huge benefits.

In this case, we can see that our data performs much better as the depth increases, but tops out around 8 levels. This shows that we need enough layers to make close enough decisions, but too many layers results in overfitting.

We will use max_depth=8 for the normalization test. Let's see how it performs.

#### Z-Score Normalization

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

# Normalize on the training data to avoid data leakage.
X_train, stats = normalize(X_train, method="zscore", return_stats=True)
X_test = normalize(X_test, method="zscore", stats=stats)

model = DecisionTreeRegressor(
    max_depth=8,
    random_state=42,
)

model.fit(X_train, y_train)

r2 = model.score(X_test, y_test)
print(f"R2 score: {r2:.3f}")

y_pred = model.predict(X_test)

mse = ((y_test - y_pred) ** 2).mean()
print(f"RMSE: {np.sqrt(mse):.3f}")

R2 score: 0.696
RMSE: 0.639


### Results
This actually does not improve the model at all. This is a key piece of trees: normalization is not normally necessary. The trees split based on single thresholds, so the data itself does not need to be normalized. It will act in the same way.