# Experiment: What is the best split criteria for RF Regression?

Vivek Gopalakrishnan | October 28, 2019


## My Random Forest setup

Split criteria being tested:

1. Mean Absolute Error (MAE)
2. Mean Squared Error (MSE)
3. Axis projections
4. Random projections


## My sampling model for synthetic data

Input data is sampled from a $d$-dimensional multivariate normal (MVN) and the output data is a random rotation of the input data.

Specifically,
$$
X_i \sim \text{MVN}\left(\mu, I_d\right) \,, y_i = AX_i \\
\mathcal{D}_n = \left\{(X_i, y_i)\right\} \text{ for } i=1,\dots,n
$$
where $A$ is a random rotation matrix sampled according to [the Haar distribution](http://scipy.github.io/devdocs/generated/scipy.stats.ortho_group.html) and $d$ is the number of simulated features.


## What simulations are run in this notebook?

Simulation parameters are `(n_samples, n_dim)`. I run the following simulations:

1. Increase `n_samples`, fix `n_dim=2`
2. Increase `n_dim`, fix `n_samples=50`

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from simulation import generate_linear_data, measure_mse

In [None]:
%load_ext autoreload
%autoreload 2

## Demo of simulation functions

The module `simulation` provides two functions:
1. `generate_linear_data`: sample data according to the model above
2. `measure_mse`: for a given set of sample data, measure the MSE for all split criteria

In [None]:
# Test data generation function
X, y = generate_linear_data(n_samples=25, n_dim=2, scale=0.1)

# Plot synthetic data
plt.scatter(X[:, 0], X[:, 1], c="blue", label="X")
plt.scatter(y[:, 0], y[:, 1], c="red", label="y")

# Plot lines between matched pairs of points
for xi, yi in zip(X, y):
    plt.plot(
        [xi[0], yi[0]], 
        [xi[1], yi[1]], 
        c="black", 
        alpha = 0.15
    )

plt.legend()
plt.show()

In [None]:
# Test MSE measuring function
measure_mse(X, y)

## Simulation

Run simulation with `python simulation.py`. Analyzed results are below.

### Simulation 1: Increasing dimensionality

- `n_samples = 75`
- `n_dim = [2, 3, 4, ..., 40]`

In [None]:
df = pd.read_csv("results/simulation_1.csv", index_col="Unnamed: 0")
df.head()

In [None]:
fig, ax = plt.subplots(dpi=300)    
f = sns.lineplot(x="n_dim", y="mse", hue="split", data=df, ci=None, alpha=0.5, ax=ax)
f.set(xlabel="n_dim", ylabel="mse")

### Simulation 2: Increased noise

- `n_samples = 30`
- `n_dim = [3, 30]`
- `sigma = np.linspace(0, 10, 50)`

In [None]:
df = pd.read_csv("results/simulation_2.csv", index_col="Unnamed: 0")
df.head()

In [None]:
fig, ax = plt.subplots(dpi=300)    
f = sns.lineplot(x="scale", y="mse", hue="split", data=df, ci=None, alpha=0.5, ax=ax)
f.set(xlabel=r"$\sigma$", ylabel="mse")