# Tree-Based Segmentation for Error Analysis

This notebook demonstrates how to use the DecisionTreeSegmentation class to identify and analyze error patterns in your model predictions using a tree-based approach.

## Introduction

When analyzing model performance, it's useful to identify regions in the feature space where your model is underperforming. The DecisionTreeSegmentation class helps segment the feature space based on error patterns, and the plotting functions help visualize these segments.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

from tab_right.plotting.plot_segmentations import plot_dt_segmentation, plot_dt_segmentation_with_stats

# Import the DecisionTreeSegmentation class and plotting functions
from tab_right.segmentations import DecisionTreeSegmentation

## Generate Sample Data

First, let's generate some synthetic data with a non-linear pattern that will be challenging for the model to learn perfectly.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic dataset with known patterns
n_samples = 1000
X = pd.DataFrame({
    "feature1": np.random.uniform(-3, 3, n_samples),
    "feature2": np.random.uniform(-3, 3, n_samples),
    "feature3": np.random.normal(0, 1, n_samples),
})

# Create a target with a complex non-linear pattern
noise = np.random.normal(0, 0.5, n_samples)
y = 2 * np.sin(X["feature1"]) + X["feature2"] ** 2 + 0.5 * X["feature3"] + noise

# Display the first few rows of the data
X.head()

## Train a Model

Now let's train a RandomForestRegressor on this data and generate predictions.

In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a random forest model (intentionally underfitting to create meaningful error patterns)
model = RandomForestRegressor(n_estimators=50, max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Generate predictions on the test set
y_pred = model.predict(X_test)

# Calculate mean squared error
mse = np.mean((y_test - y_pred) ** 2)
print(f"Mean Squared Error: {mse:.4f}")

## Visualize Error Distribution

Let's first visualize the distribution of errors to understand the overall model performance.

In [None]:
# Calculate errors
errors = np.abs(y_test - y_pred)

# Plot error distribution
plt.figure(figsize=(10, 6))
plt.hist(errors, bins=30, alpha=0.7, color="skyblue")
plt.axvline(np.mean(errors), color="red", linestyle="dashed", linewidth=1, label=f"Mean Error: {np.mean(errors):.2f}")
plt.axvline(
    np.median(errors), color="green", linestyle="dashed", linewidth=1, label=f"Median Error: {np.median(errors):.2f}"
)
plt.title("Distribution of Absolute Errors")
plt.xlabel("Absolute Error")
plt.ylabel("Frequency")
plt.legend()
plt.grid(alpha=0.3)
plt.show()

## Apply Tree-Based Segmentation

Now let's use the DecisionTreeSegmentation class to identify regions in the feature space where our model has higher errors.

In [None]:
# Initialize the decision tree segmentation
dt_seg = DecisionTreeSegmentation(max_depth=4, min_samples_leaf=20)

# Fit the model to analyze errors using the first two features
dt_seg.fit(X_test, y_test, y_pred, feature_names=["feature1", "feature2"])

## Visualize Error Segments with Plotly

Now let's use the plotting functions from the plotting subpackage to create interactive visualizations of the error segments.

In [None]:
# Create an interactive visualization using Plotly
fig_plotly = plot_dt_segmentation(dt_seg, cmap="YlOrRd", figsize=(800, 600))
fig_plotly.show()

## Combined Visualization with Statistics

Let's create a more comprehensive visualization that shows both the error heatmap and statistics about the top error segments.

In [None]:
# Create a combined visualization with both heatmap and segment statistics
fig_combined = plot_dt_segmentation_with_stats(dt_seg, n_top_segments=5, cmap="Viridis", figsize=(1000, 500))
fig_combined.show()

## Get Segment Statistics

Let's get statistical information about the top error segments to better understand where our model is underperforming.

In [None]:
# Get statistics for the top 5 error segments
segment_stats = dt_seg.get_segment_stats(n_segments=5)
segment_stats

## Get Segmented Data with Segment IDs

We can also get the original data with segment assignments, which can be useful for further analysis.

In [None]:
# Get segmented data with segment IDs and error statistics
segmented_df = dt_seg.get_segment_df(n_segments=5)
segmented_df.head()

## Extract Decision Rules

Finally, let's extract the decision rules that define the high-error segments. This gives us interpretable conditions to understand which feature combinations lead to higher prediction errors.

In [None]:
# Get decision rules for top segments
rules = dt_seg.get_decision_rules(n_segments=3)

# Format and display the rules
for segment_id, rule_list in rules.items():
    print(f"Segment {segment_id} Rules:")
    for rule in rule_list:
        print(f"  {rule['feature']} {rule['operator']} {rule['threshold']:.3f}")
    print()

## Conclusion

The tree-based segmentation analysis helps identify regions in the feature space where our model has high prediction errors. By visualizing these regions using heatmaps and extracting interpretable decision rules, we can better understand the model's limitations and potentially improve its performance through targeted feature engineering or model adjustments.

The new Plotly visualization feature provides an interactive way to explore error patterns, making it easier to analyze and communicate model performance issues.