Panel Tree (P-Tree)

A supervised clustering algorithm designed for panel data, commonly used in quantitative finance to identify time-varying, cross-sectional predictability regimes.

Installation

pip install ptree-panel

# With visualization support (matplotlib, seaborn)
pip install ptree-panel[viz]

# For development
pip install ptree-panel[dev]

Core Idea

P-Tree recursively splits the full sample into disjoint leaf nodes using asset characteristics or macro states as thresholds. Unlike standard decision trees that minimise residual MSE, P-Tree maximises the difference in predictive performance across child nodes, producing a prediction mosaic — a map showing where and when alpha is concentrated.

Key Differentiators

Feature	Standard Decision Tree	P-Tree
Objective	Minimise residual MSE/Gini	Maximise predictability difference
Leaf Model	Constant (mean)	Ridge regression / Logit
Use Case	Point prediction	Regime identification
Output	Single prediction	Prediction mosaic

Algorithm Overview

┌─────────────────────────────────────────────────────────────────┐
│                         Full Sample                              │
│                     (all time × assets)                          │
└──────────────────────────┬──────────────────────────────────────┘
                           │
         ┌─────────────────┴─────────────────┐
         │     For each (feature, threshold): │
         │     1. Split into Left & Right     │
         │     2. Fit Ridge on each subset    │
         │     3. Compute R² for each         │
         │     4. Score = |R²_L - R²_R|       │
         └─────────────────┬─────────────────┘
                           │
              Select split with max score
                           │
         ┌─────────────────┴─────────────────┐
         ▼                                   ▼
   ┌──────────┐                       ┌──────────┐
   │ Left Node│                       │Right Node│
   │ (low val)│                       │(high val)│
   └────┬─────┘                       └────┬─────┘
        │                                  │
        ▼                                  ▼
   Recurse or                         Recurse or
   become Leaf                        become Leaf

Project Structure

src/ptree/
├── __init__.py          # Package exports
├── data_handler.py      # DataHandler – alignment, missing-value fill, rank standardisation, volatility
├── predictors.py        # PredictorBase, RidgeRegressor, VolWeightedRidgeRegressor, RidgeLogitClassifier, SelfDefinedPredictor
├── criteria.py          # CriterionBase, R2DiffCriterion, ClassificationCriterion, evaluation helpers
├── node.py              # PanelTreeNode – per-node metadata container
├── engine.py            # PanelTreeEngine – recursive splitting, incremental matrix updates, feature-priority caching
└── visualization.py     # NodeReporter (text/DataFrame reports), MosaicVisualizer (heatmap)

Quick Start

import numpy as np
import pandas as pd
from ptree import DataHandler, RidgeRegressor, R2DiffCriterion, PanelTreeEngine
from ptree import NodeReporter, MosaicVisualizer

# 1. Prepare panel data (DataFrame with date, asset_id, and feature columns)
dh = DataHandler(cs_rank_standardize=True)
X, y, vol_weights = dh.fit_transform(
    df, y_series,
    time_col="date", entity_col="asset_id",
    ret_series_for_vol=ret_series,       # optional, for VolWeightedRidge
)

# 2. Build the tree
engine = PanelTreeEngine(
    predictor=RidgeRegressor(alpha=1.0),
    criterion=R2DiffCriterion(),
    split_thresholds=[0.3, 0.5, 0.7],
    max_depth=3,
    min_samples=100,
    fast_mode=False,
    verbose=1,
)
engine.fit(X, y, feature_names=dh.feature_names, weights=vol_weights)

# 3. Inspect results
reporter = NodeReporter(engine)
print(reporter.print_tree())           # text tree
print(reporter.leaf_summary())         # DataFrame

# 4. Prediction mosaic
viz = MosaicVisualizer(engine)
mosaic = viz.build_mosaic(X, y, time_col="date", metric="r2")
fig, ax = viz.plot_mosaic(mosaic)       # requires matplotlib & seaborn

# 5. Retrieve leaf-node samples
for leaf_id, indices in engine.get_leaf_samples().items():
    print(f"Leaf {leaf_id}: {len(indices)} observations")

Module Overview

DataHandler

Handles panel data preprocessing including alignment, missing value imputation, cross-sectional rank standardisation, and volatility computation.

Parameter	Default	Description
`cs_rank_standardize`	`True`	Cross-sectional rank normalisation to [0, 1]
`vol_window`	`60`	Rolling window for volatility computation
`min_obs`	`20`	Minimum observations for volatility calculation
`fillna_method`	`"ffill"`	Missing-value strategy (`ffill`, `bfill`, `zero`, `mean`, `None`)

Predictors

All predictors inherit from PredictorBase and implement fit() / predict().

Class	Use Case
`RidgeRegressor`	Standard Ridge regression (closed-form)
`VolWeightedRidgeRegressor`	Inverse-volatility weighted Ridge (handles heteroscedasticity)
`RidgeLogitClassifier`	Ridge logistic regression via IRLS
`SelfDefinedPredictor`	User-defined model base class

Custom Predictor Example:

from ptree import SelfDefinedPredictor

class MyLGBPredictor(SelfDefinedPredictor):
    def fit(self, X, y, weights=None):
        import lightgbm as lgb
        self.model = lgb.LGBMRegressor().fit(X, y, sample_weight=weights)
        return self
    
    def predict(self, X):
        return self.model.predict(X)

Criteria

Split-quality criteria evaluate whether a candidate split produces child nodes with meaningfully different predictability.

Class	Description
`R2DiffCriterion`	Maximise \|R²_L − R²_R\| (regression)
`ClassificationCriterion`	Maximise difference in Precision / F1 / AUC (classification)

PanelTreeEngine

The main engine for building and querying Panel Trees.

Parameter	Default	Description
`predictor`	`RidgeRegressor`	Leaf-node predictor (instance or class)
`criterion`	`R2DiffCriterion()`	Split-quality criterion
`split_thresholds`	`[0.3, 0.5, 0.7]`	Candidate split points on (rank-standardised) feature values
`max_depth`	`3`	Maximum tree depth
`min_samples`	`100`	Minimum observations per node
`fast_mode`	`False`	Enable feature-priority caching from parent nodes
`early_stopping_threshold`	`None`	Stop searching if criterion exceeds this value (requires `fast_mode`)
`n_jobs`	`1`	Parallel workers (`-1` = all cores)
`verbose`	`1`	Logging verbosity (0=silent, 1=per-level, 2=per-candidate)

Output & Query API Reference

P-Tree provides rich output and query interfaces across four main classes: PanelTreeEngine, PanelTreeNode, NodeReporter, and MosaicVisualizer.

PanelTreeEngine Methods

`engine.predict(X) → np.ndarray`

Generate per-sample predictions on new data. Each observation traverses down the tree to its corresponding leaf node, which provides the prediction using its local model.

preds = engine.predict(X_proc)  # shape: (n_samples,)

`engine.get_leaves() → List[PanelTreeNode]`

Return a list of all leaf node objects.

for leaf in engine.get_leaves():
    print(f"Leaf {leaf.node_id}: R²={leaf.metrics.get('r2', None):.4f}, n={leaf.n_samples}")

`engine.get_all_nodes() → List[PanelTreeNode]`

Return all nodes in the tree (BFS order), including both internal nodes and leaves.

all_nodes = engine.get_all_nodes()
print(f"Total nodes: {len(all_nodes)}")

`engine.get_node_report() → pd.DataFrame`

Return a structured DataFrame with one row per node containing the following columns:

Column	Description
`Node_ID`	Unique node identifier
`Depth`	Node depth (root = 0)
`Rule`	Full path rule from root, e.g., `root & char_1 >= 0.5 & char_3 < 0.7`
`Is_Leaf`	Whether the node is a leaf
`N_Samples`	Number of samples in the node
`Sample_Ratio`	Ratio of samples relative to total
`Split_Feature`	Feature used for splitting (NaN for leaves)
`Split_Threshold`	Split threshold value (NaN for leaves)
`Split_Score`	Criterion score at split
`Predictability_Score`	Predictability strength (R² for regression, Precision for classification)
`Metrics`	Full metrics dictionary, e.g., `{"r2": 0.63, "mse": 0.22, "n_samples": 2429}`
`Model_Weights`	Feature coefficients of the leaf model
`Elapsed_Time_s`	Time spent building the node (seconds)
`Parent_ID`	Parent node ID

report = engine.get_node_report()
print(report[["Node_ID", "Depth", "Rule", "Predictability_Score", "N_Samples"]])

`engine.get_leaf_samples() → Dict[int, np.ndarray]`

Return a dictionary mapping leaf node_id to an array of original sample row indices. Useful for extracting the raw data corresponding to each cluster.

leaf_samples = engine.get_leaf_samples()
for leaf_id, indices in leaf_samples.items():
    subset = original_df.iloc[indices]
    print(f"Leaf {leaf_id}: {len(indices)} samples, "
          f"mean_return={subset['ret'].mean():.4f}, "
          f"unique_assets={subset['asset_id'].nunique()}")

PanelTreeNode Methods

Node objects can be obtained via engine.get_leaves() or engine.get_all_nodes().

`node.n_samples → int`

Number of samples contained in this node (read-only property).

`node.metrics → Dict[str, float]`

Evaluation metrics dictionary. For regression: r2, mse, n_samples. For classification: precision, f1, auc, n_samples.

leaf = engine.get_leaves()[0]
print(leaf.metrics)  # {"r2": 0.63, "mse": 0.22, "n_samples": 2429}

`node.get_model_weights() → np.ndarray | None`

Return the feature coefficient vector of the leaf node's local model. Useful for inspecting which factors are active in a specific regime.

for leaf in engine.get_leaves():
    coef = leaf.get_model_weights()
    if coef is not None:
        for name, w in zip(dh.feature_names, coef):
            print(f"  {name}: {w:+.4f}")

`node.get_samples() → np.ndarray | None`

Return sample row indices belonging to this node. Similar to engine.get_leaf_samples(), but can be used for any node (including internal nodes).

node = engine.get_all_nodes()[1]  # Second node
indices = node.get_samples()
print(f"Node {node.node_id} contains {len(indices)} samples")

`node.to_dict() → Dict[str, Any]`

Serialise all node metadata to a flat dictionary, convenient for building DataFrames or exporting to JSON.

import json
leaf = engine.get_leaves()[0]
print(json.dumps(leaf.to_dict(), indent=2, default=str))

Common Read-Only Attributes

Attribute	Type	Description
`node.node_id`	`int`	Unique identifier
`node.depth`	`int`	Depth level
`node.rule`	`str`	Path description, e.g., `root & char_1 < 0.5 & char_3 >= 0.7`
`node.split_feature`	`str \| None`	Split feature name
`node.split_threshold`	`float \| None`	Split threshold
`node.split_score`	`float \| None`	Criterion score at split
`node.is_leaf`	`bool`	Whether this is a leaf
`node.sample_ratio`	`float`	Sample coverage ratio
`node.elapsed_time`	`float`	Build time (seconds)
`node.predictor`	`PredictorBase`	Trained local model instance

NodeReporter Methods

NodeReporter encapsulates user-facing reporting functionality. It requires a fitted PanelTreeEngine.

from ptree import NodeReporter
reporter = NodeReporter(engine)

`reporter.summary() → pd.DataFrame`

Return a complete node report DataFrame (all nodes, including internal nodes and leaves). Column definitions are the same as engine.get_node_report().

full = reporter.summary()
print(full[["Node_ID", "Depth", "Is_Leaf", "Split_Feature", "Predictability_Score"]])

`reporter.leaf_summary() → pd.DataFrame`

Return only the leaf nodes report. Structure is the same as summary(), suitable for quickly viewing final clustering results.

leaves = reporter.leaf_summary()
print(leaves[["Node_ID", "Rule", "Predictability_Score", "N_Samples", "Model_Weights"]])

Example Output:

 Node_ID                                              Rule  Predictability_Score  N_Samples
       3   root & char_1 < 0.5 & char_1 < 0.3 & char_3 < 0.7              0.0147       2438
       4  root & char_1 < 0.5 & char_1 < 0.3 & char_3 >= 0.7              0.0018       1102
      13  root & char_1 >= 0.5 & char_3 >= 0.3 & char_3 < 0.7              0.6323       2429

`reporter.print_tree() → str`

Return a formatted tree structure text string using indentation and ├─ / └─ to represent hierarchical relationships.

print(reporter.print_tree())

Example Output:

[Node 0] char_1 < 0.5 | r2=0.1234, n=12000 (Δ=0.4569)
├── [Node 1] char_1 < 0.3 | r2=0.0523, n=5940 (Δ=0.0140)
│   ├── [Leaf 3] r2=0.0147, mse=0.4769, n=2438
│   └── [Leaf 4] r2=0.0018, mse=0.8028, n=1102
└── [Leaf 5] r2=0.4640, mse=0.5483, n=6060

MosaicVisualizer Methods

MosaicVisualizer generates "prediction mosaics" — 2D heatmaps that visually display the model's predictive power across different time periods and asset clusters.

from ptree import MosaicVisualizer
viz = MosaicVisualizer(engine)

`viz.build_mosaic(X, y, time_col, metric) → pd.DataFrame`

Compute per-leaf, per-period metric values and return a DataFrame.

Parameter	Description
`X`	Processed panel DataFrame (must include `time_col` and feature columns)
`y`	Target variable
`time_col`	Time column name, default `"date"`
`metric`	Evaluation metric: `"r2"` for regression, `"precision"` / `"f1"` / `"auc"` for classification

Return Structure:

Row index: Leaf node IDs (Leaf_ID)
Columns: Time periods (determined by time_col)
Values: Metric value for that leaf in that period

mosaic = viz.build_mosaic(X_proc, y_proc, time_col="date", metric="r2")
print(mosaic.shape)       # (n_leaves, n_periods)
print(mosaic.iloc[:, :5]) # Preview first 5 periods

# Analyse which leaves perform best in which periods
best_leaf_per_period = mosaic.idxmax(axis=0)
print(best_leaf_per_period)

Example Output:

         0         1         2         3         4
Leaf_ID
3     0.016    -0.042     0.006    -0.089     0.036
13    0.621     0.782     0.599     0.687     0.605
14    0.502     0.465     0.350     0.462     0.289

`viz.plot_mosaic(mosaic, title, cmap, figsize, save_path) → (fig, ax)`

Render the mosaic matrix as a seaborn heatmap. Requires matplotlib and seaborn.

Parameter	Default	Description
`mosaic`	—	DataFrame returned by `build_mosaic()`
`title`	`"Prediction Mosaic"`	Chart title
`cmap`	`"RdYlGn"`	Colour map (red=poor, green=good)
`figsize`	`(14, 6)`	Figure size
`save_path`	`None`	If specified, automatically save as PNG

# Interactive viewing
fig, ax = viz.plot_mosaic(mosaic, title="P-Tree R² Mosaic")

# Save to file
fig, ax = viz.plot_mosaic(mosaic, save_path="output/mosaic.png", cmap="coolwarm")

Heatmap Interpretation:

X-axis: Time period $t$
Y-axis: Leaf nodes
Colour: Predictive accuracy for that leaf in that period (R² or Precision)
Instantly reveals when and where the model "fails" or "excels"

Verbose Logging

PanelTreeEngine outputs detailed splitting process logs via Python's logging module when verbose >= 1:

import logging
logging.basicConfig(level=logging.INFO, format="[%(levelname)s] %(message)s")

engine = PanelTreeEngine(..., verbose=1)
engine.fit(X, y, feature_names=...)

Example Log Output:

[INFO] [Level 0] Splitting Node 0...
  - Best Split: 'char_1' at threshold 0.5000
  - Metric Delta: score = 0.456896
  - Left: 5940 samples | Right: 6060 samples
[INFO] [Level 1] Splitting Node 1...
  - Best Split: 'char_3' at threshold 0.3000
  - Metric Delta: score = 0.179045
  - Left: 1808 samples | Right: 4252 samples
[INFO] Tree built: 15 nodes, 8 leaves, max_depth=3

Set verbose=2 to view per-candidate (feature, threshold) evaluation results.

Performance Optimisations

Incremental matrix updates – For Ridge models, $X^TWX$ and $X^TWy$ are cached at each node. Child-node statistics are obtained by subtraction, avoiding redundant matrix multiplications.
Feature-priority caching – When fast_mode=True, child nodes first evaluate the top-50% features from the parent, with optional early stopping.
Multiprocessing – Node-level parallelism via n_jobs for high-dimensional feature sets.

Requirements

Python ≥ 3.10
numpy, pandas
matplotlib, seaborn (optional, for visualisation)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

# Clone the repository
git clone https://github.com/ElenYoung/AssetPanelTree.git
cd AssetPanelTree

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest test/ -v

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use P-Tree in your research, please consider citing:

@software{ptree2026,
  author = {ElenYoung},
  title = {P-Tree: Panel Tree for Supervised Clustering},
  year = {2026},
  url = {https://github.com/ElenYoung/AssetPanelTree}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
example		example
src/ptree		src/ptree
test		test
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Panel Tree (P-Tree)

Installation

Core Idea

Key Differentiators

Algorithm Overview

Project Structure

Quick Start

Module Overview

DataHandler

Predictors

Criteria

PanelTreeEngine

Output & Query API Reference

PanelTreeEngine Methods

engine.predict(X) → np.ndarray

engine.get_leaves() → List[PanelTreeNode]

engine.get_all_nodes() → List[PanelTreeNode]

engine.get_node_report() → pd.DataFrame

engine.get_leaf_samples() → Dict[int, np.ndarray]

PanelTreeNode Methods

node.n_samples → int

node.metrics → Dict[str, float]

node.get_model_weights() → np.ndarray | None

node.get_samples() → np.ndarray | None

node.to_dict() → Dict[str, Any]

Common Read-Only Attributes

NodeReporter Methods

reporter.summary() → pd.DataFrame

reporter.leaf_summary() → pd.DataFrame

reporter.print_tree() → str

MosaicVisualizer Methods

viz.build_mosaic(X, y, time_col, metric) → pd.DataFrame

viz.plot_mosaic(mosaic, title, cmap, figsize, save_path) → (fig, ax)

Verbose Logging

Performance Optimisations

Requirements

Contributing

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`engine.predict(X) → np.ndarray`

`engine.get_leaves() → List[PanelTreeNode]`

`engine.get_all_nodes() → List[PanelTreeNode]`

`engine.get_node_report() → pd.DataFrame`

`engine.get_leaf_samples() → Dict[int, np.ndarray]`

`node.n_samples → int`

`node.metrics → Dict[str, float]`

`node.get_model_weights() → np.ndarray | None`

`node.get_samples() → np.ndarray | None`

`node.to_dict() → Dict[str, Any]`

`reporter.summary() → pd.DataFrame`

`reporter.leaf_summary() → pd.DataFrame`

`reporter.print_tree() → str`

`viz.build_mosaic(X, y, time_col, metric) → pd.DataFrame`

`viz.plot_mosaic(mosaic, title, cmap, figsize, save_path) → (fig, ax)`

Packages