A supervised clustering algorithm designed for panel data, commonly used in quantitative finance to identify time-varying, cross-sectional predictability regimes.
pip install ptree-panel
# With visualization support (matplotlib, seaborn)
pip install ptree-panel[viz]
# For development
pip install ptree-panel[dev]P-Tree recursively splits the full sample into disjoint leaf nodes using asset characteristics or macro states as thresholds. Unlike standard decision trees that minimise residual MSE, P-Tree maximises the difference in predictive performance across child nodes, producing a prediction mosaic — a map showing where and when alpha is concentrated.
| Feature | Standard Decision Tree | P-Tree |
|---|---|---|
| Objective | Minimise residual MSE/Gini | Maximise predictability difference |
| Leaf Model | Constant (mean) | Ridge regression / Logit |
| Use Case | Point prediction | Regime identification |
| Output | Single prediction | Prediction mosaic |
┌─────────────────────────────────────────────────────────────────┐
│ Full Sample │
│ (all time × assets) │
└──────────────────────────┬──────────────────────────────────────┘
│
┌─────────────────┴─────────────────┐
│ For each (feature, threshold): │
│ 1. Split into Left & Right │
│ 2. Fit Ridge on each subset │
│ 3. Compute R² for each │
│ 4. Score = |R²_L - R²_R| │
└─────────────────┬─────────────────┘
│
Select split with max score
│
┌─────────────────┴─────────────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Left Node│ │Right Node│
│ (low val)│ │(high val)│
└────┬─────┘ └────┬─────┘
│ │
▼ ▼
Recurse or Recurse or
become Leaf become Leaf
src/ptree/
├── __init__.py # Package exports
├── data_handler.py # DataHandler – alignment, missing-value fill, rank standardisation, volatility
├── predictors.py # PredictorBase, RidgeRegressor, VolWeightedRidgeRegressor, RidgeLogitClassifier, SelfDefinedPredictor
├── criteria.py # CriterionBase, R2DiffCriterion, ClassificationCriterion, evaluation helpers
├── node.py # PanelTreeNode – per-node metadata container
├── engine.py # PanelTreeEngine – recursive splitting, incremental matrix updates, feature-priority caching
└── visualization.py # NodeReporter (text/DataFrame reports), MosaicVisualizer (heatmap)
import numpy as np
import pandas as pd
from ptree import DataHandler, RidgeRegressor, R2DiffCriterion, PanelTreeEngine
from ptree import NodeReporter, MosaicVisualizer
# 1. Prepare panel data (DataFrame with date, asset_id, and feature columns)
dh = DataHandler(cs_rank_standardize=True)
X, y, vol_weights = dh.fit_transform(
df, y_series,
time_col="date", entity_col="asset_id",
ret_series_for_vol=ret_series, # optional, for VolWeightedRidge
)
# 2. Build the tree
engine = PanelTreeEngine(
predictor=RidgeRegressor(alpha=1.0),
criterion=R2DiffCriterion(),
split_thresholds=[0.3, 0.5, 0.7],
max_depth=3,
min_samples=100,
fast_mode=False,
verbose=1,
)
engine.fit(X, y, feature_names=dh.feature_names, weights=vol_weights)
# 3. Inspect results
reporter = NodeReporter(engine)
print(reporter.print_tree()) # text tree
print(reporter.leaf_summary()) # DataFrame
# 4. Prediction mosaic
viz = MosaicVisualizer(engine)
mosaic = viz.build_mosaic(X, y, time_col="date", metric="r2")
fig, ax = viz.plot_mosaic(mosaic) # requires matplotlib & seaborn
# 5. Retrieve leaf-node samples
for leaf_id, indices in engine.get_leaf_samples().items():
print(f"Leaf {leaf_id}: {len(indices)} observations")Handles panel data preprocessing including alignment, missing value imputation, cross-sectional rank standardisation, and volatility computation.
| Parameter | Default | Description |
|---|---|---|
cs_rank_standardize |
True |
Cross-sectional rank normalisation to [0, 1] |
vol_window |
60 |
Rolling window for volatility computation |
min_obs |
20 |
Minimum observations for volatility calculation |
fillna_method |
"ffill" |
Missing-value strategy (ffill, bfill, zero, mean, None) |
All predictors inherit from PredictorBase and implement fit() / predict().
| Class | Use Case |
|---|---|
RidgeRegressor |
Standard Ridge regression (closed-form) |
VolWeightedRidgeRegressor |
Inverse-volatility weighted Ridge (handles heteroscedasticity) |
RidgeLogitClassifier |
Ridge logistic regression via IRLS |
SelfDefinedPredictor |
User-defined model base class |
Custom Predictor Example:
from ptree import SelfDefinedPredictor
class MyLGBPredictor(SelfDefinedPredictor):
def fit(self, X, y, weights=None):
import lightgbm as lgb
self.model = lgb.LGBMRegressor().fit(X, y, sample_weight=weights)
return self
def predict(self, X):
return self.model.predict(X)Split-quality criteria evaluate whether a candidate split produces child nodes with meaningfully different predictability.
| Class | Description |
|---|---|
R2DiffCriterion |
Maximise |R²_L − R²_R| (regression) |
ClassificationCriterion |
Maximise difference in Precision / F1 / AUC (classification) |
The main engine for building and querying Panel Trees.
| Parameter | Default | Description |
|---|---|---|
predictor |
RidgeRegressor |
Leaf-node predictor (instance or class) |
criterion |
R2DiffCriterion() |
Split-quality criterion |
split_thresholds |
[0.3, 0.5, 0.7] |
Candidate split points on (rank-standardised) feature values |
max_depth |
3 |
Maximum tree depth |
min_samples |
100 |
Minimum observations per node |
fast_mode |
False |
Enable feature-priority caching from parent nodes |
early_stopping_threshold |
None |
Stop searching if criterion exceeds this value (requires fast_mode) |
n_jobs |
1 |
Parallel workers (-1 = all cores) |
verbose |
1 |
Logging verbosity (0=silent, 1=per-level, 2=per-candidate) |
P-Tree provides rich output and query interfaces across four main classes: PanelTreeEngine, PanelTreeNode, NodeReporter, and MosaicVisualizer.
Generate per-sample predictions on new data. Each observation traverses down the tree to its corresponding leaf node, which provides the prediction using its local model.
preds = engine.predict(X_proc) # shape: (n_samples,)Return a list of all leaf node objects.
for leaf in engine.get_leaves():
print(f"Leaf {leaf.node_id}: R²={leaf.metrics.get('r2', None):.4f}, n={leaf.n_samples}")Return all nodes in the tree (BFS order), including both internal nodes and leaves.
all_nodes = engine.get_all_nodes()
print(f"Total nodes: {len(all_nodes)}")Return a structured DataFrame with one row per node containing the following columns:
| Column | Description |
|---|---|
Node_ID |
Unique node identifier |
Depth |
Node depth (root = 0) |
Rule |
Full path rule from root, e.g., root & char_1 >= 0.5 & char_3 < 0.7 |
Is_Leaf |
Whether the node is a leaf |
N_Samples |
Number of samples in the node |
Sample_Ratio |
Ratio of samples relative to total |
Split_Feature |
Feature used for splitting (NaN for leaves) |
Split_Threshold |
Split threshold value (NaN for leaves) |
Split_Score |
Criterion score at split |
Predictability_Score |
Predictability strength (R² for regression, Precision for classification) |
Metrics |
Full metrics dictionary, e.g., {"r2": 0.63, "mse": 0.22, "n_samples": 2429} |
Model_Weights |
Feature coefficients of the leaf model |
Elapsed_Time_s |
Time spent building the node (seconds) |
Parent_ID |
Parent node ID |
report = engine.get_node_report()
print(report[["Node_ID", "Depth", "Rule", "Predictability_Score", "N_Samples"]])Return a dictionary mapping leaf node_id to an array of original sample row indices. Useful for extracting the raw data corresponding to each cluster.
leaf_samples = engine.get_leaf_samples()
for leaf_id, indices in leaf_samples.items():
subset = original_df.iloc[indices]
print(f"Leaf {leaf_id}: {len(indices)} samples, "
f"mean_return={subset['ret'].mean():.4f}, "
f"unique_assets={subset['asset_id'].nunique()}")Node objects can be obtained via engine.get_leaves() or engine.get_all_nodes().
Number of samples contained in this node (read-only property).
Evaluation metrics dictionary. For regression: r2, mse, n_samples. For classification: precision, f1, auc, n_samples.
leaf = engine.get_leaves()[0]
print(leaf.metrics) # {"r2": 0.63, "mse": 0.22, "n_samples": 2429}Return the feature coefficient vector of the leaf node's local model. Useful for inspecting which factors are active in a specific regime.
for leaf in engine.get_leaves():
coef = leaf.get_model_weights()
if coef is not None:
for name, w in zip(dh.feature_names, coef):
print(f" {name}: {w:+.4f}")Return sample row indices belonging to this node. Similar to engine.get_leaf_samples(), but can be used for any node (including internal nodes).
node = engine.get_all_nodes()[1] # Second node
indices = node.get_samples()
print(f"Node {node.node_id} contains {len(indices)} samples")Serialise all node metadata to a flat dictionary, convenient for building DataFrames or exporting to JSON.
import json
leaf = engine.get_leaves()[0]
print(json.dumps(leaf.to_dict(), indent=2, default=str))| Attribute | Type | Description |
|---|---|---|
node.node_id |
int |
Unique identifier |
node.depth |
int |
Depth level |
node.rule |
str |
Path description, e.g., root & char_1 < 0.5 & char_3 >= 0.7 |
node.split_feature |
str | None |
Split feature name |
node.split_threshold |
float | None |
Split threshold |
node.split_score |
float | None |
Criterion score at split |
node.is_leaf |
bool |
Whether this is a leaf |
node.sample_ratio |
float |
Sample coverage ratio |
node.elapsed_time |
float |
Build time (seconds) |
node.predictor |
PredictorBase |
Trained local model instance |
NodeReporter encapsulates user-facing reporting functionality. It requires a fitted PanelTreeEngine.
from ptree import NodeReporter
reporter = NodeReporter(engine)Return a complete node report DataFrame (all nodes, including internal nodes and leaves). Column definitions are the same as engine.get_node_report().
full = reporter.summary()
print(full[["Node_ID", "Depth", "Is_Leaf", "Split_Feature", "Predictability_Score"]])Return only the leaf nodes report. Structure is the same as summary(), suitable for quickly viewing final clustering results.
leaves = reporter.leaf_summary()
print(leaves[["Node_ID", "Rule", "Predictability_Score", "N_Samples", "Model_Weights"]])Example Output:
Node_ID Rule Predictability_Score N_Samples
3 root & char_1 < 0.5 & char_1 < 0.3 & char_3 < 0.7 0.0147 2438
4 root & char_1 < 0.5 & char_1 < 0.3 & char_3 >= 0.7 0.0018 1102
13 root & char_1 >= 0.5 & char_3 >= 0.3 & char_3 < 0.7 0.6323 2429
Return a formatted tree structure text string using indentation and ├─ / └─ to represent hierarchical relationships.
print(reporter.print_tree())Example Output:
[Node 0] char_1 < 0.5 | r2=0.1234, n=12000 (Δ=0.4569)
├── [Node 1] char_1 < 0.3 | r2=0.0523, n=5940 (Δ=0.0140)
│ ├── [Leaf 3] r2=0.0147, mse=0.4769, n=2438
│ └── [Leaf 4] r2=0.0018, mse=0.8028, n=1102
└── [Leaf 5] r2=0.4640, mse=0.5483, n=6060
MosaicVisualizer generates "prediction mosaics" — 2D heatmaps that visually display the model's predictive power across different time periods and asset clusters.
from ptree import MosaicVisualizer
viz = MosaicVisualizer(engine)Compute per-leaf, per-period metric values and return a DataFrame.
| Parameter | Description |
|---|---|
X |
Processed panel DataFrame (must include time_col and feature columns) |
y |
Target variable |
time_col |
Time column name, default "date" |
metric |
Evaluation metric: "r2" for regression, "precision" / "f1" / "auc" for classification |
Return Structure:
- Row index: Leaf node IDs (
Leaf_ID) - Columns: Time periods (determined by
time_col) - Values: Metric value for that leaf in that period
mosaic = viz.build_mosaic(X_proc, y_proc, time_col="date", metric="r2")
print(mosaic.shape) # (n_leaves, n_periods)
print(mosaic.iloc[:, :5]) # Preview first 5 periods
# Analyse which leaves perform best in which periods
best_leaf_per_period = mosaic.idxmax(axis=0)
print(best_leaf_per_period)Example Output:
0 1 2 3 4
Leaf_ID
3 0.016 -0.042 0.006 -0.089 0.036
13 0.621 0.782 0.599 0.687 0.605
14 0.502 0.465 0.350 0.462 0.289
Render the mosaic matrix as a seaborn heatmap. Requires matplotlib and seaborn.
| Parameter | Default | Description |
|---|---|---|
mosaic |
— | DataFrame returned by build_mosaic() |
title |
"Prediction Mosaic" |
Chart title |
cmap |
"RdYlGn" |
Colour map (red=poor, green=good) |
figsize |
(14, 6) |
Figure size |
save_path |
None |
If specified, automatically save as PNG |
# Interactive viewing
fig, ax = viz.plot_mosaic(mosaic, title="P-Tree R² Mosaic")
# Save to file
fig, ax = viz.plot_mosaic(mosaic, save_path="output/mosaic.png", cmap="coolwarm")Heatmap Interpretation:
-
X-axis: Time period
$t$ - Y-axis: Leaf nodes
- Colour: Predictive accuracy for that leaf in that period (R² or Precision)
- Instantly reveals when and where the model "fails" or "excels"
PanelTreeEngine outputs detailed splitting process logs via Python's logging module when verbose >= 1:
import logging
logging.basicConfig(level=logging.INFO, format="[%(levelname)s] %(message)s")
engine = PanelTreeEngine(..., verbose=1)
engine.fit(X, y, feature_names=...)Example Log Output:
[INFO] [Level 0] Splitting Node 0...
- Best Split: 'char_1' at threshold 0.5000
- Metric Delta: score = 0.456896
- Left: 5940 samples | Right: 6060 samples
[INFO] [Level 1] Splitting Node 1...
- Best Split: 'char_3' at threshold 0.3000
- Metric Delta: score = 0.179045
- Left: 1808 samples | Right: 4252 samples
[INFO] Tree built: 15 nodes, 8 leaves, max_depth=3
Set verbose=2 to view per-candidate (feature, threshold) evaluation results.
-
Incremental matrix updates – For Ridge models,
$X^TWX$ and$X^TWy$ are cached at each node. Child-node statistics are obtained by subtraction, avoiding redundant matrix multiplications. -
Feature-priority caching – When
fast_mode=True, child nodes first evaluate the top-50% features from the parent, with optional early stopping. -
Multiprocessing – Node-level parallelism via
n_jobsfor high-dimensional feature sets.
- Python ≥ 3.10
numpy,pandasmatplotlib,seaborn(optional, for visualisation)
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
# Clone the repository
git clone https://github.com/ElenYoung/AssetPanelTree.git
cd AssetPanelTree
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest test/ -vThis project is licensed under the MIT License - see the LICENSE file for details.
If you use P-Tree in your research, please consider citing:
@software{ptree2026,
author = {ElenYoung},
title = {P-Tree: Panel Tree for Supervised Clustering},
year = {2026},
url = {https://github.com/ElenYoung/AssetPanelTree}
}