Skip to content

ElenYoung/AssetPanelTree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Panel Tree (P-Tree)

PyPI version Python 3.10+ License: MIT

A supervised clustering algorithm designed for panel data, commonly used in quantitative finance to identify time-varying, cross-sectional predictability regimes.

Installation

pip install ptree-panel

# With visualization support (matplotlib, seaborn)
pip install ptree-panel[viz]

# For development
pip install ptree-panel[dev]

Core Idea

P-Tree recursively splits the full sample into disjoint leaf nodes using asset characteristics or macro states as thresholds. Unlike standard decision trees that minimise residual MSE, P-Tree maximises the difference in predictive performance across child nodes, producing a prediction mosaic — a map showing where and when alpha is concentrated.

Key Differentiators

Feature Standard Decision Tree P-Tree
Objective Minimise residual MSE/Gini Maximise predictability difference
Leaf Model Constant (mean) Ridge regression / Logit
Use Case Point prediction Regime identification
Output Single prediction Prediction mosaic

Algorithm Overview

┌─────────────────────────────────────────────────────────────────┐
│                         Full Sample                              │
│                     (all time × assets)                          │
└──────────────────────────┬──────────────────────────────────────┘
                           │
         ┌─────────────────┴─────────────────┐
         │     For each (feature, threshold): │
         │     1. Split into Left & Right     │
         │     2. Fit Ridge on each subset    │
         │     3. Compute R² for each         │
         │     4. Score = |R²_L - R²_R|       │
         └─────────────────┬─────────────────┘
                           │
              Select split with max score
                           │
         ┌─────────────────┴─────────────────┐
         ▼                                   ▼
   ┌──────────┐                       ┌──────────┐
   │ Left Node│                       │Right Node│
   │ (low val)│                       │(high val)│
   └────┬─────┘                       └────┬─────┘
        │                                  │
        ▼                                  ▼
   Recurse or                         Recurse or
   become Leaf                        become Leaf

Project Structure

src/ptree/
├── __init__.py          # Package exports
├── data_handler.py      # DataHandler – alignment, missing-value fill, rank standardisation, volatility
├── predictors.py        # PredictorBase, RidgeRegressor, VolWeightedRidgeRegressor, RidgeLogitClassifier, SelfDefinedPredictor
├── criteria.py          # CriterionBase, R2DiffCriterion, ClassificationCriterion, evaluation helpers
├── node.py              # PanelTreeNode – per-node metadata container
├── engine.py            # PanelTreeEngine – recursive splitting, incremental matrix updates, feature-priority caching
└── visualization.py     # NodeReporter (text/DataFrame reports), MosaicVisualizer (heatmap)

Quick Start

import numpy as np
import pandas as pd
from ptree import DataHandler, RidgeRegressor, R2DiffCriterion, PanelTreeEngine
from ptree import NodeReporter, MosaicVisualizer

# 1. Prepare panel data (DataFrame with date, asset_id, and feature columns)
dh = DataHandler(cs_rank_standardize=True)
X, y, vol_weights = dh.fit_transform(
    df, y_series,
    time_col="date", entity_col="asset_id",
    ret_series_for_vol=ret_series,       # optional, for VolWeightedRidge
)

# 2. Build the tree
engine = PanelTreeEngine(
    predictor=RidgeRegressor(alpha=1.0),
    criterion=R2DiffCriterion(),
    split_thresholds=[0.3, 0.5, 0.7],
    max_depth=3,
    min_samples=100,
    fast_mode=False,
    verbose=1,
)
engine.fit(X, y, feature_names=dh.feature_names, weights=vol_weights)

# 3. Inspect results
reporter = NodeReporter(engine)
print(reporter.print_tree())           # text tree
print(reporter.leaf_summary())         # DataFrame

# 4. Prediction mosaic
viz = MosaicVisualizer(engine)
mosaic = viz.build_mosaic(X, y, time_col="date", metric="r2")
fig, ax = viz.plot_mosaic(mosaic)       # requires matplotlib & seaborn

# 5. Retrieve leaf-node samples
for leaf_id, indices in engine.get_leaf_samples().items():
    print(f"Leaf {leaf_id}: {len(indices)} observations")

Module Overview

DataHandler

Handles panel data preprocessing including alignment, missing value imputation, cross-sectional rank standardisation, and volatility computation.

Parameter Default Description
cs_rank_standardize True Cross-sectional rank normalisation to [0, 1]
vol_window 60 Rolling window for volatility computation
min_obs 20 Minimum observations for volatility calculation
fillna_method "ffill" Missing-value strategy (ffill, bfill, zero, mean, None)

Predictors

All predictors inherit from PredictorBase and implement fit() / predict().

Class Use Case
RidgeRegressor Standard Ridge regression (closed-form)
VolWeightedRidgeRegressor Inverse-volatility weighted Ridge (handles heteroscedasticity)
RidgeLogitClassifier Ridge logistic regression via IRLS
SelfDefinedPredictor User-defined model base class

Custom Predictor Example:

from ptree import SelfDefinedPredictor

class MyLGBPredictor(SelfDefinedPredictor):
    def fit(self, X, y, weights=None):
        import lightgbm as lgb
        self.model = lgb.LGBMRegressor().fit(X, y, sample_weight=weights)
        return self
    
    def predict(self, X):
        return self.model.predict(X)

Criteria

Split-quality criteria evaluate whether a candidate split produces child nodes with meaningfully different predictability.

Class Description
R2DiffCriterion Maximise |R²_L − R²_R| (regression)
ClassificationCriterion Maximise difference in Precision / F1 / AUC (classification)

PanelTreeEngine

The main engine for building and querying Panel Trees.

Parameter Default Description
predictor RidgeRegressor Leaf-node predictor (instance or class)
criterion R2DiffCriterion() Split-quality criterion
split_thresholds [0.3, 0.5, 0.7] Candidate split points on (rank-standardised) feature values
max_depth 3 Maximum tree depth
min_samples 100 Minimum observations per node
fast_mode False Enable feature-priority caching from parent nodes
early_stopping_threshold None Stop searching if criterion exceeds this value (requires fast_mode)
n_jobs 1 Parallel workers (-1 = all cores)
verbose 1 Logging verbosity (0=silent, 1=per-level, 2=per-candidate)

Output & Query API Reference

P-Tree provides rich output and query interfaces across four main classes: PanelTreeEngine, PanelTreeNode, NodeReporter, and MosaicVisualizer.


PanelTreeEngine Methods

engine.predict(X) → np.ndarray

Generate per-sample predictions on new data. Each observation traverses down the tree to its corresponding leaf node, which provides the prediction using its local model.

preds = engine.predict(X_proc)  # shape: (n_samples,)

engine.get_leaves() → List[PanelTreeNode]

Return a list of all leaf node objects.

for leaf in engine.get_leaves():
    print(f"Leaf {leaf.node_id}: R²={leaf.metrics.get('r2', None):.4f}, n={leaf.n_samples}")

engine.get_all_nodes() → List[PanelTreeNode]

Return all nodes in the tree (BFS order), including both internal nodes and leaves.

all_nodes = engine.get_all_nodes()
print(f"Total nodes: {len(all_nodes)}")

engine.get_node_report() → pd.DataFrame

Return a structured DataFrame with one row per node containing the following columns:

Column Description
Node_ID Unique node identifier
Depth Node depth (root = 0)
Rule Full path rule from root, e.g., root & char_1 >= 0.5 & char_3 < 0.7
Is_Leaf Whether the node is a leaf
N_Samples Number of samples in the node
Sample_Ratio Ratio of samples relative to total
Split_Feature Feature used for splitting (NaN for leaves)
Split_Threshold Split threshold value (NaN for leaves)
Split_Score Criterion score at split
Predictability_Score Predictability strength (R² for regression, Precision for classification)
Metrics Full metrics dictionary, e.g., {"r2": 0.63, "mse": 0.22, "n_samples": 2429}
Model_Weights Feature coefficients of the leaf model
Elapsed_Time_s Time spent building the node (seconds)
Parent_ID Parent node ID
report = engine.get_node_report()
print(report[["Node_ID", "Depth", "Rule", "Predictability_Score", "N_Samples"]])

engine.get_leaf_samples() → Dict[int, np.ndarray]

Return a dictionary mapping leaf node_id to an array of original sample row indices. Useful for extracting the raw data corresponding to each cluster.

leaf_samples = engine.get_leaf_samples()
for leaf_id, indices in leaf_samples.items():
    subset = original_df.iloc[indices]
    print(f"Leaf {leaf_id}: {len(indices)} samples, "
          f"mean_return={subset['ret'].mean():.4f}, "
          f"unique_assets={subset['asset_id'].nunique()}")

PanelTreeNode Methods

Node objects can be obtained via engine.get_leaves() or engine.get_all_nodes().

node.n_samples → int

Number of samples contained in this node (read-only property).

node.metrics → Dict[str, float]

Evaluation metrics dictionary. For regression: r2, mse, n_samples. For classification: precision, f1, auc, n_samples.

leaf = engine.get_leaves()[0]
print(leaf.metrics)  # {"r2": 0.63, "mse": 0.22, "n_samples": 2429}

node.get_model_weights() → np.ndarray | None

Return the feature coefficient vector of the leaf node's local model. Useful for inspecting which factors are active in a specific regime.

for leaf in engine.get_leaves():
    coef = leaf.get_model_weights()
    if coef is not None:
        for name, w in zip(dh.feature_names, coef):
            print(f"  {name}: {w:+.4f}")

node.get_samples() → np.ndarray | None

Return sample row indices belonging to this node. Similar to engine.get_leaf_samples(), but can be used for any node (including internal nodes).

node = engine.get_all_nodes()[1]  # Second node
indices = node.get_samples()
print(f"Node {node.node_id} contains {len(indices)} samples")

node.to_dict() → Dict[str, Any]

Serialise all node metadata to a flat dictionary, convenient for building DataFrames or exporting to JSON.

import json
leaf = engine.get_leaves()[0]
print(json.dumps(leaf.to_dict(), indent=2, default=str))

Common Read-Only Attributes

Attribute Type Description
node.node_id int Unique identifier
node.depth int Depth level
node.rule str Path description, e.g., root & char_1 < 0.5 & char_3 >= 0.7
node.split_feature str | None Split feature name
node.split_threshold float | None Split threshold
node.split_score float | None Criterion score at split
node.is_leaf bool Whether this is a leaf
node.sample_ratio float Sample coverage ratio
node.elapsed_time float Build time (seconds)
node.predictor PredictorBase Trained local model instance

NodeReporter Methods

NodeReporter encapsulates user-facing reporting functionality. It requires a fitted PanelTreeEngine.

from ptree import NodeReporter
reporter = NodeReporter(engine)

reporter.summary() → pd.DataFrame

Return a complete node report DataFrame (all nodes, including internal nodes and leaves). Column definitions are the same as engine.get_node_report().

full = reporter.summary()
print(full[["Node_ID", "Depth", "Is_Leaf", "Split_Feature", "Predictability_Score"]])

reporter.leaf_summary() → pd.DataFrame

Return only the leaf nodes report. Structure is the same as summary(), suitable for quickly viewing final clustering results.

leaves = reporter.leaf_summary()
print(leaves[["Node_ID", "Rule", "Predictability_Score", "N_Samples", "Model_Weights"]])

Example Output:

 Node_ID                                              Rule  Predictability_Score  N_Samples
       3   root & char_1 < 0.5 & char_1 < 0.3 & char_3 < 0.7              0.0147       2438
       4  root & char_1 < 0.5 & char_1 < 0.3 & char_3 >= 0.7              0.0018       1102
      13  root & char_1 >= 0.5 & char_3 >= 0.3 & char_3 < 0.7              0.6323       2429

reporter.print_tree() → str

Return a formatted tree structure text string using indentation and ├─ / └─ to represent hierarchical relationships.

print(reporter.print_tree())

Example Output:

[Node 0] char_1 < 0.5 | r2=0.1234, n=12000 (Δ=0.4569)
├── [Node 1] char_1 < 0.3 | r2=0.0523, n=5940 (Δ=0.0140)
│   ├── [Leaf 3] r2=0.0147, mse=0.4769, n=2438
│   └── [Leaf 4] r2=0.0018, mse=0.8028, n=1102
└── [Leaf 5] r2=0.4640, mse=0.5483, n=6060

MosaicVisualizer Methods

MosaicVisualizer generates "prediction mosaics" — 2D heatmaps that visually display the model's predictive power across different time periods and asset clusters.

from ptree import MosaicVisualizer
viz = MosaicVisualizer(engine)

viz.build_mosaic(X, y, time_col, metric) → pd.DataFrame

Compute per-leaf, per-period metric values and return a DataFrame.

Parameter Description
X Processed panel DataFrame (must include time_col and feature columns)
y Target variable
time_col Time column name, default "date"
metric Evaluation metric: "r2" for regression, "precision" / "f1" / "auc" for classification

Return Structure:

  • Row index: Leaf node IDs (Leaf_ID)
  • Columns: Time periods (determined by time_col)
  • Values: Metric value for that leaf in that period
mosaic = viz.build_mosaic(X_proc, y_proc, time_col="date", metric="r2")
print(mosaic.shape)       # (n_leaves, n_periods)
print(mosaic.iloc[:, :5]) # Preview first 5 periods

# Analyse which leaves perform best in which periods
best_leaf_per_period = mosaic.idxmax(axis=0)
print(best_leaf_per_period)

Example Output:

         0         1         2         3         4
Leaf_ID
3     0.016    -0.042     0.006    -0.089     0.036
13    0.621     0.782     0.599     0.687     0.605
14    0.502     0.465     0.350     0.462     0.289

viz.plot_mosaic(mosaic, title, cmap, figsize, save_path) → (fig, ax)

Render the mosaic matrix as a seaborn heatmap. Requires matplotlib and seaborn.

Parameter Default Description
mosaic DataFrame returned by build_mosaic()
title "Prediction Mosaic" Chart title
cmap "RdYlGn" Colour map (red=poor, green=good)
figsize (14, 6) Figure size
save_path None If specified, automatically save as PNG
# Interactive viewing
fig, ax = viz.plot_mosaic(mosaic, title="P-Tree R² Mosaic")

# Save to file
fig, ax = viz.plot_mosaic(mosaic, save_path="output/mosaic.png", cmap="coolwarm")

Heatmap Interpretation:

  • X-axis: Time period $t$
  • Y-axis: Leaf nodes
  • Colour: Predictive accuracy for that leaf in that period (R² or Precision)
  • Instantly reveals when and where the model "fails" or "excels"

Verbose Logging

PanelTreeEngine outputs detailed splitting process logs via Python's logging module when verbose >= 1:

import logging
logging.basicConfig(level=logging.INFO, format="[%(levelname)s] %(message)s")

engine = PanelTreeEngine(..., verbose=1)
engine.fit(X, y, feature_names=...)

Example Log Output:

[INFO] [Level 0] Splitting Node 0...
  - Best Split: 'char_1' at threshold 0.5000
  - Metric Delta: score = 0.456896
  - Left: 5940 samples | Right: 6060 samples
[INFO] [Level 1] Splitting Node 1...
  - Best Split: 'char_3' at threshold 0.3000
  - Metric Delta: score = 0.179045
  - Left: 1808 samples | Right: 4252 samples
[INFO] Tree built: 15 nodes, 8 leaves, max_depth=3

Set verbose=2 to view per-candidate (feature, threshold) evaluation results.


Performance Optimisations

  1. Incremental matrix updates – For Ridge models, $X^TWX$ and $X^TWy$ are cached at each node. Child-node statistics are obtained by subtraction, avoiding redundant matrix multiplications.
  2. Feature-priority caching – When fast_mode=True, child nodes first evaluate the top-50% features from the parent, with optional early stopping.
  3. Multiprocessing – Node-level parallelism via n_jobs for high-dimensional feature sets.

Requirements

  • Python ≥ 3.10
  • numpy, pandas
  • matplotlib, seaborn (optional, for visualisation)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

# Clone the repository
git clone https://github.com/ElenYoung/AssetPanelTree.git
cd AssetPanelTree

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest test/ -v

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use P-Tree in your research, please consider citing:

@software{ptree2026,
  author = {ElenYoung},
  title = {P-Tree: Panel Tree for Supervised Clustering},
  year = {2026},
  url = {https://github.com/ElenYoung/AssetPanelTree}
}

About

Assets Panel Tree Method in Empirical Assets Pricing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages