Skip to content

Support externally-built pre-aggregation tables via /preaggs/register #2118

@shangyian

Description

@shangyian

The current /preaggs/plan endpoint assumes DJ owns the materialization (it generates SQL, runs it, and tracks availability). There's no path for systems that already have pre-aggregated tables built by external ETL pipelines to register those tables as pre-aggs in DJ and benefit from DJ's grain resolution.

This also surfaces a gap in DJ's metric model: a metric today can be either an atomic aggregation (SUM(view_secs)) or a derived expression (SUM(view_secs) / COUNT(sessions)), but DJ treats both identically and decomposes the derived expression at a later point to manage materialization. External pre-agg registration requires distinguishing these clearer for the client, because only atomic aggregations can be mapped 1:1 to a column in an externally-built table.

Proposal

  1. Add is_measure flag on metric nodes. A metric is a measure if its expression is a single aggregation call with no cross-measure arithmetic (e.g., SUM(x), COUNT(x), AVG(x) etc). This is computed on-the-fly based on the metric's expression. This flag is the primitive that makes external registration safe: only is_measure=true metrics can be mapped directly to pre-agg columns. Derived metrics are automatically satisfiable by an external agg if all their component measures are covered.
  2. Add an optional field source_column to PreAggMeasure, which captures the column name in the external table.
  3. Add a new POST /preaggs/register endpoint, which is for adopting externally-built tables:
POST /preaggs/register
{
    "metrics": ["${prefix}view_rate"],
    "dimensions": ["${prefix}page_d.page_id", "${prefix}country_d.country_id"],
    "table": {
      "catalog": "catalog",
      "schema": "schema",
      "table": "views_agg",
      "valid_through_ts": 1234567890
    },
    "measure_columns": {
      "events.view_secs_sum": "view_secs_sum",
      "events.session_cnt":   "session_cnt"
    }
}

DJ validates:

  • Every key in measure_columns has is_measure=true
  • All component measures of any derived metric in metrics are covered by measure_columns
  • Declared columns exist in the table (via catalog schema query)

On success:

  • Creates the PreAggregation record with grain + measures (with source_column set)
  • Auto-sets availability pointing at the provided table

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions