Skip to content

Wire Cypher IR -> DataFusion plan execution last mile (read-only) #332

@AdaWorldAPI

Description

@AdaWorldAPI

Why

The Cypher pipeline in crates/lance-graph parses Cypher into an AST (parser::parse_cypher_query -> ast::CypherQuery), lowers it to a graph IR (logical_plan::LogicalOperator via LogicalPlanner), and translates that IR to a DataFusion LogicalPlan via datafusion_planner::DataFusionPlanner::plan(). The "last mile" -- registering tables on a SessionContext, executing the produced LogicalPlan, and streaming Arrow RecordBatch results back to callers -- is not yet wired end-to-end on a single supported entry point. Today, callers either get an unparsed SQL string (via the dialect Unparser), a graph-shaped JSON, or a partially executed path that does not cover the IR features the planner already supports.

This blocks q2's cockpit from running ad-hoc Cypher: the cockpit currently renders pre-computed graphs. Once the last mile is in place, cockpit-issued Cypher queries flow parse -> IR -> DataFusionPlanner -> SessionContext::execute_logical_plan -> Arrow stream and the cockpit renders live results.

What

Wire a single, supported execution entry point on CypherQuery (e.g. execute_with_context(&self, ctx: &SessionContext) -> Result<SendableRecordBatchStream>) that:

  1. Parses Cypher (already done by parse_cypher_query).
  2. Substitutes parameters (parameter_substitution).
  3. Runs semantic analysis (semantic).
  4. Lowers to graph IR via LogicalPlanner (logical_plan::LogicalOperator).
  5. Calls DataFusionPlanner::plan(&ir) -> DataFusion LogicalPlan.
  6. Registers required tables on the provided SessionContext using the existing GraphSourceCatalog / TableReader plumbing (table_readers, sql_catalog, lance_graph_catalog).
  7. Optionally registers UDFs (datafusion_planner::udf, plus cam_distance / cam_heel_distance if with_cam_codebook was used).
  8. Calls SessionContext::execute_logical_plan(plan) and returns the resulting SendableRecordBatchStream (and a convenience collect() variant returning Vec<RecordBatch>).

Out of band, expose this through the Python binding (crates/lance-graph-python) so the cockpit can drive it directly.

Architecture

Code locations (all under crates/lance-graph/src/):

  • Parser: parser.rs -- parse_cypher_query produces ast::CypherQuery.
  • AST: ast.rs -- CypherQuery, ReadingClause, BooleanExpression, RelationshipDirection, etc.
  • Semantic: semantic.rs.
  • Graph IR: logical_plan.rs -- LogicalOperator (Scan / Filter / Expand / Project / Aggregate / variable-length Expand).
  • IR -> DataFusion: datafusion_planner/mod.rs -- DataFusionPlanner::plan(&LogicalOperator) -> LogicalPlan (two-phase: analysis then builder). Subordinates: scan_ops.rs, join_ops.rs, expression.rs, predicate_pushdown.rs, vector_ops.rs, udf.rs, analysis.rs, cost_estimation.rs, config_helpers.rs.
  • Catalog / readers: sql_catalog.rs, table_readers.rs, lance_graph_catalog::GraphSourceCatalog.
  • High-level entry: query.rs -- CypherQuery already holds query_text, ast, config, parameters, cam_codebook and exposes with_* builders. The execute_with_context entry point referenced in inline comments is the seam to wire.

What is missing on the IR -> LogicalPlan -> execution path:

  • A single CypherQuery::execute_with_context (and a collect/stream pair) that runs the whole pipeline on a caller-provided SessionContext instead of returning SQL, JSON, or a partially executed result.
  • Catalog -> SessionContext table registration glue: walk analysis::QueryAnalysis::required_datasets (or equivalent) and call ctx.register_table for each, using default_table_readers() / ParquetTableReader / DeltaTableReader from table_readers.rs and the GraphSourceCatalog resolver from sql_catalog.rs.
  • UDF registration helper that registers everything in datafusion_planner::udf plus optional CAM UDFs from cam_pq when cam_codebook is set.
  • Variable-length pattern (*1..N) end-to-end smoke: the planner already emits unrolled UNIONs (mod.rs Phase 2 doc), but execution has not been smoke-tested for the cockpit path.

Acceptance criteria

Scope: read-only Cypher only.

  • CypherQuery::execute_with_context(&self, ctx: &SessionContext) -> Result<SendableRecordBatchStream> exists, plus a collect_with_context returning Vec<RecordBatch>.
  • Helper that takes a GraphSourceCatalog + SessionContext and registers all tables required by QueryAnalysis.
  • Helper that registers all datafusion_planner::udf UDFs (and CAM UDFs when cam_codebook is set).
  • Python binding in crates/lance-graph-python exposes the new entry point so q2 cockpit can call it directly.
  • Integration tests under crates/lance-graph/tests/ execute against an in-memory / temp-dir Lance dataset (or MemTable) and assert returned RecordBatch schema + row counts:
    • MATCH (n)-[r]->(m) RETURN n, r, m -- two-hop returning node + rel + node columns.
    • MATCH (n) WHERE n.label = 'foo' RETURN n -- filter pushdown via predicate_pushdown.rs.
    • MATCH (n)-[*1..3]-(m) RETURN m -- variable-length expansion via UNION unroll.
    • MATCH (n) RETURN COUNT(n) -- aggregation through DataFusion.
  • Errors from each stage (parse / semantic / lower / plan / execute) surface as GraphError with stage context.
  • Streaming path verified: caller can pull RecordBatch batches without materializing the full result.

Out of scope

  • Write Cypher: CREATE, DELETE, SET, REMOVE, MERGE.
  • Cypher-specific aggregations beyond COUNT, SUM, AVG (no COLLECT, PERCENTILE_*, etc. in this issue).
  • New optimizer rules; we only consume what DataFusionPlanner already produces.
  • Lance-native executor path (ExecutionStrategy::LanceNative) and BlasGraph path (ExecutionStrategy::BlasGraph) -- this issue is ExecutionStrategy::DataFusion only.
  • Cockpit UI changes -- consumed via Python binding in a follow-up.

Dependencies

  • Independent. Can land in parallel with A1 and A3 -- this is the ontology / Cypher spine, not a downstream consumer.
  • Touches only crates/lance-graph and crates/lance-graph-python (binding shim). Does not require changes in lance-graph-planner or other sibling crates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions