Why
The Cypher pipeline in crates/lance-graph parses Cypher into an AST (parser::parse_cypher_query -> ast::CypherQuery), lowers it to a graph IR (logical_plan::LogicalOperator via LogicalPlanner), and translates that IR to a DataFusion LogicalPlan via datafusion_planner::DataFusionPlanner::plan(). The "last mile" -- registering tables on a SessionContext, executing the produced LogicalPlan, and streaming Arrow RecordBatch results back to callers -- is not yet wired end-to-end on a single supported entry point. Today, callers either get an unparsed SQL string (via the dialect Unparser), a graph-shaped JSON, or a partially executed path that does not cover the IR features the planner already supports.
This blocks q2's cockpit from running ad-hoc Cypher: the cockpit currently renders pre-computed graphs. Once the last mile is in place, cockpit-issued Cypher queries flow parse -> IR -> DataFusionPlanner -> SessionContext::execute_logical_plan -> Arrow stream and the cockpit renders live results.
What
Wire a single, supported execution entry point on CypherQuery (e.g. execute_with_context(&self, ctx: &SessionContext) -> Result<SendableRecordBatchStream>) that:
- Parses Cypher (already done by
parse_cypher_query).
- Substitutes parameters (
parameter_substitution).
- Runs semantic analysis (
semantic).
- Lowers to graph IR via
LogicalPlanner (logical_plan::LogicalOperator).
- Calls
DataFusionPlanner::plan(&ir) -> DataFusion LogicalPlan.
- Registers required tables on the provided
SessionContext using the existing GraphSourceCatalog / TableReader plumbing (table_readers, sql_catalog, lance_graph_catalog).
- Optionally registers UDFs (
datafusion_planner::udf, plus cam_distance / cam_heel_distance if with_cam_codebook was used).
- Calls
SessionContext::execute_logical_plan(plan) and returns the resulting SendableRecordBatchStream (and a convenience collect() variant returning Vec<RecordBatch>).
Out of band, expose this through the Python binding (crates/lance-graph-python) so the cockpit can drive it directly.
Architecture
Code locations (all under crates/lance-graph/src/):
- Parser:
parser.rs -- parse_cypher_query produces ast::CypherQuery.
- AST:
ast.rs -- CypherQuery, ReadingClause, BooleanExpression, RelationshipDirection, etc.
- Semantic:
semantic.rs.
- Graph IR:
logical_plan.rs -- LogicalOperator (Scan / Filter / Expand / Project / Aggregate / variable-length Expand).
- IR -> DataFusion:
datafusion_planner/mod.rs -- DataFusionPlanner::plan(&LogicalOperator) -> LogicalPlan (two-phase: analysis then builder). Subordinates: scan_ops.rs, join_ops.rs, expression.rs, predicate_pushdown.rs, vector_ops.rs, udf.rs, analysis.rs, cost_estimation.rs, config_helpers.rs.
- Catalog / readers:
sql_catalog.rs, table_readers.rs, lance_graph_catalog::GraphSourceCatalog.
- High-level entry:
query.rs -- CypherQuery already holds query_text, ast, config, parameters, cam_codebook and exposes with_* builders. The execute_with_context entry point referenced in inline comments is the seam to wire.
What is missing on the IR -> LogicalPlan -> execution path:
- A single
CypherQuery::execute_with_context (and a collect/stream pair) that runs the whole pipeline on a caller-provided SessionContext instead of returning SQL, JSON, or a partially executed result.
- Catalog ->
SessionContext table registration glue: walk analysis::QueryAnalysis::required_datasets (or equivalent) and call ctx.register_table for each, using default_table_readers() / ParquetTableReader / DeltaTableReader from table_readers.rs and the GraphSourceCatalog resolver from sql_catalog.rs.
- UDF registration helper that registers everything in
datafusion_planner::udf plus optional CAM UDFs from cam_pq when cam_codebook is set.
- Variable-length pattern (
*1..N) end-to-end smoke: the planner already emits unrolled UNIONs (mod.rs Phase 2 doc), but execution has not been smoke-tested for the cockpit path.
Acceptance criteria
Scope: read-only Cypher only.
Out of scope
- Write Cypher:
CREATE, DELETE, SET, REMOVE, MERGE.
- Cypher-specific aggregations beyond
COUNT, SUM, AVG (no COLLECT, PERCENTILE_*, etc. in this issue).
- New optimizer rules; we only consume what
DataFusionPlanner already produces.
- Lance-native executor path (
ExecutionStrategy::LanceNative) and BlasGraph path (ExecutionStrategy::BlasGraph) -- this issue is ExecutionStrategy::DataFusion only.
- Cockpit UI changes -- consumed via Python binding in a follow-up.
Dependencies
- Independent. Can land in parallel with A1 and A3 -- this is the ontology / Cypher spine, not a downstream consumer.
- Touches only
crates/lance-graph and crates/lance-graph-python (binding shim). Does not require changes in lance-graph-planner or other sibling crates.
Why
The Cypher pipeline in
crates/lance-graphparses Cypher into an AST (parser::parse_cypher_query->ast::CypherQuery), lowers it to a graph IR (logical_plan::LogicalOperatorviaLogicalPlanner), and translates that IR to a DataFusionLogicalPlanviadatafusion_planner::DataFusionPlanner::plan(). The "last mile" -- registering tables on aSessionContext, executing the producedLogicalPlan, and streaming ArrowRecordBatchresults back to callers -- is not yet wired end-to-end on a single supported entry point. Today, callers either get an unparsed SQL string (via the dialectUnparser), a graph-shaped JSON, or a partially executed path that does not cover the IR features the planner already supports.This blocks q2's cockpit from running ad-hoc Cypher: the cockpit currently renders pre-computed graphs. Once the last mile is in place, cockpit-issued Cypher queries flow
parse -> IR -> DataFusionPlanner -> SessionContext::execute_logical_plan -> Arrow streamand the cockpit renders live results.What
Wire a single, supported execution entry point on
CypherQuery(e.g.execute_with_context(&self, ctx: &SessionContext) -> Result<SendableRecordBatchStream>) that:parse_cypher_query).parameter_substitution).semantic).LogicalPlanner(logical_plan::LogicalOperator).DataFusionPlanner::plan(&ir)-> DataFusionLogicalPlan.SessionContextusing the existingGraphSourceCatalog/TableReaderplumbing (table_readers,sql_catalog,lance_graph_catalog).datafusion_planner::udf, pluscam_distance/cam_heel_distanceifwith_cam_codebookwas used).SessionContext::execute_logical_plan(plan)and returns the resultingSendableRecordBatchStream(and a conveniencecollect()variant returningVec<RecordBatch>).Out of band, expose this through the Python binding (
crates/lance-graph-python) so the cockpit can drive it directly.Architecture
Code locations (all under
crates/lance-graph/src/):parser.rs--parse_cypher_queryproducesast::CypherQuery.ast.rs--CypherQuery,ReadingClause,BooleanExpression,RelationshipDirection, etc.semantic.rs.logical_plan.rs--LogicalOperator(Scan / Filter / Expand / Project / Aggregate / variable-length Expand).datafusion_planner/mod.rs--DataFusionPlanner::plan(&LogicalOperator) -> LogicalPlan(two-phase:analysisthenbuilder). Subordinates:scan_ops.rs,join_ops.rs,expression.rs,predicate_pushdown.rs,vector_ops.rs,udf.rs,analysis.rs,cost_estimation.rs,config_helpers.rs.sql_catalog.rs,table_readers.rs,lance_graph_catalog::GraphSourceCatalog.query.rs--CypherQueryalready holdsquery_text,ast,config,parameters,cam_codebookand exposeswith_*builders. Theexecute_with_contextentry point referenced in inline comments is the seam to wire.What is missing on the IR -> LogicalPlan -> execution path:
CypherQuery::execute_with_context(and acollect/streampair) that runs the whole pipeline on a caller-providedSessionContextinstead of returning SQL, JSON, or a partially executed result.SessionContexttable registration glue: walkanalysis::QueryAnalysis::required_datasets(or equivalent) and callctx.register_tablefor each, usingdefault_table_readers()/ParquetTableReader/DeltaTableReaderfromtable_readers.rsand theGraphSourceCatalogresolver fromsql_catalog.rs.datafusion_planner::udfplus optional CAM UDFs fromcam_pqwhencam_codebookis set.*1..N) end-to-end smoke: the planner already emits unrolled UNIONs (mod.rs Phase 2 doc), but execution has not been smoke-tested for the cockpit path.Acceptance criteria
Scope: read-only Cypher only.
CypherQuery::execute_with_context(&self, ctx: &SessionContext) -> Result<SendableRecordBatchStream>exists, plus acollect_with_contextreturningVec<RecordBatch>.GraphSourceCatalog+SessionContextand registers all tables required byQueryAnalysis.datafusion_planner::udfUDFs (and CAM UDFs whencam_codebookis set).crates/lance-graph-pythonexposes the new entry point so q2 cockpit can call it directly.crates/lance-graph/tests/execute against an in-memory / temp-dir Lance dataset (orMemTable) and assert returnedRecordBatchschema + row counts:MATCH (n)-[r]->(m) RETURN n, r, m-- two-hop returning node + rel + node columns.MATCH (n) WHERE n.label = 'foo' RETURN n-- filter pushdown viapredicate_pushdown.rs.MATCH (n)-[*1..3]-(m) RETURN m-- variable-length expansion via UNION unroll.MATCH (n) RETURN COUNT(n)-- aggregation through DataFusion.GraphErrorwith stage context.RecordBatchbatches without materializing the full result.Out of scope
CREATE,DELETE,SET,REMOVE,MERGE.COUNT,SUM,AVG(noCOLLECT,PERCENTILE_*, etc. in this issue).DataFusionPlanneralready produces.ExecutionStrategy::LanceNative) and BlasGraph path (ExecutionStrategy::BlasGraph) -- this issue isExecutionStrategy::DataFusiononly.Dependencies
crates/lance-graphandcrates/lance-graph-python(binding shim). Does not require changes inlance-graph-planneror other sibling crates.