Skip to content

Graph DSL handlers — parser correctness, result correctness, tenant-scope leak (4 sub-items) #52

@hollanf

Description

@hollanf

Four bugs in graph DSL handlers (pgwire/ddl/graph_ops.rs, match_ops.rs). Two are silent-wrong-result bugs (GRAPH PATH returns BFS frontier instead of an actual path; MATCH leaks tenant-scoped node IDs to the client). Two are parser correctness bugs (unbounded numeric params, substring-based keyword extraction broken on data containing those keywords).


1. GRAPH PATH FROM '<src>' TO '<dst>' returns BFS discovery set, not a path

File: nodedb/src/control/server/pgwire/ddl/graph_ops.rs:208-257

match dispatch_utils::cross_core_bfs(state, tenant_id, vec![src.clone()], label, dir, max_depth).await {
    Ok(resp) => {
        let json_text = ...decode_payload_to_json(&resp.payload);
        if let Ok(nodes) = sonic_rs::from_str::<Vec<String>>(&json_text)
            && nodes.contains(&dst)
        {
            // Path exists — return the discovered nodes as the path.
            let payload = sonic_rs::to_vec(&nodes)...
            return payload_to_query_response(&path_resp.payload);
        }
        // dst not found — empty result.
        ...
    }
}

The handler does a cross_core_bfs from src up to MAX_DEPTH, then if dst appears anywhere in the discovery set, returns the entire discovery set as the "path". The inline comment at line 237 says "Path exists — return the discovered nodes as the path" — there is no parent-tracking, no path reconstruction, no depth association. On a connected graph with MAX_DEPTH 10, callers expecting [src, hop1, hop2, dst] get every node reachable within 10 hops. Any client that depends on the returned shape being an actual path (e.g. for visualisation, RAG ranking, hop counting) gets misleading data with no error signal.

Repro:

GRAPH INSERT EDGE FROM 'a' TO 'b' TYPE 'l';
GRAPH INSERT EDGE FROM 'b' TO 'c' TYPE 'l';
GRAPH INSERT EDGE FROM 'a' TO 'x' TYPE 'l';
GRAPH INSERT EDGE FROM 'a' TO 'y' TYPE 'l';
GRAPH PATH FROM 'a' TO 'c' MAX_DEPTH 5 LABEL 'l';
-- Returns: ["a", "b", "c", "x", "y", ...]   (whole BFS frontier)
-- Expected: ["a", "b", "c"]                  (actual shortest path)

2. MATCH returns tenant-scoped node IDs ({tenant_id}:name) to the client without unscoped()

Files: nodedb/src/control/server/pgwire/ddl/match_ops.rs:52-90 + nodedb/src/data/executor/handlers/graph.rs:22-23, 75-76

Data Plane writes node names as scoped_src = scoped_node(tid, src_id) (handlers/graph.rs:22), prefixing every CSR node name with the numeric tenant id ("{tid}:{src_id}"). The MATCH response builder in match_ops.rs::match_payload_to_response passes the raw column strings straight to the client:

let val = row.get(col_name).and_then(|v| v.as_str()).unwrap_or("NULL");
encoder.encode_field(&val.to_string())?;

It does not call unscoped(). The pgwire client receives "5:alice" instead of "alice". Two consequences:

  1. Information leak — exposes the internal numeric tenant id to clients; combined with cross-tenant comparison this can be used to enumerate tenant numbering.
  2. Cross-core dedup correctness — the merge relies on the scoped form being consistent. If any code path returns the unscoped form (edge_store direct scans vs. CSR traversal — they take different paths), the row dedup treats them as different rows and double-counts.

Repro:

GRAPH INSERT EDGE FROM 'a' TO 'b' TYPE 'l';
MATCH (x)-[:l]->(y) RETURN x, y;
-- Returns "<tenant_id>:a" / "<tenant_id>:b" instead of "a" / "b"

3. Unbounded numeric DSL parameters across every graph handler — single-statement DoS

Files: nodedb/src/control/server/pgwire/ddl/graph_ops.rs:152, 219, 317, nodedb/src/control/server/pgwire/ddl/tree_ops.rs:200, 320, nodedb/src/control/server/pgwire/ddl/dsl/search_fusion.rs:60-62, nodedb/src/engine/graph/algo/params.rs:172-184

// graph_ops.rs:152, 219
let depth = extract_number_after(&upper, "DEPTH").unwrap_or(2);
let max_depth = extract_number_after(&upper, "MAX_DEPTH").unwrap_or(10);
// graph_ops.rs:317
max_iterations: extract_number_after(&upper, "ITERATIONS"),
// search_fusion.rs:60-62
let vector_top_k = extract_param(&upper, "VECTOR_TOP_K").unwrap_or(20);
let expansion_depth = extract_param(&upper, "DEPTH").unwrap_or(2);
// params.rs:172-174
pub fn iterations(&self, default: usize) -> usize {
    self.max_iterations.unwrap_or(default).max(1)         // floor only, no ceiling
}

extract_number_after / extract_param return Option<usize> with no upper-bound clamp. Values propagate unchanged into:

  • cross_core_bfs(..., max_depth) — every hop fans out via broadcast_to_all_cores
  • PageRank for iter in 1..=max_iter
  • Louvain / Label-Propagation iteration loops
  • RAG-FUSION expansion_depth

A single statement with ITERATIONS 1000000000 or DEPTH 4294967295 saturates the Control Plane handler. AlgoParams::iterations has a .max(1) floor but no .min(cap) ceiling. There is no per-tenant concurrency limit on graph DSL in the reviewed paths. This is the same DoS class as #41 sub-item 1 (ef_search) but for graph operators.

Repro:

GRAPH ALGO PAGERANK ON c ITERATIONS 1000000000 TOLERANCE 1e-30;
GRAPH TRAVERSE FROM 'n' DEPTH 4294967295 LABEL 'l' DIRECTION both;
SEARCH c USING FUSION(ARRAY[0.1] VECTOR_TOP_K 100000 DEPTH 50 TOP 10);

4. extract_*_after helpers use plain substring match — node IDs / labels containing keywords parsed wrong; PROPERTIES { ... } swallows the entire statement tail

File: nodedb/src/control/server/pgwire/ddl/graph_ops.rs:22-55, 424-491

fn extract_properties(upper: &str, original: &str) -> PgWireResult<String> {
    let Some(kw_pos) = upper.find("PROPERTIES") else { return Ok(String::new()); };
    let after = original[kw_pos + "PROPERTIES".len()..].trim_start();
    if after.starts_with('{') {
        let obj_str = after.trim_end_matches(';').trim_end();   // ← whole tail
        match nodedb_sql::parser::object_literal::parse_object_literal(obj_str) {
            ...
        }
    } else {
        Ok(extract_quoted_after(upper, original, "PROPERTIES").unwrap_or_default())
    }
}

Two problems:

  1. PROPERTIES { ... } extraction grabs whole tail. Object-literal parsing receives everything from { to end of statement minus a trailing ;. There is no brace-balancing or quote-aware scanning — PROPERTIES { k: '}' } or any value containing } or ; is mis-terminated.
  2. All extract_*_after helpers use plain upper.find("KEYWORD")DEPTH, TO, FROM, TYPE, LABEL, DIRECTION, MAX_DEPTH, DAMPING, TOLERANCE, ITERATIONS, RESOLUTION, SAMPLE, MODE, PROPERTIES. A node id, label, or property string containing the literal keyword (FROM 'TO', label 'DEPTH', type 'PROPERTIES') short-circuits extraction at the wrong byte position — handlers parse arguments out of user data instead of statement structure.

Affects insert_edge, delete_edge, traverse, neighbors, shortest_path, algo.

Repro:

GRAPH INSERT EDGE FROM 'TO' TO 'FROM' TYPE 'LABEL' PROPERTIES { "x": 1 };
GRAPH INSERT EDGE FROM 'a' TO 'b' TYPE 'l' PROPERTIES { note: '} DEPTH 999' };
GRAPH ALGO PAGERANK ON c ITERATIONS 5 DAMPING 0.85; SELECT 'ITERATIONS 99';
GRAPH TRAVERSE FROM 'node_with_DEPTH_in_name' DEPTH 2;

Checklist

  • 1. GRAPH PATH should track parents during BFS and reconstruct the actual path; return [src, …, dst] in order, not the BFS closure
  • 2. MATCH response builder must call unscoped() on every node-id column before encoding to client; audit every graph response path for the same
  • 3. Clamp DEPTH / MAX_DEPTH / ITERATIONS / EXPANSION_DEPTH / VECTOR_TOP_K at configured ceilings before dispatch; reject excess as 22023
  • 4. Replace all extract_*_after substring helpers with a tokenising parser that respects quoted strings and { … } brace-balanced regions; or migrate graph DSL to the typed AST module being introduced (nodedb_sql::ddl_ast)

All items independently reproducible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions