Skip to content

Neo4j graph backend#177

Merged
m1rl0k merged 57 commits intotestfrom
neo4j-graph-backend
Jan 17, 2026
Merged

Neo4j graph backend#177
m1rl0k merged 57 commits intotestfrom
neo4j-graph-backend

Conversation

@m1rl0k
Copy link
Collaborator

@m1rl0k m1rl0k commented Jan 14, 2026

No description provided.

- Add Neo4j plugin at scripts/graph_backends/neo4j_plugin.py
- Implement GraphBackendProtocol with Neo4j driver
- Add CLI for backfill/index/clear operations
- Add docker-compose.neo4j.yml for Neo4j service
- Update ingest pipeline with graph backend hooks
- Add MCP integration stub for symbol_graph queries
This update introduces Neo4j backend support for symbol graph queries and edge deletion, including collection scoping and conditional backend selection. Functions and queries in symbol_graph.py, neo4j_graph.py, prune.py, and processor.py now handle Neo4j when enabled, and documentation is updated to reflect these changes.
Introduces an internal graph RAG enhancement layer with transparent fallback to basic behavior when the advanced backend is unavailable. Adds PageRank-based importance boost to hybrid search, multi-hop traversal for graph queries, and richer subgraph context injection in context answers. Updates configuration to support importance boost and refactors related modules for seamless integration.
Introduces an optional callee_path field to the GraphEdge class to store the resolved path of the callee symbol. Updates the to_dict method to include callee_path in the serialized output if present.
Introduces a symbol resolver abstraction for resolving symbols and imports to their definition file paths across files and repositories. Updates the graph ingestion pipeline and edge extraction functions to use the resolver, enabling more accurate call and import edge linking. Adds a new symbol_resolver.py module and extends the graph backend interface for symbol and import resolution.
Introduces tree-sitter-based detection of builtins and stdlib modules in ast_analyzer.py, and uses this for resolving callee and import paths in graph_edges.py. Updates edge payloads to include resolved callee_path for both call and import edges, and ensures pipeline.py and graph_backends/base.py support the new field and edge types. This improves accuracy of graph relationships by distinguishing builtins, stdlib, and external symbols across languages.
Corrects multi-hop via attribution in symbol graph traversal by tracking (via_symbol, task) pairs, ensuring accurate attribution when symbols are skipped. Updates context injection to skip pseudo-paths (e.g., <stdlib>, <external>, <builtin>) that do not correspond to retrievable code spans. Refactors callee/import path resolution to make language checks optional and only apply them when language is specified. Adds regression test for multi-hop attribution bug.
@augmentcode
Copy link

augmentcode bot commented Jan 14, 2026

This pull request is too large for Augment to review. The PR exceeds the maximum size limit of 100000 tokens (approximately 400000 characters) for automated code review. Please consider breaking this PR into smaller, more focused changes.

m1rl0k added 21 commits January 14, 2026 19:00
Refactored symbol resolution to use only the graph backend abstraction, removing direct Qdrant client dependencies from the symbol resolver. Added resolve_symbol and resolve_import methods to QdrantGraphBackend for direct symbol and import path resolution. Updated ingest_adapter to match the new symbol resolver interface. Enhanced mcp_memory_server to support optional pattern and sparse vector configurations for Qdrant collections based on environment variables.
Introduces environment variables for configuring lexical sparse vectors and pattern vectors in the docker-compose.yml. This allows for more flexible setup of lossless term matching and structural code similarity features.
Introduces new environment variables in docker-compose.yml to configure lexical sparse vectors and pattern vectors for improved term matching and code similarity. These variables allow customization of modes, names, IDF usage, and vector dimensions.
Updated collection name generation in both standalone_upload_client.py and workspace_state.py to use a 16-character hash instead of 8 for improved collision avoidance, especially in remote upload scenarios.
Introduces info-level logs to report the number of operations, replica roots, and details of the first three operations for troubleshooting. Also adds warnings when applying an operation to a replica workspace fails.
Introduces the NEO4J_GRAPH environment variable to configure Neo4j as the graph backend. When set, edges are stored in Neo4j instead of the Qdrant _graph collection.
Dockerfile now copies Neo4j plugins and other extensions to the image. docker-compose.yml adds environment variables for configuring Neo4j graph backend, enabling symbol_graph queries and related features.
Added neo4j>=5.0.0 to requirements.txt to support Neo4j graph backend for symbol_graph queries.
Introduces a standalone Neo4j-based graph backend plugin for Context-Engine, including backend implementation, CLI utilities, schema definitions, and test suite. Adds docker-compose configuration for Neo4j, detailed documentation, and integration points for advanced graph traversals, analytics, and knowledge graph features.
Centralizes path normalization and edge ID generation in scripts.ingest.graph_edges for consistency across plugins and adapters. Updates Neo4j and Qdrant backends to use instance-level lazy initialization. Improves builtin and external symbol detection using tree-sitter-based utilities, removes hardcoded lists, and enhances logging for symbol resolution failures. Adds __all__ exports and backward compatibility aliases in graph_edges.py.
Adds language-specific file extension support to import resolution in both Neo4j and Qdrant backends, improving accuracy for multi-language repositories. Refactors symbol_resolver with a TTL and LRU-style cache for symbol and import lookups, configurable via environment variables. Ensures consistent plugin path handling and improves parameterization and safety of Cypher queries in the Neo4j knowledge graph. Also updates base interfaces and documentation for clarity.
Introduces TTL-based caching for graph collection existence and symbol suggestions in symbol_graph.py, including cache eviction and clearing functions for testing. Updates tests to use the new cache clearing method for Neo4jGraphBackend.
Centralizes the Neo4j enablement logic into a shared is_neo4j_enabled() utility and updates all usages to reference this function. Replaces print-based error handling with logger warnings in the ingest pipeline for better observability and consistency.
Introduces a circuit breaker mechanism to prevent repeated connection attempts when Neo4j is unavailable, with configurable thresholds and reset timeouts. Also adds automatic cleanup of Neo4j driver instances on process exit using atexit and weak references, improving resource management and reliability.
Renamed and reorganized exported functions in graph_rag.py for clarity. Added additional session-related imports to search.py to support session management features.
Introduces clear_collection_stats_cache and clear_symbol_extent_cache functions in ranking.py to allow explicit cache clearing. Refactors symbol extraction logic in expand.py for clarity. Moves __all__ definition to the end of qdrant.py for improved readability.
Introduced a _sanitize_depth helper to validate and constrain user-provided depth parameters in Cypher queries, preventing injection and enforcing security limits. Updated all relevant methods to use sanitized depth values for graph traversals.
Introduces a batch shortest path query method to Neo4jKnowledgeGraph for efficient distance calculations between multiple symbol pairs. Updates graph_rag reranking to use the new batch method, improving performance for large result sets. Adds new indexes in Neo4j backend to support faster edge lookups, and ensures proper cleanup of the knowledge graph singleton on process exit. Also adds __all__ to qdrant_backend.py for explicit exports.
Updated LICENSE, README.md, and __init__.py in the neo4j_graph plugin to clarify licensing terms. Specifies free use for development and non-commercial projects, and commercial license requirements for production or commercial use. Also documents the need for a separate Neo4j database license.
Implements an auto-backfill mechanism that populates Neo4j edges from Qdrant if Neo4j is empty for a collection. The process is triggered after graph store initialization, runs once per collection per process, and can be disabled via the NEO4J_AUTO_BACKFILL_DISABLE environment variable.
m1rl0k added 14 commits January 15, 2026 11:03
Added tree_sitter parsers for Kotlin, Swift, Scala, and PHP to requirements.txt and tree_sitter.py. This expands language support for code ingestion and parsing.
Updated tree-sitter package versions for Kotlin and PHP in requirements.txt. Refined the import extraction logic for Kotlin to use the correct node type and parse qualified identifiers, and enhanced PHP import extraction to handle namespace_use_clause and qualified_name nodes. Adjusted test cases in verify_imports.py to reflect these changes.
Updated warmup_reranker to import reranker utilities from scripts.reranker instead of scripts.hybrid.rerank. Added a check for reranker availability and refactored dummy data and inference call to match new interface.
Updated language extraction tests to skip cases if the required tree-sitter parser is not available. This prevents test failures in environments where specific tree-sitter language parsers are missing. Also added a CI step to verify available tree-sitter packages.
Updated language extraction tests to skip assertions if enhanced symbol extraction is not available in the environment, improving CI reliability. Also removed the tree-sitter verification step from the CI workflow.
Copies the plugins directory into the Docker image to support Neo4j graph backend and other extensions.
Introduces threading locks for driver and schema initialization in Neo4jKnowledgeGraph to ensure thread safety. Adds transaction timeouts for long-running graph operations (PageRank, community detection) and makes fallback Cypher queries use explicit transactions with timeouts. Also updates graph_rag.py to use thread-safe lazy initialization for the knowledge graph instance.
Corrects Cypher queries in Neo4jGraphBackend to include callee_path and applies repo filters to both source and target nodes to prevent cross-repo contamination. Adds a regex escape utility for Cypher patterns. Updates QdrantGraphBackend to use nested metadata fields for path and repo, ensuring compatibility with Qdrant's payload structure.
Added TTL-based expiry for the missing graph collections cache in graph_edges.py to handle transient 404s more robustly. Also, in knowledge_graph.py, escaped regex metacharacters in node name queries to prevent injection and expensive patterns.
Replaces direct set checks and updates for _MISSING_GRAPH_COLLECTIONS with helper functions _is_collection_missing and _mark_collection_missing to support TTL expiry. Also removes unused 'language' parameter from _resolve_import_path and updates its usage.
Expanded documentation in SKILL.md and CLAUDE.example.md to describe the neo4j_graph_query tool, including its supported query types (transitive callers, impact, dependencies, cycles) and usage examples. Clarifies multi-hop traversal capabilities and when to use this tool for advanced codebase analysis.
@m1rl0k m1rl0k closed this Jan 15, 2026
@m1rl0k m1rl0k reopened this Jan 15, 2026
@m1rl0k
Copy link
Collaborator Author

m1rl0k commented Jan 15, 2026

augment review

@augmentcode
Copy link

augmentcode bot commented Jan 15, 2026

🤖 Augment PR Summary

Summary: This PR adds an optional Neo4j-powered graph backend (via a plugin) and introduces a core graph-backend abstraction layer to support richer symbol-relationship queries.

Changes:

  • Added scripts/graph_backends/ with a GraphBackend interface, a default Qdrant implementation, and an ingest adapter that routes edge writes/queries to the active backend.
  • Introduced a new Neo4j plugin under plugins/neo4j_graph/ (backend + knowledge graph + CLI + tests) plus Dockerfiles/compose wiring to ship and run the plugin and Neo4j.
  • Extended graph functionality with “Graph RAG” helpers (subgraph context, impact analysis, shortest-path distance, similarity) and an MCP tool (neo4j_graph_query) for advanced traversals.
  • Improved language coverage by expanding tree-sitter support and enhancing import extraction to capture both modules and imported symbols (to improve symbol_graph importers lookups).
  • Updated ingestion/pipeline code to emit graph edges through the backend adapter when Neo4j is enabled, including cross-file symbol/import resolution and builtin detection.
  • Added optional hybrid-search scoring enhancements using PageRank-style importance when the enhanced graph backend is available.

Technical Notes: Neo4j is enabled via NEO4J_GRAPH=1; edges can be auto-backfilled from Qdrant, and additional traversal/query capabilities become available through MCP and internal Graph RAG utilities.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Introduces optional parallel processing for file indexing in the index_repo function using ThreadPoolExecutor. The number of workers is configurable via the INDEX_WORKERS environment variable, with sensible defaults and caps to avoid overwhelming resources. Errors are collected and reported, and progress logging is updated to reflect parallel execution.
Introduces asynchronous driver management and query execution methods to Neo4jGraphBackend, including async driver initialization, async query execution, and async driver cleanup. Maintains backward compatibility by falling back to synchronous execution if async is unavailable.
Migrates all Neo4j graph query implementations in scripts/mcp_impl/neo4j_graph.py to use async execution via backend.run_query_async, replacing previous synchronous session.run usage. Adds query timeout support and improves error handling in plugins/neo4j_graph/backend.py. This change enables better performance and scalability for graph queries.
Added a threading.Lock to protect the circuit breaker state in Neo4jGraphBackend, ensuring thread safety for all circuit breaker operations. Also fixed the Cypher query in delete_edges_for_path to count relationships before deletion, returning the correct number of deleted edges. Changed get_callers to re-raise exceptions instead of returning an empty list, allowing callers to handle errors appropriately.
Introduces configurable confidence thresholds for intent classification, allowing per-intent overrides via environment variables. The default threshold is now settable, and the ML classifier uses these thresholds instead of a hard-coded value to improve intent selection accuracy and safety.
…antic expansion

Introduce new unit tests covering embedding cache logic, hybrid ranking score normalization, Qdrant upsert retry and pooling, Neo4j graph query depth clamping, rerank event logging and cleanup, and semantic expansion cache TTL expiry. These tests improve coverage and validate key behaviors such as caching, normalization, retry logic, resource pooling, defensive parameter clamping, thread safety, and cache expiration.
Adds a fast-path optimization to smart reindexing by skipping AST parsing when file hashes are unchanged in both pipeline and pseudo scripts. Refactors intent classification to use per-intent confidence thresholds, improving accuracy and flexibility. Updates Neo4j backend to use explicit transaction timeouts for both async and sync queries.
Changed the transaction handling in the async Neo4j session to explicitly call commit after running a read query. This ensures the transaction is properly closed, improving reliability and clarity in transaction management.
@m1rl0k m1rl0k merged commit 90def15 into test Jan 17, 2026
1 check passed
@m1rl0k m1rl0k deleted the neo4j-graph-backend branch January 23, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant