-
-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
CodeGraph is a Rust workspace: 15 library crates under crates/* plus the codegraph
CLI under bin/. The workspace uses edition 2021 and a pinned Rust 1.95 toolchain, with
dependencies centralized in the root Cargo.toml.
A normal extract run flows through these stages:
detect -> extract -> graph build -> cluster + analyze -> output
\-> (optional) semantic pass
- detect - walk the directory, classify files, apply ignore rules, and skip sensitive files. See Extraction.
- extract - run tree-sitter (or a regex extractor) per file to produce nodes and edges. Results are cached per file so unchanged files skip re-parsing. See Languages.
- graph build - assemble nodes/edges into a graph, resolve symbols across files, and deduplicate.
- cluster + analyze - detect communities, compute betweenness and god nodes, find import cycles and surprising connections. See Analysis and Reports.
-
output - write
graph.jsonand the report, visualizations, and exports. See Output Formats.
The optional semantic pass (--semantic) sends documents and papers to an LLM to add
concept nodes and to break ties during dedup. See Semantic Analysis.
Querying and serving read graph.json back; they do not re-extract. See Querying
and MCP Server.
| Crate | Responsibility |
|---|---|
codegraph-core |
The shared data contract: NodeId, FileType, Confidence, Node, Edge, Hyperedge, the graph.json node-link DTO, id generation, and schema validation |
codegraph-detect |
File discovery, classification, ignore handling (.codegraphignore / .gitignore), manifest building, and sensitive-file detection |
codegraph-extract |
Tree-sitter (and regex) extractors that turn source files into core nodes/edges; languages gated behind lang-* features; per-file AST cache |
codegraph-graph |
Graph assembly: build, symbol resolution, dedup (MinHash/LSH), clustering and community detection, betweenness, analysis |
codegraph-semantic |
The LLM semantic pass: documents and papers to concept nodes, plus the optional dedup tiebreaker |
codegraph-llm |
Pluggable LLM client layer: provider registry with env auto-detect, response cache, JSON repair, token-budget chunking, adaptive retry |
codegraph-query |
Query (IDF-scored subgraph retrieval), shortest path, node explanation, and reverse impact |
codegraph-output |
Output writers: graph.json, HTML viewers, SVG, GraphML, Cypher, DOT, Mermaid call-flow, D3 tree, Obsidian, wiki, and live database push |
codegraph-report |
The GRAPH_REPORT.md generator |
codegraph-ingest |
External-source ingestion: URL (SSRF-guarded), MCP config, Cargo, Postgres, SCIP, office, media |
codegraph-server |
The MCP server (read-only graph and PR tools) over stdio and HTTP, plus a small REST surface |
codegraph-prs |
Graph-aware PR dashboard: classification, CI rollup, blast radius, conflict grouping |
codegraph-incremental |
Changed-files rebuild engine plus git integration (hooks, merge driver, watch, concurrency lock) |
codegraph-workspace |
Multi-repo / monorepo federation: member discovery, namespacing, cross-repo resolution, global store, merge-graphs
|
codegraph-skillgen |
Generates and installs the assistant skill file and the .claude/settings.json hooks |
bin/codegraph |
The CLI that wires the crates into commands |
The graph is a node-link structure serialized to graph.json (NetworkX-compatible shape).
-
Nodes have an
id, alabel, afile_type(one ofcode,document,paper,image,rationale,concept), and asource_file. Optional fields includesource_location,community, andrepo(for federated graphs). -
Edges (serialized under
links) have asource,target,relation(for examplecalls,imports,imports_from,inherits,implements,references,contains,depends_on,reads_from), and aconfidenceofEXTRACTED,INFERRED, orAMBIGUOUS. Cross-repo edges are flagged withcross_repo.
The exact JSON shape and every export format are documented in Output Formats.
Extraction parallelizes across files but collects results in a stable order, so graph.json
is byte-identical for the same input regardless of thread scheduling. Community numbers are
assigned deterministically (community 0 is the largest) and kept stable across incremental
rebuilds.
- Configuration knobs and files: Configuration
- Building, testing, and CI: Development
Getting started
Concepts
Using CodeGraph
- Commands
- Extraction
- Querying
- Cross-Language Edges
- Analysis and Reports
- Output Formats
- Visualizations
Integrations
Scaling
Reference