-
-
Notifications
You must be signed in to change notification settings - Fork 0
Extraction
codegraph extract turns a directory of source code into a knowledge graph. It
discovers files, parses each supported language into nodes and edges, resolves
cross-file references, clusters the result into communities, and writes a set of
artifacts under codegraph-out/.
codegraph extract .
codegraph extract path/to/project
codegraph extract . --directed --wiki
This page covers the discovery and parsing stages: which files are found, which are skipped, how ignore files are honored, how sensitive files are excluded, and how the on-disk AST cache works. For how parsed facts become the final graph and reports, see [Analysis-and-Reports] and [Output-Formats]. For the optional LLM-driven concept layer, see [Semantic-Analysis].
- Discover and classify every file under the root.
- Read and parse each code file in parallel, producing per-file nodes, edges, raw calls, and import records.
- Resolve relative and alias imports to real file nodes.
- Extract Markdown heading structure (always, independent of
--semantic). - Optionally run the LLM semantic pass over documents and papers (
--semantic). - Build the graph, resolve cross-file calls, deduplicate entities, cluster into communities, and analyze.
- Write artifacts into
codegraph-out/.
Discovery walks the root with the ignore crate's directory walker. The walk:
- Visits dotfiles and dotted directories (hidden files are not skipped by default; the noise and sensitive rules below handle exclusions).
- Honors
.gitignore, git exclude files, and.codegraphignore(see below). - Follows symlinks, but prunes any symlink whose real target resolves outside the scan root (an escape and cycle guard).
- Prunes known noise directories so it never descends into them.
- Classifies each remaining file by extension (papers also get a content sniff).
Each code file is read and extracted with its path taken relative to the
root, so node ids and source_file values are portable across machines and
checkouts. Results are collected in a stable, path-sorted order, so graph.json
is deterministic regardless of thread scheduling.
The following directory names are pruned wherever they appear (build output, caches, dependency trees, and CodeGraph's own output):
venv .venv env .env node_modules __pycache__ .git dist build
target out site-packages lib64 .pytest_cache .mypy_cache .ruff_cache
.tox .eggs codegraph-out coverage lcov-report visual-tests visual-test
__snapshots__ snapshots storybook-static dist-protected .next .nuxt
.turbo .angular .idea .cache .parcel-cache .svelte-kit .terraform
.serverless .codegraph .worktrees
Additional rules:
- Any directory name ending in
_venv,_env, or.egg-infois pruned. - A directory literally named
worktreesnested inside a dotted directory (for example.git/worktrees/) is pruned.
These lockfiles are never indexed:
package-lock.json yarn.lock pnpm-lock.yaml Cargo.lock poetry.lock
Gemfile.lock composer.lock go.sum go.work.sum
Both ignore files are honored, layered per-directory up to the VCS root:
-
.gitignorerules apply (along with git exclude files). Global gitignore is not consulted. -
.codegraphignoreuses the same gitignore syntax (globs,!negations). - On conflicting rules,
.codegraphignoretakes precedence over.gitignore(for example a!keep.pyre-include in.codegraphignorewins over akeep.pyexclude in.gitignore). - A subdirectory's
.gitignorestill applies even when a root.codegraphignoreexists. The two layer; one does not disable the other. - A negation cannot rescue a file beneath an already-excluded directory (this is standard gitignore semantics).
Example .codegraphignore:
# Skip generated code but keep one file
generated/
!generated/manifest.py
# Skip large data fixtures
fixtures/**/*.json
Files that likely contain secrets are skipped during discovery and reported separately (they are never read into the graph). Three layers decide:
-
Parent directory is one of
.ssh,.gnupg,.aws,.gcloud,secrets,.secrets,credentials. -
Filename patterns, including:
.env/.envrcfiles; key and certificate extensions (.pem,.key,.p12,.pfx,.cert,.crt,.der,.p8); SSH key names (id_rsa,id_dsa,id_ecdsa,id_ed25519, optionally.pub);.netrc,.pgpass,.htpasswd; andaws_credentials/gcloud_credentials/service.account. -
Load-bearing keywords in the filename:
credential,secret,passwd,password,private_key,token. A keyword counts only when it ends the file stem or appears in a short (two-word-or-fewer) name. Topic words liketokenizer.pyorpassword-policy-discussion.mdare not flagged.
Code files are parsed in parallel with rayon. Each file is read and extracted
independently, and per-file results are merged in the original path-sorted order
so the output stays deterministic. The Markdown structural pass runs in parallel
the same way.
Extraction uses an on-disk per-file cache so an unchanged file skips re-parsing on a rebuild.
-
Location:
codegraph-out/cache/ast/v{version}/<key>.json. Each entry is the serialized extraction result for one file. -
Key: a BLAKE3 hash of
(relative path, file content). The path is part of the key because node ids embed it, so two files with identical bytes at different paths get distinct entries. Any change to a file's bytes is a cache miss. -
Namespace / invalidation: the
v{version}segment is{crate version}-{build fingerprint}. The build fingerprint is computed at compile time from the extract crate's source and its enabledlang-*features, so the cache namespace rotates automatically whenever the extraction logic changes (not only on a version bump). This prevents a warm cache from serving stale results after an extractor fix. - Best-effort I/O: any read, write, or parse error on a cache entry falls back to a fresh extraction. A corrupt cache never blocks a build.
Other caches live under codegraph-out/cache/ as well: the semantic LLM cache
(cache/semantic) and the change-detection manifest (cache/manifest.json),
which records per-file hashes so a later build can report what was added,
changed, or removed since last time. See [Incremental-Updates].
Clear the cache with:
codegraph cache clear .
codegraph cache clear . --recursive
cache clear only ever removes the regenerable codegraph-out/cache subtree
(and, with --recursive, the same subtree under nested project roots).
codegraph-out/ is the output directory created in the scan root. It holds:
-
cache/- the AST cache, semantic cache, and change-detection manifest. - The generated graph and reports:
graph.json,graph.html,GRAPH_REPORT.md,graph.graphml,graph.cypher,graph.dot,callflow.html,tree.html,graph.svg, andgraph-3d.html. -
obsidian/when--obsidianis passed,wiki/when--wikiis passed.
codegraph-out is itself a pruned noise directory, so re-running extraction
never indexes a previous run's output.
-
--directed- build a directed graph (affects layout and analysis). -
--obsidian- also write an Obsidian vault undercodegraph-out/obsidian/. -
--wiki- also write a wiki page set undercodegraph-out/wiki/. -
--semantic- opt in to the LLM semantic pass over documents and papers. Requires a configured backend; skipped with a note if none is detected. See [Semantic-Analysis].
See [Commands] for the full command list and [Output-Formats] for details on each artifact. The supported languages and what each extracts are listed in [Languages].
Getting started
Concepts
Using CodeGraph
- Commands
- Extraction
- Querying
- Cross-Language Edges
- Analysis and Reports
- Output Formats
- Visualizations
Integrations
Scaling
Reference