Cross-service code graph engine. Builds a graph of every component, every cross-service call, every shared resource across one repo or many. Other tools (LLM assistants, impact analyzers, service catalogs) read from this instead of reimplementing.
Rust engine. CLI (glia) and Python wheel (repo-graph-py). MCP server repo-graph wraps the wheel.
Licensed Glia Software License v0.1. PolyForm Noncommercial 1.0.0 + worker-protection overlay. Free for individuals, students, researchers, nonprofits, OSS projects, orgs <500 STEM workers, worker-owned coops, B Corps, unionized workplaces. Commercial license required otherwise. Contact
j.r.chahwan@gmail.com. Not OSI-approved by design. See License.
$ glia merge ./services/api ./services/worker ./services/web
# glia analyze
- nodes: 4,213
- edges (intra-repo): 5,108
- cross-edges: 312
| Category | Count |
| HTTP_CALLS | 38 |
| GRPC_CALLS | 17 |
| QUEUE_FLOWS | 4 |
| SHARES_CONFIG | 12 | # same env var read by 2+ services
| SHARES_DATA_ENTITY | 9 | # same Postgres table / Mongo collection
| SHARES_INFRA_REF | 6 | # same image referenced in 2+ k8s manifests
| SHARES_DEPENDENCY | 41 | # same package depended on by 2+ services
Each cross-edge is a real queryable relationship. api emits to a Kafka topic that worker subscribes to. Both api and web read JWT_SECRET from env. The cron job in infra/k8s/cleanup.yaml runs the image built by services/worker/Dockerfile.
Sourcegraph and ctags index single repos. Snyk and Endor scan dependency lists. Lens and k9s browse k8s manifests. None of them give you a graph of every service, every cross-service edge, every shared infra piece, in one place.
That's the layer glia ships. With it, downstream queries get cheap:
- LLM assistant: "what calls
/api/users?" is an edge lookup, not a 12-repo grep. - Impact analyzer: "if I change column
users.email, what tests break?" walks the graph from the SQLusersentity to handlers to tests. - Service catalog: "which services share the
rediscache?" filters oninfra:redis.
Substrate ships. Other things layer on.
19 language parsers (tree-sitter): Python, Go, TypeScript, JavaScript, React, Vue, Angular, Rust, Java, Kotlin, C#, Ruby, PHP, Swift, C/C++, Scala, Clojure, Dart, Elixir, Solidity, Terraform.
~30 web framework extractors across those languages:
Flask, FastAPI, Django, Celery, Rails, Sinatra, Laravel, Symfony, Slim, Spring, Quarkus, Dropwizard, Javalin, Ktor, WebFlux, Micronaut, JAX-RS, ASP.NET (controllers + Minimal API), Express, Koa, Hono, Fastify, NestJS, Next.js (Pages + App Router), SvelteKit, Hapi.js, Bun.serve, Axum, Actix, Rocket, Tide, Poem, Salvo, Gin, Echo, Chi, Fiber, Gorilla Mux, stdlib net/http, Phoenix, React Router, Angular Router, Vue Router.
13 cross-graph resolvers that pair entities across repo boundaries: HTTP (frontend Endpoint ↔ backend Route), gRPC (client ↔ proto service), Queue (producer ↔ consumer, including raw Redis lists), GraphQL, WebSocket, EventBus, CLI invocation ↔ command, shared schema imports, shared data entities (SQL Tables / NoSQL Collections / Graph-DB Labels), Cron schedules, Config keys (env vars across services), IaC resources (Dockerfile-built images ↔ k8s manifest references), Package dependencies.
4 non-source file types flow through bypass extractors: YAML (.github/workflows/, k8s manifests, docker-compose), Dockerfiles, .env files, package manifests (package.json, pyproject.toml, requirements.txt, Cargo.toml, go.mod, Gemfile, composer.json).
22 framework demos plus 3 multi-service demos (microservices-demo, voting-app, bank-of-anthos). 45 effective repo paths, ~128MB of cloned source.
Total: 13,371 nodes / 14,105 intra-edges / 2,789 cross-edges
Wall time: 3.1s (1.5s per-repo + 1.6s merged-resolver pass)
Cross-graph edges (resolvers fired):
PackageResolver 1,021 cross-language shared deps
DbResolver 691 shared tables / collections
ConfigResolver 370 env var sharing
IacResolver 280 image / service references
GrpcStackResolver 175 microservices-demo gRPC mesh
SharedSchemaResolver 140
HttpStackResolver 66 frontend → backend route matches
EventBusResolver 25
WebSocketResolver 16
GraphQLStackResolver 4
QueueStackResolver 1 voting-app vote → worker via Redis BLPOP
CronResolver 0 corpus-sparse, only 2 GHA workflows used schedules
CliInvocationResolver 0 corpus-sparse, needs CLI-heavy projects
22 of 23 framework demos pass the per-framework coverage check. The 1 soft-miss is react-cra (corpus is the build-tooling repo, not a component-heavy app, so HOOK count is 0; extractor wired correctly).
# CLI from source (Rust 1.95+)
git clone https://github.com/James-Chahwan/glia
cd glia
cargo build --release -p glia-cli
cp target/release/glia ~/.local/bin/
# Python wheel (works for scripts and the MCP server)
pip install repo-graph-py # ships pyo3 wheels for Linux / macOS / Windows
For LLM/MCP usage see repo-graph, which wraps the wheel as an MCP server with 13 navigation tools.
glia analyze <repo> [--format summary|mermaid|json]
Walk one repo. Default is a Markdown summary. `mermaid` renders a
`graph LR` of cross-stack edges. `json` is full nodes+edges.
glia impact <repo> <qname> [--direction forward|backward|both] [--depth N]
Reachability walk over the merged graph from one entity. Forward is
what this reaches, backward is what reaches this, depth caps the BFS.
glia merge <repo1> <repo2> [...] [--out <file>]
Build a single MergedGraph across N repos so cross-graph resolvers
fire across repo boundaries. `--out -` for stdout JSON, `--out <path>`
for file.
glia build <repo> [--out <dir>]
Walk repo and write per-language `.gmap` files (rkyv + mmap) to
`<repo>/.glia/` (or the given dir). For tools that read .gmap directly.
glia install-hooks <repo> [--uninstall] [--command "..."]
Install opt-in git hooks (post-commit, post-merge, post-checkout) that
re-run `glia build .` on every change. Refuses to clobber existing
non-glia hooks.
source files
→ per-language parser (tree-sitter → ExtractedItems)
→ cross-cutting extractors (HTTP routes, gRPC, queues, data sources,
CLI commands, env var reads, package deps, cron schedules, IaC
resources, config files, ...)
→ graph builder (resolves intra-repo references)
→ cross-graph resolvers (HttpStack, GraphQL, gRPC, Queue, WebSocket,
EventBus, SharedSchema, DB, CLI, Cron, Config, IaC, Package)
→ MergedGraph
→ .gmap binary (rkyv + mmap, sharded), dense text projection,
JSON, or pyo3 → Python
Workspace crates:
core/:Node,Edge,QName,RepoId, shared primitives.code-domain/: code-specific registries (40 NodeKind IDs, 31 EdgeCategory IDs).parsers/code/<lang>/: one crate per language.parsers/code/extractors/for cross-cutting (gRPC, queues, WebSocket, EventBus, GraphQL, CLI, data-sources, data-entities, cron, config, IaC, packages, ts-routes, React, Angular, Vue).graph/: per-repo builder, MergedGraph, all 13 cross-graph resolvers, PPR activation.engine/: orchestration glue. Used bypy/andcli/.store/:.gmapcontainer (rkyv + mmap, atomic write).projection-text/: dense sigil projection for LLM context.activation/: Personalised PageRank, domain-agnostic.
| Tool | What it does well | What glia adds |
|---|---|---|
| Sourcegraph / ctags | intra-repo symbol search at scale | cross-service edges (HTTP/gRPC/queue/shared-DB), declarative resolver layer |
| CodeQL / Semgrep | deep semantic per-file analysis, custom rules | wider substrate (more languages, more frameworks, less depth per query), works out of the box |
| Apiiro / Endor / Snyk | dependency-graph + vuln matching | cross-language reachability via call edges, not just manifest lists; IaC, config, and queue resolvers in one tool |
| Codebase-Memory MCP | LLM-targeted graph of one codebase | multi-repo merge + cross-service resolvers, pure-Rust core |
| SocratiCode | LLM-driven code Q&A | structural index, not LLM-derived; deterministic, repeatable |
| Backstage / service catalog | curated org-level service registry | derived from source + manifests automatically, no curation step |
What glia does NOT do today:
- No intra-procedural data-flow / taint analysis (CodeQL territory).
- No vulnerability matching against CVE feeds (Snyk territory).
- No source-level fix suggestions (LLM-tier work; we emit substrate).
- No Kustomize template merging, no Helm rendering. IaC resolver reads raw manifests only.
End-to-end test on a 566-node / 620-edge Go + Angular monorepo via the repo-graph MCP wrapper. Same bug, same model (Claude Opus, 100% no Haiku routing), same prompt: "Groups that were created recently are showing as closed, and old groups show as open. This is backwards. New groups should be open for members to join. Find and fix the bug." Fresh /clear for both runs.
| Without graph (grep + read loop) | With glia substrate | |
|---|---|---|
| Tokens used | 75,308 | 29,838 |
| Time to fix | 4m 36s | ~30s |
| Files explored | ~15 (grep, read, grep, read...) | 2 (flow lookup + handler) |
| Outcome | Found and fixed | Found and fixed |
2.5x fewer tokens, 9x faster, same correct fix. Without the graph Claude greps for keywords, reads candidates, greps again, narrows down. With the graph Claude calls flow("groups"), gets the handler function and file, reads it, fixes it.
| Metric | Value |
|---|---|
| 99-repo sweep, median repo (5,746 nodes, 4,979 edges) | 1.4s parse+resolve |
| 99-repo sweep, p90 (60,500 nodes, 65,667 edges) | 10.4s |
| 99-repo sweep, max (elasticsearch: 342,804 nodes / 336,081 edges) | 73.1s for 1.3GB of source |
| Aggregate across 99 repos | 2,083,755 nodes / 2,243,664 edges |
| 45-repo cross-service eval (this release) | 13,371 nodes / 14,105 edges / 2,789 cross-edges in 3.1s |
| Substrate failures across 99 repos | 0 generate failures, 0 timeouts |
A single laptop CPU walks the median real-world repo in under 2 seconds. The full microservices-demo + voting-app + bank-of-anthos + 22 framework demos cross-merge in 3.1 seconds with all 13 cross-graph resolvers running.
glia v0.4.13 ran an arm that injected graph-derived pooled vectors into a transformer's input embedding stream. The hypothesis: graph context as latent vectors (instead of verbose prefix text) could close composition gaps on SWE-bench-Lite.
What landed: marshmallow-1359 SOLVE on a 7B Q4 model (Qwen 2.5 Coder). The gold-aligned auto-driver pipeline reproduces the recipe deterministically. Single-instance proof-of-concept, not a generalizable benchmark result. A follow-up N=50 bench surfaced that ~80% of apply-then-test failures were infra (pytest collection, import errors, wheel mismatches), not model output quality. Clean cross-instance results need apply/test-runner hardening.
Why parked: the conceptual win (graph context substituting for prose context at the embedding layer) is demonstrated on one instance. Generalising it needs per-instance plumbing, but the deeper reason is that latent injection isn't necessarily the right shape. The substrate ships independently. The open research question is bigger than "make the latent arm work":
Given a graph + a problem + a query, what's the correct distillation over composition / sage-filtering / synthesised cells / pooled vectors that lets a 7B model do what a 70B model can do? There's a shape out there connecting static reasoning, query-specific context selection, and capability lifting. It hasn't fully connected yet.
The substrate is the precondition for trying any of those shapes cleanly. v0.4.x ships substrate; the reasoning layer above it is under design. v0.5+ will probably look very different from v0.4.13's latent-injection arm. The right answer isn't "more vectors", it's "smarter selection of what to feed where".
Engineering wins from this arm that did ship to v0.4.x core:
- Graph substrate hardened to feed cross-language reachability into ranked composition cells.
- Bench inference moved from candle to llama.cpp (
scratch/latent/out/run_llama_pathB.py). ~7x faster CPU decode plus GBNF-grammar-constrained decoding kills the format-prior failure class plaguing the candle path.
The latent arm itself lives in scratch/latent/, excluded from the default workspace build so default cargo invocations skip the candle download:
cargo build # core glia, no candle
cargo build -p repo-graph-latent # opt in to the parked arm
Embed-injection port to llama.cpp's llama_batch.embd API is feasible (API verified) but research follow-up, not a v0.4.x deliverable.
v0.4.x (this release): substrate + CLI + pyo3 wheel + GHA wheel matrix. v0.4.13a/b/c/d shipped the SWE-bench latent-injection arm (marshmallow-1359 SOLVE).
v0.4.14 (perf + cleanup):
- Per-graph-area incremental rebuild. Re-walk and re-resolve only the regions that changed.
glia buildrewalks everything every invocation today; on a 100k-LOC monorepo the post-commit hook is the bottleneck. Want per-file content hashing, dirty-set propagation, partial.gmappatching instead of full rewrite. - Iterator parallelisation. Per-language parser pipeline, cross-cutting extractor pass, and per-resolver index builds are embarrassingly parallel today and run sequentially. Rayon over walk + parse + extract; sharded resolver index construction.
- Cleanup. Single
--features researchtoggle (replacesdriver); "non-tree-sitter source dispatcher" trait consolidating the 5 bypass branches inengine/src/lib.rs; promotelooks_like_url_pathand the framework-presence-signals helper into a shared extractor-utils module (currently duplicated across queues/ts_routes/react); pullengine's repeated language-dispatch arms into a small registry table.
v0.5.0: domain registries for non-code (video, chemistry, policy, climate). Code becomes one of N domains. The activation crate is already domain-agnostic; the parser+extractor layer is what abstracts.
v0.5+: Cross-language taint, contract drift, type propagation; node dedupe across repos; manifest format for glia merge; org-internal-package routing (sibling-repo imports); query-specific distillation over composition / sage / synth-cells / vectors (the reasoning-layer search direction noted in Experimental notes).
See LICENSE. Glia Software License v0.1, an overlay on PolyForm Noncommercial 1.0.0 with Additional Permissions for worker-protective commercial use.
| If you are... | Cost |
|---|---|
| Individual, student, academic, researcher, hobbyist, OSS contributor | Free |
| Nonprofit | Free |
| For-profit org with fewer than 500 STEM workers | Free |
| Worker-owned org (workers hold ≥50% equity) | Free at any size |
| Certified B Corporation in good standing | Free at any size |
| Org where ≥50% of STEM workers are covered by a recognized union under an active CBA | Free at any size |
| Any other for-profit org | Commercial license required |
Commercial license inquiries: j.r.chahwan@gmail.com or open an issue on the repo. Author retains discretion to grant free Commercial Licenses case-by-case. When in doubt, ask. Past compliant use is never retroactively revoked (LICENSE §5.3).
OSI's Open Source Definition was authored in 1998 to make free software palatable to enterprises. Two clauses (§5 No Discrimination Against Persons or Groups, §6 No Discrimination Against Fields of Endeavor) exist for that reason. They forbid any license condition based on who you are or what you do. Including conditions like "treat workers fairly".
That choice has wins (the ecosystem we have) and costs (no license can encode worker, environmental, or human-rights conditions). Every ethical-source license (Hippocratic, ACSL, CSL, PolyForm Noncommercial, this one) is non-OSI for that reason.
A 1998 corporate-adoption strategy is not a 2026 verdict on what good licensing looks like. We're picking the modern take.
The qualifying conditions are baseline 21st-century governance hygiene:
- <500 STEM workers. Almost every startup, every small consultancy, every research lab. The threshold sits well above the size where you can claim resource constraints prevent intentional governance.
- B-Corp certification. ~9,000 companies and growing, including Anthropic, Patagonia, Kickstarter. ~6 months of work, manageable annual fees.
- Recognized union. Mostly labor-law compliance with a side of dignity. The bar (≥50% STEM coverage under an active CBA) is real but well under what unionized European tech companies have.
- Worker-owned. Every cooperative, every founder-led startup before dilution, Mondragon. Bar is collective worker stake ≥50%.
An org failing all four:
- Is large enough to have resources for governance
- Has chosen not to certify worker-protective governance
- Has actively suppressed (or simply opposed) collective representation
- Has opted for an extractive, no-equity employment model
That's a specific shape of company. It's the failure mode where scale is achieved by externalizing cost onto workers. The license declines to subsidize that mode.
Practical effects: GitHub marks the repo as "Other / non-standard". PyPI won't show the "OSI Approved" classifier. Some corporate legal teams auto-block. Fine. The orgs running those auto-blocks are the ones the license is asking to either qualify or pay.
Graph schema and traversal lineage:
- Joern. Code Property Graph schema is reference inspiration for glia's node + edge taxonomy. Joern's pass-composition model (parser → CFG → type-recovery → dataflow → OSS) shaped how glia layers per-language parsers, cross-cutting extractors, and cross-graph resolvers as independent passes that can be ablated.
- Personalized PageRank (Jeh & Widom, 2002). The activation algorithm underneath
activation/. Domain-agnostic; glia'sActivationConfigexposes direction, edge weights, and node specificity as the three dials the code domain sets. - HippoRAG (Jiménez Gutiérrez et al., 2024). Prior art for PPR-driven retrieval over an open knowledge graph, hippocampal-indexing-inspired. glia's activation pass borrows the seed nodes → PPR → top-K reachable shape. Difference: glia's graph is a structural code substrate, not entity-and-relation triples extracted from prose, and the consumer is downstream tooling (CLIs, MCP), not RAG context-stuffing.
- Spreading activation (Quillian 1967, Anderson 1983, Collins & Loftus 1975). Cognitive-science antecedent to all PPR-style retrieval. glia's PPR implementation is a modern, mathematically-grounded version of the same intuition: relevance propagates from seeds along weighted edges with decay.
- GraphRAG (Microsoft, 2024). Parallel work on graph-structured retrieval. Informs the broader space of "use a graph instead of/alongside vector search" approaches.
Graph theory background:
- Introduction to Algorithms, 4th edition (Cormen, Leiserson, Rivest, Stein). Reference for the graph algorithms underneath glia's traversal primitives.
- DanielKeogh/com.danielkeogh.graph. Friend's graph library; helped along the way.
Tooling:
- tree-sitter. Every language parser is built on it.
- rkyv. Zero-copy serialisation behind the
.gmapcontainer. - PyO3. Python bindings.
- maturin. Wheel build.
- PolyForm Project. The noncommercial license that glia's worker-protective overlay sits on top of.
Parked experiment:
- candle. The v0.4.13 latent-injection arm forked the qwen2 model from here (Apache-2.0 / MIT). Bench inference subsequently moved to llama.cpp for ~7x CPU speedup. The candle fork lives in
scratch/latent/for replay.
Thinking partners:
- Anthropic's Claude. Sustained design partner through the project. The graph-substrate framing, the resolver decomposition, the worker-protective licensing direction, and most of glia's actual implementation were distilled in long collaborative sessions. Thanks for being a tool that lets a single person turn a core thinking advantage into shippable substrate.