This repository now includes a local-first indexing and retrieval pipeline for your personal vault and research corpus.
- Indexes all readable text-like files (including all
.mdfiles by default). - Builds chunk-level vectors for granular retrieval.
- Builds file-level vectors for document context.
- Builds folder-level vectors (including root
.) for high-level context routing. - Stores vectors in a persistent local ChromaDB directory.
- Provides a CLI for indexing, querying, and index stats.
The pipeline implements a 3-layer semantic representation:
vault_chunkscollection- One embedding per chunk.
- Metadata includes source path, chunk index, char offsets, extension, folder.
vault_filescollection- One embedding per file-level summary.
- Metadata includes source path, extension, folder, chunk count.
vault_folderscollection- One embedding per folder summary (aggregated from child file excerpts).
- Includes a root folder vector at path
..
Use Python 3.10+.
- Core install:
python3 -m pip install -e .
- Optional LLM embeddings (recommended):
python3 -m pip install -e .[llm]
- Dev/test:
python3 -m pip install -e .[dev]
Index the vault:
python3 -m vault_pipeline.cli --root . --db .vault_index/chroma index
Query semantically:
python3 -m vault_pipeline.cli --root . --db .vault_index/chroma query "spectral graph eigenvalue methods"
Get collection stats:
python3 -m vault_pipeline.cli --root . --db .vault_index/chroma stats
If transformer embeddings are not available in the environment, use:
python3 -m vault_pipeline.cli --root . --db .vault_index/chroma --fallback-embedder index
This uses a deterministic hashing embedder for local/offline smoke usage.
--chunk-size(default:800)--chunk-overlap(default:120)--extensionscomma-separated list (default includes.md,.txt,.py,.json,.yaml,.toml,.tex,.csv, etc.)--extensions "*"to index all readable file extensions (still bounded by max file size and decoding)--include-hiddento include hidden paths--embedding-modelto set sentence-transformers model
For your paper generator and MATH Vault workflows, use this retrieval sequence:
- Query
vault_foldersto route to relevant domain folders. - Query
vault_filesin selected folders for shortlist. - Query
vault_chunkson shortlisted files for exact evidence snippets. - Return provenance metadata (
path,chunk_index, offsets) with each citation candidate.
This provides both broad context and precise supporting evidence for downstream drafting and verification agents.