Mind Map Vault Vector Pipeline

This repository now includes a local-first indexing and retrieval pipeline for your personal vault and research corpus.

What it does

Indexes all readable text-like files (including all .md files by default).
Builds chunk-level vectors for granular retrieval.
Builds file-level vectors for document context.
Builds folder-level vectors (including root .) for high-level context routing.
Stores vectors in a persistent local ChromaDB directory.
Provides a CLI for indexing, querying, and index stats.

Architecture

The pipeline implements a 3-layer semantic representation:

vault_chunks collection
- One embedding per chunk.
- Metadata includes source path, chunk index, char offsets, extension, folder.
vault_files collection
- One embedding per file-level summary.
- Metadata includes source path, extension, folder, chunk count.
vault_folders collection
- One embedding per folder summary (aggregated from child file excerpts).
- Includes a root folder vector at path ..

Install

Use Python 3.10+.

Core install:
- python3 -m pip install -e .
Optional LLM embeddings (recommended):
- python3 -m pip install -e .[llm]
Dev/test:
- python3 -m pip install -e .[dev]

CLI usage

Index the vault:

python3 -m vault_pipeline.cli --root . --db .vault_index/chroma index

Query semantically:

python3 -m vault_pipeline.cli --root . --db .vault_index/chroma query "spectral graph eigenvalue methods"

Get collection stats:

python3 -m vault_pipeline.cli --root . --db .vault_index/chroma stats

Fallback mode

If transformer embeddings are not available in the environment, use:

python3 -m vault_pipeline.cli --root . --db .vault_index/chroma --fallback-embedder index

This uses a deterministic hashing embedder for local/offline smoke usage.

Configuration knobs

--chunk-size (default: 800)
--chunk-overlap (default: 120)
--extensions comma-separated list (default includes .md, .txt, .py, .json, .yaml, .toml, .tex, .csv, etc.)
--extensions "*" to index all readable file extensions (still bounded by max file size and decoding)
--include-hidden to include hidden paths
--embedding-model to set sentence-transformers model

Agent retrieval strategy

For your paper generator and MATH Vault workflows, use this retrieval sequence:

Query vault_folders to route to relevant domain folders.
Query vault_files in selected folders for shortlist.
Query vault_chunks on shortlisted files for exact evidence snippets.
Return provenance metadata (path, chunk_index, offsets) with each citation candidate.

This provides both broad context and precise supporting evidence for downstream drafting and verification agents.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.devcontainer		.devcontainer
.obsidian		.obsidian
@Personal		@Personal
Core Domains		Core Domains
Python Research		Python Research
docs		docs
scripts		scripts
tests		tests
vault_pipeline		vault_pipeline
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mind Map Vault Vector Pipeline

What it does

Architecture

Install

CLI usage

Fallback mode

Configuration knobs

Agent retrieval strategy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mind Map Vault Vector Pipeline

What it does

Architecture

Install

CLI usage

Fallback mode

Configuration knobs

Agent retrieval strategy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages