Skip to content

agent-library v0.14 roadmap: Connector abstraction + pluggable storage #27

@torresmateo

Description

@torresmateo

This issue tracks the roadmap for agent-library v0.14 — a major-but-pre-1.0 reorganization around a Connector abstraction that lets users build their own source pipelines (Slack, GitHub, Linear, web scrapers, internal APIs — anything) on top of the librarian's parsing + chunking + embedding + search infrastructure.

TL;DR

  • New abstraction: Connector ABC + Orchestrator + Storage protocol bundle. Users implement a Connector for their data source; the librarian handles the rest.
  • Pluggable substrate: SQLite (default; OSS / local) and Postgres+pgvector (production / large corpora) ship side by side behind the same Storage protocol.
  • Built-in LocalFileConnector: re-implements the current libr add/file-indexing workflow on top of the new infrastructure. Existing CLI, Python imports, and MCP tool names continue to work.
  • One breaking change for runtime users: the on-disk DB schema is incompatible with v0.13.x and requires a one-time rebuild via libr index --rebuild. Source files on disk are the source of truth; chunks are regenerated.
  • One breaking change for Python/MCP integrators: chunk_id returned by MCP tools and Python APIs changes from int to str (deterministic hash). Code that treats it as opaque is unaffected.
  • Versioning: shipped as v0.14.0. This release is explicitly not a v1.0 stability promise.

Progress

Tracer foundations

  • Connector ABC + ChangeEvent (DocumentUpsert / DocumentSoftDelete) + ChunkInput types
  • Storage protocol bundle (MetadataStore / VectorStore / FTSStore / StateStore)
  • SQLiteStorage implementation behind the protocol bundle
  • New v0.14 schema + migration (deep-citation chunk columns, sync_state table, model_version)
  • Deterministic-hash chunk_id / document_id
  • Orchestrator with atomic state+content writes
  • Built-in LocalFileConnector
  • IndexingService shim + DeprecationWarning
  • Database mutator-method deprecations (route through Orchestrator)
  • Database.connection() removed
  • v0.13 → v0.14 detect-and-rebuild flow (libr index --rebuild with auto-backup)
  • import-linter rule defining the supported public surface
  • First pre-release: v0.14.0a1 on PyPI

Substrate parity and schema extensions

  • PostgresStorage implementation behind the same protocol bundle
  • Parameterized storage test suite (SQLite + Postgres fixtures)
  • CI matrix runs both substrates on every PR
  • expand_context(chunk_id, before, after) MCP tool
  • Deep-citation columns confirmed populated end-to-end (chunk_source_uri, document_source_uri, chunk_index, document_size, source_created_at)
  • Soft-delete (deleted_at) filtering in search_library by default

Vision pipeline

  • Vision strategy decision (Options A / B / C below) documented in this issue
  • VLM caller behind IMAGE_GENERATE_CAPTIONS / PDF_OCR_ENABLED flags
  • Failure handling: "[image, processing failed]" content + modality_data.processing_status='failed'
  • libr reprocess --asset-type image --where processing_status=failed CLI

Release

  • Beta pre-release: v0.14.0b1 on PyPI
  • CHANGELOG finalized: breaking changes + upgrade guide (back up DB → libr index --rebuild → continue)
  • v0.14.0 published to PyPI
  • v0.14-development merged to main
  • Smoke test in a fresh venv passes against agent-library==0.14.0 from PyPI
  • Detect-and-rebuild smoke test passes against a v0.13-schema DB

Follow-ups (separate issues; not blocking v0.14.0)

  • Efficient batched file-mode driver (long-lived Orchestrator + LocalFileConnector session for high-volume callers) — target v1.x alongside IndexingService removal
  • Decision: when to declare v1.0.0 stability

Motivation

Today, agent-library is excellent for indexing a folder of files on a local machine and exposing them to an agent via MCP. The pipeline is essentially file-in, search-out, with no abstraction over where the content comes from.

A lot of interesting agent use cases want the same hybrid search + MMR + chunk-level citations against content that isn't on the filesystem — a company's Slack history, a GitHub org's PRs and code, a customer's Linear issues, a CRM, a Google Drive, a Notion workspace, a research lab's paper corpus. Building any of these today means standing up a parallel ingestion pipeline alongside agent-library and hand-rolling parsing, chunking, embedding, transactional state tracking, idempotent upserts, soft-deletes, and re-ingest. That's exactly the work agent-library should be doing for you.

v0.14 turns the file-in path into a special case of a more general Connector contract, so that "build a knowledge base from N sources" becomes a small problem (implement one Connector per source) instead of a large one.

What's changing

1. Connector ABC and event-stream contract

class Connector(ABC):
    name: str  # namespaces sync state, e.g., "slack", "github_prs"

    @abstractmethod
    async def fetch_changes(
        self, state: dict
    ) -> AsyncIterator[ChangeEvent]:
        ...

    def initial_state(self) -> dict:
        return {}

ChangeEvent = DocumentUpsert | DocumentSoftDelete. A DocumentUpsert carries either pre-formed ChunkInputs (for sources like chat where the connector already knows what a chunk is) or raw_content + mimetype (for sources where the librarian's parser registry should chunk it). Connectors are stateless and DB-free — their unit tests need no database.

2. Orchestrator

librarian.orchestrator.Orchestrator is an importable class that owns chunking-by-parser-route, embedding, storage writes, state advance, batching, checkpointing, and retries. It accepts a Connector and a Storage instance and runs the pipeline.

Atomicity guarantee: every cursor advance happens in the same DB transaction as the chunk inserts it corresponds to. A crash mid-batch resumes from the last successfully-persisted cursor — never re-emits stored chunks, never silently skips unstored ones.

3. Storage protocol bundle and pluggable substrate

Storage bundles MetadataStore + VectorStore + FTSStore + StateStore for a single substrate. Two concrete implementations ship in v0.14.0:

  • SQLiteStorage — default; OSS, local. Uses sqlite-vec for vector search and FTS5 for keyword search (as today).
  • PostgresStorage — Postgres + pgvector. For larger corpora or production deployments.

No cross-store escape hatches — atomic transactions stay single-substrate. Substrate is chosen via config (e.g., STORAGE_BACKEND=sqlite|postgres).

The existing Database, VectorStore, and FTSStore classes keep their public read methods. The new protocols are scoped under librarian.storage.protocols. Database mutator methods (insert_*, update_*, delete_*) emit DeprecationWarning and route through Orchestrator for the v0.14 lifetime (slated for removal in v0.15). Database.connection() is removed (the substrate is now abstracted; no public raw-SQLite escape hatch).

4. Built-in LocalFileConnector

Reimplements the current libr add/file-indexing workflow as a connector. Walks a path, emits DocumentUpsert events for supported files (markdown, code, PDF, image), uses the existing parser registry to do the actual parsing.

IndexingService survives as a thin shim over Orchestrator + a one-shot LocalFileConnector. Methods emit DeprecationWarning directing users to Orchestrator for new code. A follow-up issue (see "Follow-ups" in Progress above) tracks adding an efficient batched file-mode driver before IndexingService is fully removed.

5. Deterministic chunk_id and document_id

chunk_id    = hash(connector_name, source_type, source_native_id_per_chunk)
document_id = hash(connector_name, source_type, source_native_id_per_document)

Stable across runs. Required for idempotent upsert on edited content (e.g., a Slack message that gets edited, a file that gets rewritten) — no "find then update" round trip needed.

This is the cause of the one Python/MCP-API breaking change: chunk_id returned by search_library and related APIs is now a string hash instead of an autoincrement integer.

6. New schema columns and sync_state table

The chunks table gains chunk_index, document_size, source_created_at, deleted_at, deletion_reason, document_source_uri, chunk_source_uri. The chunk_embeddings table gains model_version. A new sync_state table tracks per-source ingest state (cursor as opaque JSON, status, last-success timestamps, error info, counters, config_version).

These power deep citation (link agents at the specific Slack message / line of code / comment, not just at the document), soft-delete tombstoning (source-side deletes don't drop content from the corpus by default — that's a connector decision), and expand_context (below).

7. New MCP tool: expand_context

expand_context(chunk_id, before=N, after=N) returns the N chunks before and N after the given chunk in the same document, in source order. For sources that produce fragmentary chunks ("yes, do it"), agents call this to pull in surrounding context only when a result needs it — keeping embeddings clean while making fragmentary content recoverable.

8. Vision pipeline — open question

The librarian today has CLIP-style vision/image infrastructure (ENABLE_VISION_EMBEDDINGS, vec_chunks_vision, image embedder). v0.14 introduces a separate VLM-text path: one hosted-VLM call per image produces both a description and transcribed text, which becomes the chunk's text for text-embedding. Gated by IMAGE_GENERATE_CAPTIONS / PDF_OCR_ENABLED. Failed VLM calls produce a chunk with content "[image, processing failed]" and modality_data.processing_status='failed', reprocessable later via libr reprocess --asset-type image --where processing_status=failed.

Open: what happens to the existing CLIP path? Three options:

  • (A) Retire CLIP. Vision is text-from-VLM only.
  • (B) Keep CLIP as an opt-in alternative behind a config flag, parallel to the VLM-text path.
  • (C) Hybrid — e.g., CLIP code retired but the column/table machinery left in place for v1.x re-introduction.

Discussion welcome — this is the most user-facing config call in v0.14. Comment below or open a focused issue.

What stays the same

For users running today's workflow against v0.14, the user-visible surface is unchanged:

  • CLI: libr add <path>, libr search "...", libr list, libr serve, libr config, libr docs, libr index, all with the same names and flags. Internally they now route through LocalFileConnector + Orchestrator + SQLiteStorage.
  • Python imports: from librarian.storage import Database, VectorStore, FTSStore and from librarian.indexing import IndexingService, get_indexing_service still work. They're shims over the new infrastructure.
  • MCP tools: search_library, list_sources, etc. keep their names and parameters. Return shape is enriched additively with the new columns (chunk_source_uri, chunk_index, etc.).

The only thing existing users have to do on upgrade is one rebuild of the local DB.

Breaking changes (full list)

  1. DB schema is incompatible. On startup, v0.14 detects a v0.13 schema in ~/.librarian/index.db, refuses to run, and prints a loud message asking the user to back up the file and run libr index --rebuild. The rebuild command auto-backs-up the v0 DB to ~/.librarian/index.db.v0-backup (unless --no-backup is passed), wipes the chunks / documents / embeddings / FTS tables, recreates them under v0.14 schema, and re-ingests configured sources via LocalFileConnector. Source files on disk are the source of truth, so no data loss.
  2. Database mutator methods (insert_document, update_document, delete_document, insert_chunk, insert_chunks_batch, delete_chunks_by_document, clear_all) emit DeprecationWarning and route through Orchestrator. Slated for removal in v0.15.
  3. Database.connection() removed. No public raw-SQLite escape hatch.
  4. IndexingService deprecated as a thin shim. DeprecationWarning on index_file() etc., directing users to Orchestrator.
  5. chunk_id returned by MCP tools and Python APIs is now str (deterministic hash) instead of int.

Anything not in the list above keeps working.

Versioning

Released as v0.14.0, not v1.0.0. Semver 0.x → 0.y permits the breaking changes above. A v1.0.0 release will mark the public API stable — that's a separate decision made after v0.14 has settled in real use.

v0.13.x receives no maintenance. v0.13.0 stays available on PyPI for users who want to pin.

Branching plan

  • Long-lived v0.14-development branch in this repo.
  • Per-implementation feature branches off v0.14-development.
  • Pre-releases tagged from v0.14-development: v0.14.0a1 during active development, v0.14.0b1 once all implementation work has merged, v0.14.0 final.
  • Final merge to main only when v0.14.0 ships.

What feedback would help

  • Vision pipeline strategy (A / B / C above): which one, and why?
  • Database.connection() removal: is anyone relying on raw sqlite3.Connection access? If yes, please describe the use case — there might be a substrate-agnostic alternative we should expose instead.
  • Schema rebuild on upgrade: is anyone holding content in agent-library DBs that isn't recoverable from source files on disk? If so, please flag — it changes the upgrade-guidance calculus.
  • Connector ABC shape: anyone planning to implement a connector for their own source, please review the contract sketched above and call out anything awkward.

This issue stays open through v0.14 development; reply with comments or split off focused issues as needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions