agent-library v0.14 roadmap: Connector abstraction + pluggable storage

This issue tracks the roadmap for `agent-library` v0.14 — a major-but-pre-1.0 reorganization around a `Connector` abstraction that lets users build their own source pipelines (Slack, GitHub, Linear, web scrapers, internal APIs — anything) on top of the librarian's parsing + chunking + embedding + search infrastructure.

## TL;DR

- **New abstraction**: `Connector` ABC + `Orchestrator` + `Storage` protocol bundle. Users implement a `Connector` for their data source; the librarian handles the rest.
- **Pluggable substrate**: SQLite (default; OSS / local) and Postgres+pgvector (production / large corpora) ship side by side behind the same `Storage` protocol.
- **Built-in `LocalFileConnector`**: re-implements the current `libr add`/file-indexing workflow on top of the new infrastructure. **Existing CLI, Python imports, and MCP tool names continue to work.**
- **One breaking change for runtime users**: the on-disk DB schema is incompatible with v0.13.x and requires a one-time rebuild via `libr index --rebuild`. Source files on disk are the source of truth; chunks are regenerated.
- **One breaking change for Python/MCP integrators**: `chunk_id` returned by MCP tools and Python APIs changes from `int` to `str` (deterministic hash). Code that treats it as opaque is unaffected.
- **Versioning**: shipped as `v0.14.0`. This release is explicitly *not* a v1.0 stability promise.

## Progress

### Tracer foundations
- [ ] `Connector` ABC + `ChangeEvent` (`DocumentUpsert` / `DocumentSoftDelete`) + `ChunkInput` types
- [ ] `Storage` protocol bundle (`MetadataStore` / `VectorStore` / `FTSStore` / `StateStore`)
- [ ] `SQLiteStorage` implementation behind the protocol bundle
- [ ] New v0.14 schema + migration (deep-citation chunk columns, `sync_state` table, `model_version`)
- [ ] Deterministic-hash `chunk_id` / `document_id`
- [ ] `Orchestrator` with atomic state+content writes
- [ ] Built-in `LocalFileConnector`
- [ ] `IndexingService` shim + `DeprecationWarning`
- [ ] `Database` mutator-method deprecations (route through `Orchestrator`)
- [ ] `Database.connection()` removed
- [ ] v0.13 → v0.14 detect-and-rebuild flow (`libr index --rebuild` with auto-backup)
- [ ] import-linter rule defining the supported public surface
- [ ] First pre-release: `v0.14.0a1` on PyPI

### Substrate parity and schema extensions
- [ ] `PostgresStorage` implementation behind the same protocol bundle
- [ ] Parameterized storage test suite (SQLite + Postgres fixtures)
- [ ] CI matrix runs both substrates on every PR
- [ ] `expand_context(chunk_id, before, after)` MCP tool
- [ ] Deep-citation columns confirmed populated end-to-end (`chunk_source_uri`, `document_source_uri`, `chunk_index`, `document_size`, `source_created_at`)
- [ ] Soft-delete (`deleted_at`) filtering in `search_library` by default

### Vision pipeline
- [ ] Vision strategy decision (Options A / B / C below) documented in this issue
- [ ] VLM caller behind `IMAGE_GENERATE_CAPTIONS` / `PDF_OCR_ENABLED` flags
- [ ] Failure handling: `"[image, processing failed]"` content + `modality_data.processing_status='failed'`
- [ ] `libr reprocess --asset-type image --where processing_status=failed` CLI

### Release
- [ ] Beta pre-release: `v0.14.0b1` on PyPI
- [ ] CHANGELOG finalized: breaking changes + upgrade guide (back up DB → `libr index --rebuild` → continue)
- [ ] `v0.14.0` published to PyPI
- [ ] `v0.14-development` merged to `main`
- [ ] Smoke test in a fresh venv passes against `agent-library==0.14.0` from PyPI
- [ ] Detect-and-rebuild smoke test passes against a v0.13-schema DB

### Follow-ups (separate issues; not blocking v0.14.0)
- [ ] Efficient batched file-mode driver (long-lived `Orchestrator` + `LocalFileConnector` session for high-volume callers) — target v1.x alongside `IndexingService` removal
- [ ] Decision: when to declare v1.0.0 stability

## Motivation

Today, agent-library is excellent for indexing a folder of files on a local machine and exposing them to an agent via MCP. The pipeline is essentially file-in, search-out, with no abstraction over *where the content comes from*.

A lot of interesting agent use cases want the same hybrid search + MMR + chunk-level citations against content that **isn't on the filesystem** — a company's Slack history, a GitHub org's PRs and code, a customer's Linear issues, a CRM, a Google Drive, a Notion workspace, a research lab's paper corpus. Building any of these today means standing up a parallel ingestion pipeline alongside agent-library and hand-rolling parsing, chunking, embedding, transactional state tracking, idempotent upserts, soft-deletes, and re-ingest. That's exactly the work agent-library should be doing for you.

v0.14 turns the file-in path into a special case of a more general `Connector` contract, so that "build a knowledge base from N sources" becomes a small problem (implement one `Connector` per source) instead of a large one.

## What's changing

### 1. `Connector` ABC and event-stream contract

```python
class Connector(ABC):
    name: str  # namespaces sync state, e.g., "slack", "github_prs"

    @abstractmethod
    async def fetch_changes(
        self, state: dict
    ) -> AsyncIterator[ChangeEvent]:
        ...

    def initial_state(self) -> dict:
        return {}
```

`ChangeEvent = DocumentUpsert | DocumentSoftDelete`. A `DocumentUpsert` carries either pre-formed `ChunkInput`s (for sources like chat where the connector already knows what a chunk is) or `raw_content + mimetype` (for sources where the librarian's parser registry should chunk it). Connectors are stateless and DB-free — their unit tests need no database.

### 2. `Orchestrator`

`librarian.orchestrator.Orchestrator` is an importable class that owns chunking-by-parser-route, embedding, storage writes, state advance, batching, checkpointing, and retries. It accepts a `Connector` and a `Storage` instance and runs the pipeline.

Atomicity guarantee: every cursor advance happens in the **same DB transaction** as the chunk inserts it corresponds to. A crash mid-batch resumes from the last successfully-persisted cursor — never re-emits stored chunks, never silently skips unstored ones.

### 3. `Storage` protocol bundle and pluggable substrate

`Storage` bundles `MetadataStore` + `VectorStore` + `FTSStore` + `StateStore` for a single substrate. Two concrete implementations ship in v0.14.0:

- `SQLiteStorage` — default; OSS, local. Uses `sqlite-vec` for vector search and FTS5 for keyword search (as today).
- `PostgresStorage` — Postgres + pgvector. For larger corpora or production deployments.

No cross-store escape hatches — atomic transactions stay single-substrate. Substrate is chosen via config (e.g., `STORAGE_BACKEND=sqlite|postgres`).

The existing `Database`, `VectorStore`, and `FTSStore` classes keep their public read methods. The new protocols are scoped under `librarian.storage.protocols`. `Database` mutator methods (`insert_*`, `update_*`, `delete_*`) emit `DeprecationWarning` and route through `Orchestrator` for the v0.14 lifetime (slated for removal in v0.15). `Database.connection()` is removed (the substrate is now abstracted; no public raw-SQLite escape hatch).

### 4. Built-in `LocalFileConnector`

Reimplements the current `libr add`/file-indexing workflow as a connector. Walks a path, emits `DocumentUpsert` events for supported files (markdown, code, PDF, image), uses the existing parser registry to do the actual parsing.

`IndexingService` survives as a thin shim over `Orchestrator` + a one-shot `LocalFileConnector`. Methods emit `DeprecationWarning` directing users to `Orchestrator` for new code. A follow-up issue (see "Follow-ups" in Progress above) tracks adding an efficient batched file-mode driver before `IndexingService` is fully removed.

### 5. Deterministic `chunk_id` and `document_id`

```
chunk_id    = hash(connector_name, source_type, source_native_id_per_chunk)
document_id = hash(connector_name, source_type, source_native_id_per_document)
```

Stable across runs. Required for idempotent upsert on edited content (e.g., a Slack message that gets edited, a file that gets rewritten) — no "find then update" round trip needed.

This is the cause of the one Python/MCP-API breaking change: `chunk_id` returned by `search_library` and related APIs is now a string hash instead of an autoincrement integer.

### 6. New schema columns and `sync_state` table

The `chunks` table gains `chunk_index`, `document_size`, `source_created_at`, `deleted_at`, `deletion_reason`, `document_source_uri`, `chunk_source_uri`. The `chunk_embeddings` table gains `model_version`. A new `sync_state` table tracks per-source ingest state (cursor as opaque JSON, status, last-success timestamps, error info, counters, `config_version`).

These power deep citation (link agents at the specific Slack message / line of code / comment, not just at the document), soft-delete tombstoning (source-side deletes don't drop content from the corpus by default — that's a connector decision), and `expand_context` (below).

### 7. New MCP tool: `expand_context`

`expand_context(chunk_id, before=N, after=N)` returns the N chunks before and N after the given chunk in the same document, in source order. For sources that produce fragmentary chunks ("yes, do it"), agents call this to pull in surrounding context only when a result needs it — keeping embeddings clean while making fragmentary content recoverable.

### 8. Vision pipeline — open question

The librarian today has CLIP-style vision/image infrastructure (`ENABLE_VISION_EMBEDDINGS`, `vec_chunks_vision`, image embedder). v0.14 introduces a separate **VLM-text** path: one hosted-VLM call per image produces both a description and transcribed text, which becomes the chunk's text for text-embedding. Gated by `IMAGE_GENERATE_CAPTIONS` / `PDF_OCR_ENABLED`. Failed VLM calls produce a chunk with content `"[image, processing failed]"` and `modality_data.processing_status='failed'`, reprocessable later via `libr reprocess --asset-type image --where processing_status=failed`.

**Open: what happens to the existing CLIP path?** Three options:

- **(A) Retire CLIP.** Vision is text-from-VLM only.
- **(B) Keep CLIP as an opt-in alternative** behind a config flag, parallel to the VLM-text path.
- **(C) Hybrid** — e.g., CLIP code retired but the column/table machinery left in place for v1.x re-introduction.

Discussion welcome — this is the most user-facing config call in v0.14. Comment below or open a focused issue.

## What stays the same

For users running today's workflow against v0.14, the user-visible surface is unchanged:

- **CLI**: `libr add <path>`, `libr search "..."`, `libr list`, `libr serve`, `libr config`, `libr docs`, `libr index`, all with the same names and flags. Internally they now route through `LocalFileConnector` + `Orchestrator` + `SQLiteStorage`.
- **Python imports**: `from librarian.storage import Database, VectorStore, FTSStore` and `from librarian.indexing import IndexingService, get_indexing_service` still work. They're shims over the new infrastructure.
- **MCP tools**: `search_library`, `list_sources`, etc. keep their names and parameters. Return shape is enriched additively with the new columns (`chunk_source_uri`, `chunk_index`, etc.).

The only thing existing users have to do on upgrade is one rebuild of the local DB.

## Breaking changes (full list)

1. **DB schema is incompatible.** On startup, v0.14 detects a v0.13 schema in `~/.librarian/index.db`, refuses to run, and prints a loud message asking the user to back up the file and run `libr index --rebuild`. The rebuild command auto-backs-up the v0 DB to `~/.librarian/index.db.v0-backup` (unless `--no-backup` is passed), wipes the chunks / documents / embeddings / FTS tables, recreates them under v0.14 schema, and re-ingests configured sources via `LocalFileConnector`. **Source files on disk are the source of truth, so no data loss.**
2. `Database` **mutator methods** (`insert_document`, `update_document`, `delete_document`, `insert_chunk`, `insert_chunks_batch`, `delete_chunks_by_document`, `clear_all`) emit `DeprecationWarning` and route through `Orchestrator`. Slated for removal in v0.15.
3. `Database.connection()` **removed.** No public raw-SQLite escape hatch.
4. `IndexingService` **deprecated** as a thin shim. `DeprecationWarning` on `index_file()` etc., directing users to `Orchestrator`.
5. `chunk_id` returned by MCP tools and Python APIs is now `str` (deterministic hash) instead of `int`.

Anything not in the list above keeps working.

## Versioning

Released as **v0.14.0**, not v1.0.0. Semver 0.x → 0.y permits the breaking changes above. A v1.0.0 release will mark the public API stable — that's a separate decision made after v0.14 has settled in real use.

v0.13.x receives no maintenance. v0.13.0 stays available on PyPI for users who want to pin.

## Branching plan

- Long-lived `v0.14-development` branch in this repo.
- Per-implementation feature branches off `v0.14-development`.
- Pre-releases tagged from `v0.14-development`: `v0.14.0a1` during active development, `v0.14.0b1` once all implementation work has merged, `v0.14.0` final.
- Final merge to `main` only when `v0.14.0` ships.

## What feedback would help

- **Vision pipeline strategy** (A / B / C above): which one, and why?
- **`Database.connection()` removal**: is anyone relying on raw `sqlite3.Connection` access? If yes, please describe the use case — there might be a substrate-agnostic alternative we should expose instead.
- **Schema rebuild on upgrade**: is anyone holding content in agent-library DBs that *isn't* recoverable from source files on disk? If so, please flag — it changes the upgrade-guidance calculus.
- **`Connector` ABC shape**: anyone planning to implement a connector for their own source, please review the contract sketched above and call out anything awkward.

This issue stays open through v0.14 development; reply with comments or split off focused issues as needed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-library v0.14 roadmap: Connector abstraction + pluggable storage #27

TL;DR

Progress

Tracer foundations

Substrate parity and schema extensions

Vision pipeline

Release

Follow-ups (separate issues; not blocking v0.14.0)

Motivation

What's changing

1. `Connector` ABC and event-stream contract

2. `Orchestrator`

3. `Storage` protocol bundle and pluggable substrate

4. Built-in `LocalFileConnector`

5. Deterministic `chunk_id` and `document_id`

6. New schema columns and `sync_state` table

7. New MCP tool: `expand_context`

8. Vision pipeline — open question

What stays the same

Breaking changes (full list)

Versioning

Branching plan

What feedback would help

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

agent-library v0.14 roadmap: Connector abstraction + pluggable storage #27

Description

TL;DR

Progress

Tracer foundations

Substrate parity and schema extensions

Vision pipeline

Release

Follow-ups (separate issues; not blocking v0.14.0)

Motivation

What's changing

1. Connector ABC and event-stream contract

2. Orchestrator

3. Storage protocol bundle and pluggable substrate

4. Built-in LocalFileConnector

5. Deterministic chunk_id and document_id

6. New schema columns and sync_state table

7. New MCP tool: expand_context

8. Vision pipeline — open question

What stays the same

Breaking changes (full list)

Versioning

Branching plan

What feedback would help

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `Connector` ABC and event-stream contract

2. `Orchestrator`

3. `Storage` protocol bundle and pluggable substrate

4. Built-in `LocalFileConnector`

5. Deterministic `chunk_id` and `document_id`

6. New schema columns and `sync_state` table

7. New MCP tool: `expand_context`