Skip to content

v0.4 follow-up: replace token-cosine with real bge-m3 embedding for PPA retrieval #9

@MakiDevelop

Description

@MakiDevelop

Background

v0.4 PPA `_stage2_retrieve()` uses a placeholder cosine over token-count vectors. The persona_triples schema already has an `embedding BLOB` column reserved.

What's needed

  • Lazily compute bge-m3 1024-dim embeddings for new triples on insert
  • Store as BLOB in the existing column
  • Update `_stage2_retrieve()` to use the BLOB embeddings + numpy cosine

Why this matters

Token-count cosine is good enough to demo PPA but won't match the paper's reported C-Score gains. Real embeddings should close most of that gap.

Acceptance criteria

  • Embedding generation utility (local model preferred; HuggingFace inference API as fallback)
  • Migration script to backfill embeddings for existing rows
  • `_stage2_retrieve()` reads BLOB, uses numpy cosine
  • Threshold (default 0.2) re-calibrated for bge-m3 cosine distribution

Related

This becomes moot if #5 (Mem0 integration) lands first — Mem0 handles embedding + retrieval. If we go straight to Mem0, close this issue.

Estimated effort

~1 day if we keep the SQLite path; ~0 if we jump to Mem0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions