Skip to content

Chunking + embedding pipeline — temporal sliding window → pgvector #4

@Punkte

Description

@Punkte

What to build

After upload, automatically split the conversation into overlapping temporal chunks and generate embeddings for each via OpenAI text-embedding-3-small, storing vectors in pgvector. This pipeline runs as part of the post-upload processing flow.

Acceptance criteria

  • chunkConversation() splits messages using temporal sliding window (max 30 messages, 60-min gap cut, 5-message overlap)
  • Each chunk formatted as readable text with timestamps and sender names
  • Embeddings generated via text-embedding-3-small in batches of 100
  • Chunks + embeddings stored in chunks table with metadata (date range, participants)
  • Conversation status updated to 'ready' after pipeline completes
  • Smoke test: query "do you remember the holidays?" returns relevant chunks

Blocked by

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageMaintainer needs to evaluate this issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions