Skip to content

Add SummaryMemory backend — rolling LLM-generated compression #3

@Neal006

Description

@Neal006

What is SummaryMemory?

A memory backend that compresses the conversation into a rolling summary every K turns using an LLM call, then keeps only the summary + recent messages.

Strategy:
```
Keep last K messages verbatim + one running summary of everything before
```

This is how ChatGPT's memory works under the hood. It's a common pattern but has never been benchmarked against RAG or cascading approaches.

Why benchmark it?

The hypothesis: a rolling LLM summary preserves semantic meaning better than extractive compression (what Cascading's Cold tier does today) but costs more per compression call. The benchmark will quantify the recall/cost trade-off.

Implementation guide

Create `memory/summary.py`:

```python
from typing import List, Dict
from .base import BaseMemory
from utils.llm import chat

class SummaryMemory(BaseMemory):
name = "summary"

def __init__(self, window_size: int = 20, summarise_every: int = 10):
    self.window_size = window_size
    self.summarise_every = summarise_every
    self.recent: List[Dict] = []
    self.summary: str = ""
    self.turn_count: int = 0

def add_message(self, role: str, content: str, turn: int) -> None:
    self.recent.append({"role": role, "content": content, "turn": turn})
    self.turn_count += 1
    if self.turn_count % self.summarise_every == 0:
        self._compress()

def _compress(self) -> None:
    # Summarise oldest half of recent into self.summary
    ...

def get_context(self, query: str, current_turn: int) -> List[Dict]:
    context = []
    if self.summary:
        context.append({"role": "system", "content": f"[Summary] {self.summary}"})
    context.extend({"role": m["role"], "content": m["content"]} for m in self.recent[-self.window_size:])
    return context

def reset(self) -> None:
    self.recent = []
    self.summary = ""
    self.turn_count = 0

```

Requirements

  • Requires GROQ_API_KEY for compression calls (graceful degradation when not set)
  • Add a --no-llm-compress flag that uses extractive fallback instead
  • Register in evaluation/benchmark.py under the name "summary"
  • Add one test in tests/test_pipeline.py

Acceptance criteria

  • python main.py --backends summary naive rag runs without errors
  • Benchmark produces recall/token numbers comparable to the table in README
  • Works in --no-llm-compress mode without an API key
  • Test added and passing

Estimated effort

3–4 hours. Requires Groq API key for full testing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    good first issuePerfect starting point for new contributorsnew-backendProposal or implementation of a new memory backend

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions