What is SummaryMemory?
A memory backend that compresses the conversation into a rolling summary every K turns using an LLM call, then keeps only the summary + recent messages.
Strategy:
```
Keep last K messages verbatim + one running summary of everything before
```
This is how ChatGPT's memory works under the hood. It's a common pattern but has never been benchmarked against RAG or cascading approaches.
Why benchmark it?
The hypothesis: a rolling LLM summary preserves semantic meaning better than extractive compression (what Cascading's Cold tier does today) but costs more per compression call. The benchmark will quantify the recall/cost trade-off.
Implementation guide
Create `memory/summary.py`:
```python
from typing import List, Dict
from .base import BaseMemory
from utils.llm import chat
class SummaryMemory(BaseMemory):
name = "summary"
def __init__(self, window_size: int = 20, summarise_every: int = 10):
self.window_size = window_size
self.summarise_every = summarise_every
self.recent: List[Dict] = []
self.summary: str = ""
self.turn_count: int = 0
def add_message(self, role: str, content: str, turn: int) -> None:
self.recent.append({"role": role, "content": content, "turn": turn})
self.turn_count += 1
if self.turn_count % self.summarise_every == 0:
self._compress()
def _compress(self) -> None:
# Summarise oldest half of recent into self.summary
...
def get_context(self, query: str, current_turn: int) -> List[Dict]:
context = []
if self.summary:
context.append({"role": "system", "content": f"[Summary] {self.summary}"})
context.extend({"role": m["role"], "content": m["content"]} for m in self.recent[-self.window_size:])
return context
def reset(self) -> None:
self.recent = []
self.summary = ""
self.turn_count = 0
```
Requirements
- Requires
GROQ_API_KEY for compression calls (graceful degradation when not set)
- Add a
--no-llm-compress flag that uses extractive fallback instead
- Register in
evaluation/benchmark.py under the name "summary"
- Add one test in
tests/test_pipeline.py
Acceptance criteria
Estimated effort
3–4 hours. Requires Groq API key for full testing.
What is SummaryMemory?
A memory backend that compresses the conversation into a rolling summary every K turns using an LLM call, then keeps only the summary + recent messages.
Strategy:
```
Keep last K messages verbatim + one running summary of everything before
```
This is how ChatGPT's memory works under the hood. It's a common pattern but has never been benchmarked against RAG or cascading approaches.
Why benchmark it?
The hypothesis: a rolling LLM summary preserves semantic meaning better than extractive compression (what Cascading's Cold tier does today) but costs more per compression call. The benchmark will quantify the recall/cost trade-off.
Implementation guide
Create `memory/summary.py`:
```python
from typing import List, Dict
from .base import BaseMemory
from utils.llm import chat
class SummaryMemory(BaseMemory):
name = "summary"
```
Requirements
GROQ_API_KEYfor compression calls (graceful degradation when not set)--no-llm-compressflag that uses extractive fallback insteadevaluation/benchmark.pyunder the name"summary"tests/test_pipeline.pyAcceptance criteria
python main.py --backends summary naive ragruns without errors--no-llm-compressmode without an API keyEstimated effort
3–4 hours. Requires Groq API key for full testing.