Add SummaryMemory backend — rolling LLM-generated compression

## What is SummaryMemory?

A memory backend that compresses the conversation into a rolling summary every K turns using an LLM call, then keeps only the summary + recent messages.

**Strategy:**
\`\`\`
Keep last K messages verbatim  +  one running summary of everything before
\`\`\`

This is how ChatGPT's memory works under the hood. It's a common pattern but has never been benchmarked against RAG or cascading approaches.

## Why benchmark it?

The hypothesis: a rolling LLM summary preserves *semantic meaning* better than extractive compression (what Cascading's Cold tier does today) but costs more per compression call. The benchmark will quantify the recall/cost trade-off.

## Implementation guide

Create \`memory/summary.py\`:

\`\`\`python
from typing import List, Dict
from .base import BaseMemory
from utils.llm import chat

class SummaryMemory(BaseMemory):
    name = "summary"

    def __init__(self, window_size: int = 20, summarise_every: int = 10):
        self.window_size = window_size
        self.summarise_every = summarise_every
        self.recent: List[Dict] = []
        self.summary: str = ""
        self.turn_count: int = 0

    def add_message(self, role: str, content: str, turn: int) -> None:
        self.recent.append({"role": role, "content": content, "turn": turn})
        self.turn_count += 1
        if self.turn_count % self.summarise_every == 0:
            self._compress()

    def _compress(self) -> None:
        # Summarise oldest half of recent into self.summary
        ...

    def get_context(self, query: str, current_turn: int) -> List[Dict]:
        context = []
        if self.summary:
            context.append({"role": "system", "content": f"[Summary] {self.summary}"})
        context.extend({"role": m["role"], "content": m["content"]} for m in self.recent[-self.window_size:])
        return context

    def reset(self) -> None:
        self.recent = []
        self.summary = ""
        self.turn_count = 0
\`\`\`

## Requirements

- Requires `GROQ_API_KEY` for compression calls (graceful degradation when not set)
- Add a `--no-llm-compress` flag that uses extractive fallback instead
- Register in `evaluation/benchmark.py` under the name `"summary"`
- Add one test in `tests/test_pipeline.py`

## Acceptance criteria

- [ ] `python main.py --backends summary naive rag` runs without errors
- [ ] Benchmark produces recall/token numbers comparable to the table in README
- [ ] Works in `--no-llm-compress` mode without an API key
- [ ] Test added and passing

## Estimated effort

3–4 hours. Requires Groq API key for full testing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SummaryMemory backend — rolling LLM-generated compression #3

What is SummaryMemory?

Why benchmark it?

Implementation guide

Requirements

Acceptance criteria

Estimated effort

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add SummaryMemory backend — rolling LLM-generated compression #3

Description

What is SummaryMemory?

Why benchmark it?

Implementation guide

Requirements

Acceptance criteria

Estimated effort

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions