Measure how meaning decays across a chain of LLM agents and evaluate error-correction strategies that preserve fidelity.
FinalProject/
├── main.py # CLI entry point
├── config.py # Configuration dataclasses
├── requirements.txt # Python dependencies
├── Knowledge/
│ └── instructions.md # Experiment design document
├── signal_relay/
│ ├── __init__.py
│ ├── schema.py # Message, HopRecord data models
│ ├── prompts.py # Relay prompt templates
│ ├── relay.py # RelayChain engine (LLM calls)
│ ├── metrics.py # Scoring & fidelity metrics
│ ├── experiment.py # Experiment runner & batch matrix
│ ├── tasks.py # Pre-built baseline task messages
│ └── visualize.py # Decay curves & comparison plots
└── experiments/ # Auto-created output directory
└── <run_id>/
├── original.yaml
├── hop_01.yaml … hop_NN.yaml
├── metrics.csv
├── decay_curve.png
└── run_meta.json
pip install -r requirements.txtexport OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="sk-ant-..."python main.py tasks# Baseline relay, 5 hops, solar flare task
python main.py run --task solar_flare --mode baseline --hops 5
# Error-corrected relay, 7 hops
python main.py run --task solar_flare --mode error_corrected --hops 7
# With periodic repair prompts every 3 hops
python main.py run --task recipe --mode error_corrected --hops 10 --repair# All tasks × both modes × depths 3,5,7,10
python main.py matrix
# Custom subset
python main.py matrix --tasks solar_flare,recipe --modes baseline,error_corrected --depths 3,5,7# Plot a single run
python main.py plot --run-dir experiments/<run_id>
# Compare multiple runs
python main.py compare --run-dirs experiments/run1,experiments/run2 --labels "Baseline,Error-Corrected"| Flag | Default | Description |
|---|---|---|
--provider |
openai |
LLM provider (openai, anthropic) |
--model |
gpt-4o-mini |
Model name |
--temperature |
0.0 |
Sampling temperature (0 = deterministic) |
--seed |
42 |
Random seed for reproducibility |
--output-dir |
experiments |
Base output directory |
Each hop is scored against the original message:
| Metric | Weight | Description |
|---|---|---|
| Constraint Fidelity | 0.4 | % of constraints preserved exactly |
| Keyword Retention | 0.3 | % of checksum keywords still present |
| Item Retention | 0.2 | % of content items retained (fuzzy match ≥ 0.7) |
| Order Preservation | 0.1 | Longest common subsequence of item IDs |
| Overall Fidelity | — | Weighted aggregate of the above |
Additional tracked metrics: edit distance ratio, hallucination count, number retention.
- Baseline — Simple "rewrite for the next agent" instruction
- Error-Corrected — Strict fidelity rules + self-check verification
- Repair (optional) — Periodic drift-correction prompt every N hops