An eval testing LLM ability to mentally execute custom hash functions and hash chains, with and without chain-of-thought reasoning.
8 custom hash functions (str → int [0,255]) with varying complexity: carry tracking, dual state, parity branching, index-dependent primes, squaring every 3rd step, bit rotation, min/max tracking, and previous-byte-dependent logic.
Without CoT, the eval forces answer-only output using assistant prefill + low max_tokens. With CoT (extended thinking), the model can reason step-by-step before answering.
src/
hash_functions.py # 8 hash functions
eval_llm.py # Main eval harness
generate_inputs.py # Generate random inputs + ground truth
generate_human_estimates.py # Generate human time estimates
plots/
utils.py # Shared plotting utilities
accuracy_by_function.py # Accuracy per hash function (bar chart)
accuracy_by_chain_length.py # Accuracy vs chain length (bar chart)
accuracy_by_time_estimate.py # Accuracy vs estimated human time (grouped bars)
results/
human_time_estimates.json # Human expert time estimates
plots/ # Generated plot images
without_cot/{model}/ # Eval results without CoT
with_cot/{model}/ # Eval results with CoT (extended thinking)
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your Anthropic API key# Generate random test inputs
python src/generate_inputs.py
# Run eval without CoT
python src/eval_llm.py src/test_inputs_random.json --model claude-sonnet-4-20250514
# Run eval with CoT (extended thinking, 1024 token budget)
python src/eval_llm.py src/test_inputs_random.json --model claude-opus-4-20250514 --thinking 1024
# Increase parallelism
python src/eval_llm.py src/test_inputs_random.json --workers 32Results are saved to results/{with,without}_cot/{model}/.
All plot scripts accept multiple result files and support --filter (all/cot/no-cot) and --labels.
# Accuracy by hash function
python src/plots/accuracy_by_function.py results/**/*results*.json --labels Haiku Sonnet Opus "Haiku CoT" "Sonnet CoT" "Opus CoT"
# Accuracy by chain length
python src/plots/accuracy_by_chain_length.py results/**/*results*.json
# Accuracy vs estimated human computation time
python src/plots/accuracy_by_time_estimate.py results/**/*results*.json
# Show only median time estimate (no lower/upper bars)
python src/plots/accuracy_by_time_estimate.py results/**/*results*.json --no-bounds
# Filter to only non-CoT series
python src/plots/accuracy_by_function.py results/**/*results*.json --filter no-cotgenerate_inputs.py computes hash chain ground truths for chain lengths 2, 3, and 5. A chain applies the hash function repeatedly: hash(str(hash(str(...)))) n times total. The eval tests these when chain_ground_truth is present in the input file.
Hash chains are purely sequential — the output of one computation is the input to the next, so no step can be parallelized or skipped. Errors propagate unrecoverably: being off by even one digit in an intermediate result produces a completely different output for all subsequent steps.
Output JSON contains per-trial records:
{
"input": "hello",
"function": "hash_carry",
"chain_length": 1,
"expected": 151,
"got": 108,
"raw_response": " 108",
"correct": false,
"thinking_enabled": false,
"thinking_content": null,
"thinking_tokens": null
}Summary table is printed to stdout with breakdowns by function and chain length.
human_time_estimates.json provides estimated times (in minutes) for a skilled human expert to compute each hash function by hand, used to contextualize LLM performance.
Per-function timing parameters for an expert with pen, paper, and ASCII table:
| Function | Setup (min) | Per-byte (min) | Final step (min) |
|---|---|---|---|
| hash_carry | 0.05 | 0.25 | 0 |
| hash_index_prime | 0.05 | 0.33 | 0 |
| hash_dual_state | 0.1 | 0.67 | 0.17 |
| hash_parity_branch | 0.05 | 0.42 | 0.08 |
| hash_prev_byte | 0.05 | 0.33 | 0 |
| hash_square_every3 | 0.05 | 0.37 | 0.05 |
| hash_rotate_prime | 0.05 | 0.58 | 0 |
| hash_minmax | 0.1 | 0.42 | 0.17 |
Median time = setup + per_byte × length + final_step. Bounds use log-scale (multiplicative): lower = median ÷ 1.7, upper = median × 2.5.
For chain variants (_2, _3, _5), each extra iteration hashes a 0–255 decimal string (~2.3 bytes on average), adding setup + per_byte × 2.3 + final_step per iteration.
{
"input": "3z",
"length": 2,
"estimates": {
"hash_carry": [lower_min, median_min, upper_min],
"hash_carry_2": [lower_min, median_min, upper_min],
...
}
}Generate with:
python src/generate_human_estimates.py