Hash Function LLM Eval

An eval testing LLM ability to mentally execute custom hash functions and hash chains, with and without chain-of-thought reasoning.

Overview

8 custom hash functions (str → int [0,255]) with varying complexity: carry tracking, dual state, parity branching, index-dependent primes, squaring every 3rd step, bit rotation, min/max tracking, and previous-byte-dependent logic.

Without CoT, the eval forces answer-only output using assistant prefill + low max_tokens. With CoT (extended thinking), the model can reason step-by-step before answering.

Directory Structure

src/
  hash_functions.py         # 8 hash functions
  eval_llm.py               # Main eval harness
  generate_inputs.py        # Generate random inputs + ground truth
  generate_human_estimates.py # Generate human time estimates
  plots/
    utils.py                # Shared plotting utilities
    accuracy_by_function.py # Accuracy per hash function (bar chart)
    accuracy_by_chain_length.py # Accuracy vs chain length (bar chart)
    accuracy_by_time_estimate.py # Accuracy vs estimated human time (grouped bars)
results/
  human_time_estimates.json # Human expert time estimates
  plots/                    # Generated plot images
  without_cot/{model}/      # Eval results without CoT
  with_cot/{model}/         # Eval results with CoT (extended thinking)

Setup

pip install -r requirements.txt
cp .env.example .env
# Edit .env with your Anthropic API key

Usage

Running evals

# Generate random test inputs
python src/generate_inputs.py

# Run eval without CoT
python src/eval_llm.py src/test_inputs_random.json --model claude-sonnet-4-20250514

# Run eval with CoT (extended thinking, 1024 token budget)
python src/eval_llm.py src/test_inputs_random.json --model claude-opus-4-20250514 --thinking 1024

# Increase parallelism
python src/eval_llm.py src/test_inputs_random.json --workers 32

Results are saved to results/{with,without}_cot/{model}/.

Plotting

All plot scripts accept multiple result files and support --filter (all/cot/no-cot) and --labels.

# Accuracy by hash function
python src/plots/accuracy_by_function.py results/**/*results*.json --labels Haiku Sonnet Opus "Haiku CoT" "Sonnet CoT" "Opus CoT"

# Accuracy by chain length
python src/plots/accuracy_by_chain_length.py results/**/*results*.json

# Accuracy vs estimated human computation time
python src/plots/accuracy_by_time_estimate.py results/**/*results*.json

# Show only median time estimate (no lower/upper bars)
python src/plots/accuracy_by_time_estimate.py results/**/*results*.json --no-bounds

# Filter to only non-CoT series
python src/plots/accuracy_by_function.py results/**/*results*.json --filter no-cot

Hash Chains

generate_inputs.py computes hash chain ground truths for chain lengths 2, 3, and 5. A chain applies the hash function repeatedly: hash(str(hash(str(...)))) n times total. The eval tests these when chain_ground_truth is present in the input file.

Hash chains are purely sequential — the output of one computation is the input to the next, so no step can be parallelized or skipped. Errors propagate unrecoverably: being off by even one digit in an intermediate result produces a completely different output for all subsequent steps.

Results Format

Output JSON contains per-trial records:

{
  "input": "hello",
  "function": "hash_carry",
  "chain_length": 1,
  "expected": 151,
  "got": 108,
  "raw_response": " 108",
  "correct": false,
  "thinking_enabled": false,
  "thinking_content": null,
  "thinking_tokens": null
}

Summary table is printed to stdout with breakdowns by function and chain length.

Human Time Estimates

human_time_estimates.json provides estimated times (in minutes) for a skilled human expert to compute each hash function by hand, used to contextualize LLM performance.

Estimation model

Per-function timing parameters for an expert with pen, paper, and ASCII table:

Function	Setup (min)	Per-byte (min)	Final step (min)
hash_carry	0.05	0.25	0
hash_index_prime	0.05	0.33	0
hash_dual_state	0.1	0.67	0.17
hash_parity_branch	0.05	0.42	0.08
hash_prev_byte	0.05	0.33	0
hash_square_every3	0.05	0.37	0.05
hash_rotate_prime	0.05	0.58	0
hash_minmax	0.1	0.42	0.17

Median time = setup + per_byte × length + final_step. Bounds use log-scale (multiplicative): lower = median ÷ 1.7, upper = median × 2.5.

For chain variants (_2, _3, _5), each extra iteration hashes a 0–255 decimal string (~2.3 bytes on average), adding setup + per_byte × 2.3 + final_step per iteration.

Output format

{
  "input": "3z",
  "length": 2,
  "estimates": {
    "hash_carry": [lower_min, median_min, upper_min],
    "hash_carry_2": [lower_min, median_min, upper_min],
    ...
  }
}

Generate with:

python src/generate_human_estimates.py

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
inputs		inputs
results		results
src		src
.gitignore		.gitignore
README.md		README.md
report.md		report.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hash Function LLM Eval

Overview

Directory Structure

Setup

Usage

Running evals

Plotting

Hash Chains

Results Format

Human Time Estimates

Estimation model

Output format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hash Function LLM Eval

Overview

Directory Structure

Setup

Usage

Running evals

Plotting

Hash Chains

Results Format

Human Time Estimates

Estimation model

Output format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages