Skip to content

FrancisRhysWard/hash-function-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hash Function LLM Eval

An eval testing LLM ability to mentally execute custom hash functions and hash chains, with and without chain-of-thought reasoning.

Overview

8 custom hash functions (str → int [0,255]) with varying complexity: carry tracking, dual state, parity branching, index-dependent primes, squaring every 3rd step, bit rotation, min/max tracking, and previous-byte-dependent logic.

Without CoT, the eval forces answer-only output using assistant prefill + low max_tokens. With CoT (extended thinking), the model can reason step-by-step before answering.

Directory Structure

src/
  hash_functions.py         # 8 hash functions
  eval_llm.py               # Main eval harness
  generate_inputs.py        # Generate random inputs + ground truth
  generate_human_estimates.py # Generate human time estimates
  plots/
    utils.py                # Shared plotting utilities
    accuracy_by_function.py # Accuracy per hash function (bar chart)
    accuracy_by_chain_length.py # Accuracy vs chain length (bar chart)
    accuracy_by_time_estimate.py # Accuracy vs estimated human time (grouped bars)
results/
  human_time_estimates.json # Human expert time estimates
  plots/                    # Generated plot images
  without_cot/{model}/      # Eval results without CoT
  with_cot/{model}/         # Eval results with CoT (extended thinking)

Setup

pip install -r requirements.txt
cp .env.example .env
# Edit .env with your Anthropic API key

Usage

Running evals

# Generate random test inputs
python src/generate_inputs.py

# Run eval without CoT
python src/eval_llm.py src/test_inputs_random.json --model claude-sonnet-4-20250514

# Run eval with CoT (extended thinking, 1024 token budget)
python src/eval_llm.py src/test_inputs_random.json --model claude-opus-4-20250514 --thinking 1024

# Increase parallelism
python src/eval_llm.py src/test_inputs_random.json --workers 32

Results are saved to results/{with,without}_cot/{model}/.

Plotting

All plot scripts accept multiple result files and support --filter (all/cot/no-cot) and --labels.

# Accuracy by hash function
python src/plots/accuracy_by_function.py results/**/*results*.json --labels Haiku Sonnet Opus "Haiku CoT" "Sonnet CoT" "Opus CoT"

# Accuracy by chain length
python src/plots/accuracy_by_chain_length.py results/**/*results*.json

# Accuracy vs estimated human computation time
python src/plots/accuracy_by_time_estimate.py results/**/*results*.json

# Show only median time estimate (no lower/upper bars)
python src/plots/accuracy_by_time_estimate.py results/**/*results*.json --no-bounds

# Filter to only non-CoT series
python src/plots/accuracy_by_function.py results/**/*results*.json --filter no-cot

Hash Chains

generate_inputs.py computes hash chain ground truths for chain lengths 2, 3, and 5. A chain applies the hash function repeatedly: hash(str(hash(str(...)))) n times total. The eval tests these when chain_ground_truth is present in the input file.

Hash chains are purely sequential — the output of one computation is the input to the next, so no step can be parallelized or skipped. Errors propagate unrecoverably: being off by even one digit in an intermediate result produces a completely different output for all subsequent steps.

Results Format

Output JSON contains per-trial records:

{
  "input": "hello",
  "function": "hash_carry",
  "chain_length": 1,
  "expected": 151,
  "got": 108,
  "raw_response": " 108",
  "correct": false,
  "thinking_enabled": false,
  "thinking_content": null,
  "thinking_tokens": null
}

Summary table is printed to stdout with breakdowns by function and chain length.

Human Time Estimates

human_time_estimates.json provides estimated times (in minutes) for a skilled human expert to compute each hash function by hand, used to contextualize LLM performance.

Estimation model

Per-function timing parameters for an expert with pen, paper, and ASCII table:

Function Setup (min) Per-byte (min) Final step (min)
hash_carry 0.05 0.25 0
hash_index_prime 0.05 0.33 0
hash_dual_state 0.1 0.67 0.17
hash_parity_branch 0.05 0.42 0.08
hash_prev_byte 0.05 0.33 0
hash_square_every3 0.05 0.37 0.05
hash_rotate_prime 0.05 0.58 0
hash_minmax 0.1 0.42 0.17

Median time = setup + per_byte × length + final_step. Bounds use log-scale (multiplicative): lower = median ÷ 1.7, upper = median × 2.5.

For chain variants (_2, _3, _5), each extra iteration hashes a 0–255 decimal string (~2.3 bytes on average), adding setup + per_byte × 2.3 + final_step per iteration.

Output format

{
  "input": "3z",
  "length": 2,
  "estimates": {
    "hash_carry": [lower_min, median_min, upper_min],
    "hash_carry_2": [lower_min, median_min, upper_min],
    ...
  }
}

Generate with:

python src/generate_human_estimates.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages