Stress-Testing Tasks

Datasets and generation code for 5 stress-testing tasks that evaluate reasoning model behavior.

Tasks

#	Task	Description
1	Detecting the effect of a user preference	Measures whether a model changes its answer to a moral dilemma when the user expresses a preference (sycophancy detection).
2	Detecting the effect of a Stanford professor hint	Measures whether a model follows an authoritative hint ("A Stanford professor thinks the answer is X") rather than reasoning independently.
3	Identifying atypical answers	Classifies whether a model's response to a question is a majority or minority (atypical) answer across many rollouts.
4	Estimating the answer entropy	Tracks how the model's answer distribution evolves sentence-by-sentence through its chain of thought via logprob forcing.
5	Compressing reasoning traces	Evaluates how much of a chain of thought can be compressed while preserving the model's answer distribution.

Dataset Format

Tasks 1–4

Each task directory follows this structure:

datasets/{N}/
  prompts/
    train/        # Prompt/question metadata (one JSON per item)
    val/          # (where applicable)
    test/
    ood_test/     # Out-of-distribution test set (where applicable)
  qwen-3-32b/
    train/        # Model outputs (one JSON per rollout)
    val/
    test/
    ood_test/

Task 5 (Compressing reasoning traces)

Task 5 has a different structure since it is organized by compression method rather than train/test splits:

datasets/5/
  prompts/              # Flat: one compression spec JSON per question+rollout
  qwen-3-32b/
    attention_selection/    # Attention-based sentence selection
    faithful_monitor/       # LLM monitor sentence selection
    last_n_baseline/        # Keep last N sentences baseline
    sliding_window_oracle/  # Sliding window oracle selection

Generation Code

The src/ directory contains the data generation code for reference. This code is not intended to be run — the datasets are pre-generated and included in the datasets/ directory.

src/
  tasks/          # Per-task generation logic
  utils/          # Shared utilities
  runs/
    generate_dataset.py   # Unified entry point with flags -1 through -5

To see how datasets were generated, look at src/runs/generate_dataset.py and the corresponding src/tasks/{task}/task.py files.

Models

All datasets are generated using Qwen/Qwen3-32B as the subject model. The qwen-3-32b/ directories under each task contain the model's outputs.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
datasets		datasets
src		src
LICENSE		LICENSE
README.md		README.md
results_summary.csv		results_summary.csv
test_atypical_gen.py		test_atypical_gen.py
test_atypical_regen.py		test_atypical_regen.py
test_forced_response_e2e.py		test_forced_response_e2e.py
test_forced_response_gen.py		test_forced_response_gen.py
test_forced_response_gen2.py		test_forced_response_gen2.py
test_hinted_cot_regen.py		test_hinted_cot_regen.py
test_scruples_regen.py		test_scruples_regen.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stress-Testing Tasks

Tasks

Dataset Format

Tasks 1–4

Task 5 (Compressing reasoning traces)

Generation Code

Models

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stress-Testing Tasks

Tasks

Dataset Format

Tasks 1–4

Task 5 (Compressing reasoning traces)

Generation Code

Models

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages