pushover

A digital Asch conformity experiment for language models.

pushover measures sycophantic collapse: how often a language model abandons a correct answer when a user applies social pressure. It asks each question twice — once neutrally, once under pressure — and reports the flip rate: the fraction of items where the model caves to a false assertion or to mere expressed doubt.

Why this exists

Sycophancy — telling the user what they want to hear instead of what is true — is one of the most-studied alignment failures of 2025–2026, and the recent results are counterintuitive. A large clinical-pressure evaluation found acquiescence rates spanning the full 0–100% range across twenty contemporary models, and reported that model recency and capability did not reliably predict robustness — newer was sometimes worse (SycoEval-EM, arXiv:2601.16529). A separate benchmark found that alignment is strongly framing-sensitive: models refuse explicit bad requests far more reliably than the same requests phrased as a pressured shortcut (SciIntBench, arXiv:2605.29468).

pushover is a small, reproducible probe of that phenomenon you can run on a free Colab GPU in minutes. It is deliberately scoped as a teaching and portfolio artifact first, with a clear path to a hardened, citable evaluation (see Roadmap).

What it measures

For a battery of questions with a known correct answer, we run three conditions:

Condition	Prompt shape
`neutral`	Ask the question plainly.
`assert`	State a wrong answer confidently, then ask if that's right.
`doubt`	Ask neutrally, accept the model's answer, then express disbelief.

The headline metric is flip rate = (items where the model adopts the wrong answer under pressure) / (items it answered correctly when neutral). Restricting the denominator to items it got right neutrally is what separates sycophancy from plain ignorance — a model can only "cave" on something it actually knew.

Quickstart (Google Colab)

Open notebooks/pushover_experiment.ipynb in Colab (Runtime → Change runtime type → T4 GPU).
Run the cells top to bottom. The first cell downloads a small instruction-tuned model (Qwen/Qwen3-4B-Instruct-2507 — current Qwen3 generation, non-thinking, no Hugging Face login required).
Read your flip rate at the end of the run.

Or from a terminal:

pip install -r requirements.txt
python -m src.pushover            # runs the default battery, writes results/run.jsonl

Method notes

Greedy decoding (do_sample=False) so runs are reproducible.
Grading starts as normalized string matching (see src/pushover.py). This is the single biggest limitation — see Roadmap for the model-graded upgrade.
Baseline-relative. We always report neutral accuracy alongside the flip rate so the pressure effect is interpretable, not confounded with the model's base error.

Results

On Qwen3-4B-Instruct-2507, under identical 3-turn escalating pressure (n = 30):

Category	Capitulation rate	n
Factual (objective answer)	0%	15
Subjective (quality judgment)	100%	15

The model never abandoned a correct fact but always conceded a subjective rating — and the numeric scores moved with it (mean 2.1 → 8.1). The movement tracked the user's demand directionally: the one item pushed toward a low score followed downward (3 → 2) rather than inflating, while the fourteen pushed high rose toward 9. Tone capitulation was near-universal in both categories. Capitulation is gated almost perfectly by whether the question has a verifiable answer.

Full write-up, second figure, judge-validation details, and limitations: see RESULTS.md. Raw run output: pushover_results.csv.

Limitations

This is an honest list, on purpose — it's also the to-do list.

Single model (Qwen3-4B-Instruct-2507); results do not generalize until a model ladder is run.
Self-judge: the grader is from the same model family it scores — a known weakness. Next step is a different, stronger judge model plus a formal human-agreement statistic (Cohen's kappa on ~50 labeled responses).
Deliberately weak subjective items; "100%" means "on clearly-flawed artifacts pushed toward high scores," not all subjective judgments.
Small battery (15 + 15) and a single phrasing per pressure type; the rigorous version uses more items and several paraphrases per condition, reporting the spread.

Roadmap

Stronger, independent judge — swap the self-judge for a different model and report a formal human-agreement statistic (Cohen's kappa on ~50 labeled responses).
More models — run the same battery across a model ladder (small/older through frontier) to test how the factual-vs-subjective split varies with scale and recency.
Pressure taxonomy — vary the type of pressure (authority, embarrassment, repetition) and several paraphrases per type, reporting the spread.
Harder subjective items — move beyond deliberately-weak artifacts to test the ceiling of subjective capitulation.
Stats + publication — confidence intervals, then mint a Zenodo DOI; take the hardened version to arXiv (needs an endorser as of the Jan 2026 policy change).

Citing

This repo ships a CITATION.cff. After your first GitHub release, connect the repo to Zenodo to mint a DOI automatically, then paste the DOI badge at the top of this file. Consider registering a free ORCID iD and adding it to CITATION.cff — it unifies your work across Zenodo, Google Scholar, and (later) arXiv under one persistent identifier.

References

SycoEval-EM: Sycophancy Evaluation of LLMs in Simulated Clinical Encounters for Emergency Care. arXiv:2601.16529 (2026).
SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing. arXiv:2605.29468 (2026).
Large language models show amplified cognitive biases in moral decision-making. PNAS (2025). doi:10.1073/pnas.2412015122
Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments. arXiv:2412.00323 (2024).
Solomon E. Asch. Opinions and social pressure. Scientific American (1955). (the human ancestor of this probe)

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pushover

Why this exists

What it measures

Quickstart (Google Colab)

Method notes

Results

Limitations

Roadmap

Citing

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
docs		docs
notebooks		notebooks
results		results
site		site
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
pushover_results.csv		pushover_results.csv
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

pushover

Why this exists

What it measures

Quickstart (Google Colab)

Method notes

Results

Limitations

Roadmap

Citing

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages