A reasoning harness for mathematical problem-solving and proof-writing in natural language.
pip install -r requirements.txtpython solve_agent.py <problems_dir> [options]problems_dir: Directory containing.mdproblem files
| Flag | Default | Description |
|---|---|---|
--submissions_dir |
submissions/{problems_dir}-{timestamp} |
Output directory for final submissions |
--judge_prompt |
prompts/score.md |
Judge prompt file |
--solve_prompt |
None |
Solver system prompt |
--consolidation_prompt |
prompts/consolidation.md |
Consolidation prompt |
--pairwise_prompt |
prompts/pairwise.md |
Pairwise comparison prompt |
--time_limit_hours |
3.0 |
Total time limit |
--max_concurrent |
32 |
Max parallel API requests |
--target_perfect_scores |
4 |
Number of 7/7 scores needed per problem |
--model |
nomos-1 |
Model for solving |
--judge_model |
nomos-1 |
Model for judging |
--base_url |
http://localhost:30000/v1 |
OpenAI-compatible API endpoint |
Nomos keeps working on the problems you give it until its time limit runs out or it reaches a target number of self-critiqued perfect scores on every problem. Once either termination condition is reached Nomos enters a finalization phase where it first discards a number of submissions and the remainder are judged pairwise tournament-style to select a final submission.
In the solving phase we launch max_concurrent parallel workers where each worker
- Picks a problem based on priority + round-robin:
- Priority: problems with fewest perfect scores
- Round-robin among problems tied at the minimum
- Generates submission.
- Scores submission out of a maximum of 7 points.
Nomos keeps spawning workers until all problems have target_perfect_scores or time runs out.
Starts 15 minutes before time limit (or at 50% of time limit for short runs). Consists of two subphases:
- Consolidation: Groups submissions by conclusion, keeps what it thinks is the correct group (not necessarily the majority group).
- Pairwise Tournament: Single elimination bracket among consolidated submissions, with ties resolved randomly.
Each final submission is written to its own markdown file in the following format:
# problem_id
## Problem
[original problem text]
## Submission
[selected solution]./runbooks/run_putnam_2025_b_nomos-1.sh # Putnam 2025 A problems
./runbooks/run_putnam_2025_b_nomos-1.sh # Putnam 2025 B problemsWhen run on the Putnam 2025 with the NousResearch/Nomos-1 model, this reasoning harness achieves a score of 87/120 as graded by a human expert. Below we show a problem-wise comparison with Qwen3/Qwen, which scores 24/120 under the same conditions.
If you would like to cite our work, please use this for now
@misc{nomos2025,
title = {Nomos},
author = {Jin, Roger and Quesnelle, Jeffrey and Mahan, Dakota and Guang, Chen and Teknium, Ryan and Park, Jun and Ustelbay, Ibrakhim and Kim, Samuel and Yurkevich, Miron and Zauytkhan, Adilet and Amankos, Rinat and Andreyev, Alex and Nurlanov, Damir and Abuov, Abuzer and massiveaxe, Askar},
year = {2025},
howpublished = {\url{https://github.com/NousResearch/nomos}},
note = {GitHub repository},
}

