Skip to content

High quality from small models #216

@jeremymanning

Description

@jeremymanning

Most of the llmXive pipeline is driven by Qwen3.5 122B. This is a capable model, and highly convenient to use (free via chat.dartmouth.edu!) but it's not frontier level:

Image

The primary alternatives would be to either use a paid provider like OpenAI (chatGPT), Anthropic (Claude), or Google (Gemini) OR to use even smaller HuggingFace models that could run locally inside the CI environment.

In our practical testing, local models are impractical because their performance (for the size of models that can run locally in the CI environment) is very poor. Paid models (we've looked at the latest ChatGPT, Claude Opus, Claude Sonnet, and Gemini Pro models) produce substantially higher quality outputs than Qwen3.5 122B, but they cost much more (any cost is more than "free", and running llmXive at scale will quickly balloon costs if not managed very effectively-- which can be difficult during testing when many parts of this project are broken).

So the ideal solution would be to use Qwen3.5 122B via Dartmouth Chat, IF we can get it to produce high quality outputs through careful prompting, guardrails, automated deterministic checks, and by deterministically breaking down all inputs and tasks into simpler, manageable, and verifiable units.

What I'm wondering is: can we sub-divide any of the pipeline steps into a (potentially MUCH) larger number of tiny steps? Anything that can run without any LLM involvement would need to be turned into a deterministic script. And any LLM calls would need to (a) carve up the inputs into smaller units that fit within the limited model context, (b) carve up the task list into a large number of tiny scope-limited tasks, and (c) deterministically and/or with further LLM involvement (if possible to do with limited scope/context, and with limited model capabilities) stitch the final result back together to complete the pipeline step.

Another set of parallel questions is how we can reliably ensure that research steps actually run, rather than being hallucinated. As much as possible, we need deterministic guardrails. Long-running steps are particularly tricky because they may exceed wall time budgets for individual CI sessions.

Finally, of potential relevance, two recent sets of tools may be useful here:

  1. Antigravity skills for scientific discovery: https://antigravity.google/use-cases/science
  2. The "Robin" system for automated scientific discovery: https://www.nature.com/articles/s41586-026-10652-y (code is here: https://github.com/Future-House/robin)

We need to think carefully about the implementation here, and update or modify any existing open issues/plans for the pipeline, including documentation, agent prompts, and user-facing items like the website, as needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions