High quality from small models

Most of the [llmXive pipeline](https://context-lab.com/llmXive/#about) is driven by [Qwen3.5 122B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B). This is a *capable* model, and highly convenient to use (*free* via [chat.dartmouth.edu](https://chat.dartmouth.edu/)!) but it's not frontier level:

<img width="17277" height="11457" alt="Image" src="https://github.com/user-attachments/assets/5d668a30-932c-4997-9522-d8eeda401f2e" />

The primary alternatives would be to either use a paid provider like OpenAI (chatGPT), Anthropic (Claude), or Google (Gemini) OR to use even *smaller* HuggingFace models that could run locally inside the CI environment.

In our practical testing, local models are impractical because their performance (for the size of models that can run locally in the CI environment) is very poor.  Paid models (we've looked at the latest ChatGPT, Claude Opus, Claude Sonnet, and Gemini Pro models) produce substantially higher quality outputs than Qwen3.5 122B, but they cost much more (*any* cost is more than "free", and running llmXive at scale will quickly balloon costs if not managed very effectively-- which can be difficult during testing when many parts of this project are broken).

So the ideal solution would be to use Qwen3.5 122B via Dartmouth Chat, IF we can get it to produce high quality outputs through careful prompting, guardrails, automated deterministic checks, and by deterministically breaking down all inputs and tasks into simpler, manageable, and verifiable units.

What I'm wondering is: can we sub-divide any of the pipeline steps into a (potentially MUCH) larger number of tiny steps? Anything that can run without *any* LLM involvement would need to be turned into a deterministic script. And any LLM calls would need to (a) carve up the inputs into smaller units that fit within the limited model context, (b) carve up the task list into a large number of tiny scope-limited tasks, and (c) deterministically and/or with further LLM involvement (if possible to do with limited scope/context, and with limited model capabilities) stitch the final result back together to complete the pipeline step.

Another set of parallel questions is how we can *reliably* ensure that research steps actually run, rather than being hallucinated. As much as possible, we need deterministic guardrails. Long-running steps are particularly tricky because they may exceed wall time budgets for individual CI sessions.

Finally, of potential relevance, two recent sets of tools may be useful here:

1. Antigravity skills for scientific discovery: https://antigravity.google/use-cases/science
2. The "Robin" system for automated scientific discovery: https://www.nature.com/articles/s41586-026-10652-y (code is here: https://github.com/Future-House/robin)

We need to think carefully about the implementation here, and update or modify any existing open issues/plans for the pipeline, including documentation, agent prompts, and user-facing items like the website, as needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High quality from small models #216

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

High quality from small models #216

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions