dataforge

High-performance SWE-bench dataset generator that mines real GitHub pull requests and produces evaluation-ready task instances.

Built on top of SweInfinite by @unconst, fine-tuned and extended for automated large-scale dataset generation with difficulty-based filtering, structured LLM outputs, and full parallelism.

What it does

dataforge connects to GH Archive to discover recently merged pull requests, enriches them via the GitHub API, classifies their difficulty using an LLM, generates test specifications, and exports SWE-bench-compatible task instances — all in a single pipeline.

GH Archive (millions of events)
  → Pre-filter (merged PRs, org repos, no bots)
  → GitHub API enrichment (title, body, diff, stars, language)
  → Local filter (language, stars, files changed)
  → LLM pre-classification (easy/medium/hard triage in ~0.5s per PR)
  → Patch extraction (git clone + diff)
  → LLM test generation (fail_to_pass + pass_to_pass test specs)
  → LLM quality scoring (difficulty classification + quality gate)
  → Export (workspace.yaml + prompt.md + checks.txt)

Key features

Real GitHub data — mines GH Archive for recently merged PRs across all public repositories. No synthetic data, no stubs, no fallbacks.
Difficulty filtering — pre-classifies PRs as easy/medium/hard before expensive processing. Only spends LLM tokens on candidates matching your target difficulty.
Aggressive parallelism — GH Archive hours fetched 8x concurrent, enrichment 3x concurrent with rate limiting, LLM pre-classification 10x concurrent, deep processing 3x concurrent.
Structured LLM outputs — uses OpenRouter's json_schema and json_object response formats for reliable JSON parsing. No regex hacking.
Streaming chunks — processes candidates in batches of 30 to avoid burning GitHub API rate limits while maintaining throughput.
JSONL tracking — auto-appends processed PRs to a JSONL file so re-runs skip already-seen PRs.

Quick start

Prerequisites

Rust 1.70+
OpenRouter API key
GitHub Personal Access Token (PAT) with public repo read access

Build

git clone https://github.com/your-org/dataforge.git
cd dataforge
cargo build --release

Generate datasets

# Set credentials
export OPENROUTER_API_KEY="sk-or-v1-..."
export GITHUB_TOKEN="ghp_..."

# Mine 10 hard tasks
cargo run -- swe mine \
  --output ./hard-tasks \
  --pr-file ./processed.jsonl \
  --max-tasks 10 \
  --difficulty hard \
  --min-stars 10 \
  --once

# Mine 5 easy tasks (faster, more candidates match)
cargo run -- swe mine \
  --output ./easy-tasks \
  --max-tasks 5 \
  --difficulty easy \
  --once

# Mine without difficulty filter (accept all)
cargo run -- swe mine \
  --output ./all-tasks \
  --max-tasks 20 \
  --once

Output structure

Each task is exported as a directory:

hard-tasks/
  owner-repo-1234/
    workspace.yaml    # Full task metadata (SWE-bench compatible)
    prompt.md         # Task description for the agent
    checks.txt        # fail_to_pass + pass_to_pass test commands

CLI reference

dataforge swe mine [OPTIONS]

Options:
  -o, --output <DIR>          Output directory [default: ./swe-datasets]
  -m, --model <MODEL>         OpenRouter model [default: openai/gpt-5.2-codex:nitro]
  -n, --max-tasks <N>         Number of tasks to generate [default: 1]
  -d, --difficulty <LEVEL>    Filter: easy, medium, hard [optional]
      --min-stars <N>         Minimum repo stars [default: 20]
      --languages <LIST>      Comma-separated language filter [optional]
      --pr-file <PATH>        JSONL file to track processed PRs [optional]
      --once                  Run once then exit (vs continuous)
      --api-key <KEY>         OpenRouter API key (or OPENROUTER_API_KEY env)

Difficulty classification

The LLM classifies each PR into three tiers based on the scope and complexity of changes:

Level	Score	Typical changes	Examples
Easy	0.1–0.35	Typo fixes, config changes, single-file edits	Fix import, update version, rename variable
Medium	0.4–0.65	Bug fixes, feature additions, API changes	Fix race condition, add endpoint, refactor module
Hard	0.7–1.0	Cross-cutting changes, architectural refactors	New subsystem, protocol change, major migration

Pre-classification uses only the PR title and body (~100 tokens, ~0.5s). Full classification uses the complete diff and test spec.

Architecture

Pipeline stages

Stage	Parallelism	Rate limit	Description
GH Archive fetch	8 concurrent	None	Download hourly event dumps
Pre-filter	N/A	None	Exclude bots, non-org repos, invalid PRs
Enrichment	3 concurrent	GitHub 5000/h	Fetch PR metadata via GitHub API
Local filter	N/A	None	Language, stars, files changed
Pre-classification	10 concurrent	OpenRouter	Fast LLM triage on title+body
Patch extraction	3 concurrent	None	Git clone + diff extraction
Test generation	3 concurrent	OpenRouter	LLM generates test commands
Quality scoring	3 concurrent	OpenRouter	LLM classifies difficulty + quality gate

Rate limit management

GitHub API allows 5000 requests/hour per token. The pipeline processes candidates in chunks of 30 (each needing ~2 API calls for enrichment). Only candidates that pass the GH Archive pre-filter (org repos, no bots, valid PRs) are enriched — typically ~500/hour out of ~3000 merged PRs.

To increase throughput, use multiple GitHub tokens. The pipeline will respect rate limits automatically.

Configuration

Environment variables

Variable	Required	Description
`OPENROUTER_API_KEY`	Yes	OpenRouter API key for LLM calls
`GITHUB_TOKEN`	Yes	GitHub PAT for PR enrichment
`RUST_LOG`	No	Log level: `error`, `warn`, `info`, `debug`, `trace`

Model selection

The default model is openai/gpt-5.2-codex:nitro via OpenRouter. Any OpenRouter-compatible model that supports structured outputs (json_schema or json_object) can be used:

cargo run -- swe mine --model anthropic/claude-sonnet-4 --max-tasks 5

Development

# Build
cargo build

# Run tests
cargo test

# Run with debug logging
RUST_LOG=debug cargo run -- swe mine --max-tasks 1 --once

# Check for warnings
cargo clippy

Credits

This project is built on top of SweInfinite by @unconst. The original architecture for mining GitHub PRs and generating SWE-bench-style datasets was designed by the SweInfinite team. dataforge extends it with:

Difficulty-based pre-classification and filtering
Structured LLM outputs (JSON Schema) for reliable parsing
Full pipeline parallelism (GH Archive, enrichment, LLM calls)
Streaming chunk processing with rate limit management
JSONL-based PR tracking (replaces SQLite)
Relaxed test spec parsing for cross-model compatibility

Thank you @unconst for the foundational work.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
deploy		deploy
docs		docs
examples/templates		examples/templates
generated-datasets		generated-datasets
migrations		migrations
plan		plan
prompts		prompts
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataforge

What it does

Key features

Quick start

Prerequisites

Build

Generate datasets

Output structure

CLI reference

Difficulty classification

Architecture

Pipeline stages

Rate limit management

Configuration

Environment variables

Model selection

Development

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

CortexLM/dataforge

Folders and files

Latest commit

History

Repository files navigation

dataforge

What it does

Key features

Quick start

Prerequisites

Build

Generate datasets

Output structure

CLI reference

Difficulty classification

Architecture

Pipeline stages

Rate limit management

Configuration

Environment variables

Model selection

Development

Credits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages