Skip to content

MiniByte-666/Dr.SCI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dr. SCI: Improving Data and Reward Design for Scientific Reasoning in Large Language Models

[📜 Original Paper][🤗 Reproduced Dataset][💻 Reproduced Github]

Disclaimer: This is an unofficial implementation of the Dr. SCI framework introduced in
"Improving Data and Reward Design for Scientific Reasoning in Large Language Models" [arXiv].
It is not affiliated with or endorsed by the original authors. Please refer to the original paper for authoritative details.

This repository contains my implementation of Dr. SCI's full data processing pipeline and core parts of its post-training pipeline. The reproduced dataset will be released on HuggingFace (see link above).

I successfully reproduced strong performance gains on Qwen3-4B-Base across scientific reasoning benchmarks, as shown in the table below. My reproduced results are slightly lower than those reported in the original paper, likely because I simply used Qwen3-235B-A22B to generate SFT responses rather than using the original authors' SFT data. The minor gap aside, the results speak volumes about the robustness and reproducibility of the original method — it just works.

Model GPQA-D SuperGPQA GPQA-G HLE MMLU-Pro
Base Model
Qwen3-4B-Base 36.7 28.5 5.62 0.92 50.6
Thinking Mode
o1-mini 60.0 45.2 25.8 5.68 80.3
Qwen3-4B thinking 55.9 42.7 20.9 4.52 70.4
Dr. SCI-4B-think (reported) 63.2 45.7 32.4 6.12 75.6
Dr. SCI-4B-Think (reproduced) 62.7 44.8 31.2 5.86 74.8
Non-thinking (Instruct) Mode
gpt-4o 50.0 44.4 22.4 3.48 74.6
Qwen3-4B non-thinking 41.7 32.0 9.74 4.44 58.0
Dr. SCI-4B-instruct (reported) 56.6 43.6 24.3 5.36 71.0
Dr. SCI-4B-Instruct (reproduced) 53.5 42.9 23.7 5.08 68.8

Overview

This repository faithfully follows the Dr. SCI data-processing and post-training pipeline as described in the original paper. I've managed to translate the paper's clear instructions into runnable code. It is organized around two core components:

Component Description
Data_Processing Dr. SCI's data processing pipeline, including code for processing raw data, inferencing difficulty, generating rubrics; and script to select SFT data for Exploration-Expanding SFT.
verl A modified version of verl (version 0.5.0.dev) with additional support for SciRubric-guided RL and Dynamic Difficulty Curriculum.

Installation

After cloning this repository, follow the steps below to install the required dependencies.

Prerequisites: I use PyTorch 2.7.0 with CUDA 12.6 throughout the project. Other recent PyTorch versions may also work.

I start from a built verl0.5.0 docker image: app-verl0.5-vllm0.10.0-mcore0.13.0-te2.2.

Installation of required packages

# Core API clients (tested with openai==1.88.0, anthropic==0.54.0)
pip install openai anthropic

# Additional dependencies
pip install blobfile==3.0.0 tabulate==0.9.0
pip install --upgrade huggingface_hub

# SGLang serving framework and FlashAttention
pip install "sglang[all]==0.4.7.post1"
pip install flash-attn --no-build-isolation

# Transformers 
pip install "transformers==4.55.4"

# DeepSpeed and experiment tracking
pip install "deepspeed<=0.16.9"
pip install wandb
# Run `wandb login` if not authenticated

# other packages. nltk is used for tokenization in Exploration-Expanding SFT
pip install jsonlines
pip install nltk
python -c "import nltk;nltk.download('punkt')"
pip install math-verify

# install verl
cd verl
pip3 install --no-deps -e .
cd ..

Dr. SCI

Note: OpenAI's API is used throughout this codebase. Make sure to set environment variables like OPENAI_API_KEY correctly before running, or manually modify each API client to match your provider. Affected scripts include:

  • Data Processing scripts: Scripts used to process raw dataset, including: Data_Processing/data_preprocess/megascience.py, Data_Processing/data_preprocess/naturalreasoning.py, Data_Processing/data_preprocess/RaR_science.py, Data_Processing/data_preprocess/webinstruct_verified.py. By default GPT-4o is used, but you may change the model according to your requirements (as GPT-4o is retired currently OpenAI Announcement).
  • Rubric Generation Script: The script used to generated rubrics for open-ended questions in Dr. SCI dataset: Data_Processing/data_preprocess/DR_SCI_rubrics_generation.py. By default OpenAI o3 is used to generate rubrics, you may change the model (but generally a strong model leads to good rubrics).
  • Create GPQA-General Dataset: The script used to convert GPQA-diamond into an open-ended scientific reasoning benchmark: Data_Processing/data_preprocess/gpqa_general_test.py. By default GPT-4o is used.
  • Evaluation during RL: Online evaluation of GPQA-General benchmark during RL. The script can be found at: verl/verl/utils/reward_score/test_data.py.

Dr. SCI Data Processing (Reproducing the Dr. SCI Dataset)

Step 1: Download raw datasets and process raw samples

Following the original paper's data curation recipe step by step. After downloading the raw datasets — MegaScience, NaturalReasoning, WebInstruct-verified, RaR-Science — the scripts in Data_Processing/data_preprocess process each source separately. The pipeline filters low-quality samples, classifies subjects, assigns verification methods, and formats the data — all exactly as described in the paper.

# Example command to process MegaScience dataset
# Use similar commands for natural_reasoning, WebInstruct-verified, and RaR-Science
python Data_Processing/data_preprocess/megascience.py --input_path MegaScience/MegaScience --output_dir processed_data --output_filename MegaScience-processed.parquet --resume

# Combine subsets.
python Data_Processing/data_preprocess/combine_subsets.py --input_files processed_data/MegaScience-processed.parquet processed_data/NaturalReasoning-processed.parquet processed_data/WebInstruct-verified-processed.parquet processed_data/RaR-Science.parquet --output_dir processed_data --output_filename combined_samples.parquet

# Deduplication
python Data_Processing/data_preprocess/deduplication.py --input_path processed_data/combined_samples.parquet --output_dir processed_data

Step 2: Estimate difficulty for each sample in Dr. SCI dataset

Next, difficulty is estimated for each sample. As the paper says: generating 8 answer attempts and calculating the average correctness rate for each question using a proxy model (Qwen3-32B). On an 8-GPU machine, the following command will do:

# Run model inference on 8 GPU. See the python script for more arguments.
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python Data_Processing/difficulty_inference/vllm_inference_drSCI.py \
  --data_path processed_data/deduplicated_combined_samples.parquet \
  --model_name_or_path Qwen/Qwen3-32B \
  --tp_size 8 \
  --dp_size 1 \
  --resume

# Evaluate answer correctness.
python Data_Processing/difficulty_inference/vllm_judge_drSCI.py \
  --input_file output/Qwen3-32B_seed42_vllm_genresult.jsonl \
  --output_file output/Qwen3-32B_seed42_vllm_genresult_judged.jsonl \
  --model_name_or_path Qwen/Qwen3-32B \
  --tp_size 8 \
  --dp_size 1 \
  --use_rule_based \
  --resume \

# Process and convert back to parquet
python Data_Processing/difficulty_inference/jsonl_to_parquet.py output/Qwen3-32B_seed42_vllm_genresult_judged.jsonl output/deduplicated_combined_samples_with_diff.parquet

# Filter over easy samples (accuracy 8/8)
python Data_Processing/difficulty_inference/filter_samples.py \
  --input_path processed_data/deduplicated_combined_samples_diff.parquet \
  --output_path processed_data/deduplicated_combined_samples_diff.parquet \
  --threshold 1.0

Step 3: Generate rubrics for open-ended questions

Fine-grained rubrics are generated for each open-ended question in the Dr. SCI dataset — this is one of the key innovations in the original paper. I first split the processed data by verification. Then I remove the math samples in the open-ended split (as stated in the original paper). Finally, I generate rubrics for each remaining open-ended question.

# Split by verification
python Data_Processing/data_preprocess/split_by_verification.py --input_path processed_data/deduplicated_combined_samples_diff.parquet --save

# Remove math samples from open-ended split
python Data_Processing/data_preprocess/remove_certain_domain.py --input_path processed_data/deduplicated_combined_samples_diff_general.parquet --domain math

# Generate rubrics
#  - For failed samples (like API internal failure etc.), we can regenerate the rubrics 
#    by running this script again with `--resume` flag set.
#  - Carefully adjust the `NUM_PROCESSES` constant (line 69) to suite your api rate limit.
python Data_Processing/data_preprocess/DR_SCI_rubrics_generation.py --input_path processed_data/deduplicated_combined_samples_diff_general_no_math.parquet --resume

# We can filter sample's that constantly fails to generate rubrics (after many runs) through:
python Data_Processing/data_preprocess/check_rar_data.py --input_path processed_data/deduplicated_combined_samples_diff_general_no_math_with_rubrics.parquet --output_path processed_data/deduplicated_combined_samples_diff_general_no_math_with_rubrics_filtered.parquet

# Finally I combine the processed verifiable samples with the open-ended split together to form the data samples in Dr. SCI
python Data_Processing/data_preprocess/combine_subsets.py --input_files processed_data/deduplicated_combined_samples_diff_rule.parquet processed_data/deduplicated_combined_samples_diff_general_no_math_with_rubrics_filtered.parquet --output_dir processed_data/DR_SCI_combined.parquet

Reproduced Dataset

After all steps, the resulting (reproduced) Dr. SCI dataset is located at processed_data/DR_SCI_combined.parquet. In my case, this results in 414,746 verifiable questions and 475,759 open-ended questions suitable for RL (similar to original paper's 461K verifiable and 545K open-ended).

Each data sample has the following format:

{
    "data_source": 'Dr. SCI',
    "prompt": [{"role": "user", "content": '<CONTENT>'}], # Question within a instruction template
    "reward_model": {"style": "rule", "ground_truth": ground_truth,'rubric': list_of_rubrics}, # The style here just works with verl and has no meaning. It is not equal to the verification method of the question. 
    "extra_info": extra_info, # extra_info for verification
}

with extra_info of the following format:

{
    "question": "<QUESTION>", # original question
    "reference_answer": "<REF_ANSWER>",
    "subject": '<SUBJECT>', # ['math', 'physics', 'chemistry', 'biology', 'cs', 'medicine', 'economics', 'science']
    "match_rule": match_rule, # [True, False]. True means verifiable, False means open-ended
    "from":"<ORI_DATASET>", # Original Source of dataset: MegaScience, NaturalReasoning etc.
    "difficulty": difficulty  # [0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0] . Difficulty value
}

As mentioned above, I provide my reproduced data at HuggingFace Repo.

Dr. SCI Post-Training (Reproducing Dr.SCI-4B's performance)

The post-training pipeline below follows the original paper's design. All the algorithmic ideas — Exploration-Expanding SFT, SciRubric-Guided RL, Dynamic Difficulty Curriculum — are wired up here.

Exploration-Expanding SFT

Exploration-Expanding SFT (EESFT) is a clever idea from the original paper that enhances an LLM's exploration capability before RL by maximizing the number of unique 4-grams in (a fixed-size) SFT dataset. This is achieved through a greedy and iterative selection of distilled SFT data.

Given data samples in my reproduced Dr. SCI dataset, responses can be distilled using open-source LLMs. After distillation and filtering, a large SFT data pool in ShareGPT format can be assembled. (Distillation code is not included here as it is straightforward.)

{
    'messages':[
        {'role':'user','content':'<QUESTION>'},
        {'role':'assistant','content','<RESPONSE>'}
    ]
}

After distillation, data for Exploration-Expanding SFT is selected by iteratively picking samples that contribute the largest number of new unique 4-grams to the already-selected set:

# Slightly increase the number of chunks if your machine run out of memory
#  - This will process data selection in chunks.
#  - nltk is used for tokenization. make sure you have run `nltk.download('punkt')` before
python Data_Processing/n_gram_diversity_filtering_fast_early_stop.py \
  --role assistant \
  --target_size 1000000 \ 
  --n_gram 4 \
  --num_chunks 1 \
  --input_path path-to-distilled-dataset.jsonl \

SFT can then be run on the selected data using any standard framework such as LLaMA-Factory or SWIFT. (SFT training code is not included as it follows a standard practice easily.)

RL (SciRubric-Guided RL with Dynamic Difficulty Curriculum)

Apart from Exploration-Expanding SFT, two other core components of the Dr. SCI post-training pipeline are:

  1. SciRubric-Guided RL: The generated rubrics, together with a strict final-answer check, are used to compute the reward for open-ended questions during RL. Rubric weights are predefined as {"Essential": 1.0, "Important": 0.7, "Optional": 0.3, "Pitfall": 0.9}, while the final-answer verification carries a higher weight of 4.0. Qwen3-4B serves as the GenRM during experiments. This elegant reward design yields high-quality rewards for open-ended samples and stable RL training with significant performance gains.
  2. Dynamic Difficulty Curriculum: A difficulty tracker monitors each sample's average rollout accuracy within each epoch. Training begins with samples at 6/8 and 7/8 difficulty (relatively easy). At the end of each epoch, samples whose average accuracy exceeds a threshold (0.9 by default) are replaced with harder ones. This ingenious curriculum strategy keeps training-data difficulty aligned with the policy model's evolving capability boundary, enabling efficient and effective RL training.

Both components are implemented within the verl framework in this repository. See verl/verl/utils/dataset/difficulty_tracker.py and verl/verl/utils/dataset/dynamic_difficulty_dataset.py for the Dynamic Difficulty Curriculum; verl/verl/workers/genrm/rubric_verifier.py for SciRubric-Guided RL; and verl/verl/trainer/ppo/ray_trainer.py, verl/verl/trainer/main_ppo.py for training-loop integration.

To launch RL training:

# Some preparations. Run if necessary
wandb login

# RL commands
cd verl
export LLM_PATH=<path-to-base-policy> # init policy llm
export VERIFIER_PATH=Qwen/Qwen3-4B # verifier model
export REQUIRES_ANSWER_VERIFY=True # True for SciRubric-Guided RL
export BS=2 # batch size per gpu

bash examples/DR_SCI/final/qwen3_long_DrSCI_adaptive_difficulty_full.sh # for thinking mode
# bash examples/DR_SCI/final/qwen3_long_DrSCI_adaptive_difficulty_full.sh # for instruct mode

Acknowledgements

All credit goes to the original authors of Dr. SCI, and their paper is a model of clarity and reproducibility. This repository is nothing more than a faithful reproduction.

For any questions about the methodology, please contact the original authors directly — they are the real experts. Me, just an enthusiastic reproduction effort with no special insight to offer beyond "I followed the paper and it worked."

Citation

If you find this work useful, please cite the original paper:

@article{chen2026improving,
  title={Improving Data and Reward Design for Scientific Reasoning in Large Language Models},
  author={Chen, Zijie and Lin, Zhenghao and Liu, Xiao and Lan, Zhenzhong and Gong, Yeyun and Cheng, Peng},
  journal={arXiv preprint arXiv:2602.08321},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages