Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
DeScore is a video reward model built on a decoupled "Think-then-Score" paradigm:
- An MLLM (Qwen3-VL-8B) first generates a Chain-of-Thought (CoT) for the input video
- A learnable
<Reward>query token + regression head then predicts the final scalar reward independently from the CoT generation
Training follows a two-stage framework:
- Stage 1 — Discriminative Cold Start (
cold_start/): LoRA fine-tuning with BT loss on pre-collected CoT data, with random CoT masking for robustness - Stage 2 — Dual-Objective RL (
dual_rl/): GRPO to refine CoT quality + auxiliary BT loss to calibrate the reward head
As shown in the figure above:
- (b) Preference Accuracy: Incorporating CoT enables Generative RMs to outperform Discriminative RMs, highlighting the necessity of explicit thinking for generalization.
- (c) Training Stability: Coupling thinking and scoring in one chain forces reliance on GRPO, causing pronounced training fluctuations. BT loss converges smoothly.
DeScore resolves this by decoupling reasoning from scoring — the scoring module receives a direct gradient via BT loss, completely bypassing GRPO's high-variance policy gradient. This achieves +5.4% accuracy on VideoGen-Bench over the best generative baseline while using 76% less training data.
DeScore/
├── cold_start/ # Stage 1: Discriminative Cold Start
│ ├── train_reward.py # Main training entry
│ ├── trainer_qwen3.py # Reward model + trainer
│ ├── data.py # Data loading & collator
│ ├── utils.py # Config dataclasses
│ ├── vision_process.py # Video preprocessing
│ ├── train.sh # Launch script
│ ├── env.yaml # Conda environment
│ ├── requirements.txt # pip dependencies
│ ├── ds_config/ # DeepSpeed ZeRO configs (zero0/2/3)
│ ├── infer_utils/ # Inference utilities
│ ├── datasets/
│ │ ├── train/ # Training data (CSV + videos)
│ │ └── eval/ # Eval benchmark
│ └── model/ # Place base model here
│
├── dual_rl/ # Stage 2: Dual-Objective RL
│ ├── inference.py # Inference entry point
│ ├── inference.sh # Inference launch script
│ ├── env.yaml # Conda environment
│ ├── requirements.txt # pip dependencies
│ ├── examples/
│ │ ├── config.yaml # Full training config
│ │ ├── train.sh # RL training launch script
│ │ ├── format_prompt/ # Jinja2 prompt templates
│ │ └── reward_function/
│ │ └── r1ta_subdim_head.py # Composite reward function
│ ├── verl/ # Customized RL framework (Ray + FSDP + vLLM)
│ └── data/ # Place train/test CSVs here
│
└── figure/ # Figures for README
We recommend using two separate environments for the two stages.
Stage 1 — Cold Start:
conda env create -f cold_start/env.yaml
conda activate Descore_cs
pip install -r cold_start/requirements.txtStage 2 — Dual-Objective RL:
conda env create -f dual_rl/env.yaml
conda activate qwen3
pip install -r dual_rl/requirements.txtcd cold_start/model
git lfs install
git clone https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct
cd ../..Organize your data under cold_start/datasets/train/:
cold_start/datasets/train/
├── your_data.csv
└── videos/
├── video_1_A.mp4
├── video_1_B.mp4
└── ...
See cold_start/datasets/train/README.md and cold_start/datasets/train/example.csv for the full CSV schema.
cd cold_start
# Edit train.sh: set --model_name_or_path, --meta_data, --output_dir
bash train.shOutput: checkpoint-*/ (LoRA weights + rm_head.pth) under --output_dir.
See
cold_start/README.mdfor all training arguments.
cd dual_rl
# Edit examples/train.sh: set MODEL_PATH and HEAD_PATH to the Stage 1 checkpoint
# Place train.csv and test.csv in dual_rl/data/
bash examples/train.shOutput: results/{EXP_NAME}/global_step_{N}/actor/ (HuggingFace weights + rm_head.pth).
See
dual_rl/README.mdfor all training arguments.
cd dual_rl
# Edit inference.sh: set --model_ckpt and --rm_ckpt to the Stage 2 checkpoint
bash inference.shOr run directly:
python dual_rl/inference.py \
--data_path /path/to/eval.csv \
--model_ckpt /path/to/checkpoint/actor/huggingface \
--rm_ckpt /path/to/checkpoint/actor/rm_head.pth \
--output results/output.csv \
--batch_size 4 \
--bench_type tabench \
--special_token "<Reward>"- VideoAlign: Cold-start training framework
- EasyR1: Distributed RL training framework
