DeScore

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

[📄 Paper] [🌐 Project Page]

Overview

DeScore is a video reward model built on a decoupled "Think-then-Score" paradigm:

An MLLM (Qwen3-VL-8B) first generates a Chain-of-Thought (CoT) for the input video
A learnable <Reward> query token + regression head then predicts the final scalar reward independently from the CoT generation

Training follows a two-stage framework:

Stage 1 — Discriminative Cold Start (cold_start/): LoRA fine-tuning with BT loss on pre-collected CoT data, with random CoT masking for robustness
Stage 2 — Dual-Objective RL (dual_rl/): GRPO to refine CoT quality + auxiliary BT loss to calibrate the reward head

Motivation

As shown in the figure above:

(b) Preference Accuracy: Incorporating CoT enables Generative RMs to outperform Discriminative RMs, highlighting the necessity of explicit thinking for generalization.
(c) Training Stability: Coupling thinking and scoring in one chain forces reliance on GRPO, causing pronounced training fluctuations. BT loss converges smoothly.

DeScore resolves this by decoupling reasoning from scoring — the scoring module receives a direct gradient via BT loss, completely bypassing GRPO's high-variance policy gradient. This achieves +5.4% accuracy on VideoGen-Bench over the best generative baseline while using 76% less training data.

Repository Structure

DeScore/
├── cold_start/                    # Stage 1: Discriminative Cold Start
│   ├── train_reward.py            # Main training entry
│   ├── trainer_qwen3.py           # Reward model + trainer
│   ├── data.py                    # Data loading & collator
│   ├── utils.py                   # Config dataclasses
│   ├── vision_process.py          # Video preprocessing
│   ├── train.sh                   # Launch script
│   ├── env.yaml                   # Conda environment
│   ├── requirements.txt           # pip dependencies
│   ├── ds_config/                 # DeepSpeed ZeRO configs (zero0/2/3)
│   ├── infer_utils/               # Inference utilities
│   ├── datasets/
│   │   ├── train/                 # Training data (CSV + videos)
│   │   └── eval/                  # Eval benchmark
│   └── model/                     # Place base model here
│
├── dual_rl/                       # Stage 2: Dual-Objective RL
│   ├── inference.py               # Inference entry point
│   ├── inference.sh               # Inference launch script
│   ├── env.yaml                   # Conda environment
│   ├── requirements.txt           # pip dependencies
│   ├── examples/
│   │   ├── config.yaml            # Full training config
│   │   ├── train.sh               # RL training launch script
│   │   ├── format_prompt/         # Jinja2 prompt templates
│   │   └── reward_function/
│   │       └── r1ta_subdim_head.py  # Composite reward function
│   ├── verl/                      # Customized RL framework (Ray + FSDP + vLLM)
│   └── data/                      # Place train/test CSVs here
│
└── figure/                        # Figures for README

Quick Start

1. Install Dependencies

We recommend using two separate environments for the two stages.

Stage 1 — Cold Start:

conda env create -f cold_start/env.yaml
conda activate Descore_cs
pip install -r cold_start/requirements.txt

Stage 2 — Dual-Objective RL:

conda env create -f dual_rl/env.yaml
conda activate qwen3
pip install -r dual_rl/requirements.txt

2. Download Base Model

cd cold_start/model
git lfs install
git clone https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct
cd ../..

3. Prepare Training Data

Organize your data under cold_start/datasets/train/:

cold_start/datasets/train/
├── your_data.csv
└── videos/
    ├── video_1_A.mp4
    ├── video_1_B.mp4
    └── ...

See cold_start/datasets/train/README.md and cold_start/datasets/train/example.csv for the full CSV schema.

4. Stage 1 — Cold Start Training

cd cold_start
# Edit train.sh: set --model_name_or_path, --meta_data, --output_dir
bash train.sh

Output: checkpoint-*/ (LoRA weights + rm_head.pth) under --output_dir.

See cold_start/README.md for all training arguments.

5. Stage 2 — Dual-Objective RL

cd dual_rl
# Edit examples/train.sh: set MODEL_PATH and HEAD_PATH to the Stage 1 checkpoint
# Place train.csv and test.csv in dual_rl/data/
bash examples/train.sh

Output: results/{EXP_NAME}/global_step_{N}/actor/ (HuggingFace weights + rm_head.pth).

See dual_rl/README.md for all training arguments.

6. Inference

cd dual_rl
# Edit inference.sh: set --model_ckpt and --rm_ckpt to the Stage 2 checkpoint
bash inference.sh

Or run directly:

python dual_rl/inference.py \
    --data_path  /path/to/eval.csv \
    --model_ckpt /path/to/checkpoint/actor/huggingface \
    --rm_ckpt    /path/to/checkpoint/actor/rm_head.pth \
    --output     results/output.csv \
    --batch_size 4 \
    --bench_type tabench \
    --special_token "<Reward>"

Acknowledgements

VideoAlign: Cold-start training framework
EasyR1: Distributed RL training framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeScore

Overview

Motivation

Repository Structure

Quick Start

1. Install Dependencies

2. Download Base Model

3. Prepare Training Data

4. Stage 1 — Cold Start Training

5. Stage 2 — Dual-Objective RL

6. Inference

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
cold_start		cold_start
dual_rl		dual_rl
figure		figure
.gitignore		.gitignore
README.md		README.md
inference.py		inference.py

Folders and files

Latest commit

History

Repository files navigation

DeScore

Overview

Motivation

Repository Structure

Quick Start

1. Install Dependencies

2. Download Base Model

3. Prepare Training Data

4. Stage 1 — Cold Start Training

5. Stage 2 — Dual-Objective RL

6. Inference

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages