Skip to content

KlingAIResearch/DeScore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeScore

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

[📄 Paper]   [🌐 Project Page]


Overview

DeScore is a video reward model built on a decoupled "Think-then-Score" paradigm:

  1. An MLLM (Qwen3-VL-8B) first generates a Chain-of-Thought (CoT) for the input video
  2. A learnable <Reward> query token + regression head then predicts the final scalar reward independently from the CoT generation

Training follows a two-stage framework:

  • Stage 1 — Discriminative Cold Start (cold_start/): LoRA fine-tuning with BT loss on pre-collected CoT data, with random CoT masking for robustness
  • Stage 2 — Dual-Objective RL (dual_rl/): GRPO to refine CoT quality + auxiliary BT loss to calibrate the reward head

Motivation

DeScore Motivation

As shown in the figure above:

  • (b) Preference Accuracy: Incorporating CoT enables Generative RMs to outperform Discriminative RMs, highlighting the necessity of explicit thinking for generalization.
  • (c) Training Stability: Coupling thinking and scoring in one chain forces reliance on GRPO, causing pronounced training fluctuations. BT loss converges smoothly.

DeScore resolves this by decoupling reasoning from scoring — the scoring module receives a direct gradient via BT loss, completely bypassing GRPO's high-variance policy gradient. This achieves +5.4% accuracy on VideoGen-Bench over the best generative baseline while using 76% less training data.


Repository Structure

DeScore/
├── cold_start/                    # Stage 1: Discriminative Cold Start
│   ├── train_reward.py            # Main training entry
│   ├── trainer_qwen3.py           # Reward model + trainer
│   ├── data.py                    # Data loading & collator
│   ├── utils.py                   # Config dataclasses
│   ├── vision_process.py          # Video preprocessing
│   ├── train.sh                   # Launch script
│   ├── env.yaml                   # Conda environment
│   ├── requirements.txt           # pip dependencies
│   ├── ds_config/                 # DeepSpeed ZeRO configs (zero0/2/3)
│   ├── infer_utils/               # Inference utilities
│   ├── datasets/
│   │   ├── train/                 # Training data (CSV + videos)
│   │   └── eval/                  # Eval benchmark
│   └── model/                     # Place base model here
│
├── dual_rl/                       # Stage 2: Dual-Objective RL
│   ├── inference.py               # Inference entry point
│   ├── inference.sh               # Inference launch script
│   ├── env.yaml                   # Conda environment
│   ├── requirements.txt           # pip dependencies
│   ├── examples/
│   │   ├── config.yaml            # Full training config
│   │   ├── train.sh               # RL training launch script
│   │   ├── format_prompt/         # Jinja2 prompt templates
│   │   └── reward_function/
│   │       └── r1ta_subdim_head.py  # Composite reward function
│   ├── verl/                      # Customized RL framework (Ray + FSDP + vLLM)
│   └── data/                      # Place train/test CSVs here
│
└── figure/                        # Figures for README

Quick Start

1. Install Dependencies

We recommend using two separate environments for the two stages.

Stage 1 — Cold Start:

conda env create -f cold_start/env.yaml
conda activate Descore_cs
pip install -r cold_start/requirements.txt

Stage 2 — Dual-Objective RL:

conda env create -f dual_rl/env.yaml
conda activate qwen3
pip install -r dual_rl/requirements.txt

2. Download Base Model

cd cold_start/model
git lfs install
git clone https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct
cd ../..

3. Prepare Training Data

Organize your data under cold_start/datasets/train/:

cold_start/datasets/train/
├── your_data.csv
└── videos/
    ├── video_1_A.mp4
    ├── video_1_B.mp4
    └── ...

See cold_start/datasets/train/README.md and cold_start/datasets/train/example.csv for the full CSV schema.

4. Stage 1 — Cold Start Training

cd cold_start
# Edit train.sh: set --model_name_or_path, --meta_data, --output_dir
bash train.sh

Output: checkpoint-*/ (LoRA weights + rm_head.pth) under --output_dir.

See cold_start/README.md for all training arguments.

5. Stage 2 — Dual-Objective RL

cd dual_rl
# Edit examples/train.sh: set MODEL_PATH and HEAD_PATH to the Stage 1 checkpoint
# Place train.csv and test.csv in dual_rl/data/
bash examples/train.sh

Output: results/{EXP_NAME}/global_step_{N}/actor/ (HuggingFace weights + rm_head.pth).

See dual_rl/README.md for all training arguments.

6. Inference

cd dual_rl
# Edit inference.sh: set --model_ckpt and --rm_ckpt to the Stage 2 checkpoint
bash inference.sh

Or run directly:

python dual_rl/inference.py \
    --data_path  /path/to/eval.csv \
    --model_ckpt /path/to/checkpoint/actor/huggingface \
    --rm_ckpt    /path/to/checkpoint/actor/rm_head.pth \
    --output     results/output.csv \
    --batch_size 4 \
    --bench_type tabench \
    --special_token "<Reward>"

Acknowledgements

  • VideoAlign: Cold-start training framework
  • EasyR1: Distributed RL training framework

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages