[KL-divergence formula] Paper #2

philhoonoh · 2023-05-19T03:02:45Z

Hi, my name is Phil.
First of all, thx for the paper. I really like the idea of reducing variance using quantization!

It might be my misunderstanding but I got a question regarding KL divergence in the paper described below:
In the paper, the reward formula, is

Here it says
$K L (p_{0} ‖ p_{θ})$
Since $p_{0}$ is the initial policy (Pretrained LM) and $p_{θ}$ is the current policy (Model we are training right now),

$K L (p_{θ} ‖ p_{0})$ isn't this the right order?

I attacked the reference example below
reference

Fine-Tuning Language Models from Human Preferences

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KL-divergence formula] Paper #2

[KL-divergence formula] Paper #2

philhoonoh commented May 19, 2023

[KL-divergence formula] Paper #2

[KL-divergence formula] Paper #2

Comments

philhoonoh commented May 19, 2023