Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KL-divergence formula] Paper #2

Open
philhoonoh opened this issue May 19, 2023 · 0 comments
Open

[KL-divergence formula] Paper #2

philhoonoh opened this issue May 19, 2023 · 0 comments

Comments

@philhoonoh
Copy link

Hi, my name is Phil.
First of all, thx for the paper. I really like the idea of reducing variance using quantization!

It might be my misunderstanding but I got a question regarding KL divergence in the paper described below:
In the paper, the reward formula, is
image
Here it says
$$KL(p_{0} \lVert p_\theta)$$
Since $p_{0}$ is the initial policy (Pretrained LM) and $p_\theta$ is the current policy (Model we are training right now),

$$KL(p_\theta \lVert p_{0})$$ isn't this the right order?

I attacked the reference example below
reference
image
Fine-Tuning Language Models from Human Preferences

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant