You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, my name is Phil.
First of all, thx for the paper. I really like the idea of reducing variance using quantization!
It might be my misunderstanding but I got a question regarding KL divergence in the paper described below:
In the paper, the reward formula, is
Here it says $$KL(p_{0} \lVert p_\theta)$$
Since $p_{0}$ is the initial policy (Pretrained LM) and $p_\theta$ is the current policy (Model we are training right now),
$$KL(p_\theta \lVert p_{0})$$ isn't this the right order?
Hi, my name is Phil.
First of all, thx for the paper. I really like the idea of reducing variance using quantization!
It might be my misunderstanding but I got a question regarding KL divergence in the paper described below:
$$KL(p_{0} \lVert p_\theta)$$ $p_{0}$ is the initial policy (Pretrained LM) and $p_\theta$ is the current policy (Model we are training right now),
In the paper, the reward formula, is
Here it says
Since
I attacked the reference example below
reference
Fine-Tuning Language Models from Human Preferences
The text was updated successfully, but these errors were encountered: