You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, my name is Phil.
First of all, thx for the paper. I really like the idea of reducing variance using quantization!
It might be my misunderstanding but I got a question regarding KL divergence in the paper described below:
In the paper, the reward formula, is
Here it says
Since is the initial policy (Pretrained LM) and is the current policy (Model we are training right now),
Hi, my name is Phil.
First of all, thx for the paper. I really like the idea of reducing variance using quantization!
It might be my misunderstanding but I got a question regarding KL divergence in the paper described below:

is the initial policy (Pretrained LM) and is the current policy (Model we are training right now),
In the paper, the reward formula, is
Here it says
Since
I attacked the reference example below

reference
Fine-Tuning Language Models from Human Preferences
The text was updated successfully, but these errors were encountered: