From ea45e48536c0d554c83b218e24d5072585353057 Mon Sep 17 00:00:00 2001
From: Jonas Maison <Jonas1312@users.noreply.github.com>
Date: Sun, 9 Jun 2024 09:23:14 +0200
Subject: [PATCH] Update notes

---
 .../neural-nets/transformers/transformers.md         | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md b/base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md
index 14d5069..06e6d17 100644
--- a/base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md
+++ b/base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md
@@ -725,13 +725,15 @@ The loss is usually the cross-entropy loss, but we don't want our model to be to
 
 ### RLHF, PPO, DPO, IPO, KTO
 
-After pre-training, we finetune the model to be "instruct" or "chat".
+RLHF (Reinforcement Learning from Human Feedback) is a broader framework that encompasses various techniques for fine-tuning models using human feedback. Techniques like DPO, PPO, and KTO can be considered methods or variations within the RLHF framework. Here's how they fit:
 
-DPO: A type of training which removes the need for a reward model. It simplifies significantly the RLHF-pipeline.
+- RLHF (Reinforcement Learning from Human Feedback): A framework that involves training models using feedback from humans to align model outputs with human preferences and values. Uses a reward model to provide feedback to the agent.
+- PPO (Proximal Policy Optimization): A specific RL algorithm used within RLHF that optimizes the policy by balancing exploration and exploitation while ensuring updates do not deviate too far from the current policy
+- DPO (Direct Preference Optimization): A training technique within the RLHF framework that simplifies the process by removing the need for a separate reward model.
+- IPO (Improved Preference Optimization): A refinement of the DPO method, aiming to be simpler and less prone to overfitting, also within the RLHF framework.
+- KTO (Keypoint Training Optimization): Another technique within RLHF that requires only binary labels (accepted or rejected) instead of preference pairs, allowing for scaling to larger datasets.
 
-IPO: A change in the DPO objective which is simpler and less prone to overfitting.
-
-KTO: While PPO, DPO, and IPO require pairs of accepted vs rejected generations, KTO just needs a binary label (accepted or rejected), hence allowing to scale to much more data.
+<https://magazine.sebastianraschka.com/i/144110727/is-dpo-superior-to-ppo-for-llm-alignment-a-comprehensive-study>
 
 ## Transformers in NLP