GitHub - knowledgehacker/trlx-examples

Learning to summarize from Human Feedback using `trlx`

This example shows how to use trlx to train a summarization model using human feedback following the fine-tuning procedures described in Stiennon et al.'s, "Learning to Summarize from human feedback".

Before running everything, we need some extra packages not included in the trlx dependency list. Specifically, we need HuggingFace's evaluate package and Google's re-implementation of ROUGE, rouge-score. To install them, run requirements.txt in this example's root directory:

pip install -r requirements.txt

Training Process

For an in-depth description of the example, please refer to our blog post. We leave the following for a quick overview of the fine-tuning process and what scripts to run.

Train SFT:
```
cd sft/ && deepspeed train_sft.py
```
Checkpoint: SFT

Train Reward Model:

cd rm/ && deepspeed train_rm.py

Download reward model checkpoint:

mkdir rm/rm_checkpoint
wget https://huggingface.co/CarperAI/openai_summarize_tldr_rm_checkpoint/resolve/main/pytorch_model.bin -O rm/rm_checkpoint/pytorch_model.bin

PPO training:
```
accelerate launch --config_file configs/default_accelerate_config.yaml trlx_train.py
```
Checkpoint: PPO

🩹 Warning: This particular training configuration requires at least 55GB of VRAM and is setup to use two GPUs, decrease batch_size in case you're running out of memory.

Results

The following tables display ROUGE and reward scores on the test set of the TL;DR dataset between SFT and PPO models.

SFT vs PPO

ROUGE scores

Model Rouge-1 Rouge-2 Rouge-L Average

SFT 0.334 0.125 0.261 0.240

PPO 0.323 0.109 0.238 0.223

Reward scores

Model Average Reward Reward $\Delta$

SFT 2.729 -0.181

PPO 3.291 +0.411
Examples of generated summaries can be found here.
Check our blog post for metric logs and other results here.

References

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "Learning to Summarize from human feedback", Neural Information Processing Systems, 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
configs		configs
rm		rm
sft		sft
util		util
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
test_rm.py		test_rm.py
test_sft.py		test_sft.py
train_rm.py		train_rm.py
train_sft.py		train_sft.py
trl_test.py		trl_test.py
trl_train.py		trl_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning to summarize from Human Feedback using `trlx`

Training Process

Results

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Model	Rouge-1	Rouge-2	Rouge-L	Average
SFT	0.334	0.125	0.261	0.240
PPO	0.323	0.109	0.238	0.223

Model	Average Reward	Reward $\Delta$
SFT	2.729	-0.181
PPO	3.291	+0.411

knowledgehacker/trlx-examples

Folders and files

Latest commit

History

Repository files navigation

Learning to summarize from Human Feedback using trlx

Training Process

Results

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Learning to summarize from Human Feedback using `trlx`

Packages