Skip to content

knowledgehacker/trlx-examples

Repository files navigation

Learning to summarize from Human Feedback using trlx

This example shows how to use trlx to train a summarization model using human feedback following the fine-tuning procedures described in Stiennon et al.'s, "Learning to Summarize from human feedback".

Before running everything, we need some extra packages not included in the trlx dependency list. Specifically, we need HuggingFace's evaluate package and Google's re-implementation of ROUGE, rouge-score. To install them, run requirements.txt in this example's root directory:

pip install -r requirements.txt

Training Process

For an in-depth description of the example, please refer to our blog post. We leave the following for a quick overview of the fine-tuning process and what scripts to run.

  1. Train SFT:

    cd sft/ && deepspeed train_sft.py

    Checkpoint: SFT

  2. Train Reward Model:

    cd rm/ && deepspeed train_rm.py

    Download reward model checkpoint:

    mkdir rm/rm_checkpoint
    wget https://huggingface.co/CarperAI/openai_summarize_tldr_rm_checkpoint/resolve/main/pytorch_model.bin -O rm/rm_checkpoint/pytorch_model.bin
  3. PPO training:

    accelerate launch --config_file configs/default_accelerate_config.yaml trlx_train.py

    Checkpoint: PPO

    🩹 Warning: This particular training configuration requires at least 55GB of VRAM and is setup to use two GPUs, decrease batch_size in case you're running out of memory.

Results

The following tables display ROUGE and reward scores on the test set of the TL;DR dataset between SFT and PPO models.

  1. SFT vs PPO

    ROUGE scores

    Model Rouge-1 Rouge-2 Rouge-L Average
    SFT 0.334 0.125 0.261 0.240
    PPO 0.323 0.109 0.238 0.223

    Reward scores

    Model Average Reward Reward $\Delta$
    SFT 2.729 -0.181
    PPO 3.291 +0.411
  2. Examples of generated summaries can be found here.

  3. Check our blog post for metric logs and other results here.

References

  1. Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "Learning to Summarize from human feedback", Neural Information Processing Systems, 2020.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages