# Finetune LLM with trlX

The primary objective is to showcase a technique for reducing bias when fine-tuning Language Models (LLMs) using feedback from humans. We achieve this goal using a minimal set of tools, including Huggingface, GPT2, Label Studio, and trlX. Our aim is to provide the most efficient and straightforward method for creating a pipeline that moves from raw data to a real-world RLHF system.

## Dataset collection
- Collect dataset of human preferences for GPT prompt completions using Label Studio (Pairwise preferences)

- Convert annotations into Elo scores (utils/elo.py)

- Use Elo scores as reward function to finetune

or

- use conversion format from 
   - https://huggingface.co/datasets/Anthropic/hh-rlhf
   - https://huggingface.co/datasets/CarperAI/openai_summarize_comparisons

## Model finetuning
Here should be a runnable piece of code that utilize collected samples to run RL with trlX library:

In [None]:
trlx.train(
    "gpt2",
    samples=prompts,
    rewards=ratings
)

## References

- [General overview about RLHF](https://huggingface.co/blog/rlhf)
- [Another end-to-end example with trlX](https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2)
- [Similar human-in-the-loop annotation framework](https://github.com/CarperAI/cheese/tree/main/examples)
- [Antropic harmless RLHF paper](https://arxiv.org/pdf/2204.05862.pdf) and [blog about CAI general principles](https://lifearchitect.ai/anthropic/)