**README**

This document is prepared by the **Kaggle**.


# Install Packages

In [9]:
!pip install -q trl

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kfp 2.5.0 requires google-cloud-storage<3,>=2.2.1, but you have google-cloud-storage 1.44.0 which is incompatible.[0m[31m
[0m

In [8]:
from datasets import load_dataset
from huggingface_hub import login

# [TRL](https://pypi.org/project/trl/#description)

TRL - Transformer Reinforcement Learning

**Full stack library to fine-tune and align large language models.**

**What is it?**

The trl library is a full stack tool to fine-tune and align transformer language and diffusion models using methods such as `Supervised Fine-tuning` step (SFT), `Reward Modeling` (RM) and the `Proximal Policy Optimization` (PPO) as well as `Direct Preference Optimization` (DPO).

The library is built on top of the `transformers` library and thus allows to use any model architecture available there.


## Highlights

* **Efficient and scalable**

  * `accelerate` is the backbone of trl which allows to scale model training from a single GPU to a large scale multi-node cluster with methods such as `DDP` and `DeepSpeed`.
  * `PEFT` is fully integrated and allows to train even the largest models on modest hardware with quantisation and methods such as **LoRA** or **QLoRA**.
  * `unsloth` is also integrated and allows to significantly speed up training with dedicated kernels.

* `CLI`: With the CLI you can fine-tune and chat with LLMs without writing any code using a single command and a flexible config system.

* `Trainers`: The Trainer classes are an abstraction to apply many fine-tuning methods with ease such as the `SFTTrainer`, `DPOTrainer`, `RewardTrainer`, `PPOTrainer`, `CPOTrainer`, and `ORPOTrainer`.

* `AutoModels`: The `AutoModelForCausalLMWithValueHead` & `AutoModelForSeq2SeqLMWithValueHead` classes add an additional value head to the model which allows to train them with **RL** algorithms such as *PPO*.

* `Examples`: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier, full RLHF using adapters only, train GPT-j to be less toxic, StackLlama example, etc. following the examples.


## Command Line Interface (CLI)

You can use TRL Command Line Interface (CLI) to quickly get started with Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO) and test your aligned model with the chat CLI:

1. **SFT - Supervised Fine Tuning**
```
!trl sft --model_name_or_path facebook/opt-125m --dataset_name imdb --output_dir opt-sft-imdb
```
2. **DPO - Direct Preference Optimization**
```
!trl dpo --model_name_or_path facebook/opt-125m --dataset_name trl-internal-testing/hh-rlhf-helpful-base-trl-style --output_dir opt-sft-hh-rlhf
```
3. **Chat**
```
!trl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat
```

The above three commands we can use to Run & Train a model through `trl` command.

I will run one chat command below for testing.

In [None]:
# !trl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat

[2K[32m⠧[0m [1;35mWelcome! Initializing the TRL CLI...[0m0m
[1A[2K2024-06-26 09:44:02.753251: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-26 09:44:02.753308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-26 09:44:02.754763: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2J[H[1;31m<[0m[1;31mroot[0m[1;31m>[0m[1;31m:[0m
