when you clone these github repos, you can go to default_configs.py file to change training parameters. I used ILQL here but i also tried PPO. PPO is better 🥇.

**Clone the Repository**:

The command !git clone https://github.com/CarperAI/trlx.git clones the trlx repository from GitHub into the local directory. This repository contains the code for training the model.

In [1]:
!git clone https://github.com/CarperAI/trlx.git
!git config --global --add safe.directory /content/trlx && cd /content/trlx && pip install -e .

Cloning into 'trlx'...
remote: Enumerating objects: 7589, done.[K
remote: Counting objects: 100% (2874/2874), done.[K
remote: Compressing objects: 100% (599/599), done.[K
remote: Total 7589 (delta 2486), reused 2460 (delta 2275), pack-reused 4715[K
Receiving objects: 100% (7589/7589), 46.76 MiB | 19.23 MiB/s, done.
Resolving deltas: 100% (5184/5184), done.
Obtaining file:///content/trlx
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate>=0.17.1 (from trlx==0.7.0)
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
Collecting cattrs>=22.2.0 (from trlx==0.7.0)
  Downloading cattrs-2

**Setup for the Repository**:
The command !git config --global --add safe.directory /content/trlx && cd /content/trlx && pip install -e . sets the working directory to /content/trlx and installs the package and all its dependencies in editable mode (pip install -e .). This means any changes made in the code will be immediately effective without the need to reinstall the package.

In [2]:
# uninstall scikit_learn + jax to avoid numpy issues
!pip uninstall -y scikit_learn jax

Found existing installation: scikit-learn 1.2.2
Uninstalling scikit-learn-1.2.2:
  Successfully uninstalled scikit-learn-1.2.2
Found existing installation: jax 0.4.13
Uninstalling jax-0.4.13:
  Successfully uninstalled jax-0.4.13


**Change Directory**:

The os.chdir('/content/trlx') changes the current working directory to the cloned trlx repository.

In [3]:
import os

# run within repo
os.chdir('/content/trlx')
print(os.getcwd())

/content/trlx


**Import Modules**:

Several Python modules and functions are imported, such as load_dataset from the datasets library, pipeline from the transformers library, and several from the trlx package that was just cloned.

In [4]:
import yaml
from datasets import load_dataset
from transformers import pipeline
import pathlib
from typing import Dict, List
import trlx
from trlx.data.default_configs import TRLConfig, default_ilql_config

[2023-07-15 16:51:35,235] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


If we have trained our reward model with our data we could have used it as well. Since not, we are only using the transfer learning model with a random reward model for sentiments.

**Configuration for Training**:

The default_ilql_config function is called to fetch the default configuration for training the model using the Incremental Lagrangian Q-Learning (ILQL) method. Several parameters are then modified in this configuration, including disabling the tracker, setting the batch size to 16, the number of epochs to 10, and changing the model path to use the AliChazz/GPT2_Fine_Tune_Requirement_Produce model.

In [5]:
default_config = default_ilql_config().to_dict()
default_config['train']['tracker'] = None
default_config['train']['batch_size'] = 16
default_config['train']['epochs'] = 10
default_config['model']['model_path']='AliChazz/GPT2_Fine_Tune_Requirement_Produce'
config = TRLConfig.update(default_config, {})
print(config)

{
    "method": {
        "name": "ilqlconfig",
        "tau": 0.7,
        "gamma": 0.99,
        "cql_scale": 0.1,
        "awac_scale": 1,
        "alpha": 0.001,
        "beta": 0,
        "steps_for_target_q_sync": 5,
        "two_qs": true,
        "gen_kwargs": {
            "max_new_tokens": 56,
            "top_k": 20,
            "beta": 1,
            "temperature": 1.0
        }
    },
    "model": {
        "model_path": "AliChazz/GPT2_Fine_Tune_Requirement_Produce",
        "model_arch_type": "causal",
        "num_layers_unfrozen": -1,
        "peft_config": null
    },
    "optimizer": {
        "name": "adamw",
        "kwargs": {
            "lr": 5e-05,
            "betas": [
                0.9,
                0.95
            ],
            "eps": 1e-08,
            "weight_decay": 1e-06
        }
    },
    "scheduler": {
        "name": "cosine_annealing",
        "kwargs": {
            "T_max": 1000000000000.0,
            "eta_min": 5e-05
        }
    },
   

Explainations for the code below :

This code is responsible for sentiment analysis using a pretrained DistilBERT model and defining the metric function to use during training.

Function - get_positive_score: The purpose of this function is to take the output of a sentiment analysis pipeline (which is a list of dictionaries) and return the score corresponding to the "POSITIVE" sentiment. Here is how it works:

It takes a list of dictionaries as input where each dictionary represents a sentiment (like "POSITIVE" or "NEGATIVE") and its corresponding score.
It applies a map function to each dictionary which transforms each dictionary into a tuple containing the sentiment and its score.
It then converts the resulting list of tuples back into a dictionary.
It then returns the score associated with the "POSITIVE" sentiment.
DistilBERT Sentiment Analysis Pipeline: This piece of code sets up a sentiment analysis pipeline using a pretrained DistilBERT model ("lvwerra/distilbert-imdb").

"sentiment-analysis" is the task type for the pipeline.
"lvwerra/distilbert-imdb" is the pretrained model which is fine-tuned on the IMDB dataset.
"top_k" indicates that the pipeline will return scores for the top 2 sentiments.
"truncation=True" ensures that inputs longer than the model's maximum input length are truncated.
"batch_size" sets the number of samples to process at a time.
"device=0" ensures that the pipeline uses the GPU if available.
Function - metric_fn: This function takes a list of text samples as input and returns a dictionary with the sentiment scores for those samples.

It applies the sentiment analysis pipeline to the samples to get the sentiment scores.
It then applies the get_positive_score function to extract the "POSITIVE" sentiment scores.
Finally, it returns a dictionary where the key is "sentiments" and the value is the list of "POSITIVE" sentiment scores.
This function will be used during training to calculate the reward for the RL algorithm: the higher the positive sentiment score for a generated text, the higher the reward.

In [6]:
def get_positive_score(scores):
    "Extract value associated with a positive sentiment from pipeline's output"
    return dict(map(lambda x: tuple(x.values()), scores))["POSITIVE"]

sentiment_fn = pipeline(
    "sentiment-analysis",
    "lvwerra/distilbert-imdb",
    top_k=2,
    truncation=True,
    batch_size=256,
    device=0,
)

def metric_fn(samples: List[str], **kwargs) -> Dict[str, List[float]]:
    sentiments = list(map(get_positive_score, sentiment_fn(samples)))
    return {"sentiments": sentiments}

imdb = load_dataset("imdb", split="train+test")

Downloading (…)lve/main/config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


In [7]:
trainer = trlx.train(
    samples=imdb["text"],
    rewards=imdb["label"],
    eval_prompts=[
        "I don't know much about Hungarian underground",
        "What made this movie so distinctly",
        "Like the sandwich I just bought at the grocery store,",
        "I cannot believe how much this movie made me want to"
    ] * 20,
    metric_fn=metric_fn,
    config=config,
)

[RANK 0] Initializing model: AliChazz/GPT2_Fine_Tune_Requirement_Produce


Downloading (…)lve/main/config.json:   0%|          | 0.00/932 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.
[RANK 0] Collecting rollouts
Token indices sequence length is longer than the specified maximum sequence length for this model (1169 > 1024). Running this sequence through the model will result in indexing errors
[RANK 0] Logging sample example


[RANK 0] Logging experience string statistics


[RANK 0] Starting training
[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


  0%|          | 0/1000 [00:00<?, ?it/s]

[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


[RANK 0] Saving intermediate checkpoint into ckpts/checkpoint_1000
[RANK 0] Evaluating model


[generation sweep 0/1 | eval batch 0/5]:   0%|          | 0/5 [00:00<?, ?it/s]

[RANK 0] Computing metrics
[RANK 0] Summarizing evaluation


You can see now our model is changed in a way to not output a requirement but a positive sentimental continuation.

**Generate Text**:

 Finally, the trained model is used to generate a text continuation for the prompt 'Aeroplanes shall work with'. The generated text is then printed to the console.

In [8]:
# output
input_str = 'Aeroplanes shall work with'
trainer_output = trainer.generate_eval(
    **trainer.tokenizer(input_str, return_tensors='pt'))[0]
print(trainer.tokenizer.decode(trainer_output))

Aeroplanes shall work with a group of people (at least in the U.S. in the late 1930's), and they are all like a group who are trying to get away from a town and to do so they are put into "bodies" and put into a car and driven to a
