<a href="https://colab.research.google.com/github/ChandlerU11/Hugging_ReST/blob/main/ReST_Method_Fine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging ReST
This notebook contains my implementation of the Reinforced Self-Training (ReST) for Language Modeling algorithm described in Google Deepmind's paper located at: https://arxiv.org/pdf/2308.08998.pdf. I would highly recommend giving it a read!

**This is NOT an official implementation of their work.**

## The Implementation
The notebook uses the Hugging Face library to fine-tune an LM (t5-small) using the reinforcement learning method, ReST. This training script contains a `dummy_reward_model` that rewards the generator being fine-tuned based on how many times it generates the word "hugs". The desired number of "hugs" generations is passed as input to the model during training.

Because it uses the Hugging Face library for fine-tuning, this implementation is easily **transferable** and with just a few changes can be used with many LLM's given that they are hosted on Hugging Face. The `dummy_reward_model` in this implementation can also be changed to any text-based classifier that produces rewards. The reward model used for ReST training can range from the simple `dummy_reward_model` here to another LLM. Human preference scores of generated texts are fair game too.

In [6]:
!pip install datasets
!pip install accelerate -U
!pip install transformers[torch]



In [7]:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments, AutoModelForSeq2SeqLM, AutoTokenizer
from sklearn.model_selection import train_test_split
from datasets import Dataset
from tqdm import tqdm
import torch
import pandas as pd
import statistics
import random
import time

if torch.cuda.is_available():
    device = 0

def prepare_dataset(examples):
  input_ids = tokenizer(examples['input'], truncation=True, max_length=128)['input_ids']
  label_ids = tokenizer(examples["response"], truncation=True, max_length=128)['input_ids']
  return {"input_ids": input_ids, "labels": label_ids}

def dummy_reward_model(df):
    # Get input length
    input_len = [int(each.split()[0]) for each in df['input']]

    # Get word count of response
    gen_len = [len(each.split()) for each in df['response']]

    # Count number of "hugs" in response
    hug_count = [sum(1 for i in each.split() if i == 'hugs') for each in df['response']]

    # If count of "hugs" and total words in string do not match input legnth, give negative reward
    rewards = []
    for x,y,z in zip(input_len, hug_count, gen_len):
      if z == y:
        rewards.append(-abs(int(x) - y))
      else:
        rewards.append(-3)

    return rewards

def generate(df, gen_model, tokenizer, generation_kwargs, N):
    responses = []
    for _ in range(N):
        inputs = tokenizer(df['input'].tolist(), return_tensors="pt", padding = True, truncation = True)
        response = gen_model.generate(inputs["input_ids"].to(device), **generation_kwargs)
        response = [tokenizer.decode(each, skip_special_tokens = True) for each in response]
        responses.extend(response)

    # Create preserved-order dataframe with prompt / response pairs
    gen_df = pd.concat([df.copy()] * N)
    gen_df['response'] = responses

    return gen_df

def ReST(D, Deval, G, I, N, model, tokenizer, generation_kwargs, training_args):
    for g in range(G):
        print('Grow Step ', g)

        # Generate Dg. N determines number of generations per sample.
        Dg = generate(D, model, tokenizer, generation_kwargs, N)

        # Annotate Dg with reward model.
        Dg['scores'] = dummy_reward_model(Dg)

        print(len(Dg[Dg['scores'] == 0]), "generations out of ", len(Dg), "are the correct length.")
        print("Example output from model:")
        print(Dg.head(25))
        time.sleep(10)

        steps = 0
        for tau_i in I:
            print('Improve Step: ', steps)
            print('Threshold: ', tau_i)

            # Filter for samples at or above threshold
            Dg_filt = Dg.loc[(Dg['scores'] >= tau_i)].copy()
            if len(Dg_filt) == 0:
                print("NO SAMPLES ABOVE THRESHOLD")
                break

            Dg_filt = Dataset.from_pandas(Dg_filt).map(prepare_dataset, batched=True)

            # Create trainer with newly filtered data
            trainer = Seq2SeqTrainer(model=model, args=training_args, tokenizer=tokenizer, train_dataset=Dg_filt, data_collator=data_collator)

            # First fine-tuning of improve step
            trainer.train()

            # Generate one response for every sample in eval set
            Dg_eval = generate(Deval, model, tokenizer, generation_kwargs, 1)
            Dg_eval['scores'] = dummy_reward_model(Dg_eval)

            # While generator improves reward model score on eval, continue to fine-tune using Dg_filt
            prev = -5
            improve = statistics.mean(Dg_eval['scores'])
            while prev < improve:
                trainer.train()
                Dg_eval = generate(Deval, model, tokenizer, generation_kwargs, 1)
                Dg_eval['scores'] = dummy_reward_model(Dg_eval)
                prev = improve
                improve = statistics.mean(Dg_eval['scores'])

            steps += 1

    print("Training Finished!!!")
    return model

In [8]:
# Generate training data
rand_data = []
for i in range(1000):
     rand_data.append(str(random.randrange(1,5)) + ' hugs')

# Generate test data
rand_test_data = []
for i in range(100):
     rand_test_data.append(str(random.randrange(1,5)) + ' hugs')

train_df = pd.DataFrame()
train_df['input'] = rand_data

test_df = pd.DataFrame()
test_df['input'] = rand_test_data

generation_kwargs = {
    "min_length":-1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
}

training_args = Seq2SeqTrainingArguments(
            do_train=True,
            do_eval=False,
            learning_rate = 3e-4,
            output_dir="./t5-small",
            num_train_epochs=1,
            per_device_train_batch_size = 64
            )

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small").to(device)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, label_pad_token_id=-100)

D, Deval = train_test_split(train_df, test_size = .1, random_state = 42)
G = 3          # Number of Grow Steps
I = [-2,-1,0]  # Number of Improve Steps (length of list) with respective thresholds
N = 10         # Number of generations for each sample when creating Dg

fine_tuned_model = ReST(D, Deval, G, I, N, model, tokenizer, generation_kwargs, training_args)

Grow Step  0




4 generations out of  9000 are the correct length.
Example output from model:
      input response  scores
716  4 hugs     choc      -3
351  1 hugs    1 and      -3
936  3 hugs   6 Famé      -3
256  4 hugs     4 us      -3
635  2 hugs   2 hugs      -3
644  3 hugs    7 fro      -3
554  1 hugs      She      -3
959  2 hugs   2 hugs      -3
168  2 hugs   2 hugs      -3
917  1 hugs               -1
528  1 hugs      WOW      -3
823  2 hugs   2 biss      -3
985  3 hugs       0.      -3
816  2 hugs   2 hugs      -3
86   4 hugs  4 Cous.      -3
432  2 hugs    2hugs      -3
184  4 hugs  Kritiks      -3
978  2 hugs               -2
534  1 hugs      1 s      -3
294  3 hugs   3 hugs      -3
892  4 hugs   4 hugs      -3
425  2 hugs    2hugs      -3
713  4 hugs   4 hugs      -3
260  3 hugs   3 hugs      -3
237  4 hugs   4 hugs      -3
Improve Step:  0
Threshold:  -2


Map:   0%|          | 0/169 [00:00<?, ? examples/s]

Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss




Improve Step:  1
Threshold:  -1


Map:   0%|          | 0/141 [00:00<?, ? examples/s]

Step,Training Loss




Step,Training Loss


Improve Step:  2
Threshold:  0




Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss




Grow Step  1
2176 generations out of  9000 are the correct length.
Example output from model:
      input response  scores
716  4 hugs     hugs      -3
351  1 hugs     hugs       0
936  3 hugs     hugs      -2
256  4 hugs     hugs      -3
635  2 hugs     hugs      -1
644  3 hugs     hugs      -2
554  1 hugs     hugs       0
959  2 hugs     hugs      -1
168  2 hugs     hugs      -1
917  1 hugs     hugs       0
528  1 hugs    Verts      -3
823  2 hugs               -2
985  3 hugs     hugs      -2
816  2 hugs     hugs      -1
86   4 hugs     hugs      -3
432  2 hugs     hugs      -1
184  4 hugs     hugs      -3
978  2 hugs     hugs      -1
534  1 hugs     hugs       0
294  3 hugs     hugs      -2
892  4 hugs     hugs      -3
425  2 hugs     hugs      -1
713  4 hugs     hugs      -3
260  3 hugs     hugs      -2
237  4 hugs     hugs      -3
Improve Step:  0
Threshold:  -2


Map:   0%|          | 0/6408 [00:00<?, ? examples/s]

Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss




Improve Step:  1
Threshold:  -1


Map:   0%|          | 0/4196 [00:00<?, ? examples/s]

Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss




Improve Step:  2
Threshold:  0


Map:   0%|          | 0/2176 [00:00<?, ? examples/s]

Step,Training Loss




Step,Training Loss




Grow Step  2




5193 generations out of  9000 are the correct length.
Example output from model:
      input                                       response  scores
716  4 hugs             hugs hugs hugs hugs hugs hugs hugs      -3
351  1 hugs                                           hugs       0
936  3 hugs             hugs hugs hugs hugs hugs hugs hugs      -4
256  4 hugs                                 hugs hugs hugs      -1
635  2 hugs                                      hugs hugs       0
644  3 hugs                                      hugs hugs      -1
554  1 hugs                                           hugs       0
959  2 hugs                                      hugs hugs       0
168  2 hugs                                      hugs hugs       0
917  1 hugs                                           hugs       0
528  1 hugs                                           hugs       0
823  2 hugs                                      hugs hugs       0
985  3 hugs             hugs hugs hugs hugs hugs

Map:   0%|          | 0/7903 [00:00<?, ? examples/s]

Step,Training Loss




Step,Training Loss




Step,Training Loss




Step,Training Loss


Improve Step:  1
Threshold:  -1




Map:   0%|          | 0/7072 [00:00<?, ? examples/s]

Step,Training Loss




Step,Training Loss


Improve Step:  2
Threshold:  0




Map:   0%|          | 0/5193 [00:00<?, ? examples/s]

Step,Training Loss




Step,Training Loss


Training Finished!!!




## Output Prior to RL Fine-tuning

In [9]:
def test_ReST(test_data, model, tokenizer, generation_kwargs):
      test_df = generate(test_data, model, tokenizer, generation_kwargs, 1)
      test_df['scores'] = dummy_reward_model(test_df)
      print("The model generates the correct number of hugs", len(test_df[test_df['scores'] == 0]) / len(test_df) * 100, "percent of the time!" )
      print(test_df.head(25))

no_fine_model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small").to(device)
test_ReST(test_df, no_fine_model, tokenizer, generation_kwargs)



The model generates the correct number of hugs 0.0 percent of the time!
     input       response  scores
0   1 hugs         1 hugs      -3
1   2 hugs              è      -3
2   1 hugs  1.5 abs short      -3
3   1 hugs         1 hugs      -3
4   3 hugs              1      -3
5   2 hugs         2 hugs      -3
6   2 hugs        2 bless      -3
7   2 hugs         2 hugs      -3
8   1 hugs          1 hug      -3
9   2 hugs          2hugs      -3
10  2 hugs         2 hugs      -3
11  2 hugs         2 hugs      -3
12  3 hugs             35      -3
13  2 hugs         2 viel      -3
14  1 hugs          1 hug      -3
15  3 hugs         3 hugs      -3
16  2 hugs       2 ofhugs      -3
17  2 hugs          2 hug      -3
18  3 hugs   3 Rab 3 hugs      -3
19  2 hugs          2hugs      -3
20  1 hugs                     -1
21  1 hugs          1 hug      -3
22  2 hugs         2 hugs      -3
23  1 hugs    Hus........      -3
24  3 hugs         3 hugs      -3


## Output Post RL Fine-tuning

In [10]:
test_ReST(test_df, fine_tuned_model, tokenizer, generation_kwargs)



The model generates the correct number of hugs 100.0 percent of the time!
     input        response  scores
0   1 hugs            hugs       0
1   2 hugs       hugs hugs       0
2   1 hugs            hugs       0
3   1 hugs            hugs       0
4   3 hugs  hugs hugs hugs       0
5   2 hugs       hugs hugs       0
6   2 hugs       hugs hugs       0
7   2 hugs       hugs hugs       0
8   1 hugs            hugs       0
9   2 hugs       hugs hugs       0
10  2 hugs       hugs hugs       0
11  2 hugs       hugs hugs       0
12  3 hugs  hugs hugs hugs       0
13  2 hugs       hugs hugs       0
14  1 hugs            hugs       0
15  3 hugs  hugs hugs hugs       0
16  2 hugs       hugs hugs       0
17  2 hugs       hugs hugs       0
18  3 hugs  hugs hugs hugs       0
19  2 hugs       hugs hugs       0
20  1 hugs            hugs       0
21  1 hugs            hugs       0
22  2 hugs       hugs hugs       0
23  1 hugs            hugs       0
24  3 hugs  hugs hugs hugs       0
