## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [None]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb
!pip install -q auto-gptq==0.4.2

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.0/225.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.0/102.0 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m24.5 MB/

In [None]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


## Dataset



In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

In [None]:
import pandas as pd

In [None]:
# dataset_train=load_dataset("csv", data_files="/content/drive/MyDrive/CB/LLM/Falcon-7b-MCQ-sample_dataset-model/training_dataset/CNN_dataset_mcq_train.csv",split="train")
# dataset_val=load_dataset("csv", data_files="/content/drive/MyDrive/CB/LLM/Falcon-7b-MCQ-sample_dataset-model/training_dataset/CNN_dataset_mcq_validation.csv",split="train")

In [None]:
dataset_train=load_dataset("csv", data_files="/content/drive/MyDrive/RH/RLHF/CNN_dataset_mcq_train.csv",split="train")
dataset_val=load_dataset("csv", data_files="/content/drive/MyDrive/RH/RLHF/CNN_dataset_mcq_validation.csv",split="train")

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
dataset_train,dataset_val

(Dataset({
     features: ['text'],
     num_rows: 999
 }),
 Dataset({
     features: ['text'],
     num_rows: 399
 }))

## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "chintan4560/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
    )

from transformers import FalconForCausalLM

model = FalconForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map='auto'
)

model.config.use_cache = False

# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     quantization_config=bnb_config,
#     trust_remote_code=True,
#     device_map='auto'
# )
# model.config.use_cache = False

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

You are using a model of type RefinedWebModel to instantiate a model of type falcon. This is not supported for all configurations of models and can yield errors.


pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]

pytorch_model-00001-of-00005.bin:   0%|          | 0.00/2.91G [00:00<?, ?B/s]

pytorch_model-00002-of-00005.bin:   0%|          | 0.00/2.90G [00:00<?, ?B/s]

pytorch_model-00003-of-00005.bin:   0%|          | 0.00/2.90G [00:00<?, ?B/s]

pytorch_model-00004-of-00005.bin:   0%|          | 0.00/2.90G [00:00<?, ?B/s]

pytorch_model-00005-of-00005.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Let's also load the tokenizer below

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [None]:
print(model)

FalconForCausalLM(
  (transformer): FalconModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x FalconDecoderLayer(
        (self_attention): FalconAttention(
          (rotary_emb): FalconRotaryEmbedding()
          (query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
          (dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): FalconMLP(
          (dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
          (act): GELU(approximate='none')
          (dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
        )
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)


In [None]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 8

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense"
        # "dense_h_to_4h"
        # "dense_4h_to_h",
    ]
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [None]:
# from trl.trainer import ConstantLengthDataset

In [None]:
from transformers import TrainingArguments

In [None]:
# help(TrainingArguments)

In [None]:
training_arguments = TrainingArguments(
    output_dir="/content/drive/MyDrive/CB/LLM/Falcon-7b-MCQ-sample_dataset-model/finetuned_model/SFT_tuning_with_first_two_modules",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    evaluation_strategy='epoch',
    num_train_epochs=3,
    save_strategy='epoch',
    logging_steps=100,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    group_by_length=True,
    warmup_ratio = 0.03,
    lr_scheduler_type="constant",
)

Then finally pass everthing to the trainer

In [None]:
from trl import SFTTrainer

max_seq_length = 128

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    eval_dataset=dataset_val,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,

)

Map:   0%|          | 0/999 [00:00<?, ? examples/s]

Map:   0%|          | 0/399 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [None]:
# for name, module in trainer.model.named_modules():
#     if "norm" in name:
#         module = module.to(torch.float32)

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [None]:
#242aee8ab658cfed48a145ceb8dbea4b8397905c

In [None]:
# trainer.train()

In [None]:
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

In [None]:
peft_model = PeftModel.from_pretrained(model,
                                       '/content/drive/MyDrive/CB/LLM/Falcon-7b-MCQ-sample_dataset-model/finetuned_model/SFT_tuning_with_first_two_modules/checkpoint-747',
                                       lora_config=peft_config,
                                       torch_dtype=torch.bfloat16,
                                       device_map="auto",
                                       is_trainable=True)

# print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')

In [None]:
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead

In [None]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

# print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)



ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=4544, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


In [None]:
from trl import create_reference_model

In [None]:
ref_model = create_reference_model(ppo_model)
# print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

# Reward model

In [None]:
import torch
from tqdm import tqdm
import pandas as pd
tqdm.pandas()
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead,RewardTrainer
from trl.core import LengthSampler
import random
from datasets import Dataset, load_dataset
from transformers import AutoModelForSequenceClassification,AutoTokenizer,TrainingArguments,pipeline

In [None]:
# config = PPOConfig(
#     model_name=model_name,
#     learning_rate=1.41e-5)

config = PPOConfig(
    model_name="gpt2",
    learning_rate=1.41e-5)

In [None]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("Short-Answer-Feedback/saf_communication_networks_english")

# Access dataset splits (train, validation, test)
train_data = dataset["train"]
validation_data = dataset["validation"]
# test_data = dataset["test"]
# train_df = pd.DataFrame(df)

import pandas as pd

# Assuming train_df is your DataFrame
train_data.to_csv("train_data.csv", index=False)

Downloading readme:   0%|          | 0.00/6.68k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/532k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/150k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/125k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/134k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1700 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/427 [00:00<?, ? examples/s]

Generating test_unseen_answers split:   0%|          | 0/375 [00:00<?, ? examples/s]

Generating test_unseen_questions split:   0%|          | 0/479 [00:00<?, ? examples/s]

Creating CSV from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

2335929

In [None]:
df = pd.read_csv('train_data.csv')
df = df[:100]

In [None]:
import random
import pandas as pd
from operator import itemgetter
import torch
import warnings
warnings.filterwarnings('ignore')
from datasets import Dataset, load_dataset
from transformers import AutoModelForSequenceClassification,AutoTokenizer,TrainingArguments
from trl import RewardTrainer

In [None]:
df['tup'] = list(zip(df['provided_answer'], df['score']))
df_g = df.groupby('question')['tup'].apply(list).reset_index()
df_g["sorted_tup"] = df_g["tup"].apply(lambda x :sorted(x,key=itemgetter(1)) )
df_g["chosen"] = df_g["sorted_tup"].apply(lambda x: x[-1][0])
df_g["chosen_score"] = df_g["sorted_tup"].apply(lambda x: x[-1][1])
df_g["rejected"] = df_g["sorted_tup"].apply(lambda x: x[0][0])
df_g["rejected_score"] = df_g["sorted_tup"].apply(lambda x: x[0][1])
df_g = df_g.dropna()
df_g = df_g[(df_g['chosen_score']>=0.5) & (df_g['rejected_score']<0.5)]
df_g.to_csv("feedback_comparison_dataset.csv")

In [None]:
rows = []
for record in df_g.itertuples(index=True, name='Pandas'):
    if record is None or len(record) == 0:
        continue
    rows.append({
        "instruction": record.question,
        "chosen_response": record.chosen,
        "rejected_response": record.rejected
    })

prepared_dataset = Dataset.from_list(rows)
prepared_dataset.to_pandas()

Unnamed: 0,instruction,chosen_response,rejected_response
0,Consider the following topology from the exerc...,"Hop 1:\n(A, B, forward) because A is source an...","Hop 1:\n(H,G, forward)\nHop 2:\n(G,F, forward)..."
1,WHAT is the purpose of Reverse Path Forwarding...,The purpose of Reverse Path Forwarding (RPF) i...,1. Purpose: help prevent IP address spoofing....
2,WHICH PROPERTY of spanning trees makes them ap...,Spanning trees allows to reach all other nodes...,The property is that all IS know the multicast...
3,Write-down all addresses in Class A networks t...,0.0.0.0/8 - Addresses in this block refer to s...,"0.0.0.0 , 127.255.255.255"


In [None]:
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name).cuda()
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead,RewardTrainer

In [None]:
# device = ppo_trainer.accelerator.device
# if ppo_trainer.accelerator.num_processes == 1:
#     device = 0 if torch.cuda.is_available() else "cpu"

In [None]:
import random
import pandas as pd
from operator import itemgetter
import torch
import warnings
warnings.filterwarnings('ignore')
from datasets import Dataset, load_dataset
from transformers import AutoModelForSequenceClassification,AutoTokenizer,TrainingArguments
from trl import RewardTrainer

In [None]:
rows = []
for record in df_g.itertuples(index=True, name='Pandas'):
    if record is None or len(record) == 0:
        continue
    rows.append({
        "instruction": record.question,
        "chosen_response": record.chosen,
        "rejected_response": record.rejected
    })

prepared_dataset = Dataset.from_list(rows)
prepared_dataset.to_pandas()

Unnamed: 0,instruction,chosen_response,rejected_response
0,Consider the following topology from the exerc...,"Hop 1:\n(A, B, forward) because A is source an...","Hop 1:\n(H,G, forward)\nHop 2:\n(G,F, forward)..."
1,WHAT is the purpose of Reverse Path Forwarding...,The purpose of Reverse Path Forwarding (RPF) i...,1. Purpose: help prevent IP address spoofing....
2,WHICH PROPERTY of spanning trees makes them ap...,Spanning trees allows to reach all other nodes...,The property is that all IS know the multicast...
3,Write-down all addresses in Class A networks t...,0.0.0.0/8 - Addresses in this block refer to s...,"0.0.0.0 , 127.255.255.255"


In [None]:
#Select a base model whch we need to train for reward modeling.
model_name = "distilroberta-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

def formatting_func(examples):
    kwargs = {"padding": "max_length", "truncation": True, "max_length": 128, "return_tensors": "pt"}

    prompt_plus_chosen_response = examples["instruction"] + "\n" + examples["chosen_response"]
    prompt_plus_rejected_response = examples["instruction"] + "\n" + examples["rejected_response"]
    tokens_chosen = tokenizer.encode_plus(prompt_plus_chosen_response, **kwargs)
    tokens_rejected = tokenizer.encode_plus(prompt_plus_rejected_response, **kwargs)

    return {
        "input_ids_chosen": tokens_chosen["input_ids"][0], "attention_mask_chosen": tokens_chosen["attention_mask"][0],
        "input_ids_rejected": tokens_rejected["input_ids"][0], "attention_mask_rejected": tokens_rejected["attention_mask"][0]
    }

formatted_dataset = prepared_dataset.map(formatting_func)
formatted_dataset = formatted_dataset.train_test_split()

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

In [None]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/RH/RLHF/reward_model",
    per_device_train_batch_size=8,
    evaluation_strategy="steps",
    logging_steps=1,
    num_train_epochs = 3,
    report_to=None,
)

trainer = RewardTrainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
)

In [None]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Accuracy
1,0.6808,0.687487,1.0
2,0.683,0.685138,1.0
3,0.6659,0.684253,1.0


TrainOutput(global_step=3, training_loss=0.67656143506368, metrics={'train_runtime': 554.2213, 'train_samples_per_second': 0.016, 'train_steps_per_second': 0.005, 'total_flos': 0.0, 'train_loss': 0.67656143506368, 'epoch': 3.0})

In [None]:
trainer.save_model()

# RLHF Training

In [None]:
config = PPOConfig(
    model_name="gpt2",
    learning_rate=1.41e-5)

In [None]:
def build_dataset(config, input_min_text_length=2, input_max_text_length=200):
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    df = pd.read_csv("/content/train_data.csv")
    ds = Dataset.from_pandas(df)
    ds = ds.rename_columns({"question": "review"})
    input_size = LengthSampler(input_min_text_length, input_max_text_length)
    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample
    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

In [None]:
dataset = build_dataset(config)
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

Map:   0%|          | 0/1700 [00:00<?, ? examples/s]

In [None]:
dataset[:10]

{'id': ['6a31b925382d4e31a417cc78399dbff2',
  'fa2712ae605143e1a277ad6df6c2d7b3',
  '76a8b716eb8f4bdf8ef918aa1ab467ad',
  '99a2b05b1e8c42f09feeae71be973b31',
  '435d0324512d46cf8c23f34fa71b950c',
  '0006018bf61042dbbb2aad64b52f05c1',
  '2f583d5638af4bb7951bb301bb581765',
  'af3bbaff4bfb4c20a633530f876e3aa0',
  '050459587d614dd5b3662284b3a75de8',
  'dffb7d34fed74b198a8d1aeb4212c593'],
 'review': ['What is "frame bursting"? Also, give 1 advantage and disadvantage compared to the carrier extension.',
  'Discuss 3 methods (each with at least one advantage and disadvantage) that address the problem of duplicate packets on the transport layer in a connection-oriented service.',
  'Consider a single server queueing system with a buffer of size 10. Let us assume that 9 packets arrive per second and 10 packets are served per second on an average. Assume you monitor the system for exactly one minute after the system reaches equilibrium. How many seconds would you expect the system to be in a sta

In [None]:
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name).cuda()
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

In [None]:
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)

In [None]:
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"

In [None]:
rm_model_trained = AutoModelForSequenceClassification.from_pretrained("/content/drive/MyDrive/RH/RLHF/reward_model")
rm_tokenizer_trained = AutoTokenizer.from_pretrained("/content/drive/MyDrive/RH/RLHF/reward_model")

if rm_tokenizer_trained.pad_token is None:
    rm_tokenizer_trained.pad_token = rm_tokenizer_trained.eos_token
    rm_model_trained.config.pad_token_id = rm_model_trained.config.eos_token_id

In [None]:
gen_kwargs = {"min_length": -1, "top_k": 0.0, "top_p": 1.0, "do_sample": True, "pad_token_id": tokenizer.eos_token_id}

In [None]:
output_min_length = 2
output_max_length = 8
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    text = [q + r for q, r in zip(batch["query"], batch["response"])]
    encoding = rm_tokenizer_trained(text, return_tensors="pt",padding='max_length',truncation=True)
    outputs = rm_model_trained(**encoding)
    rewards = [torch.tensor(i) for i in outputs.logits]

    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

0it [00:48, ?it/s]


OutOfMemoryError: CUDA out of memory. Tried to allocate 4.48 GiB. GPU 0 has a total capacity of 14.75 GiB of which 3.27 GiB is free. Process 5642 has 11.47 GiB memory in use. Of the allocated memory 10.84 GiB is allocated by PyTorch, and 512.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)