In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Install the desired library

In [2]:
!pip3 install transformers
!pip3 install datasets
!pip3 install trl
!pip install -q -U bitsandbytes
!pip install accelerate
!pip install peft



## load the dataset

In [2]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer,TrainingArguments, AutoModelForQuestionAnswering
from trl import DPOTrainer, SFTTrainer
from transformers import BitsAndBytesConfig
from torch import bfloat16
import torch
from peft import LoraConfig,PeftModel, PeftConfig, get_peft_model, AutoPeftModelForCausalLM
import warnings
import os
warnings.filterwarnings("ignore")

## Load the model and tokenizer

### To perform the finetuning of LLMs using DPO.
* First we need to perform the supervised fine tuning
* Once supervised fine-tuning is done, we need to perform the DPO training which takes in input the SFT model.

### We are going to use quantization technique to load the model
* Now most of the deep learning model use float32 representation for the numeric values.
* But representation with float 32 has it's own advantage and disadvantge
* representing the value in 32 bit means, we have 1 signed bit, 8 exponent bit and 23 is fraction bit. Means, that we have around 2^8 + 2^23 + 2^1 possible combination to store the value.
* Which is very big and can lead to memory error.

* So Bits and bytes provide the way to load the model in  4 bit or half precision which is 16 bit.

* Now main question is does loading the model in lesser bit effect the performance, well for that we need to understand how does the particular quantization works


> Let's say I load the model in 4bit qunatisation using BitsandBytes which is `bnb_4bit_quant_type='nf4'`.
* So how it works, first the model weight is normalised into desired range.
* After normalisation it is quantised into 4 bit which means the value is evenly spaced w.r.t normalised weight.
* Although the weights are stored in 4-bit, they are dequantized during computation which gives a performance boost during inference.

> Eeven tho the process is faster but it does show some effect in performance, but it's not that huge and on top of that we are reducing the computation cost.





In [4]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)
#Even though the weights are loaded in a 4-bit format,
#computations during forward and backward passes will be performed using bfloat16 data type,
#which is a 16-bit floating-point representation.
#bfloat16 strikes a balance between memory efficiency and computational precision.
model_name ="facebook/opt-350m"

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
base_model.config.use_cache = False

### Let's apply Lora for finetuning.
> LoRA (Low-Rank Adaptation of Large Language Models) is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share. These are also called adapters.

In [5]:
base_model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear4bit(in_features=1024, out_features=512, bias=False)
      (project_in): Linear4bit(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear4bit(in_features=1024, out_features=4096, bias=True)
          (

### Applying SFTTrainer (Supervised fine-tuning for large language model)

* We don't need to apply the Supervised finetuning but there are some reason to do that, let's understand this by an example.

<img src="https://drive.google.com/uc?id=1p0tirQiwwuw8B36fmJYIM6DpBbx_4jAJ">

* Given an image above it shows the output we get if don't fine tuning the model using SFTT. This output is good, but it lacks the human touch, which means it does give the answer but it sound more like the auto generated text from machine.
* We want to give response which sound more like human respone and not machine generated.
* for those cases we try to finetune this model, which help to understand the relationship between input and output better
* As shown in the image below:


<img src="https://drive.google.com/uc?id=1-OnH8UfTdwzzdnenwJnFmTsHlI8aVL6f">


> Images taken from https://medium.com/mantisnlp/supervised-fine-tuning-customizing-llms-a2c1edbf22c3



In [31]:
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [6]:
 def formatting_prompts_func(example):
  output_texts = []
  assert len(example["text"]) == len(example["label"])
  for i in range(len(example['text'])):
    text = f"### Input: ```{example['text'][i]}```\n ### Output: {example['label'][i]}"
    output_texts.append(text)
  return output_texts

In [7]:
import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'

In [8]:
sft_dataset = load_dataset("gbharti/finance-alpaca", split="train")
sft_datset_new = sft_dataset.select(range(10000))
sft_splitted_data = sft_datset_new.train_test_split(test_size=0.3,seed=42, shuffle=True)

sft_splitted_data["train"] = sft_splitted_data["train"].remove_columns(["input", "text"])
sft_splitted_data["test"] = sft_splitted_data["test"].remove_columns(["input", "text"])

sft_splitted_data["train"] = sft_splitted_data["train"].rename_column("output", "label")
sft_splitted_data["train"] = sft_splitted_data["train"].rename_column("instruction", "text")

sft_splitted_data["test"]  = sft_splitted_data["test"].rename_column("output", "label")
sft_splitted_data["test"]  = sft_splitted_data["test"].rename_column("instruction", "text")

In [9]:
sft_splitted_data

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 7000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 3000
    })
})

In [10]:
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# training the supervised model
"""
 max_step = epoch*(total_example/batch_size)
 logging_step = total_example/batch_size
 loggers = logging_step/epoch
"""
output_path = "/content/drive/MyDrive/fine_tuning_llm_with_dop_peft/models/"
config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["k_proj","v_proj","q_proj","out_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)
MAX_STEPS =int(2 * len(sft_splitted_data["train"]) / 4)
args = TrainingArguments(
        output_path,
        overwrite_output_dir=True, # This reduces the amt of disk space that gets used.
        fp16=True,  # fp16 training to allow larger batch sizes to be used
        evaluation_strategy = "steps",
        save_strategy = "steps",
        learning_rate=5e-5,
        warmup_steps=500,
        warmup_ratio=0.1,
        do_train=True,
        do_eval=True,
        logging_steps=500,
        eval_steps=500,
        save_total_limit=1,
        gradient_accumulation_steps=2,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        max_steps=MAX_STEPS,
        weight_decay=0.01,
        dataloader_num_workers=4,
        load_best_model_at_end=True
)

trainer = SFTTrainer(
    model=base_model,
    train_dataset=sft_splitted_data["train"],
    eval_dataset=sft_splitted_data["test"],
    tokenizer=tokenizer,
    peft_config=config,
    formatting_func=formatting_prompts_func,
    dataset_text_field="text",
    args=args,         # Trainer arguments
    max_seq_length=256,
)
trainer.train()

Step,Training Loss,Validation Loss
500,3.9444,3.648086
1000,3.6399,3.55551
1500,3.565,3.506459
2000,3.5012,3.471764
2500,3.4699,3.445186
3000,3.4301,3.432025
3500,3.4403,3.425004


TrainOutput(global_step=3500, training_loss=3.5701109793526786, metrics={'train_runtime': 2393.2822, 'train_samples_per_second': 11.699, 'train_steps_per_second': 1.462, 'total_flos': 1056753579786240.0, 'train_loss': 3.5701109793526786, 'epoch': 4.0})

In [3]:
output_path = "/content/drive/MyDrive/fine_tuning_llm_with_dop_peft/models/"
adapter_path = os.path.join(output_path,"checkpoint-3500")

In [4]:
math_dataset = load_dataset("argilla/distilabel-math-preference-dpo", split="train")
def process_dataset(sample_data):
  return {
      "prompt": [f"Question: " + question + "\n\nAnswer: "
      for question in sample_data["instruction"]
      ],
      "chosen": sample_data["chosen_response"],
      "rejected": sample_data["rejected_response"]
  }

original_cols = math_dataset.column_names
math_dataset = math_dataset.map(process_dataset,batched=True,remove_columns=original_cols)
# let's divide the dataset
math_dataset_splits = math_dataset.train_test_split(test_size=0.2,seed=42,shuffle=True)
math_dataset_splits

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 1934
    })
    test: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 484
    })
})

In [5]:
# now we can train the DPO model based on the dataset we create
model = AutoPeftModelForCausalLM.from_pretrained(
 adapter_path,
 low_cpu_mem_usage=True,
 torch_dtype=torch.bfloat16,
 load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(adapter_path,trust_remote_code=True)

In [6]:
MAX_STEPS =int(5 * len(math_dataset_splits["train"]) / 16)

output_path_dpo = os.path.join(output_path, "dpo_models")
args = TrainingArguments(
        output_path_dpo,
        overwrite_output_dir=True, # This reduces the amt of disk space that gets used.
        fp16=True,  # fp16 training to allow larger batch sizes to be used
        evaluation_strategy = "steps",
        save_strategy = "steps",
        learning_rate=5e-5,
        warmup_steps=500,
        warmup_ratio=0.1,
        do_train=True,
        do_eval=True,
        logging_steps=500,
        eval_steps=500,
        save_total_limit=1,
        gradient_accumulation_steps=2,
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        max_steps=MAX_STEPS,
        weight_decay=0.01,
        dataloader_num_workers=4,
        load_best_model_at_end=True
)


In [7]:
config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["k_proj","v_proj","q_proj","out_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

dpo_trainer = DPOTrainer(
    model,
    args=args,
    beta=0.1,
    train_dataset=math_dataset_splits["train"],
    eval_dataset=math_dataset_splits["test"],
    tokenizer=tokenizer,
    peft_config=config,
    max_length=512,
)
dpo_trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
500,0.8526,0.69232,0.555589,0.50666,0.528926,0.04893,-596.032288,-616.445007,-1.932621,-2.044497


TrainOutput(global_step=604, training_loss=0.8429368758043706, metrics={'train_runtime': 904.2759, 'train_samples_per_second': 2.672, 'train_steps_per_second': 0.668, 'total_flos': 0.0, 'train_loss': 0.8429368758043706, 'epoch': 1.25})

##The DPO pipeline consists of three main steps:
 * Supervised fine-tuned model
 * The process of annotating data with preference labels
 * DPO training



# Inference

In [None]:
!rm -rf /content/drive/MyDrive/fine_tuning_llm_with_dop_peft/models/dpo_models/checkpoint-11500

In [8]:
adapter_model = "/content/drive/MyDrive/fine_tuning_llm_with_dop_peft/models/dpo_models/checkpoint-500/"
# tokenizer = AutoTokenizer.from_pretrained(adapter_model)
# model = AutoModelForCausalLM.from_pretrained(base_model)
# model.resize_token_embeddings(len(tokenizer))
# peftmodel = PeftModel.from_pretrained(model, adapter_model)

model = AutoPeftModelForCausalLM.from_pretrained(adapter_model, torch_dtype=torch.bfloat16)
model = model.merge_and_unload()
tokenizer = AutoTokenizer.from_pretrained(adapter_model)

In [13]:
model.eval()
for i in range(len(math_dataset_splits["test"][:20])):
  prompt = math_dataset_splits["test"][i]["prompt"]

  inputs = tokenizer(prompt, return_tensors="pt")
  with torch.no_grad():

    generate_kwargs = dict(
    input_ids=inputs["input_ids"],
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=100,
    repetition_penalty=1.3
    )
    outputs = model.generate(**generate_kwargs)
  print(tokenizer.decode(outputs[0]))
  print("**"*50)


</s>Question: Explain the concept of conditional probability and provide an example..
 Take a deep breath, think step by step, and give an accurate response

Answer: ___________.
I am trying to understand the concept of conditional probabilities in a mathematical context (e.g.,
the mathematical model of a probability distribution). I have tried to explain it using
a mathematical model that is based on the following concepts:
(1) The probability distribution is a function of the number of possible values for a given
value;
and
(2) The probability distributions are a function of the number of possible values for any one
of these values.

****************************************************************************************************
</s>Question: Detail the steps involved in finding the solution to a linear equation with three variables..
 Take a deep breath, think step by step, and give an accurate response

Answer: _______________________________________________
A.
1)
2)
3)
4)
5)
6