## PEFT (Parameter Efficient Fine Tuning):
* A library for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a model’s parameters because it is prohibitively costly
* Adapters were one of the first parameter-efficient fine-tuning techniques released. showed that you can add more layers to the pre-existing transformer architecture and only finetune them instead of the whole model.


In [1]:
import sys
!{sys.executable} -m pip install -q peft ipywidgets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_model, LoraConfig, TaskType, PromptTuningConfig
import torch
model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"


In [3]:
device = "mps" if torch.backends.mps.is_available() else "cpu"
device

'mps'

### Supported PEFT types:
    - PROMPT_TUNING
    - MULTITASK_PROMPT_TUNING
    - P_TUNING
    - PREFIX_TUNING
    - LORA (Default value)
    - ADALORA
    - ADAPTION_PROMPT
    - IA3
    - LOHA
    - LOKR
    - OFT

### Overview of the supported task types:
    - SEQ_CLS: Text classification.
    - SEQ_2_SEQ_LM: Sequence-to-sequence language modeling.
    - CAUSAL_LM: Causal language modeling.
    - TOKEN_CLS: Token classification.
    - QUESTION_ANS: Question answering.
    - FEATURE_EXTRACTION: Feature extraction. Provides the hidden states which can be used as embeddings or features
      for downstream tasks.

In [7]:
"""r : the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
    bias: Specifies if the bias parameters should be trained.
"""

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, # 
    target_modules=None, # The names of the modules to apply the adapter to. 
    inference_mode=False, # whether you’re using the model for inference or not
    r=8,  # Lora attention dimension (the "rank")
    lora_alpha=32, # The alpha parameter for Lora scaling for Low Ranking matrices.
    lora_dropout=0.1 #The dropout probability for Lora layers.
)
peft_config

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=<TaskType.SEQ_2_SEQ_LM: 'SEQ_2_SEQ_LM'>, inference_mode=False, r=8, target_modules=None, lora_alpha=32, lora_dropout=0.1, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None)

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path) ## 4.92GB

In [9]:
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 2,359,296 || all params: 1,231,940,608 || trainable%: 0.19151053100118282


In [10]:
## This will only save the incremental PEFT weights that were trained.
## safetensors saved which is 9MB

model.save_pretrained("model") 

In [13]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel, PeftConfig

peft_model_id = "model"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)

# Merge base model with peft saved weights
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

In [14]:
model = model.to(device)
model.eval()
inputs = tokenizer("Tweet text : @HondaCustSvc Your customer service has been horrible during the recall process. I will never purchase a Honda again. Label :", return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(input_ids=inputs["input_ids"].to("mps"), max_new_tokens=10)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])

Honda CUST SERVICE


## Prompt Tuning:
* Prompt tuning adds task-specific prompts to the input, and these prompt parameters are updated independently of the pretrained model parameters which are frozen.
* Effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks.

In [None]:
from peft import PromptEmbedding, PromptTuningConfig

config = PromptTuningConfig(
    peft_type="PROMPT_TUNING",
    task_type="SEQ_2_SEQ_LM",
    num_virtual_tokens=20,
    token_dim=768, # The hidden embedding dimension of the base transformer model.
    num_transformer_submodules=1,
    num_attention_heads=12,
    num_layers=12,
    prompt_tuning_init="TEXT",
    prompt_tuning_init_text="Predict if sentiment of this review is positive, negative or neutral",
    tokenizer_name_or_path="t5-base",
)

# t5_model.shared is the word embeddings of the base model
prompt_embedding = PromptEmbedding(config, t5_model.shared)

## Prefix tuning
* Prefix tuning prefixes a series of task-specific vectors to the input sequence that can be learned while keeping the pretrained model frozen. 
* The prefix parameters are inserted in all of the model layers.
* A lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). 
* Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”. 

In [None]:
from peft import PrefixEncoder, PrefixTuningConfig

config = PrefixTuningConfig(
    peft_type="PREFIX_TUNING",
    task_type="SEQ_2_SEQ_LM",
    num_virtual_tokens=20,
    token_dim=768,
    num_transformer_submodules=1,
    num_attention_heads=12,
    num_layers=12,
    encoder_hidden_size=768,
)

prefix_encoder = PrefixEncoder(config)

## P-tuning
* P-tuning adds trainable prompt embeddings to the input that is optimized by a prompt encoder to find a better prompt, eliminating the need to manually design prompts. 
* The prompt tokens can be added anywhere in the input sequence, and p-tuning also introduces anchor tokens for improving performance.
* employs trainable continuous prompt embeddings.


In [None]:
from peft import PromptEncoder, PromptEncoderConfig

config = PromptEncoderConfig(
    peft_type="P_TUNING",
    task_type="SEQ_2_SEQ_LM",
    num_virtual_tokens=20,
    token_dim=768,
    num_transformer_submodules=1,
    num_attention_heads=12,
    num_layers=12,
    encoder_reparameterization_type="MLP",
    encoder_hidden_size=768,
)

prompt_encoder = PromptEncoder(config)

## LoRA (Low Rank Adapter)
* By default, PEFT initializes LoRA weights with Kaiming-uniform for weight A and zeros for weight B resulting in an identity transform .
* init_lora_weights="gaussian". As the name suggests, this results in initializing weight A with a Gaussian distribution. (Diffusion)
* The LoRA architecture scales each adapter during every forward pass by a fixed scalar which is set at initialization and depends on the rank r. 
* **Rank-stabilized LoRA**: The scalar is given by lora_alpha/r in the original implementation, but rsLoRA uses lora_alpha/math.sqrt(r) which stabilizes the adapters and increases the performance potential from using a higher r.
* **DoRA**: Splits weight update as magniture (a separate learnable parameter) and direction (LoRA). It needs to me merged before inference for speed.


In [None]:
from transformers import AutoModelForSeq2SeqLM
from peft import LoraModel, LoraConfig

config = LoraConfig(
    task_type="SEQ_2_SEQ_LM",
    r=8,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.01,
)

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
lora_model = LoraModel(model, config, "default")

In [None]:
import torch
import transformers
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training

rank = ...
target_modules = ["q_proj", "k_proj", "v_proj", "out_proj", "fc_in", "fc_out", "wte"]
config = LoraConfig(
    r=4, lora_alpha=16, target_modules=target_modules, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM"
)
quantization_config = transformers.BitsAndBytesConfig(load_in_8bit=True)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    "kakaobrain/kogpt",
    revision="KoGPT6B-ryan1.5b-float16",  # or float32 version: revision=KoGPT6B-ryan1.5b
    bos_token="[BOS]",
    eos_token="[EOS]",
    unk_token="[UNK]",
    pad_token="[PAD]",
    mask_token="[MASK]",
)
model = transformers.GPTJForCausalLM.from_pretrained(
    "kakaobrain/kogpt",
    revision="KoGPT6B-ryan1.5b-float16",  # or float32 version: revision=KoGPT6B-ryan1.5b
    pad_token_id=tokenizer.eos_token_id,
    use_cache=False,
    device_map={"": rank},
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)
model = prepare_model_for_kbit_training(model)
lora_model = get_peft_model(model, config)

## QLoRA
* Use LoftQ while training using quantization.
* The idea is that the LoRA weights are initialized such that the quantization error is minimized.
* for LoftQ to work best, it is recommended to target as many layers with LoRA as possible, since those not targeted cannot have LoftQ applied. 
* This means that passing LoraConfig(..., target_modules="all-linear")

In [None]:
from peft import LoftQConfig, LoraConfig, get_peft_model

base_model = AutoModelForCausalLM.from_pretrained(...)  # don't quantize here
loftq_config = LoftQConfig(loftq_bits=4, ...)           # set 4bit quantization
lora_config = LoraConfig(target_modules="all-linear", init_lora_weights="loftq", loftq_config=loftq_config)
peft_model = get_peft_model(base_model, lora_config)

## Multiple Adapters
* add_weighted_adapter() 

In [None]:
from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16, device_map="auto"
)

peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
model = PeftModel.from_pretrained(base_model, peft_model_id, adapter_name="sft")

model.load_adapter("alignment-handbook/zephyr-7b-dpo-lora", adapter_name="dpo")

In [None]:
weighted_adapter_name = "sft-dpo"

model.add_weighted_adapter(
    adapters=["sft", "dpo"],
    weights=[0.7, 0.3],
    adapter_name=weighted_adapter_name,
    combination_type="linear"
)
model.set_adapter(weighted_adapter_name)

In [None]:
from transformers import AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
model = PeftModel.from_pretrained(base_model, peft_model_id)

# load different adapter
model.load_adapter("alignment-handbook/zephyr-7b-dpo-lora", adapter_name="dpo")

# set adapter as active
model.set_adapter("dpo")

# unload adapter
model.unload()

# delete adapter
model.delete_adapter("dpo")