<a href="https://www.kaggle.com/code/aisuko/producing-adapter-with-vera?scriptVersionId=185192906" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Fine-tuning Vectors on top of Random Matrices(VeRA).VeRA has two matrices same to LoRA, but they are frozen, random and shread across layers. The trinable parameters are in two vectors d and b that are placed after A and b, respectively. d and b are not shared across layers. It means that VeRA uses random matrices in the context of parameter-efficient fine-tuning. Since VeRA only trains 2 vectors, VeRA has significantly fewer trainable parameters than LoRA.

It is implemented in Hugging Face PEFT. This implementation has two significant limitations(as of June 8th, 2024):
* It doen't support VeRA over a quantized model, it can only target modules using `nn.Linear`
* The targeted modules must have the same shape
* VeRA produce a larger adapter than LoRA. This is because, PEFT also saves the random matrices in addition to the fine-tuned vectors. It guarantees the portability of the fine-tuned adapter to other hardware/software configurations.

In [1]:
!pip install -U -q transformers==4.39.3
!pip install -U -q accelerate==0.28.0
!pip install -U -q datasets==2.18.0
!pip install -U -q peft==0.11.1
!pip install -U -q bitsandbytes==0.43.1
!pip install -U -q trl==0.9.4

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tuning Llama 3 8B with vera"
os.environ["WANDB_NAME"] = "ft-Llama3-8b-vera"
os.environ["MODEL_NAME"] = "meta-llama/Meta-Llama-3-8B"
os.environ["DATASET"] = "timdettmers/openassistant-guanaco"

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `meta-llama/Meta-Llama-3-8B` from `transformers`...
┌──────────────────────────────────────────────────────┐
│Memory Usage for loading `meta-llama/Meta-Llama-3-8B` │
├───────┬─────────────┬──────────┬─────────────────────┤
│ dtype │Largest Layer│Total Size│ Training using Adam │
├───────┼─────────────┼──────────┼─────────────────────┤
│float32│   1.96 GB   │ 28.21 GB │      112.83 GB      │
│float16│  1002.0 MB  │ 14.1 GB  │       56.42 GB      │
│  int8 │   501.0 MB  │ 7.05 GB  │       28.21 GB      │
│  int4 │   250.5 MB  │ 3.53 GB  │       14.1 GB       │
└───────┴─────────────┴──────────┴─────────────────────┘


In [4]:
import warnings

warnings.filterwarnings("ignore")

In [5]:
import torch

compute_dtype=torch.float16

if torch.cuda.is_bf16_supported():
    compute_dtype = torch.bfloat16
    
print(compute_dtype)

torch.float16


In [6]:
from transformers import AutoTokenizer


tokenizer=AutoTokenizer.from_pretrained(os.getenv("MODEL_NAME"))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
tokenizer.pad_token="<|eot_id|>"
tokenizer.pad_token_id=128009
tokenizer.padding_side="left"

# Loading data

In [8]:
from datasets import load_dataset

# loading datesets
ds=load_dataset(os.getenv("DATASET"))
ds

Repo card metadata block was not found. Setting CardData to empty.


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

## Fit low GPU

In [9]:
train_ds=ds["train"].shuffle(seed=42).select(range(300))
eval_ds=ds["test"].shuffle(seed=42).select(range(200))

print(train_ds)
print(eval_ds)

Dataset({
    features: ['text'],
    num_rows: 300
})
Dataset({
    features: ['text'],
    num_rows: 200
})


In [10]:
import multiprocessing

# add EOS token
def pre_process(x):
    x["text"]=x["text"]+"<|end_of_text|>"
    return x

# ds=ds.map(pre_process, num_proc=multiprocessing.cpu_count(), load_from_cache_file=False)
# ds

train_ds=train_ds.map(pre_process, num_proc=multiprocessing.cpu_count(), load_from_cache_file=False)
eval_ds=eval_ds.map(pre_process, num_proc=multiprocessing.cpu_count(), load_from_cache_file=False)

Map (num_proc=4):   0%|          | 0/300 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/200 [00:00<?, ? examples/s]

# Loading model

In [11]:
from transformers import AutoModelForCausalLM

model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"), 
    device_map={"":0},
    torch_dtype=compute_dtype
    # attn_implementation
)
model

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head)

In [12]:
model.device

device(type='cuda', index=0)

In [13]:
def print_trainable_parameters(model):
    trainable_params=0
    all_params=0
    for _, param in model.named_parameters():
        all_params+=param.numel()
        if param.requires_grad:
            trainable_params+=param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_params} || trainable%: {100 * trainable_params/all_params:.2f}")

print_trainable_parameters(model)

trainable params: 8030261248 || all params: 8030261248 || trainable%: 100.00


In [14]:
model.gradient_checkpointing_enable()

In [18]:
from trl import SFTTrainer, SFTConfig
from peft import VeraConfig

peft_config=VeraConfig(
    vera_dropout=0.05,
    r=512,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["gate_proj","up_proj"]
)

training_arguments=SFTConfig(
    output_dir=os.getenv("WANDB_NAME"),
    evaluation_strategy="steps",
    do_eval=True,
    optim="paged_adamw_8bit",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=16,
    per_device_eval_batch_size=2,
    log_level="debug",
    save_strategy="epoch",
    logging_steps=100,
    learning_rate=1e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    eval_steps=100,
    num_train_epochs=1,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    report_to="wandb",
    run_name=os.getenv('WANDB_NAME')
)

trainer=SFTTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_arguments
)

trainer.train()

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 2
***** Running training *****
  Num examples = 300
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 16
  Total optimization steps = 9
  Number of trainable parameters = 950,272
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


OutOfMemoryError: CUDA out of memory. Tried to allocate 186.00 MiB. GPU 0 has a total capacty of 15.89 GiB of which 180.12 MiB is free. Process 11429 has 15.72 GiB memory in use. Of the allocated memory 15.34 GiB is allocated by PyTorch, and 90.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
kwargs={
    'model_name': os.getenv("WANDB_NAME"),
    'finetuned_from': os.getenv('MODEL_NAME'),
#     'tasks': 'Text-Generation',
#     'dataset_tags':'',
    'dataset': os.getenv("DATASET")
}

tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(**kwargs)

# Credit
* https://towardsdatascience.com/fine-tune-tiny-adapters-for-llama-3-with-vera-7c48f4391d84
* https://arxiv.org/abs/2310.11454
* https://www.kaggle.com/code/aisuko/fine-tune-llama3-with-orpo
* https://huggingface.co/docs/peft/v0.11.0/package_reference/vera