<a href="https://colab.research.google.com/github/Kishore8949/LLMsApplications/blob/main/LLMs_with_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Low-Rank Adaption (LoRA)**

This Notebook introduces how to apply low-rank adaptation (LoRA) to your model of choice using Parameter-Efficient Fine-Tuning (PEFT) library developed by Hugging Face.


In [2]:
!pip install peft==0.4.0



In [3]:
!pip install datasets



In [4]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "bigscience/bloomz-560m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
foundation_model = AutoModelForCausalLM.from_pretrained(model_name)

data = load_dataset("Abirate/english_quotes", cache_dir="../working/cache"+"/datasets")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
train_sample = data["train"].select(range(50))
display(train_sample)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['quote', 'author', 'tags', 'input_ids', 'attention_mask'],
    num_rows: 50
})

**Define LoRA configurations**

By using LoRA, you are unfreezing the attention Weight_delta matrix and only updating W_a and W_b.

You can treat r (rank) as a hyperparameter. Recall from the lecture that, LoRA can perform well with very small ranks based on Hu et a 2021's paper. GPT-3's validation accuracies across tasks with ranks from 1 to 64 are quite similar. From PyTorch Lightning's documentation:



> A smaller r leads to a simpler low-rank matrix, which results in fewer parameters to learn during adaptation. This can lead to faster training and potentially reduced computational requirements. However, with a smaller r, the capacity of the low-rank matrix to capture task-specific information decreases. This may result in lower adaptation quality, and the model might not perform as well on the new task compared to a higher r.


Other arguments:

lora_dropout:

*   Dropout is a regularization method that reduces overfitting by randomly and temporarily removing nodes during training.
*   It works like this:
Apply to most type of layers (e.g. fully connected, convolutional, recurrent) and larger networks
Temporarily and randomly remove nodes and their connections during each training cycle

target_modules:

*   Specifies the module names to apply to
*   This is dependent on how the foundation model names its attention weight matrices.
*   Typically, this can be:
query, q, q_proj
key, k, k_proj
value, v , v_proj
query_key_value
The easiest way to inspect the module/layer names is to print the model, like we are doing below.





Step 1
Fill in r=1 and target_modules.

*   Note:

For r, any number is valid. The smaller the r is, the fewer parameters there are to update during the fine-tuning process.
*   Hint:

For target_modules, what's the name of the first module within each BloomBlock's self_attention?






In [5]:
import peft
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=1,
    lora_alpha=1, # a scaling factor that adjusts the magnitude of the weight matrix. Usually set to 1
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none", # this specifies if the bias parameter should be trained.
    task_type="CAUSAL_LM"
)

Step 2

Add the adapter layers to the foundation model to be trained

In [6]:
peft_model = get_peft_model(foundation_model, lora_config)
print(peft_model.print_trainable_parameters())

trainable params: 98,304 || all params: 559,312,896 || trainable%: 0.01757585078102687
None


**Define Trainer class for fine-tuning**

Step 3

Fill out the Trainer class. Feel free to tweak the training_args provided, but remember that lowering the learning rate and increasing the number of epochs will increase training time significantly. If you change none of the defaults we set below, it could take ~15 mins to fine-tune.

In [7]:
import transformers
from transformers import TrainingArguments, Trainer
import os

output_directory = os.path.join("../cache/working", "peft_lab_outputs")
training_args = TrainingArguments(
    report_to="none",
    output_dir=output_directory,
    auto_find_batch_size=True,
    learning_rate= 3e-2, # Higher learning rate than full fine-tuning.
    num_train_epochs=5,
    no_cuda=False
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_sample,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
trainer.train()

You're using a BloomTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


Step,Training Loss


TrainOutput(global_step=65, training_loss=6.5061697152944715, metrics={'train_runtime': 20.4424, 'train_samples_per_second': 12.229, 'train_steps_per_second': 3.18, 'total_flos': 41838064386048.0, 'train_loss': 6.5061697152944715, 'epoch': 5.0})

Load model

Step 4

Load the PEFT model using pre-defined LoRA configs and foundation model. We set is_trainable=False to avoid further training.

In [8]:
import time

time_now = time.time()

peft_model_path = os.path.join(output_directory, f"peft_model_{time_now}")

trainer.model.save_pretrained(peft_model_path)

In [9]:
from peft import PeftModel, PeftConfig

loaded_model = PeftModel.from_pretrained(foundation_model, peft_model_path,
                                        is_trainable=False)

Inference

Step 5

Generate output tokens to the same input we provided in the demo notebook before. How do the outputs compare?

In [11]:
inputs = tokenizer("Two things are infinite: ", return_tensors="pt")
inputs = inputs.to('cuda')
outputs = peft_model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_new_tokens=10,
    eos_token_id=tokenizer.eos_token_id
    )

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['Two things are infinite: “ is””””””””']
