## Install requirements
First, run the cells below to install the requirements:

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git

## Model loading
Let's load the bloomz-3B model!

We're also going to load the bigscience/bloomz-3b which is the tokenizer for all of the BLOOM models.

This step will take some time, as we have to download the model weights which are >6GB.

In [None]:
import torch
torch.cuda.is_available()

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloomz-3b", 
    torch_dtype=torch.float16,
    load_in_8bit=True, 
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-3b")

## Model Architecture
It's important to observe the model's construction so you can ensure you know which modules you should apply LoRA to.

As per the paper, we're going to focus on the attention weights - so keep an eye out for modules like: q_proj, v_proj, query_key_value. This is model dependent.

In [None]:
print(model)

## Post-processing on the model
Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in float32 for stability. We also cast the output of the last layer in float32 for the same reasons.

In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

## Apply LoRA
Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function from peft.

## Helper Function to Print Parameter %age
This is just a helper function to print out just how much LoRA reduces the number of trainable parameters.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

## Initializing LoRA Config
There's a lot to unpack here - so let's talk about the main parameters:

`r:` is the "rank" of the two decomposed matrices we'll be using to represent our weight matrix. In reality, this is the dimension of the decomposed matrices.

`target_modules: `As LoRA can be applied to any weight matrix - we need to configure which module (weight matrix) it's being applied to. The paper suggests applying it to the Attention weights, and so we're doing that here. Be mindful that, while BLOOMZ's attention weight modules are called query_key_value - other models will name them with different convention. Please ensure you look at your model's architecture and select the appropriate module.

`task_type:` This is a derived property. If you're using a causal model, this should be set to CAUSAL_LM. Please ensure this property is set based on your selected model.

Again, while this is the way we're leveraging LoRA in this notebook - it can be used in conjunction with many different models - and many different tasks. You can even use it for tasks like [token classification](https://huggingface.co/docs/peft/task_guides/token-classification-lora)!

In [None]:
from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

## Preprocessing
We can simply load our dataset from 🤗 Hugging Face with the load_dataset method!

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
import transformers
from datasets import load_dataset

HUGGING_FACE_USER_NAME = "your-user-name"
dataset_name = "your-data-set"
dataset_name = f"{user_name}/{dataset_name}"
product_name = "product"
product_desc = "description"
product_ad = "automating_tech_blog"

In [None]:
dataset = load_dataset(dataset_name)
print(dataset)
print(dataset['train'][0])

Because we're using BLOOMZ (which is an instruct-tuned base model), we should see better results providing the instruction - though that is not necessary.

In [None]:
def generate_prompt(summary: str, post: str) -> str:
  prompt = f"### INSTRUCTION\nBelow is a summary of a post and its corresponding social media post, please write social media post for this blog.\n\n### Summary:\n{summary}\n### Post:\n{post}\n"
  return prompt

mapped_dataset = dataset.map(lambda samples: tokenizer(generate_prompt(samples['summary'], samples['post'])))


The Trainer class contains all the usual suspects - these are the same hyper-parameters you know and love from traditional ML applications!

If you're running into CUDA memory issues - please modify both the per_device_train_batch_size to be lower, and also reduce r in your LoRAConfig. You'll need to restart and re-run your notebook after doing so.

In [None]:
trainer = transformers.Trainer(
    model=model, 
    train_dataset=mapped_dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=6, 
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100, 
        learning_rate=1e-3, 
        fp16=True,
        logging_steps=1, 
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

## Share adapters on the 🤗 Hub
Normally, we would only seek to push the LoRA adapters to the hub. This is a lightweight and memory efficient way to push this model - as you can pull the base model down as part of the inference pipeline.

However, if you want to leverage the one-click-deploy features of Hugging Face, you'll need to first merge_and_unload() the model and push the resulting model to the hub. This process will merge the LoRA weights back into the base model.

Please note that if you leveraged bitsandbytes to load the model in 8-bit - you cannot merge the weights into the base model at this step.

In [None]:
model_name = "YOUR MODEL NAME HERE"

model.push_to_hub(f"{HUGGING_FACE_USER_NAME}/{model_name}", use_auth_token=True)
