# Outline
- Fine-tune a lightweight LLM (OPT-125M) with LoRA and 8-bit quantization using Launch
- Checkpoint the LoRA adapter weights as artifacts
- Link the best checkpoint in Model Registry
- Linkage triggers an automation in github actions or Modal to quantize the model for inference
  - CI job should also profile the model and generate a W&B report
- Add `production` alias to registered model version
- Triggers automation to deploy model as Fast API inference server in Modal

Stretch: interact with model and log inputs/outputs with W&B prompts

## Fine-tune large models using 🤗 `peft` adapters, `transformers` & `bitsandbytes`

In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.
The fine-tuning method will rely on a recent method called "Low Rank Adapters" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model.
After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!

**TODO:** Turn this section of code into a launch job

### Install requirements

First, run the cells below to install the requirements:

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git
!pip install -q wandb
!pip install -q ctranslate2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.tom

In [None]:
import os
os.environ['WANDB_BASE_URL'] = "https://staging-aws.wandb.io/"
os.environ['WANDB_API_KEY'] = ""

In [None]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mkenleewb[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Model loading

Here let's load the `opt-6.7b` model, its weights in half-precision (float16) are about 13GB on the Hub! If we load them in 8-bit we would require around 7GB of memory instead.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

project_name = "model-registry-walkthrough" #@param
entity = "smle-machine" #@param

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-125m",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")

Downloading (…)lve/main/config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 589824 || all params: 125829120 || trainable%: 0.46875


In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 768, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (layers): ModuleList(
            (0-11): 12 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
                (v_proj): Linear8bitLt(
                  in_features=768, out_features=768, bias=True
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=768, out_features=16, bias=False)
                  )
                  (lora_B): ModuleDict(
                    (default): Lin

# Log Checkpoints Automatically with Hugging Face

Logging your Hugging Face model to W&B Artifacts can be done by setting a W&B environment variable called `WANDB_LOG_MODEL`
- `WANDB_LOG_MODEL='end'` - logs only the final model
- `WANDB_LOG_MODEL='checkpoint'` - logs the model checkpoints every `save_steps` in the `TrainingArguments`
- Optionally use the wandb artifacts api to implement your own checkpointing logic using HF's Callbacks

See more details on our Hugging Face integration [here](https://docs.wandb.ai/guides/integrations/huggingface)

In [None]:
import transformers
from datasets import load_dataset
import wandb

os.environ["WANDB_LOG_MODEL"] = "checkpoint"

wandb.init(project=project_name,
           entity=entity,
           job_type="training")

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        report_to="wandb",
        warmup_steps=5,
        max_steps=25,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        save_steps=5,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()
wandb.finish()



Step,Training Loss
1,2.8079
2,2.7583
3,2.9862
4,2.9198
5,2.9172
6,3.1136
7,3.1315
8,2.9097
9,3.1321
10,2.6756


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-5)... Done. 0.0s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-10)... Done. 0.0s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-15)... Done. 0.0s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-20)... Done. 0.0s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-25)... Done. 0.0s


0,1
train/epoch,▁▁▁▂▂▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇▇████
train/global_step,▁▁▂▂▂▂▃▃▃▄▄▄▅▅▅▅▆▆▆▇▇▇▇███
train/learning_rate,▂▄▅▇██▇▇▇▆▆▆▅▅▅▄▄▃▃▃▂▂▂▁▁
train/loss,▂▂▄▃▃▅▅▃▅▁▁█▃▃▄▄▅▄▄▄▂▇▅▆█
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁
train/train_samples_per_second,▁
train/train_steps_per_second,▁

0,1
train/epoch,0.16
train/global_step,25.0
train/learning_rate,0.0
train/loss,3.433
train/total_flos,17751296065536.0
train/train_loss,3.0133
train/train_runtime,28.5895
train/train_samples_per_second,13.991
train/train_steps_per_second,0.874


### Adding Model Weights to Model Registry

In [None]:
last_run_id = "gox8phgt" #@param
wandb.init(project=project_name, entity=entity, job_type="registering_best_model")
best_model = wandb.use_artifact(f'{entity}/{project_name}/checkpoint-{last_run_id}:latest')
registered_model_name = "Review Summarization" #@param {type: "string"}
wandb.run.link_artifact(best_model, f'{entity}/model-registry/{registered_model_name}', aliases=['staging'])
wandb.finish()

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112527788889616, max=1.0…



## Consuming Model From Registry and Converting using ctranslate2
- **TODO** section of code can be translated to an automated CI job
- **TODO** Need to also run inferences against a test set and generate a report with tables

In [None]:
wandb.init(project=project_name, entity=entity, job_type="ctranslate2")
best_model = wandb.use_artifact(f'{entity}/model-registry/OPT-125M:latest')
best_model.download(root='model-registry/OPT-125M:latest')
wandb.finish()

In [None]:
from peft import PeftModel, PeftConfig

def convert_qlora2ct2(adapter_path='model-registry/OPT-125M:latest',
                      full_model_path="opt125m-finetuned",
                      offload_path="opt125m-offload",
                      ct2_path="opt125m-finetuned-ct2",
                      quantization="int8"):


    peft_model_id = adapter_path
    peftconfig = PeftConfig.from_pretrained(peft_model_id)

    model = AutoModelForCausalLM.from_pretrained(
      "facebook/opt-125m",
      offload_folder  = offload_path,
      device_map='auto',
    )

    tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")

    model = PeftModel.from_pretrained(model, peft_model_id)

    print("Peft model loaded")

    merged_model = model.merge_and_unload()

    merged_model.save_pretrained(full_model_path)
    tokenizer.save_pretrained(full_model_path)

    if quantization == False:
        os.system(f"ct2-transformers-converter --model {full_model_path} --output_dir {ct2_path} --force")
    else:
        os.system(f"ct2-transformers-converter --model {full_model_path} --output_dir {ct2_path} --quantization {quantization} --force")
    print("Convert successfully")

In [None]:
convert_qlora2ct2(adapter_path='model-registry/OPT-125M:latest')

In [None]:
!ls .

## Run Inference Using Quantized CTranslate2 Model
- **TODO** can put this behind a Fast API Server in Modal as the "deployed" model

In [None]:
import ctranslate2

generator = ctranslate2.Generator("opt125m-finetuned-ct2")

prompt = "Hey, are you conscious? Can you talk to me?"
start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))

results = generator.generate_batch([start_tokens], max_length=30)

output = tokenizer.decode(results[0].sequences_ids[0])
print(output)

In [None]:
from transformers import pipeline

generator = pipeline('text-generation', model="facebook/opt-125m")
generator("Hello, I'm am conscious and")
