Ref: https://www.youtube.com/watch?v=DcBC4yGHV4Q

### Fine-tune large models using 🤗 [`peft`](https://github.com/huggingface/peft) adapters, [`transformers`](https://github.com/huggingface/transformers) & [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes)

I am doing my project experiment  wiht fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in **8-bit**.
The fine-tuning method relies on a recent method called "Low Rank Adapters" ([LoRA](https://arxiv.org/pdf/2106.09685.pdf)), instead of fine-tuning the entire model, we just have to fine-tune these adapters and load them properly inside the model.
After fine-tuning the model, I am goinf to share the model adapters on the 🤗 Hub and load them very easily.
Let me start!

### Install requirements

First, running the cells below to install the requirements:

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib einops
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

## Checking Graphic Cards presence

In [None]:
!nvidia-smi

## Importing Packages

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import torch
import torch.nn as nn
import bitsandbytes as bnb

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
)

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training
)

## Huggingface Credentials  + Google Drive Mounting (later reqquired for dataset loading)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Initializing Parameters / Settings

In [None]:
free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
max_memory

## Loading Pre-trained Model & Tokenizer from repositoiry

In [None]:
MODEL_NAME = "tiiuae/falcon-7b"   # original "tiiuae/falcon-7b"

In [None]:
from transformers.modeling_utils import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,   # can be 4bit / 8bit
    #load_in_8bit = True,   # can be 4bit / 8bit
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',    # minmax, meanmin
    bnb_4bit_compute_dtype=torch.bfloat16
    #bnb_8bit_compute_dtype=torch.bfloat16
)


In [None]:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map='auto',
    trust_remote_code=True,
    quantization_config = bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

## Prepare model for training

Some pre-processing will be done before training such an int8 model using `peft`, therefore I will import an utiliy function `prepare_model_for_kbit_training` that will:
- Casts all the non `int8` modules to full precision (`fp32`) for stability
- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states
- Enable gradient checkpointing for more memory-efficient training

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

### Apply LoRA

Here I will utilize the magical 'parameter efficient fine tuning' `peft`!, that is loading a `PeftModel` and specify to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

### Model Architecture

In [None]:
model

### Printing trainable parameters

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model)

### Data loading from prepared json file

### prompt generation configuration

In [None]:
generation_config = model.generation_config

generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.do_sample = False # new
generation_config.top_p = 0.7
generation_config.top_k = 20
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id
generation_config.use_cache = False

In [None]:
generation_config

### Inference Before Training
This is just to check the model is loaded properly, and the actual inference shall be done in a separate module namd with 'My project -Inference'

In [None]:
prompt = f"""
<bot>: How can I wash my hand?
<human>:
""".strip()
print(prompt)

In [None]:
%%script true

%%time
device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode(True):
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
from datasets import load_dataset, Dataset
json_file_path = '/content/drive/MyDrive/ISP/data/final_qa.json'

# loading saved json data
data = load_dataset('json', data_files = json_file_path)

In [None]:
print('question: ',data['train'][50]['question'])
print('answer: ', data['train'][50]['answer'])

In [None]:
print(len(data['train']))

## Build HuggingFace dataset / Model compatible

In [None]:
# This generates data as per already defined prompt titles (human/Assistant)

def generate_prompt(data_point):
    return f"""
: {data_point["question"]}
: {data_point["answer"]}
""".strip()

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt

In [None]:
# generate train data in the form of tokenized prompts
train_data_enc = data["train"].shuffle().map(generate_and_tokenize_prompt)

# Training

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
OUTPUT_DIR = '/content/drive/MyDrive/ISP/gc1-Falcon/experiments'

%load_ext tensorboard
%tensorboard --logdir '/content/drive/MyDrive/ISP/gc1-Falcon/experiments'

In [None]:
#@title
import transformers

training_args = transformers.TrainingArguments(
    per_device_train_batch_size = 32, # adjust as per vram of GPU
    auto_find_batch_size=True,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    max_steps=120,
    learning_rate=1e-4,
    fp16=True,
    save_strategy= 'epoch',
    optim="paged_adamw_8bit",
    lr_scheduler_type = 'cosine',
    warmup_ratio = 0.05,
    output_dir=OUTPUT_DIR,
    logging_steps=1,
    report_to = 'tensorboard',
    save_total_limit=3,
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=train_data_enc,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

# training loop
trainer.train()

## Finetuned model adapter :  Save to Disk and Share on the 🤗 Hub

In [None]:
FINETUNED_MODEL_NAME = 'TariqJamil/falcon-7b-peft-qlora-my_finetuned_model-0706'

model.save_pretrained(FINETUNED_MODEL_NAME)

In [None]:
#model.push_to_hub('falcon-7b-instruct-peft-qlora-my_finetuned_model-0607', use_auth_token=True, create_pr=1)
model.push_to_hub(FINETUNED_MODEL_NAME, private=True)
tokenizer.push_to_hub(FINETUNED_MODEL_NAME)
model.config.to_json_file("config.json")