## load the necessary modules

source: https://www.datacamp.com/tutorial/fine-tuning-llama-2

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
torch.cuda.empty_cache()

In [None]:
# Model from Hugging Face hub
base_model = "tiiuae/falcon-7b"

# Fine-tuned model 
new_model = "falcon-7b-Property Classification"

## Loading dataset, model, and tokenizer

In [None]:
dataset = load_dataset('json', data_files='./train_dataset', split = 'train')

## 4-bit quantization configuration

4-bit quantization via QLoRA allows efficient finetuning of huge LLM models on consumer hardware while retaining high performance. This dramatically improves accessibility and usability for real-world applications.

QLoRA quantizes a pre-trained language model to 4 bits and freezes the parameters. A small number of trainable Low-Rank Adapter layers are then added to the model.

During fine-tuning, gradients are backpropagated through the frozen 4-bit quantized model into only the Low-Rank Adapter layers. So, the entire pretrained model remains fixed at 4 bits while only the adapters are updated. Also, the 4-bit quantization does not hurt model performance.

In [None]:
compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

## Loading Llama 2 model

In [None]:
gpu_id = "cuda:0"
device = torch.device(gpu_id)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map= 'auto' # device
)
model.config.use_cache = False
model.config.pretraining_tp = 1

## Loading tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## PEFT parameters

Traditional fine-tuning of pre-trained language models (PLMs) requires updating all of the model's parameters, which is computationally expensive and requires massive amounts of data.



Parameter-Efficient Fine-Tuning (PEFT) works by only updating a small subset of the model's parameters, making it much more efficient. Learn about parameters by reading the PEFT official documentation.

In [None]:
#If targeting all linear layers
target_modules = ['q_proj','k_proj','v_proj']

# Recent work as shown in the QLoRA paper by Dettmers et al. suggests that targeting all linear layers results in better adaptation quality.
# 'o_proj','gate_proj','down_proj','up_proj' ,'lm_head'
peft_params = LoraConfig(
    lora_alpha=256,
    lora_dropout=0.1,
    target_modules = target_modules,
    r=256,
    bias="none",
    task_type="CAUSAL_LM",
)

## Training parameters
Below is a list of hyperparameters that can be used to optimize the training process:

* output_dir: The output directory is where the model predictions and checkpoints will be stored.
* num_train_epochs: One training epoch.
* fp16/bf16: Disable fp16/bf16 training.
* per_device_train_batch_size: Batch size per GPU for training.
* per_device_eval_batch_size: Batch size per GPU for evaluation.
* gradient_accumulation_steps: This refers to the number of steps required to accumulate the gradients during the update process.
* gradient_checkpointing: Enabling gradient checkpointing.
* max_grad_norm: Gradient clipping.
* learning_rate: Initial learning rate.
* weight_decay: Weight decay is applied to all layers except bias/LayerNorm weights.
* Optim: Model optimizer (AdamW optimizer).
* lr_scheduler_type: Learning rate schedule.
* max_steps: Number of training steps.
* warmup_ratio: Ratio of steps for a linear warmup.
* group_by_length: This can significantly improve performance and accelerate the training process.
* save_steps: Save checkpoint every 25 update steps.
* logging_steps: Log every 25 update steps.

In [None]:
training_params = TrainingArguments(
    output_dir="./Trained_models",
    num_train_epochs=1,
    per_device_train_batch_size=5,
    auto_find_batch_size=False,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    logging_steps=5,
    save_steps=54486,
    learning_rate=3e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=True,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

## Model fine-tuning

Supervised fine-tuning (SFT) is a key step in reinforcement learning from human feedback (RLHF). The TRL library from HuggingFace provides an easy-to-use API to create SFT models and train them on your dataset with just a few lines of code. It comes with tools to train language models using reinforcement learning, starting with supervised fine-tuning, then reward modeling, and finally proximal policy optimization (PPO).

We will provide SFT Trainer the model, dataset, Lora configuration, tokenizer, and training parameters.

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=2048,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)


In [None]:
trainer.train()

In [None]:
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)