> The notebook [Fine_Tuning_Mistral7B_with_QLoRA_Native_PyTorch_Training.ipynb](https://github.com/BitwiseBrains/RagOptimize/blob/main/fine_tuning/Fine_Tuning_Mistral7B_with_QLoRA_Native_PyTorch_Training.ipynb) shows how can we write our own PyTorch trainer to fine-tune the model using `QLoRA`. Note that Huggingface already provides a `Trainer` class that can be used to fine-tune the model, albeit with very little code. This notebook contains code for just that.
>
> **Note:** In our case, we used the native PyTorch code for the training. This notebook is just for completeness.

> Since most of the code in this notebook is similar to the notebook mentioned above, I have skipped the explanation of some code blocks. Refer to the above notebook for more details.

# Preparation


We will start just like before, by downloading some libraries and making the imports.

In [1]:
!pip install -U -q datasets bitsandbytes accelerate torch
!pip install -q peft==0.6.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cubinlinker, which is not installed.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires ptxcompiler, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 15.0.1 which is incompatible.
beatrix-jupyterlab 2023.128.151533 requires jupyterlab~=3.6.0, but you have jupyterlab 4.1.2 which is incompatible.
cudf 23.8.0 requires cuda-python<12.0a0,>=11.7.1, but you have cuda-pyth

In [2]:
from datasets import load_dataset
import torch
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    MistralForCausalLM,
    TrainerCallback,
    TrainingArguments,
    Trainer,
)
from peft import (
    PeftModel,
    prepare_model_for_kbit_training,
    LoraConfig,
    get_peft_model,
)
import gc
import wandb
import yaml
import os

from huggingface_hub import HfApi, CommitOperationAdd, login, hf_hub_download

2024-03-17 03:09:54.837151: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-17 03:09:54.837304: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-17 03:09:54.975311: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Config and Logins


## Configuration

The config has the same keys (except the addition the `eval_interval` which specifies the number of batches to train before running the evaluation loop) as that in the other notebook, however, some values are changed.


In [3]:
config = """---
hf_repo_id: hari31416/Mistral_Base_Finance_Finetuning_Trainer
start_batch_number: 0
end_batch_number: 500
base_model_id: mistralai/Mistral-7B-v0.1
load_pretrained: False
head_file_name: mistral_head.pt
dataset_id: gbharti/finance-alpaca
quantization_config:
  load_in_4bit: True
  bnb_4bit_quant_type: nf4
  bnb_4bit_use_double_quant: True
  bnb_4bit_compute_dtype: bfloat16
lora_config:
  r: 16
  lora_alpha: 4
  lora_dropout: 0.05
  bias: none
  task_type: CAUSAL_LM
  target_modules:
    - o_proj
    - v_proj
    - k_proj
    - q_proj
num_warmup_steps: 0
epochs: 1
max_iter_per_epoch:
max_steps:
log_interval: 1
eval_interval: 100
wandb: True
project: RAGOptimize
wandb_name: fine_tune_trainer
notes: RAGOptimize Training With HF Trainer
lr: 0.0001
accumulation_steps: 1
batch_size: 4
max_length: 1024
model_save_root_dir: /kaggle/working/models
push_to_hub: True
push_to_hub_frequency: 100
max_hours: 11.7
"""
config = yaml.safe_load(config)
# Create the model save root directory so that we can save the model
os.makedirs(config["model_save_root_dir"], exist_ok=True)

## Login to W&B and HF

We will be logging in to `wand` and `huggingface_hub` just like before. We have a couple of changes:

- We will not be creating a run directly here by using `wandb.init`. HF `Trainer` will create this for us.
- We need to set an environmental variable named `WANDB_PROJECT` which will decide the name of the project under which `Trainer` will create the run. The second code cell does this by using the `%env` magic command.

In [4]:
from kaggle_secrets import UserSecretsClient


user_secrets = UserSecretsClient()

WANDB_API_KEY = user_secrets.get_secret("WANDB_API_KEY")

text = f"""machine api.wandb.ai
  login user
  password {WANDB_API_KEY}
"""
# wandb saves credentials at /root/.netrc
with open("/root/.netrc", "w") as f:
    f.write(text)

run_name = f"""{config["wandb_name"]}_{config["start_batch_number"]}_{config["end_batch_number"]}"""
if config["push_to_hub"]:
    HUGGING_FACE_API_KEY = user_secrets.get_secret("HUGGING_FACE_API_KEY")
    login(HUGGING_FACE_API_KEY)

api = HfApi()

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [5]:
project = config["project"]
%env WANDB_PROJECT=$project

env: WANDB_PROJECT=RAGOptimize


# Loading the Model and Tokenizer


The model will be loaded in a similar manner.

In [6]:
# use bfloat16 or float16 depending on the config
bnb_4bit_compute_dtype = (
    torch.bfloat16
    if config["quantization_config"]["bnb_4bit_compute_dtype"] == "bfloat16"
    else torch.float16
)
config["quantization_config"].update({"bnb_4bit_compute_dtype": bnb_4bit_compute_dtype})

# load the quantization and lora config
quantization_config = BitsAndBytesConfig(**config["quantization_config"])
lora_config = LoraConfig(**config["lora_config"])

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(config["base_model_id"])
tokenizer.pad_token_id = tokenizer.eos_token_id

# load the base model
model = MistralForCausalLM.from_pretrained(
    config["base_model_id"],
    quantization_config=quantization_config,
)
# save the original head weights to check if the head weights are updated
og_head = model.lm_head.state_dict()["weight"].to(
    "cpu"
)  # move to cpu to avoid getting the weights updated

# enable gradient checkpointing and prepare the model for kbit training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

if config["load_pretrained"]:
    # if loading pretrained, load the head weights
    print("Loading pretrained PEFT Model and the head")
    head_file_path = hf_hub_download(
        config["hf_repo_id"], config["head_file_name"], local_dir="."
    )
    lm_head_state_dict = torch.load(head_file_path)
    model.lm_head.load_state_dict(lm_head_state_dict)

    # load the adapter to the base model
    model = PeftModel.from_pretrained(model, config["hf_repo_id"], is_trainable=True)

    # if loading pretrained, make sure that the weights of the head are different
    new_head = model.lm_head.state_dict()["weight"].to("cpu")
    if torch.equal(new_head, og_head):
        raise ValueError("Head weights are the same!")
    print("Head weights are different.")
    # delete the head weights for memory
    del lm_head_state_dict, new_head
else:
    # If not loading pretrained, create the PEFT model using the LORA config
    print("Creating the PEFT model using LORA config")
    model = get_peft_model(model, lora_config)

# delete the original head weights for memory
del og_head
# make the head trainable
model.base_model.model.lm_head.weight.requires_grad_()

# print the trainable parameters
model.print_trainable_parameters()

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Creating the PEFT model using LORA config
trainable params: 144,703,488 || all params: 7,255,363,584 || trainable%: 1.9944346871755614


In [7]:
# make sure that the head weights are trainable
assert model.base_model.model.lm_head.weight.requires_grad

# The Dataset


## Load the Dataset

Loading and preprocessing the text data is similar.

In [8]:
dataset = load_dataset(config["dataset_id"])
dataset = dataset["train"]
TOTAL_SAMPLES = len(dataset)
TOTAL_BATCHES = TOTAL_SAMPLES // config["batch_size"]
print(
    f"Total number of samples: {TOTAL_SAMPLES}\nTotal Number of Batches: {TOTAL_BATCHES}"
)


def format_input_text(text, verbose=False):
    """Formats the input text to the format required by the model.
    
    Parameters
    ----------
    text : dict
        The input text dictionary containing the input, instruction, text and output.
    verbose : bool, optional
        Whether to print the formatted message, by default False
    """
    input_ = text["input"]
    instruction = text["instruction"]
    text_ = text["text"]
    output = text["output"]
    user_content = ""
    if input_:
        user_content += f"{input_}\n"
    if instruction:
        user_content += f"{instruction}\n"
    if text_:
        user_content += f"{text_}\n"
    user_content = user_content.strip()
#     message = f"<INST>{user_content}</INST>{output}"
    message = f"<s><INST>{user_content}</INST>{output}</s>"
    if verbose:
        print(message)
    return {"message": message}

# use the format_input_text function to format the input text
dataset = dataset.map(format_input_text)
# remove the columns that are not required
dataset = dataset.remove_columns(["text", "instruction", "output", "input"])

Downloading readme:   0%|          | 0.00/709 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 42.9M/42.9M [00:01<00:00, 40.6MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

Total number of samples: 68912
Total Number of Batches: 17228


Map:   0%|          | 0/68912 [00:00<?, ? examples/s]

## Filter The Dataset

While filtering, we will be using the first 10 batches as evaluation dataset since the `Trainer` expects an evaluation dataset.

In [9]:
eval_dataset_length = 10*config["batch_size"]
dataset_start_idx = config["start_batch_number"] * config["batch_size"] + eval_dataset_length
dataset_end_idx = config["end_batch_number"] * config["batch_size"] + eval_dataset_length
print(f"Splitting from {dataset_start_idx} to {dataset_end_idx}")
data = dataset.select(range(dataset_start_idx, dataset_end_idx))
eval_data = dataset.select(range(0, eval_dataset_length))
print(f"Number of samples to be trained on: {len(data['message'])}")

Splitting from 40 to 2040
Number of samples to be trained on: 2000


## Tokenizing the Dataset

Note that we do not need to create a custom `Dataset`, `Trainer` will handle all these. However, we do need to pass the tokenized dataset. Batching and collating will be handled by `Trainer` with help of `DataCollatorForLanguageModeling`.

In [10]:
# The data collator for the language modeling task. This will make sure that the data is in correct format for the model
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

def preprocess_function(examples):
    return tokenizer(examples["message"], truncation=True, return_tensors="pt",
            max_length=config["max_length"], padding=True)
train_tokenized_ds = data.map(preprocess_function, batched=True)
eval_tokenized_ds = eval_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

# Training


## Custom Callback

Though, we do not need to write training loop and other logic for training the model, we do need to write a logic that will be used to push the model to the periodically on Huggingface Hub. The `Trainer` does provide in-build functionality to push the model periodically, however, it can not push the head of the model. For this, we need to create a custom callback class by inheriting the `TrainerCallback`. See the [documentation](https://huggingface.co/docs/transformers/en/main_classes/callback#transformers.TrainerCallback) for detail.

The method `push_model_to_hub` is the same method that was used in the other notebook.

In [11]:
def push_model_to_hub(model):
    # Push the adapter to the hub
    start_step = config["start_batch_number"]
    end_step = config["end_batch_number"]
    #TODO: Change it o that `end_step` is the number of batches trained till now
    commit_message = f"Trained model from {start_step} to {end_step} steps"
    print(f"Pushing the model to hub with commit: {commit_message}")
    model.push_to_hub(config["hf_repo_id"], commit_message=commit_message)

    # save the dict containing the model head state to hub using the Hugging Face API
    head = model.lm_head
    file_name = config["head_file_name"]
    torch.save(head.state_dict(), file_name)
    operations = [
        CommitOperationAdd(path_in_repo=file_name, path_or_fileobj=file_name)
    ]
    commit_message = f"Adding head to model from {start_step} to {end_step} steps"
    print(f"Pushing the head to hub with commit: {commit_message}")
    api.create_commit(
        config["hf_repo_id"],
        operations=operations,
        commit_message=commit_message,
    )

class CustomCallback(TrainerCallback):
     # push the model to HF when the model is saved
     def on_save(self, args, state, control, model, logs=None, **kwargs):
        if not config["push_to_hub"]:
            return None
        push_model_to_hub(model)

## Training the Model

After the callback, we are ready to train the model. For this, we need to give some arguments to the trainer using `TrainingArguments`.

In [12]:
gc.collect()
torch.cuda.empty_cache()

report_to = "wandb" if config["wandb"] else "none"

training_args = TrainingArguments(
    output_dir=config["model_save_root_dir"],
    learning_rate=1e-4,
    per_device_train_batch_size=config["batch_size"],
    per_device_eval_batch_size=config["batch_size"],
    optim='paged_adamw_32bit',
    lr_scheduler_type="cosine",
    num_train_epochs=config["epochs"],
    weight_decay=0.01,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=False,
    hub_model_id = config["hf_repo_id"],
    hub_strategy = "every_save",
    eval_steps=config["eval_interval"],
    save_steps = config["push_to_hub_frequency"],
    logging_steps=config["log_interval"],
    report_to=report_to,
    run_name= run_name,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized_ds,
    eval_dataset=eval_tokenized_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks = [CustomCallback]
)

trainer.train()

push_model_to_hub(model)

wandb.finish()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
[34m[1mwandb[0m: Currently logged in as: [33mhari31416[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.4 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240317_031154-kfr9mcgw[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mfine_tune_trainer_base_0_500[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/hari31416/RAGOptimize[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/hari31416/RAGOptimize/runs/kfr9mcgw[0m
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
100,2.5623,2.495888
200,2.068,2.447163
300,2.0741,2.397187
400,2.5224,2.36111
500,2.1368,2.354057


Pushing the model to hub with commit: Trained model from 0 to 500 steps


adapter_model.safetensors:   0%|          | 0.00/54.6M [00:00<?, ?B/s]

Pushing the head to hub with commit: Adding head to model from 0 to 500 steps


mistral_head.pt:   0%|          | 0.00/524M [00:00<?, ?B/s]



Pushing the model to hub with commit: Trained model from 0 to 500 steps


README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/54.6M [00:00<?, ?B/s]

Pushing the head to hub with commit: Adding head to model from 0 to 500 steps


mistral_head.pt:   0%|          | 0.00/524M [00:00<?, ?B/s]



Pushing the model to hub with commit: Trained model from 0 to 500 steps


adapter_model.safetensors:   0%|          | 0.00/54.6M [00:00<?, ?B/s]

Pushing the head to hub with commit: Adding head to model from 0 to 500 steps


mistral_head.pt:   0%|          | 0.00/524M [00:00<?, ?B/s]



Pushing the model to hub with commit: Trained model from 0 to 500 steps


adapter_model.safetensors:   0%|          | 0.00/54.6M [00:00<?, ?B/s]

Pushing the head to hub with commit: Adding head to model from 0 to 500 steps


mistral_head.pt:   0%|          | 0.00/524M [00:00<?, ?B/s]



Pushing the model to hub with commit: Trained model from 0 to 500 steps


adapter_model.safetensors:   0%|          | 0.00/54.6M [00:00<?, ?B/s]

Pushing the head to hub with commit: Adding head to model from 0 to 500 steps


mistral_head.pt:   0%|          | 0.00/524M [00:00<?, ?B/s]

Pushing the model to hub with commit: Trained model from 0 to 500 steps
Pushing the head to hub with commit: Adding head to model from 0 to 500 steps


[34m[1mwandb[0m:                                                                                
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:                      eval/loss █▆▃▁▁
[34m[1mwandb[0m:                   eval/runtime ▁▅▅██
[34m[1mwandb[0m:        eval/samples_per_second ▁▁▁▁▁
[34m[1mwandb[0m:          eval/steps_per_second ▁▁▁▁▁
[34m[1mwandb[0m:                    train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
[34m[1mwandb[0m:              train/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
[34m[1mwandb[0m:                train/grad_norm ██▇▇▄▃▃▄▁▂▂▄▃▂▃▂▃▂▃▂▁▂▂▁▁▂▂▂▂▁▂▄▂▃▂▄▂▂▂▂
[34m[1mwandb[0m:            train/learning_rate ███████▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
[34m[1mwandb[0m:                     train/loss ▇█▇▄▆▇▃▆▅▇▅▅▅▃▇▃▇▃▃▄▂▇▅▆▃▁▃▅▂▁▆▆▂█▃▂▅▆▄▅
[34m[1mwandb[0m:               train/total_flos ▁
[34m[1mwandb[0m:               train/train_loss ▁
[34m[1mwandb[0m:            train/train_runtime ▁