<a href="https://www.kaggle.com/code/aisuko/fine-tuning-mistral-with-qlora?scriptVersionId=165471314" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

In this notebook, we are going to fine-tune Mistral-7b-v01 with QLoRA. And the value of hyperparameters are come from some Github issues.

In [1]:
%%capture
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3

In [2]:
import os
import torch
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tuning Mistral-7B-01"
os.environ["WANDB_NAME"] = "ft-mistral-7b-v01"
os.environ["MODEL_NAME"] = "mistralai/Mistral-7B-v0.1"
os.environ["DATASET"] = "OpenAssistant/oasst_top1_2023-08-25"

torch.backends.cudnn.deterministic=True

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `mistralai/Mistral-7B-v0.1` from `transformers`...
config.json: 100%|█████████████████████████████| 571/571 [00:00<00:00, 3.85MB/s]
┌────────────────────────────────────────────────────────┐
│  Memory Usage for loading `mistralai/Mistral-7B-v0.1`  │
├───────┬─────────────┬──────────┬───────────────────────┤
│ dtype │Largest Layer│Total Size│  Training using Adam  │
├───────┼─────────────┼──────────┼───────────────────────┤
│float32│  864.03 MB  │ 27.49 GB │       109.96 GB       │
│float16│  432.02 MB  │ 13.74 GB │        54.98 GB       │
│  int8 │  216.01 MB  │ 6.87 GB  │        27.49 GB       │
│  int4 │   108.0 MB  │ 3.44 GB  │        13.74 GB       │
└───────┴─────────────┴──────────┴───────────────────────┘


# Loading Datasets

In [4]:
from datasets import load_dataset
dataset=load_dataset(os.getenv("DATASET"), split="train[:500]")
dataset=dataset.train_test_split(test_size=0.1)
print(dataset["train"][0]["text"])

dataset

Downloading readme:   0%|          | 0.00/512 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/31.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

<|im_start|>user
¿Qué distancia hay entre A Coruña y Oporto?<|im_end|>
<|im_start|>assistant
La distancia entre A Coruña, España, y Oporto, Portugal, es de aproximadamente 272 kilómetros si se viaja por carretera. El tiempo de viaje en automóvil puede variar dependiendo del tráfico y las condiciones de la carretera, pero generalmente toma alrededor de 3-4 horas. También existen opciones de transporte público, como autobuses y trenes, que pueden tomar un poco más de tiempo pero pueden ser una buena alternativa para aquellos que prefieren no conducir.<|im_end|>



DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 450
    })
    test: Dataset({
        features: ['text'],
        num_rows: 50
    })
})

# Loading the Tokenizer

In [5]:
from transformers import AutoTokenizer

# fast tokenizer sometimes ignores added tokens
tokenizer=AutoTokenizer.from_pretrained(os.getenv('MODEL_NAME'), use_fast=False)
# add tokens <|im_start|> and <|im_end|>, latter is special eos token
tokenizer.pad_token="</s>"
tokenizer.add_tokens(["<|im_start|>"])
tokenizer.add_special_tokens(dict(eos_token="<|im_end|>"))
print(len(tokenizer))
tokenizer

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

32002


LlamaTokenizer(name_or_path='mistralai/Mistral-7B-v0.1', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '<|im_end|>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	32000: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32001: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

# Preprocess Data

In [6]:
def preprocess_func(example):
    return tokenizer(example["text"], truncation=True, max_length=2048, add_special_tokens=False)

dataset_tokenized=dataset.map(preprocess_func, batched=True, num_proc=os.cpu_count(), remove_columns=["text"])

Map (num_proc=4):   0%|          | 0/450 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/50 [00:00<?, ? examples/s]

# Spliting Batch

In [7]:
def collate(example):
    """
    Transform list of dictionaties [{input_ids:[123,...]}, {...}] 
    to single batch dictionary { input_ids: [...], labels: [...], attention_mask: [...]}
    """
    tokenlist=[e["input_ids"] for e in example]
    tokens_maxlen=max([len(t) for t in tokenlist])
    
    input_ids, labels, attention_masks=[],[],[]
    for tokens in tokenlist:
        pad_len=tokens_maxlen-len(tokens)
        # pad input_ids with pad_token, label with ignore_index (-100) and set attention_mask 1 where content, otherwise 0
        input_ids.append(tokens+[tokenizer.pad_token_id]*pad_len)
        labels.append(tokens+[-100]*pad_len)
        attention_masks.append([1]*len(tokens)+[0]*pad_len)
    batch={
        "input_ids":torch.tensor(input_ids),
        "labels":torch.tensor(labels),
        "attention_mask": torch.tensor(attention_masks)
    }
    return batch

# Loading Model

In [8]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True,
)

model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

def print_trainable_parameters(model):
    trainable_params=0
    all_params=0
    for _, param in model.named_parameters():
        all_params+=param.numel()
        if param.requires_grad:
            trainable_params+=param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_params} || trainable%: {100 * trainable_params/all_params:.2f}")

model.resize_token_embeddings(len(tokenizer))
model.config.eos_token_id=tokenizer.eos_token_id
model.gradient_checkpointing_enable()

print_trainable_parameters(model)

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

trainable params: 262426624 || all params: 3752087552 || trainable%: 6.99


## Freeze Weights and add LoRA

In [9]:
from peft import prepare_model_for_kbit_training

prepared_model=prepare_model_for_kbit_training(
    model, use_gradient_checkpointing=True
)

print_trainable_parameters(prepared_model)
print(prepared_model)

trainable params: 0 || all params: 3752087552 || trainable%: 0.00
MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32002, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
    

In [10]:
from peft import LoraConfig, TaskType, get_peft_model

lora_config=LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj','gate_proj', 'o_proj', 'up_proj'],
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["lm_head","embed_tokens"], # we added new tokens to tokenizer, this is necesarry
    task_type=TaskType.CAUSAL_LM
)

lora_model=get_peft_model(prepared_model, lora_config)
lora_model.config.use_cache=False
print_trainable_parameters(lora_model)

trainable params: 429932544 || all params: 4182020096 || trainable%: 10.28


# Training

In [11]:
from transformers import TrainingArguments, Trainer

bs=2
ga_steps=4
epochs=3

steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps)

args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    evaluation_strategy="steps",
    logging_steps=1,
    eval_steps=steps_per_epoch,
    save_steps=steps_per_epoch,
    # increases effective batch size without consuming additional VRAM but makes training slower.
    # the effective batch size is batch_size* gradient_accumulation_steps
    gradient_accumulation_steps=ga_steps,
    num_train_epochs=epochs,
    lr_scheduler_type="constant",
    optim="paged_adamw_32bit",
    # using default lr suggested by QLoRA. 0.0002 for &b/13B model. For more parameters, lower lr are suggested.
    # for example, 0.0001 for models with 33B and 65B parameters.
    learning_rate=0.0002,
    group_by_length=True,
    fp16=True,
    ddp_find_unused_parameters=False, # needed for training with accelerate
    report_to='wandb',
    run_name=os.getenv('WANDB_NAME')
)

trainer=Trainer(
    model=lora_model,
    tokenizer=tokenizer,
    data_collator=collate,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    args=args
)

trainer.train()

2024-03-05 00:59:03.080205: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-05 00:59:03.080296: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-05 00:59:03.197692: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.16.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240305_005914-c7c6sf8m[0m
[34m[1mwandb[0m: Run [1m`wandb offline`

Step,Training Loss,Validation Loss
56,2.1567,1.792963
112,0.8783,2.077052
168,0.4812,2.194745




TrainOutput(global_step=168, training_loss=1.0743146112986974, metrics={'train_runtime': 3664.5794, 'train_samples_per_second': 0.368, 'train_steps_per_second': 0.046, 'total_flos': 2.842819675373568e+16, 'train_loss': 1.0743146112986974, 'epoch': 2.99})

In [12]:
kwargs={
    'model_name': f'{os.getenv("WANDB_NAME")}',
    'finetuned_from': os.getenv('MODEL_NAME'),
#     'tasks': '',
#     'dataset_tags':'',
    'dataset': os.getenv("DATASET")
}
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(**kwargs)

adapter_model.safetensors:   0%|          | 0.00/1.72G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.73k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/ft-mistral-7b-v01/commit/4d729d8908299eea771136fe9e66b82dd024044f', commit_message='End of training', commit_description='', oid='4d729d8908299eea771136fe9e66b82dd024044f', pr_url=None, pr_revision=None, pr_num=None)

# Reference List

* https://medium.com/@geronimo7/finetuning-llama2-mistral-945f9c200611
* https://github.com/geronimi73/qlora-minimal
* https://arxiv.org/pdf/2305.14314.pdf
* https://github.com/artidoro/qlora/blob/main/qlora.py
* https://github.com/artidoro/qlora/issues/152
* https://medium.com/@geronimo7/reproducing-guanaco-141a6a85a3f7