<a href="https://www.kaggle.com/code/aisuko/fine-tuning-llama2-with-qlora?scriptVersionId=165066994" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

**Note: This one is not reviewed because no enough computing resource here. However, it is although good to show the ft with QLoRA, and how to merge model. 2. All the pictures here are from the article in the Credits section.**

**QLoRA(Quantized Low-Rank Adaptation)** is a method that quantizes a model to 4-bits and then trains it with LoRA. During finetuning, QLoRA backpropagates gradient through the frozen 4-bit quantized pretrained language model into the LoRA. The LoRA layers are the only parameters being updated during training.

**QLora has one storage data type(usually 4-bit NormalFloat) for the base model weights** and **a computation data type (16-bit BrainFloat) used to perform computations**. QLoRA dequantizes weights from the storage data type to the computation data type to perform the forward and backward passes, but only computes weight gradient for the LoRA parameters which use 16-bit float. The weights are decompressed only when they are needed, therefore the memory usage stays low during traing and inference.

**LoRA** a technique that accelerates the fine-tuning of large models while consuming less memory. To make fine-tuning more efficient, LoRA's approach is to represent the weight updatas with two smaller matrices(called update matrices) through low-rank decomposition. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn't reveive any further adjustments. To produce the final results, both the original and the adapted weights are combined.

This approach has a number of advantages:

* LoRA makes fine-tuning more efficient by drastically reducing the numer of trainable parameters.
* The original pre-trained weights are kept frozen, which means we can have multiple lightweight and portable LoRA models for various downstream tasks built on top of them.
* LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them.
* Performance of models fine-tuned using LoRA is comparable to the performance of fully fine-tuned models.
* LoRA does not add any inference latency because adapter weights can be merged with the base model.

In principle, LoRA can be applied to any subset of weight matrices in a neural netwotk to reduce the number of trainable parameters. However, for simplicity and further parameter efficiency, in Transformer models LoRA is typically applied to attention blocks only. The resulting number of trainable parameters in a LoRA model depends on the size of the low-rank updata matrices, which is determined mainly by the rank r and the shape of the original weight matrix.

![](https://files.mastodon.social/media_attachments/files/111/702/004/494/881/797/original/a26697e010f0096b.webp)

In [1]:
%%capture
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install trl==0.7.7

In [2]:
import os, torch
from trl import SFTTrainer

from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tuning Llama2-with-alpaca-gpt4-lora"
os.environ["WANDB_NOTES"] = "Fine tune model distilbert base uncased"
os.environ["WANDB_NAME"] = "ft-Llama2-with-alpaca-gpt4-lora"



Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory meta-llama/Llama-2-7b-hf --library_name transformers

Loading pretrained config for `meta-llama/Llama-2-7b-hf` from `transformers`...
config.json: 100%|█████████████████████████████| 609/609 [00:00<00:00, 2.49MB/s]
┌──────────────────────────────────────────────────────────┐
│   Memory Usage for loading `meta-llama/Llama-2-7b-hf`    │
├───────┬─────────────┬──────────┬─────────────────────────┤
│ dtype │Largest Layer│Total Size│   Training using Adam   │
├───────┼─────────────┼──────────┼─────────────────────────┤
│float32│  776.03 MB  │ 24.74 GB │         98.96 GB        │
│float16│  388.02 MB  │ 12.37 GB │         49.48 GB        │
│  int8 │  194.01 MB  │ 6.18 GB  │         24.74 GB        │
│  int4 │   97.0 MB   │ 3.09 GB  │         12.37 GB        │
└───────┴─────────────┴──────────┴─────────────────────────┘


In [4]:
model_name="meta-llama/Llama-2-7b-hf"

dataset_name="vicgalle/alpaca-gpt4"

# Preparing the dataset

In [5]:
from datasets import load_dataset

dataset=load_dataset(dataset_name,split="train[:100]")
dataset["text"][0]

Downloading readme:   0%|          | 0.00/3.38k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/48.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.'

# Loading the model and tokenizer

We are going to load the model with 4-bit quantization, and the computed data type will be BFloat16.

## Quantize a model

[bitsandbytes]() is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the [BitsAndBytesConfig](https://huggingface.co/docs/transformers/v4.36.0/en/main_classes/quantization#transformers.BitsAndBytesConfig) class.

* set `load_in_4bit=True` to quantize the model to 4-bits when we load it
* set `bnb_4bit_quant_type="nf4"` to use a special 4-bit data type for weights initialized from a normal distribution
* set `bnb_4bit_use_double_quant=True` to use a nested quantization scheme to quantize the already quantized weights
* set`bnb_4bit_compute_dtype=torch.bfloat16` for faster computation

The matrix. multiplication and training will be faster if one uses a 16-bit compute dtype. One should leverage the recent BitsAndBytesConfig from transformers to change these parameters. An exapmple to load a model in 4bit using NF4 quantization below with double quantization with the compute dtype bfloat16 for faster training.

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, TextStreamer
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training

def print_trainable_parameters(model):
    trainable_params=0
    all_params=0
    for _, param in model.named_parameters():
        all_params+=param.numel()
        if param.requires_grad:
            trainable_params+=param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_params} || trainable%: {100* trainable_params/all_params:.2f}")

bnb_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

model=AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16
)

model.config.use_cache=False
model.config.pretraining_tp=1

print_trainable_parameters(model)

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

trainable params: 262410240 || all params: 3500412928 || trainable%: 7.50


# Freeze Original Weights

In [7]:
from peft import prepare_model_for_kbit_training

#gradient checkpointing to save memory
model.gradient_checkpointing_enable()

prepared_model=prepare_model_for_kbit_training(
    model, use_gradient_checkpointing=True
)
prepared_model.get_memory_footprint()
print(prepared_model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRM

In [8]:
# Load LLaMA tokenizer
tokenizer=AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
tokenizer.pad_token=tokenizer.eos_token
tokenizer.add_eos_token=True
tokenizer.add_bos_token, tokenizer.add_eos_token

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

(True, True)

# LoRA config

* **task_type**: token classification(TaskType.TOKEN_CLS)
* **r**: the dimension of the low-rank matrices
* **lora_alpha**: scaling factor for the weight matrices
* **lora_output**: droppit probability of the LoRA layers
* **bias**: set to all to train all bias parameters

The weight matrix is scaled by `lora_alpha/r`, and a higher `lora_alpha` value assigns more weight to the LoRA activations. For performance, let's setting `bias` to `None` first, and then `lora_only`, before trying `all`.

In [9]:
peft_config=LoraConfig(
    lora_alpha=2,
    lora_dropout=0.1,
    r=4,
    bias='none',
    task_type='CAUSAL_LM',
    target_modules=['q_proj','k_proj','v_proj','o_proj','gate_proj','up_proj']
)

peft_model=get_peft_model(prepared_model, peft_config)
peft_model.print_trainable_parameters()

trainable params: 8,060,928 || all params: 6,746,476,544 || trainable%: 0.1194835251768423


In [10]:
peft_model.get_memory_footprint()

4387004416

# Training arguments

In [11]:
training_arguments=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    num_train_epochs=0.5,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=6,
    optim="paged_adamw_8bit",
#     save_steps=1000,
    logging_steps=1,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    warmup_ratio=0.3,
    group_by_length=True,
    lr_scheduler_type="linear",
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
)

trainer=SFTTrainer(
    model=peft_model,
    train_dataset=dataset,
    max_seq_length=None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False
)

trainer.train()



Map:   0%|          | 0/100 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240301_235009-65n2t2my[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-Llama2-with-alpaca-gpt4-lora[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tuning%20Llama2-with-alpaca-gpt4-lora[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tuning%20Llama2-with-alpaca-gpt4-lora/runs/65n2t2my[0m
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster t

Step,Training Loss
1,1.5625
2,2.2536
3,1.2726
4,2.0534


TrainOutput(global_step=4, training_loss=1.7854987382888794, metrics={'train_runtime': 86.9963, 'train_samples_per_second': 0.575, 'train_steps_per_second': 0.046, 'total_flos': 396209808949248.0, 'train_loss': 1.7854987382888794, 'epoch': 0.48})

# Save the adapter and the model

In [12]:
kwargs={
    'model_name': f'{os.getenv("WANDB_NAME")}',
    'finetuned_from': model_name,
    'tasks': 'Text Conversation',
#     'dataset_tags':'',
    'dataset':dataset_name
}

tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(**kwargs)

training_args.bin:   0%|          | 0.00/4.35k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/32.3M [00:00<?, ?B/s]

'https://huggingface.co/aisuko/ft-Llama2-with-alpaca-gpt4-lora/tree/main/'

# Inferencing

In [13]:
peft_model.config.use_cache=True
peft_model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=4, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=4, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_

In [14]:
system_prompt='Below is an instruction that describes a task. Write a response that appropriately completes thr request.\n\n'
user_prompt='what is newtons 2rd law and its formula'
B_INST,E_INST="### Instruction:\n", "### Response:\n"

prompt=f'{system_prompt}{B_INST}{user_prompt.strip()}\n\n{E_INST}'

inputs=tokenizer([prompt],return_tensors="pt")

streamer=TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True,)

In [15]:
peft_model.generate(**inputs,streamer=streamer, max_new_tokens=500)



сайт для магазинов

Проект для реализации на 3-х платформах:

- [Уралмаш](https://www.uralmash.com/ru/catalog/products/7380/pump-for-oil-and-gas-industry-120-mm-diameter-with-a-flange-on-the-inlet-side)
- [Газпром нефть](https://www.gazprom-neft.ru/products/oil-and-gas-equipment/pumps/pumps-for-oil-and-gas-industry/pump-for-oil-and-gas-industry-120-mm-diameter-with-a-flange-on-the-inlet-side)
- [Газпром нефть Канада](https://www.gazprom-neft.ca/products/oil-and-gas-equipment/pumps/pumps-for-oil-and-gas-industry/pump-for-oil-and-gas-industry-120-mm-diameter-with-a-flange-on-the-inlet-side)

### Задача

1. Сделать простой проект для магазинов, который будет работать в 3-х странах и будет запущен в 3-х странах.
2. Сделать простой проект, который будет работать в 3-х странах и будет запущен в 3-х странах.

### Технологии

- React
- TypeScript
- GraphQL
- Redux
- Apollo
- Sass
- PostgreSQL
- Docker
- Heroku

### Состав команды

- [Виталий Дудкин](https://github.com/vitdudkin) - (lead)
- [Ан

tensor([[    1, 13866,   338,   385, 15278,   393, 16612,   263,  3414, 29889,
         14350,   263,  2933,   393,  7128,  2486,  1614,  2167,  1468,  2009,
         29889,    13,    13,  2277, 29937,  2799,  4080, 29901,    13,  5816,
           338,   716,  7453, 29871, 29906,  5499,  4307,   322,   967,  7063,
            13,    13,  2277, 29937, 13291, 29901,    13,     2,     1, 23784,
          3807,  2394,  1779,  1916,  2835,    13,    13, 30013,   576,  3506,
         29932,  3807,  1909, 12172,  1902,  3540,   665, 29871, 29941, 29899,
         29988,  8433, 29932, 12446, 29988, 29901,    13,    13, 29899,   518,
         30053, 12454,  1155, 30002,   850,   991,   597,  1636, 29889,  3631,
         29885,  1161, 29889,   510, 29914,   582, 29914, 28045, 29914, 14456,
         29914, 29955, 29941, 29947, 29900, 29914, 29886,  3427, 29899,  1454,
         29899, 29877,   309, 29899,   392, 29899, 25496, 29899, 20041,   719,
         29899, 29896, 29906, 29900, 29899,  4317, 2

# Merging the adapter with model

In [16]:
del model, trainer
torch.cuda.empty_cache()

In [17]:
base_model=AutoModelForCausalLM.from_pretrained(
    model_name, low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

model=PeftModel.from_pretrained(base_model, os.getenv("WANDB_NAME"))
model=model.merge_and_unload()

# Reload tokenizer
tokenizer=AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token=tokenizer.eos_token
tokenizer.padding_size="right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We upload the adapter only before because it is smaller size than the merged model checkpoint. So, here we do not upload merged model.

In [18]:
# tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
# model.push_to_hub(os.getenv("WANDB_NAME"))

# Credits
* https://gathnex.medium.com/fine-tuning-llama-2-llm-on-google-colab-a-step-by-step-guide-dd79a788ac16
* https://huggingface.co/docs/peft/conceptual_guides/lora
* https://huggingface.co/docs/peft/developer_guides/quantization
* https://huggingface.co/blog/4bit-transformers-bitsandbytes