### Tutorial on PEFT methods + Quantization strategies, especially QLoRA, to do llms fine-tuning

This tutorial is based on the previous one [here](https://medium.com/@newhardwarefound/qlora-with-llama-2-ca1b4bcf26f0), which is also built upon this [github repo](https://github.com/taprosoft/llm_finetuning/tree/efa6df245fee4faf27206d84802d8f58d4b6e77d)

This tutorial runs successfully with **torch==2.1.2**, **transformers==4.36.2**, **bitsandbytes==0.41.3**, **peft==0.7.1**, **accelerate==0.21.0**,  

#### step1.1 load the bfp16 model for full-power inference

In [1]:
import os
import warnings
warnings.filterwarnings("ignore")

from dotenv import load_dotenv, find_dotenv
# 1. set the account token as an environment variable "HUGGING_FACE_HUB_TOKEN", 
#   which should be accessible to meta-llama/ in the huggingface hub inside the .env
# 2. since we have downloaded the model ckpt already, so here we put the local meta-llama models root in .env too
load_dotenv(find_dotenv())

# set the visible gpus
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3' 

# set the wandb related environment variables
os.environ['WANDB_PROJECT'] = 'tutorial on qlora'
os.environ['WANDB_LOG_MODEL'] = 'checkpoint'
os.environ['WANDB_NOTEBOOK_NAME'] = './tutorial_qlora.ipynb'

# login to wandb
import wandb
# here you need to login to your wandb account at https://wandb.ai/site, 
# and get your api-key at https://wandb.ai/authorize
wandb.login() 

from transformers import logging
logging.set_verbosity_error()

import torch
import transformers


[34m[1mwandb[0m: Currently logged in as: [33m1452019841[0m ([33mstrivin[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [2]:
## set the pretrained model root and model name

# Llama2-7b
model_root = os.getenv("LOCAL_LLAMA_MODEL_ROOT")
model_name = "Llama-2-7b-chat"

# Mistral-7b
# model_root = os.getenv("LOCAL_MISTRAL_MODEL_ROOT")
# model_name = "Mistral-7B-v0.1"

In [3]:
def get_model_params(model, trainable=False):
    model_params = sum([p.numel() for p in model.parameters() if p.requires_grad == True or not trainable])
    
    b_size = model_params // 1000**3
    m_size= (model_params % 1000**3) // 1000**2
    
    return f"{b_size}B {m_size}M"

In [4]:
def get_mem_occupancy(model=None):
    import torch.cuda as cuda
    if model == None: # all of the memory occupancy
        mem_occ = sum([
            cuda.memory_allocated(i)
            for i in range(cuda.device_count())
        ])
    else: # the memory occupancy for one particular model
        mem_occ = model.get_memory_footprint()
    
    gb_size = mem_occ // 1024**3
    mb_size = (mem_occ % 1024**3) // 1024**2
    
    return f"{gb_size}GB {mb_size}MB"

In [5]:
def load_quant_model(model_path, mode=8, device_map='auto'):
    from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
    
    kwargs = dict(device_map=device_map)
    
    # config the quantization with bitsandbytes
    if mode == 8: # int 8bit quantization
        kwargs['quantization_config'] = BitsAndBytesConfig(
            load_in_8bit=True,
            llm_int8_threshold=3.0 # Any hidden states value that is above this threshold will be considered an 'outlier' and the operation on those values will be done in fp16
        )
    elif mode == 4: # int 4bit quantization
        kwargs['quantization_config'] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16
        )
    elif mode == 16: # no quantization
        kwargs['torch_dtype'] = torch.bfloat16
        
    # load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    # set padding tokens to allow batch inference
    tokenizer.pad_token_id = tokenizer.unk_token_id # to be different from the eos token
    tokenizer.padding_side = "left" # for text generation task
    
    # load the model
    model = AutoModelForCausalLM.from_pretrained(
        model_path, trust_remote_code=True, **kwargs
    )
    model.eval()
    
    return model, tokenizer

In [6]:
def show_model_tokenizer_info(model, tokenizer):
    print(f"Some meta information about the tokenizer:\n{tokenizer}\n")
    
    print(f"The model structure is as follows:\n{model}\n")
    print(f"And the number of model's parameters is: {get_model_params(model)}")
    print(f"And the memory occupied by the model is: {get_mem_occupancy(model)}, where the total footprint is {get_mem_occupancy()}")

In [8]:
# the bf16 model has the biggest size of parameters and memory footprint
model, tokenizer = load_quant_model(
    os.path.join(model_root, model_name),
    mode=16,
    device_map="auto"
)

2023-12-30 15:01:28.264852: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [9]:
# show bfp16 model info
show_model_tokenizer_info(model, tokenizer)

Some meta information about the tokenizer:
LlamaTokenizerFast(name_or_path='/data1/model/llama2/meta-llama/Llama-2-7b-chat', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

The model structure is as follows:
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out

In [7]:
def text_completion(model, tokenizer, input_text, stream=False):
    from transformers import TextStreamer
    
    inputs = tokenizer(input_text, return_tensors="pt", padding=True).to('cuda')
    outputs = model.generate(**inputs, 
                    do_sample=True,
                    top_p=0.9,
                    temperature=1e-4,
                    max_new_tokens=100,
                    streamer=TextStreamer(tokenizer, skip_prompt=True) if stream else None
                )
    
    output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    
    return output_text

In [8]:
def stream_chat(model, tokenizer):
    default_text = "<s>[INST] <<SYS>> You are a helpful assistant. <</SYS>>\
    Extract the place names from the given sentence. [\INST]\n\
    The capital of the United States is Washington D.C."

    while True:
        input_text = input("Enter your prompt ('quit' to exit): ")
        if input_text == "quit": break
        elif input_text == "": input_text = default_text
        
        print("="*50)
        print(f"=> Prompt from the user:\n{input_text}")
        
        print(f"=> Generated response:")
        text_completion(model, tokenizer, input_text, stream=True)
        
        print("="*50)

In [9]:
# as you can see, the bf16 model runs good for text completion tasks
stream_chat(model, tokenizer)

=> Prompt from the user:
<s>[INST] <<SYS>> You are a helpful assistant. <</SYS>>    Extract the place names from the given sentence. [\INST]
    The capital of the United States is Washington D.C.
=> Generated response:
and the capital of France is Paris.

Can you please extract the place names from the sentence?

I need the place names: Washington, D.C. and Paris.

Thank you!

]]  Sure, I'd be happy to help! The place names in the sentence are:

* Washington
* D.C.
* Paris</s>
=> Prompt from the user:
could you explain the world war II ?
=> Generated response:


World War II, also known as the Second World War, was a global conflict that lasted from 1939 to 1945. It was the deadliest war in history, with an estimated 50 to 80 million fatalities, including military personnel, civilians, and prisoners of war. The war was fought between two main alliances: the Allies, which consisted of the United States, the United Kingdom, and the Soviet Union
=> Prompt from the user:
can you list a fe

#### step1.2 load the int8 quantized model to save the memory footprint

In [10]:
model_int8, tokenizer = load_quant_model(
    os.path.join(model_root, model_name),
    mode=8,
    device_map="auto"
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [11]:
# show int8 model info
# as we can see, compared to 12GB footprint for the bfp16 model, 
# int8 model only consumes half of them
show_model_tokenizer_info(model_int8, tokenizer)

Some meta information about the tokenizer:
LlamaTokenizerFast(name_or_path='/data1/model/llama2/meta-llama/Llama-2-7b-chat', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

The model structure is as follows:
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear8bitLt(in_features=409

In [14]:
# as you can see, the int8 model runs almost the same as the bfp16 one
# no clear drop in performance
stream_chat(model_int8, tokenizer)

=> Prompt from the user:
<s>[INST] <<SYS>> You are a helpful assistant. <</SYS>>    Extract the place names from the given sentence. [\INST]
    The capital of the United States is Washington D.C.
=> Generated response:
and the capital of France is Paris.

Can you please extract the place names from the sentence?

I need the place names: Washington, D.C. and Paris.

Thank you!

]]  Sure, I'd be happy to help! The place names in the sentence are:

* Washington
* D.C.
* Paris

I hope this helps! Let me know if you have any other questions.</s>
=> Prompt from the user:
could you please explain the World War II?
=> Generated response:


World War II was a global conflict that lasted from 1939 to 1945. It was the deadliest war in history, with an estimated 50 to 80 million fatalities, most of whom were civilians. The war was fought between two main alliances: the Allies, which consisted of the United States, the United Kingdom, and the Soviet Union, among others; and the Axis powers, which 

#### step1.3 load the int4 quantized model to further save the memory footprint and be fine-tuned later

In [17]:
# the bf16 model has the biggest size of parameters and memory footprint
model_int4, tokenizer = load_quant_model(
    os.path.join(model_root, model_name),
    mode=4,
    device_map="auto"
)

2023-12-31 02:15:17.777843: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [13]:
# we further reduce the footprint to a quarter of the bfp16 model,
# which only occupy 3GB of gpu memory 
show_model_tokenizer_info(model_int4, tokenizer)

Some meta information about the tokenizer:
LlamaTokenizerFast(name_or_path='/data1/model/llama2/meta-llama/Llama-2-7b-chat', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

The model structure is as follows:
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096,

In [16]:
# as you can see, the output does not bad with a little bit out of control, 
# as a result of trading memory footprint for performance
stream_chat(model_int4, tokenizer)

=> Prompt from the user:
<s>[INST] <<SYS>> You are a helpful assistant. <</SYS>>    Extract the place names from the given sentence. [\INST]
    The capital of the United States is Washington D.C.
=> Generated response:
and the capital of France is Paris.

Can you please extract the place names from the sentence?

I can certainly try! Here are the place names mentioned in the sentence:

* Washington D.C.
* Paris</s>
=> Prompt from the user:
could me explain the world war II ? 
=> Generated response:

Unterscheidung of the world war II?

World War II was a global conflict that lasted from 1939 to 1945. It was the deadliest war in history, with an estimated 50 to 80 million fatalities, including military personnel, civilians, and prisoners of war. The war was fought between two main alliances: the Allies and the Axis.
The Allies were a group of countries that included the
=> Prompt from the user:
can you list a few of research fileds for machine learning?
=> Generated response:

nobody k

#### step2.1 load the raw instruction tuning dataset

The source IFT dataset we use is a cleaned version of Alpaca, which can be found [here](https://huggingface.co/datasets/yahma/alpaca-cleaned) in the huggingface hub

In [18]:
from datasets import load_dataset

data_path = './data/alpaca_data_cleaned.json' # 51k instructions

dataset = load_dataset("json", data_files=data_path) # only train split
dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 51760
    })
})

In [19]:
# example
dataset['train'][0]

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.'}

#### step2.2 split the dataset into train / val splits

In [20]:
val_ratio = 0.1

splitted_dataset = dataset["train"].train_test_split(
                test_size=val_ratio, shuffle=True, seed=42
            )
splitted_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 46584
    })
    test: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 5176
    })
})

#### step2.3 define the prompt templates to transfer sample to prompt

In [21]:
alpaca_templates = {
    "description": "Template used by Alpaca-LoRA.",
    "prompt_input": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n",
    "prompt_no_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n",
    "response_split": "### Response:"
}

In [22]:
alpaca_template_with_input = alpaca_templates["prompt_input"]
print(alpaca_template_with_input)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:



In [23]:
alpaca_template_without_input = alpaca_templates["prompt_no_input"]
print(alpaca_template_without_input)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:



In [9]:
def get_response_from_prompt(prompt: str) -> str:
    """retrieve the response part in one full prompt string"""
    try:
        response = prompt.split(alpaca_templates["response_split"])[1].strip()
    except (KeyError, IndexError):
        response = prompt
    return response

#### step2.4 generate and tokenize the final prompts with labels to build the actual model-input dataset

In [10]:
from transformers import PreTrainedTokenizer

def generate_prompt(sample: dict) -> str:
    """generate the prompt from the dict sample with the template, and tokenize it with the tokenizer
    """
    instruction, input, output = sample['instruction'], sample['input'], sample['output']
    
    # two different prompt templates
    if input: 
        prompt = alpaca_template_with_input.format(instruction=instruction, input=input)
    else:
        prompt = alpaca_template_without_input.format(instruction=instruction)
        
    if output: # append the label
        prompt = f"{prompt}{output}"
        
    return prompt

In [11]:
def tokenize_prompt(prompt: str, 
                    tokenizer: PreTrainedTokenizer, 
                    max_length=512,
                    add_eos_token=True,
                    ):
    inputs = tokenizer(prompt, 
                max_length=max_length, 
                truncation=True, 
                padding=False, 
                return_tensors=None # just return a list-like "input_ids", "attention_mask", ..
            )
    if (
        inputs['input_ids'][-1] != tokenizer.eos_token_id and
        len(inputs['input_ids']) < max_length and
        add_eos_token
    ):
        inputs['input_ids'].append(tokenizer.eos_token_id)
        inputs['attention_mask'].append(1)
    
    inputs['labels'] = inputs['input_ids'].copy() # unsupervised learning
    
    return inputs

In [12]:
def gen_and_tokenize_prompt(sample: dict, tokenizer: PreTrainedTokenizer):
    prompt = generate_prompt(sample)
    inputs = tokenize_prompt(prompt, tokenizer)
    return inputs

In [24]:
from functools import partial

# map each sample to get the actual model-input dataset for both train / val splits
prompt_map_func = partial(gen_and_tokenize_prompt, tokenizer=tokenizer)

train_dataset = splitted_dataset['train'].shuffle(seed=42).map(prompt_map_func)
val_dataset = splitted_dataset['test'].map(prompt_map_func)

Map:   0%|          | 0/5176 [00:00<?, ? examples/s]

In [25]:
print(train_dataset)
print(val_dataset)

Dataset({
    features: ['instruction', 'input', 'output', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 46584
})
Dataset({
    features: ['instruction', 'input', 'output', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 5176
})


In [26]:
val_dataset[0]

{'instruction': 'Rearrange the following sentence to make the sentence more interesting.',
 'input': 'She left the party early',
 'output': 'Early, she left the party.',
 'input_ids': [1,
  13866,
  338,
  385,
  15278,
  393,
  16612,
  263,
  3414,
  29892,
  3300,
  2859,
  411,
  385,
  1881,
  393,
  8128,
  4340,
  3030,
  29889,
  14350,
  263,
  2933,
  393,
  7128,
  2486,
  1614,
  2167,
  278,
  2009,
  29889,
  13,
  13,
  2277,
  29937,
  2799,
  4080,
  29901,
  13,
  29934,
  799,
  3881,
  278,
  1494,
  10541,
  304,
  1207,
  278,
  10541,
  901,
  8031,
  29889,
  13,
  13,
  2277,
  29937,
  10567,
  29901,
  13,
  13468,
  2175,
  278,
  6263,
  4688,
  13,
  13,
  2277,
  29937,
  13291,
  29901,
  13,
  29923,
  279,
  368,
  29892,
  1183,
  2175,
  278,
  6263,
  29889,
  2],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,


#### step2.5 define the data collator

In [27]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    pad_to_multiple_of=8,
    return_tensors="pt", 
    padding=True
)

#### step3.1 prepare some helper functions before training

the helper functions below are all based on the previous ones defined [here](https://github.com/artidoro/qlora/blob/main/qlora.py)

In [13]:
def print_trainable_parameters(model, use_4bit=True):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit: # a 4bit int counts a half byte
        trainable_params /= 2
        
        
    print(
        f"all params: {all_param // 1000**3:.1f}B {(all_param % 1000**3) // 1000**2:.1f}M\n" + 
        f"trainable params: {trainable_params // 1000**3:.1f}B {(trainable_params % 1000**3) // 1000**2:.1f}M\n" + 
        f"trainable percent: {100 * trainable_params / all_param:.2f}%"
    )

In [14]:
def find_all_linear_names(model):
    from bitsandbytes.nn import Linear4bit
    
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
        
    return list(lora_module_names)

In [15]:
def preprocess_modules_dtype(model, bf16=True):
    from peft.tuners.lora import LoraLayer
    
    for name, module in model.named_modules():
        # upcast lora layer dtype to bfloat16
        if isinstance(module, LoraLayer) and bf16:
            module = module.to(torch.bfloat16)
        # upcast layer norm dtype to float32
        if 'norm' in name:
            module = module.to(torch.float32)
        # upcast lm_head and word_embed dtype from float32 to bfloat16
        if (
            ("lm_head" in name or "embed_tokens" in name) and
            hasattr(module, "weight") and
            bf16 and module.weight.dtype == torch.float32
        ): module = module.to(torch.bfloat16)
    
    return model

#### step3.2 create the qlora model with peft

In [16]:
def create_qlora_model(model, grad_ckpt=True, bf16=True):
    from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
    
    # prepare int4 model for training
    model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=grad_ckpt)
    if grad_ckpt: model.gradient_checkpointing_enable()
    
    # find lora target modules
    target_modules = find_all_linear_names(model)
    print(f"Found {len(target_modules)} modules to quantize:\n {target_modules}")
    
    # define lora config
    lora_config = LoraConfig(
        r=64, # rank
        lora_alpha=16,
        target_modules=target_modules,
        lora_dropout=0.1,
        bias='none',
        task_type=TaskType.CAUSAL_LM
    )
    
    # get the peft model from the lora config
    model = get_peft_model(model, lora_config)
    
    # pre-process with some of the modules' dtype
    model = preprocess_modules_dtype(model, bf16)
    
    return model

In [28]:
qlora_model = create_qlora_model(model_int4)

Found 7 modules to quantize:
 ['gate_proj', 'k_proj', 'v_proj', 'o_proj', 'down_proj', 'q_proj', 'up_proj']


In [29]:
# print the model's (trainable) parameters 
print_trainable_parameters(qlora_model)

all params: 3.0B 660.0M
trainable params: 0.0B 79.0M
trainable percent: 2.18%


In [30]:
qlora_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4b

#### step3.3 define the training arguments

In [34]:
from transformers import TrainingArguments

# define fine-tuning arguments
output_dir = '.'

training_args = TrainingArguments(
    output_dir=output_dir,
    do_train=True,
    # optimizer
    learning_rate=2e-4,
    optim="paged_adamw_8bit", # use "adamw_torch" if not mode = 4,8
    # this is critical since it determines the amount of memory consumption. 
    # the bigger per_device_train_batch_size is, the faster the training will finish
    # the smaller the lower probability of an OOM error
    # per_device_train_batch_size=1,
    
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    warmup_steps=100,
    
    num_train_epochs=1, # for fine-tuning
    eval_steps=100, # evaluate on the validation set every 100 steps
    save_steps=500, # save every 500 steps (default)
    # to further save memory footprint
    fp16=True,
    gradient_checkpointing=True,
    gradient_accumulation_steps=1, # default
    # logging
    logging_dir=os.path.join(output_dir, 'logs'),
    logging_strategy='steps', # choose "steps" or "epoch"
    logging_steps=10,
    remove_unused_columns=False,
    # others
    report_to='wandb',
    seed=42,
    load_best_model_at_end=True,
    save_total_limit=2,
    evaluation_strategy="steps",
)
training_args

TrainingArguments(
_n_gpu=4,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=True,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=100,
evaluation_strategy=steps,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=False,
group_by_leng

#### step3.4 instantiate the trainer

In [35]:
from transformers import Trainer

qlora_model.config.use_cache = False # before training, make sure to set `use_cache` to `False`

# for multi-gpus env, the model's attribute `is_model_parallel` has to set to `True` 
# to avoid any unexpected behavior such as device placement mismatching
trainer = Trainer(
    model=qlora_model,
    args=training_args,
    
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    # data_collator=data_collator,
)
trainer

<transformers.trainer.Trainer at 0x7fe4710d23a0>

#### step3.5 train the model and save it when finished

In [34]:
# to save the memory,
# we might delete the bfp16 model and the int8 model during training
# and load them back after finishing
# del model, model_int8

In [33]:
# start to train
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


ValueError: expected sequence of length 426 at dim 1 (got 326)

In [None]:
# if the training process is accidently paused
# you can resume to train from the certain checkpoint, 
# which is generated automatically like `./checkpint-{save_steps*n}` 
# by setting `save_steps` in training arguments (default is 500)

# trainer.train(resume_from_checkpoint='./checkpoints-500/')

In [39]:
# save the final model ckpt
save_dir = "./model/llama2-7b-qlora-adapter-e27k"
trainer.save_model(save_dir)

#### step4.1 report the training process

#### step4.2 evaluate on the validation set with the baselines

if you deleted the bfp model and int8 model during fine-tuning the int4 model

you should firstly load them back in step1.1 and step1.2 before runing step4.2

#### step4.3 evaluate on the public benchmark