<a href="https://www.kaggle.com/code/aisuko/fine-tuning-a-llama2-for-code-generation?scriptVersionId=164058160" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Let's trying to fine-tune Llama2 on a the dataset which is included Python code solves a given task.

In [1]:
%%capture --no-stderr
!pip install transformers==4.36.2
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install accelerate==0.25.0
!pip install trl==0.7.7
!pip install tqdm==4.66.1
# Although flash-attn is not supported in Kaggle env.However, we prepare the notebook for future usage.
!pip install flash-attn==2.4.2

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models-with-QLoRA"
os.environ["WANDB_NOTES"] = "Fine-tuning casual language models with QLoRA"
os.environ["WANDB_NAME"] = "fine-tuning-Llama2-with-pycode-instructions-with-QLoRA"
os.environ["MODEL_NAME"] = "meta-llama/Llama-2-7b-hf"
os.environ["DATASET_NAME"]="iamtarun/python_code_instructions_18k_alpaca"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `meta-llama/Llama-2-7b-hf` from `transformers`...
config.json: 100%|█████████████████████████████| 609/609 [00:00<00:00, 4.35MB/s]
┌──────────────────────────────────────────────────────────┐
│   Memory Usage for loading `meta-llama/Llama-2-7b-hf`    │
├───────┬─────────────┬──────────┬─────────────────────────┤
│ dtype │Largest Layer│Total Size│   Training using Adam   │
├───────┼─────────────┼──────────┼─────────────────────────┤
│float32│  776.03 MB  │ 24.74 GB │         98.96 GB        │
│float16│  388.02 MB  │ 12.37 GB │         49.48 GB        │
│  int8 │  194.01 MB  │ 6.18 GB  │         24.74 GB        │
│  int4 │   97.0 MB   │ 3.09 GB  │         12.37 GB        │
└───────┴─────────────┴──────────┴─────────────────────────┘


In [4]:
from datasets import load_dataset

dataset_name=os.getenv("DATASET_NAME")

dataset=load_dataset(dataset_name, split="train[:1000]") #It can be a smaller slice for fit the lower GPU memory
len(dataset)

Downloading readme:   0%|          | 0.00/905 [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/18612 [00:00<?, ? examples/s]

1000

In [5]:
def format_instruction(sample):
    return f"""Instruction:
    Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:
    
    ### Task:
    {sample['instruction']}
    
    ### Input:
    {sample['input']}
    
    ### Response
    {sample['output']}
    """

# Load the model

In [6]:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
import torch

bnb_config= BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.float16
)

model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    quantization_config=bnb_config,
    use_cache=False,
    device_map='auto',
    torch_dtype=torch.float16
)

model.config

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

LlamaConfig {
  "_name_or_path": "meta-llama/Llama-2-7b-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "quantization_config": {
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8bit": false,
    "quant_method": "bitsandbytes"
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",

In [7]:
model.config.pretraining_tp=1
model.get_memory_footprint()

3829940224

In [8]:
from peft import PeftModel, get_peft_model, prepare_model_for_kbit_training

# to save memory
model.gradient_checkpointing_enable()
model.get_memory_footprint()

3829940224

In [9]:
# freeze the base model layers and cast layernorm in fp32
model=prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRM

In [14]:
from peft import LoraConfig, TaskType

peft_config=LoraConfig(
    # Alpha parameter for LoRA scaling
    lora_alpha=8,
    # Dropout probability for LoRA layers
    lora_dropout=0.1,
    # LoRA attention dimension
    r=4,
    bias="none",
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'o_proj',
        'gate_proj',
        'up_proj',
        'down_proj',
        'lm_head'
    ],
    task_type=TaskType.CAUSAL_LM
)

peft_model=get_peft_model(model,peft_config)
peft_model.get_memory_footprint()

4395315200

In [15]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(
    os.getenv("MODEL_NAME"),
    trust_remote_code=False,
    use_fast=True
)

tokenizer.pad_token=tokenizer.eos_token
tokenizer.padding="right"

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

# Training

In [17]:
from transformers import TrainingArguments
from trl import SFTTrainer

training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    num_train_epochs=1,
    # Number of training steps (overrides num_train epochs)
    max_steps=100,
    per_device_train_batch_size=16, # 6 if use flash attention else 4
    # Number of update steps to accumulate the gradients for
    gradient_accumulation_steps=4,
    # Enable gradient checkpointing
    gradient_checkpointing=True,
    # Optimizer to use
    optim='paged_adamw_8bit',
    # Log every X updates steps
    logging_steps=25,
    save_strategy="no",
    # Initial learning rate (AdamW optimizer)
    learning_rate=2e-4,
    # Weight decay to apply to all layers except bias/LayerNorm weights
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    # Maximum gradient normal(gradient clipping)
    max_grad_norm=0.3,
    # Ratio of steps for a linear warmup(from 0 to learning rate)
    warmup_ratio=0.03,
    # Group sequences into batches with same length
    # Save memory and speeds up training considerably
    group_by_length=True,
    lr_scheduler_type='cosine',
    disable_tqdm=False,
    report_to="wandb",
    seed=42,
    run_name=os.getenv("WANDB_NAME")
)

sft_trainer=SFTTrainer(
    model=peft_model,
    train_dataset=dataset,
    # Maximum sequence length to use
    max_seq_length=2048,
    tokenizer=tokenizer,
    # Pack multiple short examples in the same input sequence to increase efficiency
    packing=True,
    formatting_func=format_instruction,
    args=training_args,
)

sft_trainer.train()

Generating train split: 0 examples [00:00, ? examples/s]

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 14.75 GiB total capacity; 8.56 GiB already allocated; 5.95 GiB free; 8.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
sft_trainer.push_to_hub(os.getenv("WANDB_NAME"))
tokenizer.push_to_hub(os.getnev("WANDB_NAME"))

# Inference

In [None]:
import gc

del peft_model, model, trainer
gc.collect()
torch.cuda.empty_cache()

In [None]:
from peft import PeftConfig, PeftModel

peft_config=PeftConfig.from_pretrained("aisuko/"+os.getenv("WANDB_NAME"))
base_model=AutoModelForCausalLM.from_pretrained(peft_config.base_)
peft_model=PeftModel.from_pretrained(base_model,"aisuko/"+os.getenv("WANDB_NAME"))

In [None]:
instrunction="Optimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2."
inputs=""

prompt=f"""### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.

### Task:
{instruction}

### Input:
{input}

### Response:
"""
input_ids=tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
outputs=model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.5)

tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)

# Credit

* https://pub.towardsai.net/fine-tuning-a-llama-2-7b-model-for-python-code-generation-865453afdf73
* https://github.com/edumunozsala/llama-2-7B-4bit-python-coder/blob/main/Llama-2-finetune-qlora-python-coder.ipynb