# Efficient fine tuning llama2 using LoRA

## What is fine tuning?
Process of adjusting weights and parameters of a pre-trained model on new data to optimize its performance on a specific task

## Requirements
- Pretrained language model
- New dataset specific to the task at hand
- GPU

## Kaggle hardware specs

In [1]:
!nvidia-smi
!lscpu | head -n 15

Tue Jul 30 07:50:16 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   54C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00

## Install required libraries

 - bitsandbytes - Provides 8-bit CUDA fuctions for PyTorch
 - peft - Parameter efficient fine tuning for huggingface
 - trl - Transformer reinforcement learning
 - accelerate - Huggingface integration for multiple GPUs
 - torchkeras - Enhanced PyTorch functionality
 - langchain - Framework for LLM development

In [2]:
!pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu117


Looking in indexes: https://download.pytorch.org/whl/nightly/cu117
Collecting torch
  Downloading https://download.pytorch.org/whl/nightly/cu117/torch-2.1.0.dev20230621%2Bcu117-cp310-cp310-linux_x86_64.whl (1886.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 GB[0m [31m436.4 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting filelock (from torch)
  Downloading https://download.pytorch.org/whl/nightly/filelock-3.13.1-py3-none-any.whl (11 kB)
Collecting typing-extensions (from torch)
  Downloading https://download.pytorch.org/whl/nightly/typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Collecting sympy (from torch)
  Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.1-py3-none-any.whl (6.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m87.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hCollecting networkx (from torch)
  Downloading https://download.pytorch.org/whl/nightly/networ

In [3]:
!pip install peft trl accelerate torchkeras langchain

Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting trl
  Downloading trl-0.9.6-py3-none-any.whl.metadata (12 kB)
Collecting torchkeras
  Downloading torchkeras-3.9.9-py3-none-any.whl.metadata (8.8 kB)
Collecting langchain
  Downloading langchain-0.2.11-py3-none-any.whl.metadata (7.1 kB)
Collecting tyro>=0.5.11 (from trl)
  Downloading tyro-0.8.5-py3-none-any.whl.metadata (8.2 kB)
Collecting langchain-core<0.3.0,>=0.2.23 (from langchain)
  Downloading langchain_core-0.2.24-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.93-py3-none-any.whl.metadata (13 kB)
Collecting packaging>=20.0 (from peft)
  Downloading packaging-24.1-py3-none-any.whl.metadata (3.2 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downlo

In [4]:
!pip install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.2-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.43.2-py3-none-manylinux_2_24_x86_64.whl (137.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.2


In [5]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0


## Imports

In [6]:
import numpy as np
import pandas as pd 
from datasets import load_dataset
import torch
from torch import nn 
from torch.utils.data import DataLoader 

from time import time
import warnings 
warnings.filterwarnings('ignore')

# Import specific libraries for acceleration and model performance monitoring
import accelerate 
import peft 

# Import necessary modules from the transformers library
from transformers import AutoTokenizer, AutoConfig, AutoModel, BitsAndBytesConfig, AutoModelForCausalLM
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy

# Set environment variables for PyTorch CUDA and XLA
# max_split_size_mb - prevents allocator from splitting blocks larger than this size
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64" 
os.environ['XLA_USE_BF16'] = "1"
os.environ['XLA_TENSOR_ALLOCATOR_MAXSIZE'] = '100000000'

# Import the torchkeras library for enhanced PyTorch functionality
import torchkeras

2024-07-30 07:53:14.681316: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-30 07:53:14.681422: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-30 07:53:14.805565: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Set model, dataset, output model


In [7]:
# Model
base_model = "/kaggle/input/llama-2/pytorch/7b-chat-hf/1/"

# New instruction dataset
hrcombo_dataset = "/kaggle/input/smallest/small_processed_data.jsonl"

# Fine-tuned model
new_model = "llama-2-7b-chat-hrcombo2"

# Load dataset

- Load tokenizer and set end of sentence token

In [8]:
# Load and prepare data
import json
from datasets import Dataset
from datasets import load_dataset
from transformers import LlamaTokenizer

# Load the dataset
train_dataset = load_dataset('json', data_files={"train":[hrcombo_dataset]}, split='train')


def formatInputJSONData(datapoint):
    job_desc = " ".join([f"{k}: {v}" for k, v in datapoint['job'].items()])
    resume = " ".join([f"{k}: {v}" for k, v in datapoint['resume'].items()])
    indicators = " ".join([f"{k}: {v}" for k, v in datapoint['indicators'].items()])
    text = f"Job Description: {job_desc}\n\nResume: {resume}\n\nIndicators: {indicators}"
    return text

## Tokenize data

tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=False,
   trust_remote_code=True, add_eos_token=True)

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# set the pad token to indicate that it's the end-of-sentence
tokenizer.pad_token = tokenizer.eos_token

## Tokenize prompt

def tokenizePrompt(prompt):
    currPrompt = formatInputJSONData(prompt)
    return tokenizer(currPrompt)

tokenized_train_dataset = train_dataset.map(tokenizePrompt)
print(len(tokenized_train_dataset))


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

30


## **PEFT Fine-Tuning**

We now prepare the model for knowledge distillation training using the PEFT (Parameter-Efficient Fine-Tuning) method to significantly reduce the memory and compute requirements.

# Load model, bnbconfig
- Set bitsandbytes config to load model with 4-bit quantization
- Load model with bnb config

In [9]:
model_name_or_path = base_model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    lm_int8_has_fp16_weight=False
)

model =  AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    quantization_config=bnb_config,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    device_map = 'auto')  
model.config.use_cache = False     # Disable cache for generation
model.config.pretraining_tp = 1    # Slower but more accurate computation of linear layers

Unused kwargs: ['lm_int8_has_fp16_weight']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


ImportError: Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`

In [None]:
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model

# gradient checkpointing to reduce memory usage for increased compute time
model.gradient_checkpointing_enable()

# compressing the base model into a smaller, more efficient model
model = prepare_model_for_kbit_training(model)

## **Configure model with LoRA**

The code below uses LoRA (a PEFT method) to reduce the number of trainable parameters. LoRA works by decomposing the large matrix of the pre-trained model into two smaller low-rank matrices in the attention layers which drastically reduces the number of parameters that need to be fine-tuned. 

In [None]:
config = LoraConfig(
    r=8,# rank of the update matrices,Lower rank results in smaller matrices with fewer trainable params
    lora_alpha=64,# impacts low-rank approximation aggressiveness,increasing value speeds up training
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "gate_proj",
        "down_proj",
        "up_proj",
        "o_proj"
    ], # modules to apply the LoRA update matrices
    bias="none", # determines LoRA bias type, influencing training dynamics
    lora_dropout=0.05, # regulates model regularization; increasing may lead to underfitting
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)

## **Training the model**

We're now ready to train our Llama 2 model on our new data! We'll be using the Transformers library to create a Trainer object for training the model. The Trainer takes the pre-trained model, training datasets, training arguments, and data collator as input.

Training time depends on the size of the training data, number of epochs and the configuration of the GPU used. 

In [None]:
import transformers

trainer = transformers.Trainer(
    model=model,                             # llama-2-7b-chat model
    train_dataset=tokenized_train_dataset,   # training data that's tokenized
    args=transformers.TrainingArguments(
        output_dir="/kaggle/working/results",       # directory where checkpoints are saved
        per_device_train_batch_size=2,       # number of samples processed in one forward/backward pass per GPU
        gradient_accumulation_steps=2,       # [default = 1] number of updates steps to accumulate the gradients for
        max_steps = 100,   
        learning_rate=1e-4,                  # [IMPORTANT] smaller LR for better finetuning
        bf16=False,                          # train parameters with this precision
        optim="paged_adamw_8bit",            # use paging to improve memory management of default adamw optimizer
        logging_dir="/kaggle/working/logs",                # directory to save training log outputs
        save_strategy="steps",               # [default = "steps"] store after every iteration of a datapoint
        save_steps=10,                       # save checkpoint after number of iterations
        logging_steps = 10,                   # specify frequency of printing training loss data
        run_name="llama-2-hrcombo"
    ),

    # use to form a batch from a list of elements of train_dataset
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# if use_cache is True, past key values are used to speed up decoding
# if applicable to model. This defeats the purpose of finetuning
model.config.use_cache = False

# train the model based on the above config
trainer.train()

In [None]:
trainer.save_model(new_model)

In [6]:
!zip -r file.zip /kaggle/working/llama-2-7b-chat-hrcombo2

  adding: kaggle/working/llama-2-7b-chat-hrcombo/ (stored 0%)
  adding: kaggle/working/llama-2-7b-chat-hrcombo/adapter_model.safetensors (deflated 7%)
  adding: kaggle/working/llama-2-7b-chat-hrcombo/training_args.bin (deflated 51%)
  adding: kaggle/working/llama-2-7b-chat-hrcombo/adapter_config.json (deflated 53%)
  adding: kaggle/working/llama-2-7b-chat-hrcombo/README.md (deflated 66%)


In [7]:
from IPython.display import FileLink
FileLink(r'file.zip')

## **Load Finetuned Model**

In [15]:
torch.cuda.empty_cache()

In [17]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig,LlamaTokenizer
from peft import PeftModel

time_1 = time()
model_name_or_path = base_model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_has_fp16_weight=False,
)

tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path, use_fast=False,
                                           trust_remote_code=True,
                                           add_eos_token=True)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# set the pad token to indicate that it's the end-of-sentence
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,  #same as before
    quantization_config=bnb_config,  #same quantization config as before
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path, use_fast=False,
                                           trust_remote_code=True)

modelFinetuned = PeftModel.from_pretrained(model,"/kaggle/working/results/checkpoint-50")

time_2 = time()
print(f"Time elapsed: {round(time_2-time_1, 3)} sec.")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Time elapsed: 7.109 sec.


### Save model

In [None]:
# modelFinetuned.save_model('/kaggle/working/' + new_model)

In [None]:
!zip -r file.zip /kaggle/working

In [None]:
from IPython.display import FileLink
FileLink(r'file.zip')

## **Llama 2 prompt format**
- \<s>, \</s> - marks start and end of coversation
- [INST], [/INST] - marks start and end of user prompt
- \<\<SYS>>, \<\</SYS>> - marks start and end of system prompt

In [6]:
### Function to create prompts formatted for llama 2, Includes default system prompt

DEFAULT_SYSTEM_PROMPT = """
You are a helpful, respectful and honest human resource assistant. Always answer as helpfully as possible, while being safe. 
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer to a question, please don't share false information.
""".strip()


def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
    return f"""
[INST] <<SYS>>
{system_prompt}
<</SYS>>

{prompt} [/INST]
""".strip()

## Test inference of fine tuned model

In [7]:
    sample_job = """
    Software Engineer
Tech Solutions Inc. is a forward-thinking technology company focused on innovative software solutions. We strive to create an inclusive and collaborative environment where creativity and technical excellence thrive.
To provide cutting-edge software solutions that empower businesses to achieve their goals.
To be the leading provider of innovative software solutions worldwide.
Innovation Collaboration Excellence
Inclusive collaborative and innovative
We are seeking a highly skilled Software Engineer to join our dynamic team. The ideal candidate will have strong experience in software development a passion for technology and a drive to deliver high-quality software solutions.
Design develop and maintain software applications.
Collaborate with cross-functional teams to define and implement software requirements.
Write clean efficient and maintainable code.
Perform code reviews and provide constructive feedback.
Troubleshoot and debug software issues.
Bachelor's degree in Computer Science or a related field.
Java Python JavaScript SQL Software Development Lifecycle (SDLC)
3+ years of experience in software development Experience with cloud platforms (e.g. AWS Azure) Knowledge of Agile methodologies Experience with containerization (e.g. Docker)
Experience in the tech industry is preferred. Strong problem-solving skills Excellent communication skills Ability to work both independently and in a team Attention to detail

    """
    sample_resume = """
    Skilled Software Engineer with over 4 years of experience in developing robust software applications. Proficient in Java, Python, and JavaScript, with a strong understanding of cloud platforms and Agile methodologies. Seeking to leverage my expertise to contribute to the success of Tech Solutions Inc.
Software Engineer
Designed and implemented new features for a cloud-based application, improving user experience by 30%.
Collaborated with cross-functional teams to define project requirements and deliver high-quality software solutions.
Optimized database queries, reducing load times by 25%.
Mentored junior developers and conducted code reviews.
Junior Software Engineer
Developed and maintained web applications using JavaScript and Python.
Assisted in migrating legacy applications to cloud infrastructure (AWS).
Troubleshot and resolved software issues, improving application stability.
Bachelor of Science in Computer Science
Programming Languages: Java, Python, JavaScript,
Web Technologies: HTML, CSS, React,
Databases: MySQL, PostgreSQL,
Cloud Platforms: AWS, Azure,
Development Tools: Git, Docker,
Soft Skills: Problem-solving, Communication, Teamwork
AWS Certified Developer – Associate,
E-commerce Platform Development,
Lead Developer,
Developed a scalable e-commerce platform using Java and Spring Boot.
Implemented RESTful APIs to support front-end applications.
Integrated third-party payment gateways and shipping services.
Java, Spring Boot, MySQL
IEEE Computer Society,
Volunteer Developer,
Code for Good,
Developed and maintained software solutions for non-profit organizations.
Provided technical support and training to volunteers.
English,
Fluent
Spanish,
Intermediate
    """

In [None]:
torch.cuda.empty_cache()

In [None]:
time_1 = time()

request = '''
    {sample_job}
    {sample_resume}
    Do not say any other thing, just return the indicators as a json.
    '''

# Format the question
eval_prompt = generate_prompt(request.format(sample_job=sample_job, sample_resume=sample_resume))
promptTokenized = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

modelFinetuned.eval()
with torch.no_grad():
    print(tokenizer.decode(modelFinetuned.generate(**promptTokenized, max_new_tokens = 1024)[0], skip_special_tokens=True))
torch.cuda.empty_cache()

time_2 = time()
print(f"Time elapsed: {round(time_2-time_1, 3)} sec.")

In [None]:
torch.cuda.empty_cache()

In [20]:
time_1 = time()

request = '''
    Given a job description and a resume, analyse and compare them and return the results in the following format in indicators:
    The job description is: 
    {sample_job}
    
    The resume is: 
    {sample_resume}
    
    The indicators to be assigned in the following form:
    overallMatchPercentage: ""
    culturalFitScore: ""
    culturalFitScale: 5,
    culturalFitDescription: ""
    culturalFitReasoning: ""
    growthPotentialScore: ""
    growthPotentialScale: 10,
    growthPotentialDescription: ""
    growthPotentialReasoning: ""
    productivityIndicatorScore: ""
    productivityIndicatorScale: 5,
    productivityIndicatorDescription: ""
    productivityIndicatorReasoning: ""
    likelihoodOfJobOfferAcceptanceScore: ""
    likelihoodOfJobOfferAcceptanceScale: 10,
    likelihoodOfJobOfferAcceptanceDescription: ""
    likelihoodOfJobOfferAcceptanceReasoning: ""
    predictedCandidateSuccessScore: ""
    predictedCandidateSuccessScale: 10,
    predictedCandidateSuccessDescription: ""
    predictedCandidateSuccessReasoning: ""
    longTermEngagementAndRetentionScore: ""
    longTermEngagementAndRetentionScale: 5,
    longTermEngagementAndRetentionDescription: ""
    longTermEngagementAndRetentionReasoning: ""
    skillMatchScore: ""
    skillMatchScale: 10,
    skillMatchDescription: ""
    skillMatchReasoning: ""
    
    Do not say any other thing, just return the indicators as a json.
    '''

# Format the question
eval_prompt = generate_prompt(request.format(sample_job=sample_job, sample_resume=sample_resume))
promptTokenized = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

modelFinetuned.eval()
with torch.no_grad():
    print(tokenizer.decode(modelFinetuned.generate(**promptTokenized, max_new_tokens = 1024)[0], skip_special_tokens=True))
torch.cuda.empty_cache()

time_2 = time()
print(f"Time elapsed: {round(time_2-time_1, 3)} sec.")

[INST] <<SYS>>
You are a helpful, respectful and honest human resource assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>


    Given a job description and a resume, analyse and compare them and return the results in the following format in indicators:
    The job description is: 
    
Software Engineer
Tech Solutions Inc. is a forward-thinking technology company focused on innovative software solutions. We strive to create an inclusive and collaborative environment where creativity and technical excellence thrive.
To provide cutting-edge software solutions that empowe

In [11]:
import shutil
import os

# Paths to the original and fine-tuned model directories
original_model_dir = "/kaggle/input/llama-2/pytorch/7b-chat-hf/1"
finetuned_model_dir = "/kaggle/working/llama-2-7b-chat-hrcombo"

# List of essential files
essential_files = ["config.json", "tokenizer_config.json", "vocab.json", "merges.txt"]

for file_name in essential_files:
    src_file = os.path.join(original_model_dir, file_name)
    dst_file = os.path.join(finetuned_model_dir, file_name)
    if not os.path.exists(dst_file) and os.path.exists(src_file):
        shutil.copy(src_file, dst_file)

print("Missing files copied successfully.")


Missing files copied successfully.


In [8]:
model_finetuned = AutoModelForCausalLM.from_pretrained(new_model)
tokenizer = AutoTokenizer.from_pretrained(new_model)
eval_prompt = generate_prompt(request.format(
    sample_job=sample_job, sample_resume=sample_resume))
prompt_tokenized = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model_finetuned.eval()
with torch.no_grad():
    result = tokenizer.decode(model_finetuned.generate(
        **prompt_tokenized, max_new_tokens=1024)[0], skip_special_tokens=True)

print(result)
torch.cuda.empty_cache()

OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory llama-2-7b-chat-hrcombo.