<a href="https://colab.research.google.com/github/2003Yash/LORA_finetuning/blob/main/LORA_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Source: https://www.youtube.com/watch?v=D3pXSkGceY0&list=PLZAGXXsIV3P3gCenOWRd56ZpeksdYFUKV

Step-1: Test the model before fine-tuning for checking accuracy later

Step-2: Get resources to fine-tune the model (usually text or pdfs)

Step-3: Chunk the data (using Hybrid chunker for semantic chunking strategies) and use some really big model to generate basic question for the each chunk and structure them in question and answer format

In [None]:
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from colorama import Fore

import json
from typing import List
from pydantic import BaseModel
from litellm import completion

prompt_template = """You are an expert data curator assisting a machine learning engineer in creating a high-quality instruction tuning dataset. Your task is to transform
    the provided data chunk into diverse question and answer (Q&A) pairs that will be used to fine-tune a language model.

    For each of the {num_records} entries, generate one or two well-structured questions that reflect different aspects of the information in the chunk.
    Ensure a mix of longer and shorter questions, with shorter ones typically containing 1-2 sentences and longer ones spanning up to 3-4 sentences. Each
    Q&A pair should be concise yet informative, capturing key insights from the data.

    Structure your output in JSON format, where each object contains 'question' and 'answer' fields. The JSON structure should look like this:

        "question": "Your question here...",
        "answer": "Your answer here..."

    Focus on creating clear, relevant, and varied questions that encourage the model to learn from diverse perspectives. Avoid any sensitive or biased
    content, ensuring answers are accurate and neutral.

    Example:

        "question": "What is the primary purpose of this dataset?",
        "answer": "This dataset serves as training data for fine-tuning a language model."


    By following these guidelines, you'll contribute to a robust and effective dataset that enhances the model's performance."

    ---

    **Explanation:**

    - **Clarity and Specificity:** The revised prompt clearly defines the role of the assistant and the importance of the task, ensuring alignment with the
    project goals.
    - **Quality Standards:** It emphasizes the need for well-formulated Q&A pairs, specifying the structure and content of each question and answer.
    - **Output Format:** An example JSON structure is provided to guide the format accurately.
    - **Constraints and Biases:** A note on avoiding sensitive or biased content ensures ethical considerations are met.
    - **Step-by-Step Guidance:** The prompt breaks down the task into manageable steps, making it easier for the assistant to follow.

    This approach ensures that the generated data is both high-quality and meets the specific requirements of the machine learning project.

    Data
    {data}
    """

class Record(BaseModel):
    question: str
    answer: str

class Response(BaseModel):
    generated: List[Record]

# call to convert data into question and answer
def llm_call(data: str, num_records: int = 5) -> dict:
    stream = completion(
        model="ollama_chat/qwen2.5:14b",
        messages=[
            {
                "role": "user",
                "content": prompt_template(data, num_records),
            }
        ],
        stream=True,
        options={"num_predict": 2000},
        format=Response.model_json_schema(), # telling output format
    )
    data = ""
    for x in stream:
        delta = x['choices'][0]["delta"]["content"]
        if delta is not None:
            print(Fore.LIGHTBLUE_EX+ delta + Fore.RESET, end="")
            data += delta
    return json.loads(data)

# chunk input data from the pdf and convert into question and answer format for each chunk using a really big model
if __name__ == "__main__":
    converter = DocumentConverter()
    doc = converter.convert("tm1_dg_dvlpr-10pages.pdf").document
    chunker = HybridChunker()
    chunks = chunker.chunk(dl_doc=doc)

    dataset = {}
    for i, chunk in enumerate(chunks):
            print(Fore.YELLOW + f"Raw Text:\n{chunk.text[:300]}…" + Fore.RESET)
            enriched_text = chunker.contextualize(chunk=chunk)
            print(Fore.LIGHTMAGENTA_EX + f"Contextualized Tex:\n{enriched_text[:300]}…" + Fore.RESET)

            data = llm_call(
                enriched_text
            )
            dataset[i] = {"generated":data["generated"], "context":enriched_text}

    with open('tm1data.json','w') as f:
        json.dump(dataset, f)


Step-4: After converting data into question and answer format === REFORMAT THEM USING CONTEXT FOR MORE ACCURATE FINETUNING

In [None]:
# simply add chunk as context in the question so llm can also learn knowledge along with question
#  = convert {question, answer} into {question = context:{chunk}+question, answer}

import json
from colorama import Fore

instructions = []
with open('tm1data.json', 'r') as f:
    data = json.load(f)
    for key,chunk in data.items():
        for pairs in chunk['generated']:
            question, answer = pairs['question'], pairs['answer']
            context_pair = {
                'question':f'{pairs['question']}',
                'answer':pairs['answer']
                }
            instructions.append(context_pair)
        print(Fore.YELLOW + str(chunk))
        print('\n~~~~~~~~~~~~~~~~~~~~~')

with open('data/instruction.json','w') as f:
    json.dump(instructions,f)

with open('data/instruction.json','r') as f:
    data = json.load(f)
    print(Fore.LIGHTMAGENTA_EX + str(data[:10]))

**To run the fine-tuning file in a big gpu environment. Go to runpod.io -> fine a good gpu pod and click deploy -> copy ssh -> open local vs code and serach on top for ssh: connect to host and paste link give by deployed pod in run pod ( what ever code we execute in this folder of local vs code will get executed in runpod )**

Step-5: Create Chat-Template For our Question-answer Data ( to convert our data into llm's native training structure so it can understand our data easily )

In [None]:
from datasets import load_dataset
from colorama import Fore

dataset = load_dataset("data", split='train') # data folder has instruction.json where questiona dn answer are present
print(Fore.YELLOW + str(dataset[2]) + Fore.RESET)  # print some of training data

# CREATE A CHAT-TEMPLATE = Here the llm is trained on a strucute of tokens (system, user and assitant prompts) so we need ot resturcture our data into that format
# if we don't resturcture and train llm on direct questiona dn answer, llm will hallucinate large since we destriiyed it training format and also destoryed the input in diiferent format so if doesn't even understand whats the input so output finetuing in such case increases hallucination

def format_chat_template(batch, tokenizer):
    system_prompt =  """You are a helpful, honest and harmless assitant designed to help engineers. Think through each question logically and provide an answer. Don't make things up, if you're unable to answer a question advise the user that you're unable to answer as it is outside of your scope."""
    samples = []

    # Seperate questions and answer from input
    questions = batch["question"]
    answers = batch["answer"]

    for i in range(len(questions)):
        # convert seperated question-answers into raw_json so we can use it to convert into chat-template
        row_json = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": questions[i]},
            {"role": "assistant", "content": answers[i]}
        ]

        # Apply chat template and append the result to the list
        tokenizer.chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"
        text = tokenizer.apply_chat_template(row_json, tokenize=False)
        samples.append(text)

    # Return a dictionary with lists as expected for batched processing
    return {
        "instruction": questions,
        "response": answers,
        "text": samples  # The processed chat template text for each row
    }

# By default the LLM Fine-tuning happens on samples (Chat-templates) not on seprated questions and asnwers

Step-6: Fine-tune the model

In [None]:
#import dependencies
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, prepare_model_for_kbit_training
import torch

# usually we prefer a SLM to fine-tune than an llm since slm is fast, simple and far easily fine-tunable for something super niche
base_model = "meta-llama/Llama-3.2-1B"

# tokenizer is not llm, it converts text into tokens for llm
tokenizer = AutoTokenizer.from_pretrained(
        base_model,
        trust_remote_code=True, # to download model from HF Repo
        token="HF_TOKEN_HERE",
)

# Create Training Dataset from our question and answer dataset using above cell function
# we are using tokenizer since tokenizer knows llm specific chat_template tokens structure
train_dataset = dataset.map(lambda x: format_chat_template(x, tokenizer), num_proc=8, batched=True, batch_size=10) # num_proc defined how many processes to do parallely - good if very big dataset

# quantization configuration
# lets say model weight are 16bit flaot if we use 4 bit qunatization => if weight = 9.12345678902333341, we only finetune: 9.123 (crude and not too fine-tuneing but works)
# 2-bit and 1-bit qunatizations are prone to hallucianating as they change a lot for everyupdate since then only update = 9.1 part of weight
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# get model from Hugginf face
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    device_map="cuda:0", # where model should run cpu or gpu = we can set it to "auto" it will also work if gpu is already present
    quantization_config=quant_config, # quantization configuration
    token="HF_TOKEN_HERE",
    cache_dir="./workspace", # where exactly we load out model in the workspace
)

model.gradient_checkpointing_enable()

# pass it through parameter efficent fine-tuning class to make to ready for fine-tuning
model = prepare_model_for_kbit_training(model)

# LORA Configuration
peft_config = LoraConfig(
    # big rank = more accurate fine-tunes and if rank is too low then accuracy is low and can introduce negligable finetuning
    r=256, # rank determines the rank of matix we use for fine-tuning. EXAMPLE original weights are 8x8 and we use rank 2= instead of updating all 8x8 we update 8x2 and 2x8 matrices and then multiply them to get 8x8 weights
    lora_alpha=512, # recommended to be a multple of rank and a higher value than rank
    lora_dropout=0.05, # dropout to be better performer during inference
    target_modules="all-linear", # finetunes only lineer layer, if we do embedding layer then during inference it can be a bit of nightmare.
    task_type="CAUSAL_LM",
)

# in peft config = increased rank and alpha increases accuracy

# SFT = supervised fine-tuning
trainer = SFTTrainer(
    model,
    train_dataset=train_dataset,
    args=SFTConfig(output_dir="meta-llama/Llama-3.2-1B-SFT", num_train_epochs=50), # 50 is good but if larger or complex trainign data then increase epochs accordingly
                              # fine-tuned model is stored in this output_dir
    peft_config=peft_config,
)

# fine-tuning the model using trainer
trainer.train()

# save the checkpoint and model
trainer.save_model('complete_checkpoint')
trainer.model.save_pretrained("final_model")

**Step-7:  deploy locally using ilama = After training we have adapter-config.json and adaper_Safetensors = t=put them in a folder called checko]points and then use them inside olama to run the model and check fine-tuning**

KEY TAKEAWAYS:

- increase rank and alpha inside peft will increases accuracy
- Larger data is not always good, data quality is very important
- Data should be more targetted than generic
- Increasing epochs not always good, some may cause diminishing returns due to overfitting
- More Quality data doen't need more Epoch runs