# Finetune Llama-3 with LLaMA Factory

Please use a **free** Tesla T4 Colab GPU to run this!

Project homepage: https://github.com/hiyouga/LLaMA-Factory

## Install Dependencies

In [1]:
%cd /content/
%rm -rf LLaMA-Factory
!git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
%ls
!pip install -e .[torch,bitsandbytes]

/content
Cloning into 'LLaMA-Factory'...
remote: Enumerating objects: 357, done.[K
remote: Counting objects: 100% (357/357), done.[K
remote: Compressing objects: 100% (274/274), done.[K
remote: Total 357 (delta 72), reused 307 (delta 68), pack-reused 0 (from 0)[K
Receiving objects: 100% (357/357), 9.74 MiB | 20.11 MiB/s, done.
Resolving deltas: 100% (72/72), done.
/content/LLaMA-Factory
[0m[01;34massets[0m/       [01;34mevaluation[0m/  MANIFEST.in     requirements.txt  [01;34mtests[0m/
CITATION.cff  [01;34mexamples[0m/    pyproject.toml  [01;34mscripts[0m/
[01;34mdata[0m/         LICENSE      README.md       setup.py
[01;34mdocker[0m/       Makefile     README_zh.md    [01;34msrc[0m/
Obtaining file:///content/LLaMA-Factory
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l

### Check GPU environment

In [1]:
import torch
try:
  assert torch.cuda.is_available() is True
except AssertionError:
  print("Please set up a GPU before using LLaMA Factory: https://medium.com/mlearning-ai/training-yolov4-on-google-colab-316f8fff99c6")

## Using our Custom Dataset

In [2]:
import json

# Paths
INPUT_PATH  = "/content/LLaMA-Factory/data/dataset.json"
OUTPUT_PATH = "/content/LLaMA-Factory/data/dataset_new.json"

# Template with a placeholder for feedback_prompt
INSTRUCTION_TEMPLATE = """
You are reviewing a quiz designed to assess students' theoretical understanding and practical application of topics taught in class.

EVALUATE it STRICTLY based on the following criteria. Assign a score out of 10 for each and justify with 1–2 sentences.

For each criterion, do the following:
1. Give a score out of 10.
2. Justify the score with 1–2 sentences.

1. **Clarity and Relevance**:
  - Are the questions clearly worded and free from ambiguity?
  - Are they appropriate for the level of the course and relevant to topics taught?
  - Do they reflect the expected knowledge and skill level of students?

2. **Coverage of Concepts**:
  - Does the quiz cover a diverse and representative set of concepts taught?
  - Are both theoretical and practical aspects of the topic included?
  - Does it balance breadth and depth appropriately?

3. **Question Quality and Structure**:
  - Are MCQs structured well with plausible distractors?
  - Are True/False statements precise and unambiguous?
  - Are short/medium questions open-ended enough to assess understanding, but focused enough to guide students?

4. **Cognitive Depth and Usefulness**:
  - Do questions vary in difficulty and promote higher-order thinking (not just recall)?
  - Are there any case-based or real-world application questions?
  - Does it test understanding, analysis, and application?

5. **Task Redundancy / Overlap**:
  - Tasks should be distinct and may be divided into subtasks if complex.
  - Avoid repetition and ensure flow and progression in learning.

6. **Feedback Incorporation**:
  {feedback_prompt}

At the end, write the following in one line exactly:
[[[REVIEW_SCHEME]]] = {{ 'clarity': CLARITY_SCORE,
                           'coverage': COVERAGE_SCORE,
                           'structure': STRUCTURE_SCORE,
                           'overlap': OVERLAP_SCORE,
                           'depth': DEPTH_SCORE,
                           'feedback': FEEDBACK_SCORE }}
""".strip()

def wrap_for_alpaca():
    with open(INPUT_PATH) as f:
        raw = json.load(f)["dataset"]

    wrapped = []
    for item in raw:
        # Fill in the feedback prompt inside the instruction
        instruction = INSTRUCTION_TEMPLATE.format(
            feedback_prompt=item["input"]["feedback_prompt"].strip()
        )
        # Keep the quiz itself as the input
        input_text = item["input"]["quiz"].strip()

        wrapped.append({
            "instruction": instruction,
            "input": input_text,
            "output": item["output"].strip()
        })

    with open(OUTPUT_PATH, "w") as f:
        json.dump(wrapped, f, indent=2)
    print(f"🎉 Written {len(wrapped)} examples to {OUTPUT_PATH}")


wrap_for_alpaca()


🎉 Written 68 examples to /content/LLaMA-Factory/data/dataset_new.json


## Fine-tuning model via Command Line

It takes ~20min for training.

In [4]:
import json

args = dict(
  stage="sft",                                               # do supervised fine-tuning
  do_train=True,
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  dataset="dataset_new",                                     # use alpaca and identity datasets
  template="llama3",                                         # use llama3 prompt template
  finetuning_type="lora",                                    # use LoRA adapters to save memory
  lora_target="all",                                         # attach LoRA adapters to all linear layers
  output_dir="llama3_lora",                                  # the path to save LoRA adapters
  per_device_train_batch_size=2,                             # the micro batch size
  gradient_accumulation_steps=4,                             # the gradient accumulation steps
  lr_scheduler_type="cosine",                                # use cosine learning rate scheduler
  logging_steps=5,                                           # log every 5 steps
  warmup_ratio=0.1,                                          # use warmup scheduler
  save_steps=1000,                                           # save checkpoint every 1000 steps
  learning_rate=5e-5,                                        # the learning rate
  num_train_epochs=3.0,                                      # the epochs of training
  max_samples=500,                                           # use 500 examples in each dataset
  max_grad_norm=1.0,                                         # clip gradient norm to 1.0
  loraplus_lr_ratio=16.0,                                    # use LoRA+ algorithm with lambda=16.0
  fp16=True,                                                 # use float16 mixed precision training
  report_to="none",                                          # disable wandb logging
)

json.dump(args, open("train_llama3.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli train train_llama3.json

/content/LLaMA-Factory
2025-05-03 11:39:25.843599: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746272366.125504    1507 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746272366.198684    1507 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-03 11:39:26.782831: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[INFO|2025-05-03 11:39:39] llamafactory.hparams.parser:401 >> Process rank: 0, world size: 1, 

## Infer the fine-tuned model

In [5]:
from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

args = dict(
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="llama3_lora",                        # load the saved LoRA adapters
  template="llama3",                                         # same to the one in training
  finetuning_type="lora",                                    # same to the one in training
)
chat_model = ChatModel(args)

/content/LLaMA-Factory


[INFO|tokenization_utils_base.py:2060] 2025-05-03 11:52:47,159 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/fd5a4dc328319c1cfe9489eccfb9c6406bdfd469/tokenizer.json
[INFO|tokenization_utils_base.py:2060] 2025-05-03 11:52:47,160 >> loading file tokenizer.model from cache at None
[INFO|tokenization_utils_base.py:2060] 2025-05-03 11:52:47,161 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2060] 2025-05-03 11:52:47,161 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/fd5a4dc328319c1cfe9489eccfb9c6406bdfd469/special_tokens_map.json
[INFO|tokenization_utils_base.py:2060] 2025-05-03 11:52:47,162 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/fd5a4dc328319c1cfe9489eccfb9c6406bdfd469/tokenizer_con

[INFO|2025-05-03 11:52:48] llamafactory.data.template:143 >> Add <|eom_id|> to stop words.


[INFO|configuration_utils.py:693] 2025-05-03 11:52:48,848 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/fd5a4dc328319c1cfe9489eccfb9c6406bdfd469/config.json
[INFO|configuration_utils.py:765] 2025-05-03 11:52:48,850 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 128255,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_storage": "uint8",
  

[INFO|2025-05-03 11:52:48] llamafactory.model.model_utils.quantization:143 >> Loading ?-bit BITSANDBYTES-quantized model.
[INFO|2025-05-03 11:52:48] llamafactory.model.model_utils.kv_cache:143 >> KV cache is enabled for faster generation.


[INFO|quantization_config.py:436] 2025-05-03 11:52:48,937 >> Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
[INFO|modeling_utils.py:1124] 2025-05-03 11:52:49,001 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/fd5a4dc328319c1cfe9489eccfb9c6406bdfd469/model.safetensors
[INFO|modeling_utils.py:2167] 2025-05-03 11:52:49,005 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1142] 2025-05-03 11:52:49,009 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "pad_token_id": 128255
}

[INFO|quantizer_bnb_4bit.py:124] 2025-05-03 11:52:49,192 >> target_dtype {target_dtype} is replaced by `CustomDtype.INT4` for 4-bit BnB quantization
[INFO|modeling_utils.py:4930] 2025-05-03 11:53:05,822 >> All mod

[INFO|2025-05-03 11:53:06] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-05-03 11:53:06] llamafactory.model.adapter:143 >> Loaded adapter(s): llama3_lora
[INFO|2025-05-03 11:53:06] llamafactory.model.loader:143 >> all params: 8,051,232,768


In [12]:
def query_chat_model_once(query: str,
                         chat_model) -> str:
    """
    Query the LLaMA-Factory chat model once with a given input string and return the response.

    Args:
        query (str): The input query to send to the model.
        model_name_or_path (str): Path or name of the base model.
        adapter_name_or_path (str): Path to the LoRA adapters.
        template (str): Chat template to use (e.g., 'llama3').
        finetuning_type (str): Type of fine-tuning (e.g., 'lora').

    Returns:
        str: The model's response to the query.
    """
    if not query.strip():
        return "Empty query provided."

    # Prepare the message
    messages = [{"role": "user", "content": query}]

    # Generate response using streaming
    response = ""
    for new_text in chat_model.stream_chat(messages):
        response += new_text

    # Clean up memory
    torch_gc()

    return response

quiz = '''
Quiz on Gradient-Based Optimization\n\n1. MCQ: Which technique helps escape shallow local minima?\nA) SGD with momentum\nB) Batch gradient descent\nC) L-BFGS\nD) Early stopping\nCorrect: A) SGD with momentum. Reason: Momentum carries past gradients.\n\n2. True/False: Adam optimizer adapts learning rates per parameter.\nTrue. Reason: Adam uses estimates of first and second moments.\n\n
'''
feedback_prompt = "This is the first draft so give full marks only for feedback incorporation (10/10). Other metrics CANNOT have more than 7 marks and need extensive criticism for improvement."
prompt = f'''
  You are a meticulous and brutally honest educator tasked with dissecting and evaluating the quality of a student-designed quiz meant to assess theoretical understanding and practical application of topics taught in class.
 Your job is not to merely review but to interrogate every choice made in the quiz—question wording, content selection, cognitive depth, and structure—with a fine-toothed comb.
 Challenge every assumption. Be hyper-critical and assume nothing is adequate unless proven through rigor, clarity, and flawless execution.
 Expose ambiguity, shallowness, redundancy, poor alignment with learning objectives, and any missed opportunities—no matter how subtle.
 Even if the quiz seems serviceable, your goal is to highlight weaknesses in coverage, depth, construction, and learning value.
 Never default to leniency. **Demand perfection, especially in the absence of prior feedback.**
 The quiz content is below:
        =====================
        {quiz}
        =====================

        EVALUATE it STRICTLY based on the following criteria. Assign a score out of 10 for each and justify with detail.

        For each criterion, do the following:
        1. Give a score out of 10.
        2. Justify the score in detail.

        1. **Clarity and Relevance**:
          - Are the questions clearly worded and free from ambiguity?
          - Are they appropriate for the level of the course and relevant to topics taught?
          - Do they reflect the expected knowledge and skill level of students?

        2. **Coverage of Concepts**:
          - Does the quiz cover a diverse and representative set of concepts taught?
          - Are both theoretical and practical aspects of the topic included?
          - Does it balance breadth and depth appropriately?

        3. **Question Quality and Structure**:
          - Does the quiz contain the required 3-5 MCQs, 2-3 True/False statements, and 4-5 short/medium-length questions?
          - Are MCQs structured well with plausible distractors?
          - Are True/False statements precise and unambiguous?
          - Are short/medium questions open-ended enough to assess understanding, but focused enough to guide students?

        4. **Cognitive Depth and Usefulness**:
          - Do questions vary in difficulty and promote higher-order thinking (not just recall)?
          - Are there any case-based or real-world application questions?
          - Does it test understanding, analysis, and application?

        5. **Task Redundancy / Overlap**:
          - Tasks should be distinct and may be divided into subtasks if complex.
          - Avoid repetition and ensure flow and progression in learning.

        6. **Feedback Incorporation**:
          - {feedback_prompt}
          - If this is the first draft (i.e. no previous feedback), be EXTREMELY CRITICAL in other fields. ONLY give scores above 7 for criteria that are exceptionally well-executed with zero flaws. Assume perfection is expected on first draft to drive improvement.

         At the end, write the following in one line ... [[[REVIEW_SCHEME]]] = {{ 'clarity': CLARITY_SCORE, 'coverage': COVERAGE_SCORE, 'structure': STRUCTURE_SCORE, 'overlap': OVERLAP_SCORE, 'depth': DEPTH_SCORE, 'feedback': FEEDBACK_SCORE }}
'''
query = prompt
response = query_chat_model_once(query,chat_model)
print(f"User: {query}")
print(f"Assistant: {response}")

User: 
  You are a meticulous and brutally honest educator tasked with dissecting and evaluating the quality of a student-designed quiz meant to assess theoretical understanding and practical application of topics taught in class.
 Your job is not to merely review but to interrogate every choice made in the quiz—question wording, content selection, cognitive depth, and structure—with a fine-toothed comb.
 Challenge every assumption. Be hyper-critical and assume nothing is adequate unless proven through rigor, clarity, and flawless execution.
 Expose ambiguity, shallowness, redundancy, poor alignment with learning objectives, and any missed opportunities—no matter how subtle.
 Even if the quiz seems serviceable, your goal is to highlight weaknesses in coverage, depth, construction, and learning value.
 Never default to leniency. **Demand perfection, especially in the absence of prior feedback.**
 The quiz content is below:
        
Quiz on Gradient-Based Optimization

1. MCQ: Which tech

Following this the adapter weights and configs were downloaded from ```llama3_lora folder``` and uploaded to [HuggingFace Models](https://huggingface.co/mahmad1882/llama3-8b-instruct-verification-lora). From there we use Together AI's serverless inference API supporting custom adapters in free tier for inference in our verification step of quiz generation.