# Fine-tune Llama 3.2 to generate Markdown friendly Python functions

In this notebook, we are going to fine tune a Llama 3.2 1B model using QLORA and the [Google Mostly Basic Python Problems](https://huggingface.co/datasets/google-research-datasets/mbpp) dataset.

## 🛠️ Supported Hardware

This notebook can run in a CPU or in a GPU.

✅ AMD Instinct™ Accelerators  
✅ AMD Radeon™ RX/PRO Graphics Cards  

Suggested hardware: **AMD Instinct™ Accelerators**, this notebook may not run in a CPU if your system does not have enough memory.

## ⚡ Recommended Software Environment

::::{tab-set}

:::{tab-item} Linux
- [Install Docker container](https://amdresearch.github.io/aup-ai-tutorials//env/env-gpu.html)
- [Install PyTorch](https://amdresearch.github.io/aup-ai-tutorials//env/env-cpu.html)
:::

::::

## 🎯 Goals

- Specialize a model using fine tuning
- Quantize the model using BitsandBytes
- Define QLoRa parameters
- Fine tune using SFTTrainer

```{seealso}

- This notebook is partially based on the [FluidNumerics](https://www.fluidnumerics.com/) webinar.

- [Fine Tuning Llama 3 on AMD Radeon GPUs](https://webinar.amd.com/Fine-Tuning-Llama-3-on-AMD-Radeon-GPUs/en)

- [Fine-Tuning Llama-3 on AMD Radeon GPU](https://github.com/FluidNumerics/amd-ml-examples/blob/main/fine-tuning-llama-3/train-single-gpu.ipynb)

- [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/index) is a Python wrapper library that offers fast and efficient 8-bit quantization of machine learning models.

- [Parameter-Efficient Fine-Tuning](https://huggingface.co/docs/peft/en/index)

```

## Get the Model and Tokenizer

Import some of the necessary packages

In [1]:
import torch
from numpy import argmax

from transformers import AutoTokenizer, BitsAndBytesConfig, LlamaForCausalLM, pipeline, TrainingArguments
from peft import LoraConfig, get_peft_model
import evaluate
from trl import SFTConfig, SFTTrainer

Select GPU if available, note that a consumer CPU may not be able to fine-tune this model if it does not have enough VRAM memory.

:::{note}
Using a GPU with large memory is recommended.
:::

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(f"Device: {device}")
if device == torch.device("cuda"):
    print(f'Device name: {torch.cuda.get_device_name(0)}')
    print(f'GPU available memory: {torch.cuda.mem_get_info()[1]/1024/1024//1024} GB')

Device: cuda
Device name: AMD Instinct MI210
GPU available memory: 63.0 GB


Define the model id from HuggingFace, Llama 3.2 1 Billion parameter model. Get the [tokenizer](https://huggingface.co/docs/transformers/v4.46.0/en/model_doc/auto#transformers.AutoTokenizer) and set padding token to the `EOS` token. Also, set `padding_side` to right.

In [None]:
model_id = 'unsloth/Llama-3.2-1B'

my_tokenizer = AutoTokenizer.from_pretrained(path_to_model)
my_tokenizer.pad_token = my_tokenizer.eos_token
my_tokenizer.padding_side = 'right'

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

We will use [BitsandBytes](https://github.com/bitsandbytes-foundation/bitsandbytes) to quantize the model. First, we define the `BitsAndBytesConfig`, we will use 4-bit quantization with the `fp4` datatype with nested quantization, finally the computation type is `float16`.

In [None]:
fp4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

config.json:   0%|          | 0.00/935 [00:00<?, ?B/s]

g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.



model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

Then we use `transformers.LlamaForCausalLM.from_pretrained` to load the model from Hugging Face and apply the `fp4_config` configuration. We will also set the device that we got before.

In [None]:
quantized_model = LlamaForCausalLM.from_pretrained(
    model_id,
    quantization_config=fp4_config,
    device_map=device,
)

## Sample Prompt

Now, we will evaluate the model with a sample prompt. We define `transformers.pipeline` for `text-generation` using the quantized model.

In [None]:
sample_prompt = (
    r"write a python function to find duplicate numbers in a list"
)

quantized_pipeline = pipeline(
    "text-generation",
    model=quantized_model,
    tokenizer=my_tokenizer,
    torch_dtype=torch.float16,
    device_map=device,
)

Device set to use cuda:0
  attn_output = torch.nn.functional.scaled_dot_product_attention(



Result:
write a python function to find duplicate numbers in a list of integer values
import  from  collections

duplicate_numbers_list = [1,0,3,0,2,3,6]

print(dup_number = [i for a if i!= a[i]] for i in enumerate(a.values()) if  i == 1)

duplicate_numbers_list = [i  for  i in a.values()
                              for  i in enumerate(a)]
print(dup number  of numbers = duplicate_numbers)

print([i for a if i  for  i in enumerate(a)])
```
```
[1,0,3,0,2,3,0]

[0]

duplicate_number  of  number s = [i for  i in enumerate(a)]
```


Then we can invoke the model to generate an answer to our prompt. We will also print the generated `sequences`.

:::{tip}
Explore different values of `top_k` and `temperature` and run the prompt twice. What happens if you increase the `temperature`? 
:::

In [None]:
sequences = quantized_pipeline(
    text_inputs=sample_prompt,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=my_tokenizer.eos_token_id,
    max_new_tokens=512,
    temperature=0.2,
)

for seq in sequences:
    print(f"\nResult:\n{seq['generated_text']}")

## Define fine-tune parameters

Now, to fine tune the model we will use the Low-Rank Adaption technique. In this technique, instead of modifying the model itself a few extra parameters (rank) are added and then updated during the fine tuning process. For more information, check [here](https://huggingface.co/docs/peft/main/en/developer_guides/lora).

We can define the LoRA configuration with `peft.LoraConfig`:
- `r`: size of adaptation layer
- `lora_alpha`: indicates how strongly does the adaptation layer affect the base model [see 4.1](https://arxiv.org/abs/2106.09685)
- `lora_dropout`: optional dropout layer
- `bias`: whether or not to set bias
- `task_type`: task type see [TaskType](https://huggingface.co/docs/peft/en/package_reference/peft_types#peft.TaskType)
- `target_modules`: which modules to apply adapter layers

In [None]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "up_proj",
        "down_proj",
        "gate_proj",
        "k_proj",
        "q_proj",
        "v_proj",
        "o_proj",
    ],
)

We this configuration, we can define our `adapted_model`, the model we will use the fine tune. And our `adapted_pipeline`

In [None]:
adapted_model = get_peft_model(quantized_model, lora_config)

adapted_pipeline = pipeline(
    "text-generation",
    model=adapted_model,
    tokenizer=my_tokenizer,
    device_map=device,
)

Device set to use cuda:0


Let's run the `sample_prompt` on the adapted model.

:::{tip}
Do you note anything different from the original model?
:::

In [None]:
sequences = adapted_pipeline(
    text_inputs=sample_prompt,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=my_tokenizer.eos_token_id,
    max_new_tokens=512,
    temperature=0.2
)

for seq in sequences:
    print(f"\nResult:\n{seq['generated_text']}")


Result:
write a python function to find duplicate numbers in a list

def find_duplicate_numbers(numbers):
  """Find duplicate numbers in a list.
  :param numbers: a list of numbers to search for duplicates.
  :returns: a list of numbers that are duplicates.
  """
  duplicates = []
  for i in range(len(numbers)):
    if numbers[i] == numbers[i+1]:
      duplicates.append(numbers[i])
  return duplicates


Result:
write a python function to find duplicate numbers in a list of integers
You can use the built-in function to find duplicates in Python. The function is named find_dublicates and it is declared inside the Python standard library.
The find_dublicates function takes a list of integers as its argument. It then uses a for loop to iterate over the list and checks if each integer is equal to any of the other integers in the list. If it is, then the function returns a boolean True, which means that there are duplicate numbers in the list. Otherwise, the function returns a boolean False

## Get Dataset to fine-tune model

We are going to use the [Google Mostly Basic Python Problems](https://huggingface.co/datasets/google-research-datasets/mbpp) dataset. Although, large language models are very good at Python, the idea of this example is to fine-tune the model into providing the output in a particular style. It may be possible to get similar results with prompt-engineering techniques, however the idea of the notebook is to show you an example of fine-tuning.

Load dataset and print it.


:::{note}
By executing the next cell, you will download the dataset `google-research-datasets/mbpp` and you agree to its license and obtaining permission to use it from dataset owner if needed.
:::

In [10]:
from datasets import load_dataset

google_python = load_dataset("google-research-datasets/mbpp", "sanitized")

print(google_python)

DatasetDict({
    train: Dataset({
        features: ['source_file', 'task_id', 'prompt', 'code', 'test_imports', 'test_list'],
        num_rows: 120
    })
    test: Dataset({
        features: ['source_file', 'task_id', 'prompt', 'code', 'test_imports', 'test_list'],
        num_rows: 257
    })
    validation: Dataset({
        features: ['source_file', 'task_id', 'prompt', 'code', 'test_imports', 'test_list'],
        num_rows: 43
    })
    prompt: Dataset({
        features: ['source_file', 'task_id', 'prompt', 'code', 'test_imports', 'test_list'],
        num_rows: 7
    })
})


We are now going to define the output format that we want the model to be fine tuning on using [chat templates](https://huggingface.co/blog/chat-templates). The task is to fine-tune the model so the output Python is Markdown friendly, i.e., being able to print code snippets.

The function `instructify` receives the `qr_row` dictionary that contains the `prompt`, `code` and `test_list`. We define the `qr_json` template with the `user` and `assistant` role. The user role contains the `prompt` and the `assistant` role contains the Python `code` as snippet and the test list. Finally, we apply the `apply_chat_template` to the roles dict and add it to the `text` key and return `qr_row`.

In [None]:
def instructify(qr_row):
    qr_json = [
        {
            "role": "user",
            "content": qr_row["prompt"],
        },
        {
            "role": "assistant",
            "content": f'''
```python
{qr_row["code"]}
```

Test List:

```python
test_list={qr_row["test_list"]}
```
''',
        },
    ]

    qr_row["text"] = my_tokenizer.apply_chat_template(qr_json, tokenize=False)
    return qr_row

We will define the chat template. Check Llama-3 prompt formats [here](https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/). Concatenating query/response is sufficient for our use case.

In [None]:
my_tokenizer.chat_template = """{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = message['content'] | trim + '\n' %}{{ content }}{% endfor %}"""

print(my_tokenizer.chat_template)

{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = message['content'] | trim + '
' %}{{ content }}{% endfor %}


We now can apply the chat template to our dataset

In [None]:
formatted_dataset = google_python.map(instructify)

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

Map:   0%|          | 0/257 [00:00<?, ? examples/s]

Map:   0%|          | 0/43 [00:00<?, ? examples/s]

Map:   0%|          | 0/7 [00:00<?, ? examples/s]

Display one example, you can see how the dataset now is formatted to show code snippets (```). 

In [14]:
print(formatted_dataset["train"][0]["text"])

Write a python function to find the first repeated character in a given string.
```python
def first_repeated_char(str1):
  for index,c in enumerate(str1):
    if str1[:index+1].count(c) > 1:
      return c
```

Test List:

```python
test_list=['assert first_repeated_char("abcabc") == "a"', 'assert first_repeated_char("abc") == None', 'assert first_repeated_char("123123") == "1"']
```



Display the same content using the `IPython.display.Markdown` visualization

In [None]:
from IPython.display import display, Markdown
Markdown(formatted_dataset["train"][0]["text"])

Write a python function to find the first repeated character in a given string.
```python
def first_repeated_char(str1):
  for index,c in enumerate(str1):
    if str1[:index+1].count(c) > 1:
      return c
```

Test List:

```python
test_list=['assert first_repeated_char("abcabc") == "a"', 'assert first_repeated_char("abc") == None', 'assert first_repeated_char("123123") == "1"']
```


Let's run this example prompt on the adapted model and observe the output. Although, we see some code snippet, the test list is not there.

In [17]:
example_prompt = formatted_dataset["test"][0]["prompt"]

sequences = adapted_pipeline(
    text_inputs=example_prompt,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=my_tokenizer.eos_token_id,
    max_new_tokens=512,
)

for seq in sequences:
    print(f"\nResult:\n{seq['generated_text']}")


Result:
Write a python function to remove first and last occurrence of a given character from the string. 
The function should return a new string with the given character removed from the string.

For example, if the string is 'hello', the function should return 'hello' and if the string is 'world', the function should return 'w'.

Hint: use the `count` method to count the number of occurrences of a character in the string and then remove the first and last occurrences of the character.
```python
string = 'hello'
char = 'o'
result = string.count(char)
print(f"String '{string}' has character '{char}' {result} times")
```


## 🚀 Fine-tune the Adapted Model

We now define the metric that will be used to [evaluate](https://huggingface.co/docs/evaluate/package_reference/loading_methods#evaluate.load) the fine-tuned model, we will use [accuracy](https://huggingface.co/docs/evaluate/v0.4.0/en/types_of_evaluations#metrics). We will also define the loss function with the `compute_metric` function.

In [None]:
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = argmax(logits, axis=-1)
    return evaluate.metric.compute(predictions=predictions, references=labels)

We also need to tokenize the dataset before it can be consumed in the training.

In [None]:
def tokenize_dataset(dataset, tokenizer, text_field):
    def tokenize_function(examples):
        return tokenizer(examples[text_field], truncation=True, padding=True)

    return dataset.map(tokenize_function, batched=True)

tokenized_train_dataset = tokenize_dataset(formatted_dataset["train"], my_tokenizer, "text")
tokenized_eval_dataset = tokenize_dataset(formatted_dataset["test"], my_tokenizer, "text")

Let's define our training configuration, we do this with `trl.SFTConfig`, some of the most relevant arguments are listed below:
- `per_device_train_batch_size`: size of the training batch
- `per_device_eval_batch_size`: size of the evaluation batch
- `gradient_accumulation_steps`: Gradient accumulation steps
- `optim`: optimizer type
- `num_train_epochs`: number of training epochs
- `eval_steps`: evaluation steps
- `logging_steps`: how often the model logs progress
- `warmup_steps`: warmup steps
- `learning_rate`: rate of learning
- We use `fp16` precision
- `group_by_length`: Group samples by length

In [None]:
sft_config = SFTConfig(
    output_dir="Llama-Python-Single-GPU",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    optim="paged_adamw_8bit",
    num_train_epochs=20,
    eval_steps=0.5,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=1e-4,
    fp16=True,
    bf16=False,
    group_by_length=True,
    max_seq_length=512,
)

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

Map:   0%|          | 0/257 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/120 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/257 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


With the configuration defined, we can finally create the `trl.SFTTrainer` that will help us with the fine tuning.
We initialize it with the `adapted_model`, the tokenized tran and eval datasets, the `SFTConfig` and the `lora_config`.

In [None]:
trainer = SFTTrainer(
    model=adapted_model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    args=sft_config,
    peft_config=lora_config,
)

Finally, we can call the `.train()` method to start fine tuning the model.

In [None]:
trainer.train()

  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


Step,Training Loss
1,3.5695
2,2.8269
3,3.3778
4,3.8339
5,3.58
6,3.1383
7,1.1863
8,0.8321
9,0.8777
10,0.707


TrainOutput(global_step=300, training_loss=0.2689305164354543, metrics={'train_runtime': 192.0013, 'train_samples_per_second': 12.5, 'train_steps_per_second': 1.562, 'total_flos': 6322328115609600.0, 'train_loss': 0.2689305164354543})

You can decide to save the model.

In [None]:
save_model = False
if save_model:
    trainer.save_model()

## Evaluate Fine-tuned Model

After the fine tuning, we can evaluate if we achieved our desired outcome. Let us define a different prompt and invoke the fine-tuned model.

In [None]:
example_prompt = r"write a python function that returns the least common denominator of all elements in a list."

sequences = adapted_pipeline(
    text_inputs=example_prompt,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=my_tokenizer.eos_token_id,
    max_new_tokens=512,
    temperature=0.2
)

Display the generated text using the Markdown display.

In [28]:
Markdown(sequences[0]["generated_text"])

write a python function that returns the least common denominator of all elements in a list. https://www.geeksforgeeks.org/least-common-denominator/
```python
def lcm_of_elements(arr):
    (left, right) = (arr[0], arr[-1])
    for m in (le, rt):
        if (m == left or m == m * right / m):
            return m
        else:
            return m
```

Test List:

```python
test_list=['assert lcm_of_elements([2,2,1])->1', 'assert lcm_of_elements([1,5,7,1])->5', 'assert lcm_of_elements([12,45,67,12])->45']
```


## Summary

In this notebook you quantized a Llama model, then added LoRA to adapt the model to be able to train on a custom dataset. You also defined chat templates that guided the fine-tuning process.

Now, you may be wondering how much bigger is the adapted model. Let's have a look.

In [None]:
from torchinfo import summary

In [None]:
model_quant = summary(quantized_model, input_size=(1, 112, 112), col_names=["input_size", "output_size", "num_params", "mult_adds", "trainable"])
model_quant

In [None]:
adapt_model_quant = summary(adapted_model, input_size=(1, 112, 112), col_names=["input_size", "output_size", "num_params", "mult_adds", "trainable"])
adapt_model_quant

----------
Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved. Portions of this file consist of AI-generated content.

SPDX-License-Identifier: MIT