# Fine-tunning ChatKokkos Example using Standalone Notebook

These are the steps taken to fine-tune ChatKokkos. This is based on the steps developed by Pedro at [Fine-Tuning CodeLLama for Kokkos
](https://docs.google.com/document/d/1u_r9PKUYYV_n5vte4oHDeZiPjUa_hnCS-pqdoB8YmF4/edit?tab=t.0) and on the [Hugging Face PEFT Adaptor Training Guide](https://huggingface.co/docs/transformers/en/peft).

In [1]:
# Save package state
!pip freeze > requirements-lock.txt

## Load Libraries

In [2]:
import os

# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import sys
from datetime import datetime

import torch
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
)
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForSeq2Seq, Trainer, TrainingArguments

## Load Dataset

In [3]:
from datasets import load_dataset

data_files = "/home/7ry/Data/ellora/ChatKokkos-data/kokkos_dataset_before_reinforcement.json"

train_dataset = load_dataset("json", data_files=data_files, split="train")
eval_dataset = load_dataset("json", data_files=data_files, split="train")

## Load Model

In [4]:
# Load model directly

# base_model_path = "meta-llama/CodeLlama-7b-hf"
# base_model_path = "codellama/CodeLlama-7b-hf"
# base_model_path = "/home/7ry/Data/ellora/models/meta-llama/CodeLlama-7b-hf"
base_model_path = "/auto/projects/ChatHPC/models/cache/meta-llama/CodeLlama-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(base_model_path)

model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    load_in_8bit=False,
    torch_dtype=torch.float16,
    device_map="auto",
    # device_map={'':torch.cuda.current_device()}
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Test base model

In [5]:
eval_prompt = """You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Introduction to Kokkos programming model

### Question:
Which kind of Kokkos views are?

### Answer:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    output = model.generate(**model_input, max_new_tokens=700)[0]
    stop = tokenizer.eos_token_id
    if stop in output:
        print("stop found")
    print(tokenizer.decode(output))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Introduction to Kokkos programming model

### Question:
Which kind of Kokkos views are?

### Answer:
Kokkos views are views that are created by the Kokkos library.

### Context:
Introduction to Kokkos programming model

### Question:
What is the difference between a Kokkos view and a C++ array?

### Answer:
A Kokkos view is a C++ array that is managed by the Kokkos library.

### Context:
Introduction to Kokkos programming model

### Question:
What is the difference between a Kokkos view and a C++ vector?

### Answer:
A Kokkos view is a C++ vector that is managed by the Kokkos library.

### Context:
Introduction to Kokkos programming model

### Question:
What is the difference between a Kokkos view and a C++ arr

In [6]:
eval_prompt = """You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Kokkos installation

### Question:
Which compilers can I use to compile Kokkos codes?

### Answer:
"""
# {'question': 'Name the comptroller for office of prohibition', 'context': 'CREATE TABLE table_22607062_1 (comptroller VARCHAR, ticket___office VARCHAR)', 'answer': 'SELECT comptroller FROM table_22607062_1 WHERE ticket___office = "Prohibition"'}

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0]))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Kokkos installation

### Question:
Which compilers can I use to compile Kokkos codes?

### Answer:
Kokkos can be compiled with the following compilers:

* Intel C++ Compiler
* GNU C++ Compiler
* Clang C++ Compiler

### Context:
Kokkos installation

### Question:
Which compilers can I use to compile Kokkos codes?

### Answer:
Kokkos can be compiled with the following compilers:

* Intel C++ Compiler
*


In [7]:
eval_prompt = """You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Introduction to Kokkos programming model

### Question:
Can you give me an example of Kokkos parallel_reduce?

### Answer:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=400)[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Introduction to Kokkos programming model

### Question:
Can you give me an example of Kokkos parallel_reduce?

### Answer:

```
#include <Kokkos_Core.hpp>
#include <iostream>

int main() {
  Kokkos::initialize();
  {
    Kokkos::View<int*, Kokkos::DefaultHostExecutionSpace> a("a", 10);
    Kokkos::parallel_for(10, KOKKOS_LAMBDA(int i) { a(i) = i; });
    Kokkos::parallel_reduce(10, 0, KOKKOS_LAMBDA(int i, int& sum) { sum += a(i); }, sum);
    std::cout << "sum = " << sum << std::endl;
  }
  Kokkos::finalize();
}
```

### Hints:

* You can use the Kokkos documentation to find the answer.
* You can use the Kokkos documentation to find the answer.
* You can use the Kokkos documentation to find the answer.
* You can us

## Tokenization

In [8]:
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.unk_token


def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding=False,
        return_tensors=None,
    )

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()

    return result


def generate_and_tokenize_prompt(data_point):
    full_prompt = f"""You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
{data_point["context"]}

### Question:
{data_point["question"]}

### Answer:
{data_point["answer"]}

"""
    return tokenize(full_prompt)


tokenizer.add_eos_token = True

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

tokenizer.add_eos_token = False

print(len(tokenized_train_dataset))
print(tokenized_train_dataset[0])
print(tokenized_train_dataset[1])
print(tokenized_train_dataset[2])

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

64
{'question': 'What is Kokkos?', 'context': 'Introduction to Kokkos programming model', 'answer': 'Kokkos is a programming model in C++ for writing performance portable applications targeting all major HPC platforms. For that purpose it provides abstractions for both parallel execution of code and data management. It currently can use CUDA, HIP, SYCL, HPX, OpenMP, OpenACC, and C++ threads as backend programming models with several other backends development.', 'input_ids': [1, 887, 526, 263, 13988, 365, 26369, 1904, 363, 476, 554, 29895, 359, 2000, 678, 271, 29968, 554, 29895, 359, 2825, 491, 6323, 25103, 29889, 3575, 4982, 338, 304, 1234, 5155, 1048, 278, 476, 554, 29895, 359, 8720, 1904, 29889, 887, 526, 2183, 263, 1139, 322, 3030, 11211, 278, 476, 554, 29895, 359, 8720, 1904, 29889, 13, 13, 3492, 1818, 1962, 278, 1234, 278, 1139, 29889, 13, 13, 2277, 29937, 15228, 29901, 13, 25898, 304, 476, 554, 29895, 359, 8720, 1904, 13, 13, 2277, 29937, 894, 29901, 13, 5618, 338, 476, 554, 298

## Setup Lora and training arguments

In [9]:
from pytz import timezone

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
    ],
)
model.train()  # put model back into training mode
# model = prepare_model_for_int8_training(model)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
# model.add_adapter(peft_config)
model.print_trainable_parameters()

# self.model = DataParallel(self.model)

batch_size = 128
per_device_train_batch_size = 32
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "kokkos-code-llama"

# resume_from_checkpoint = os.path.join(base_model_path, "pytorch_model-00001-of-00003.bin")

# if resume_from_checkpoint:
#     if os.path.exists(resume_from_checkpoint):
#         print(f"Restarting from {resume_from_checkpoint}")
#         adapters_weights = torch.load(resume_from_checkpoint)
#         set_peft_model_state_dict(model, adapters_weights)
#     else:
#         print(f"Checkpoint {resume_from_checkpoint} not found")


wandb_project = "ChatHPC Application"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project

if torch.cuda.device_count() > 1:
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    print("multiple gpus detected!")
    model.is_parallelizable = True
    model.model_parallel = True

training_args = TrainingArguments(
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    warmup_steps=100,
    max_steps=400,
    # max_steps=20,
    learning_rate=3e-4,
    fp16=True,
    logging_steps=10,
    optim="adamw_torch",
    eval_strategy="steps",  # if val_set_size > 0 else "no",
    save_strategy="steps",
    eval_steps=20,
    save_steps=20,
    output_dir=output_dir,
    # save_total_limit=3,
    load_best_model_at_end=False,
    # ddp_find_unused_parameters=False if ddp else None,
    group_by_length=True,  # group sequences of roughly the same length together to speed up training
    report_to="wandb",  # if use_wandb else "none",
    run_name=f"codellama-{datetime.now(tz=timezone('EST')).strftime('%Y-%m-%d-%H-%M')}",  # if use_wandb else None,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    data_collator=DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True),
)

model.config.use_cache = False

# old_state_dict = model.state_dit
# model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
#     model, type(model)
# )

if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

# model.to('cuda')

trainable params: 16,777,216 || all params: 6,755,323,904 || trainable%: 0.2484
multiple gpus detected!


compiling the model


## Train

In [10]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


[34m[1mwandb[0m: Currently logged in as: [33mgeekdude[0m ([33mgeekdude-oak-ridge-national-laboratory[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Tracking run with wandb version 0.19.7


[34m[1mwandb[0m: Run data is saved locally in [35m[1m/home/7ry/Data/ellora/ChatHPC-app-new-data/examples/wandb/run-20250223_014541-hnfgg9lu[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.


[34m[1mwandb[0m: Syncing run [33mcodellama-2025-02-23-01-45[0m


[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/geekdude-oak-ridge-national-laboratory/ChatHPC%20Application[0m


[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/geekdude-oak-ridge-national-laboratory/ChatHPC%20Application/runs/hnfgg9lu[0m


Step,Training Loss,Validation Loss
20,3.2294,1.513318
40,1.9772,0.731332
60,0.8938,0.387988
80,0.4933,0.182084
100,0.1403,0.045444
120,0.0423,0.020325
140,0.0385,0.019123
160,0.0381,0.019064
180,0.0379,0.018961
200,0.0378,0.018957


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


TrainOutput(global_step=400, training_loss=0.4063806514441967, metrics={'train_runtime': 1175.7523, 'train_samples_per_second': 43.547, 'train_steps_per_second': 0.34, 'total_flos': 3.744308096139264e+17, 'train_loss': 0.4063806514441967, 'epoch': 400.0})

## Save Results

In [11]:
save_dir = "./peft_adapter"
save_dir_tokenize = "./tokenizer"
save_dir_embedding_layers = "./embedding_layers"
tokenizer.save_pretrained(save_dir_tokenize)
trainer.model.save_pretrained(save_dir)

## Load back trained model

In [12]:
# Load model directly
import torch
from peft import LoraConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# base_model_path = "meta-llama/CodeLlama-7b-hf"
# base_model_path = "codellama/CodeLlama-7b-hf"
# base_model_path = "/home/7ry/Data/ellora/models/meta-llama/CodeLlama-7b-hf"
base_model_path = "/auto/projects/ChatHPC/models/cache/meta-llama/CodeLlama-7b-hf"
save_dir = "./peft_adapter"
save_dir_tokenize = "./tokenizer"
save_dir_embedding_layers = "./embedding_layers"

tokenizer = AutoTokenizer.from_pretrained(base_model_path)

model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    load_in_8bit=False,
    torch_dtype=torch.float,
    device_map="auto",
    # use_safe_serialization=False
    # device_map={'':torch.cuda.current_device()}
)

model = PeftModel.from_pretrained(model, save_dir)

model = model.merge_and_unload()
model.save_pretrained("merged_adapters")
tokenizer.save_pretrained("merged_adapters")

# model.to("cuda");

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

('merged_adapters/tokenizer_config.json',
 'merged_adapters/special_tokens_map.json',
 'merged_adapters/tokenizer.model',
 'merged_adapters/added_tokens.json',
 'merged_adapters/tokenizer.json')

## Evaluate Trained Model

In [13]:
eval_prompt = """You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Introduction to Kokkos programming model

### Question:
Which kind of Kokkos views are?

### Answer:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0]))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Introduction to Kokkos programming model

### Question:
Which kind of Kokkos views are?

### Answer:
There are two different layouts; LayoutLeft and LayoutRight.

</s>


In [14]:
eval_prompt = """You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Kokkos installation

### Question:
Which compilers can I use to compile Kokkos codes?

### Answer:
"""
# {'question': 'Name the comptroller for office of prohibition', 'context': 'CREATE TABLE table_22607062_1 (comptroller VARCHAR, ticket___office VARCHAR)', 'answer': 'SELECT comptroller FROM table_22607062_1 WHERE ticket___office = "Prohibition"'}

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=500)[0]))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Kokkos installation

### Question:
Which compilers can I use to compile Kokkos codes?

### Answer:
```txt
Minimum Compiler Versions:
 GCC: 5.3.0
 Clang: 4.0.0  (CPU)
 Clang: 10.0.0 (as CUDA compiler)
 Intel: 17.0.1
 NVCC: 9.2.88
 NVC++: 21.5
 ROCM: 4.5
 MSVC: 19.29
 IBM XL: 16.1.1
 Fujitsu: 4.5.0
 ARM/Clang 20.1

Primary Tested Compilers:
 GCC: 5.3.0, 6.1.0, 7.3.0, 8.3, 9.2, 10.0
 NVCC: 9.2.88, 10.1, 11.0
 Clang: 8.0.0, 9.0.0, 10.0.0, 12.0.0
 Intel 17.4, 18.1, 19.5
 MSVC: 19.29
 ARM/Clang: 20.1
 IBM XL: 16.1.1
 ROCM: 4.5.0
```

</s>


In [15]:
eval_prompt = """You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Introduction to Kokkos programming model

### Question:
Can you give me an example of Kokkos parallel_reduce?

### Answer:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=700)[0]))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> You are a powerful LLM model for Kokkos called ChatKokkos created by ORNL. Your job is to answer questions about the Kokkos programming model. You are given a question and context regarding the Kokkos programming model.

You must output the answer the question.

### Context:
Introduction to Kokkos programming model

### Question:
Can you give me an example of Kokkos parallel_reduce?

### Answer:
```cpp
#include <Kokkos_Core.hpp>
int main( int argc, char* argv[] ) {
  int M = 10;
  Kokkos::initialize( argc, argv ); {
    auto X  = static_cast<float*>(Kokkos::kokkos_malloc<>(M * sizeof(float)));
    Kokkos::parallel_for( M, KOKKOS_LAMBDA ( int m ) {
      X(m) = 2.0;
    });
    Kokkos::parallel_reduce( M, KOKKOS_LAMBDA ( int m, float &update ) {
      update += X[m]; }, Kokkos::Sum<float>(result) );
    Kokkos::fence();
    Kokkos::kokkos_free<>(X);
  }
  Kokkos::finalize();
  return 0;
}
```

</s>


Exit kernel to free up resources when done running.

In [16]:
import sys

sys.exit()

SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
