These are the required dependencies on Google Colab. Locally, there might be more.

In [11]:
%%capture
%pip install -U accelerate
%pip install -U peft
%pip install -U trl
%pip install -U bitsandbytes
%pip install -U transformers

Login to Huggingface to be able to save the model to your account

In [1]:
from huggingface_hub import login

hf_token = "HUGGINGFACE_ACCESS_TOKEN"
login(hf_token)

Download a quantized version of Qwen3-8B

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model_dir = "Qwen/Qwen3-8B"
model_dir = "/home/sebastian/Documents/Uni/LlamaCPP/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

model.config.use_cache = False
model.config.pretraining_tp = 1

  from .autonotebook import tqdm as notebook_tqdm


Unify the system-, user-, and assistant-prompts into one big text

In [3]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
{}

### Question:
{}

### Response:
<think>
</think>
{}"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    messages = examples["messages"]
    system_prompt = messages[0]["content"]
    user_prompt = messages[1]["content"]
    response = messages[2]["content"]
    if not response.endswith(tokenizer.eos_token):
        response += tokenizer.eos_token
    text = train_prompt_style.format(system_prompt, user_prompt, response)
    return text, system_prompt, user_prompt, response

Load the train set used for e.g. finetuning GPT-4o and convert it to the  previous format.

In [4]:
import json
import random

data = []
with open("train.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        data.append(json.loads(line))

data = [formatting_prompts_func(x) for x in data]
raw_texts, instructions, questions, answers = zip(*data)
data = {
    "text": raw_texts,
    "system": instructions,
    "user": questions,
    "response": answers
}

data["text"][10]

'Below is an instruction that describes a task, paired with an input that provides further context. \nWrite a response that appropriately completes the request. \nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are driving a car and need to make a high-level driving decision.\nFirst, carefully observe the environment.\nThen, reason through your decision step by step and present it in natural language.\nFinally, return the top 3 advisable longitudinal–lateral action pairs, ranked from best to worst.\nFeasible longitudinal actions:\n  - accelerate\n  - decelerate\n  - keep\n  - stop\nFeasible lateral actions:\n  - follow_lane\n  - right\n\n\n### Question:\nHere is an overview of your environment:\nThere is no left-adjacent lane. There is a right-adjacent lane with the same direction. \nYour velocity is 40.3 m/s, orientation is -3.141 rad, steering angle is 0.000 rad, and a

Create a custom dataset from the prompts

In [5]:
from datasets import Dataset

dataset = Dataset.from_dict(data)

dataset["text"][10]

'Below is an instruction that describes a task, paired with an input that provides further context. \nWrite a response that appropriately completes the request. \nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are driving a car and need to make a high-level driving decision.\nFirst, carefully observe the environment.\nThen, reason through your decision step by step and present it in natural language.\nFinally, return the top 3 advisable longitudinal–lateral action pairs, ranked from best to worst.\nFeasible longitudinal actions:\n  - accelerate\n  - decelerate\n  - keep\n  - stop\nFeasible lateral actions:\n  - follow_lane\n  - right\n\n\n### Question:\nHere is an overview of your environment:\nThere is no left-adjacent lane. There is a right-adjacent lane with the same direction. \nYour velocity is 40.3 m/s, orientation is -3.141 rad, steering angle is 0.000 rad, and a

In [6]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

Tryout the dataset on the base model. (Might take some time, so you can also skip it)

In [36]:
inference_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
{}

### Question:
{}

### Response:
<think>

"""
instruction = dataset[10]['system']
question = dataset[10]['user']
inputs = tokenizer(
    [inference_prompt_style.format(instruction, question) + tokenizer.eos_token],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(response[0].split("### Response:")[1])


<think>



### Reasoning:
1. **Observation of the Environment:**
   - There are two adjacent lanes in the same direction: one on the left and one on the right.
   - The truck 10087 is driving on the right-adjacent lane in the same direction at a velocity of 22.5 m/s.
   - The truck 10092 is driving on the right-adjacent lane in the same direction at a velocity of 18.2 m/s.
   - The car 10107 is driving on the left-adjacent lane in the same direction at a velocity of 38.2 m/s.
   - The car 10108 is driving on the same lane at a velocity of 34.1 m/s.
   - The maximum speed limit is 33.3 m/s.
   - The current velocity of the vehicle is 32.7 m/s.
   - The orientation is 0.003 rad, the steering angle is 0.000 rad, and the acceleration is 0.4 m/s².
   - The time-to-collision for all obstacles is infinite, indicating no imminent collision.

2. **Evaluation of the Lateral Actions:**
   - **Follow Lane:** The vehicle is driving on the same lane as the car 10108, which is driving at 34.1 m/s. T

Initialize a LoRA version of the base model

In [7]:
from peft import LoraConfig, get_peft_model

# LoRA config
peft_config = LoraConfig(
    lora_alpha=16,                           # Scaling factor for LoRA
    lora_dropout=0.05,                       # Add slight dropout for regularization
    r=64,                                    # Rank of the LoRA update matrices
    bias="none",                             # No bias reparameterization
    task_type="CAUSAL_LM",                   # Task type: Causal Language Modeling
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],  # Target modules for LoRA
)

model = get_peft_model(model, peft_config)

Initialize the SFT trainer. You can choose the batchsize according to your hardware.

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments


# Training Arguments
training_arguments = TrainingArguments(
    output_dir="output",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    logging_steps=0.2,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    group_by_length=True,
    report_to="none"
)

# Initialize the Trainer
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset,
    peft_config=peft_config,
    data_collator=data_collator,
)

Adding EOS to train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [9]:
import gc, torch
gc.collect()
torch.cuda.empty_cache()
model.config.use_cache = False
trainer.train()

Step,Training Loss
50,0.4738
100,0.1324
150,0.123
200,0.1229
250,0.1217


TrainOutput(global_step=250, training_loss=0.19476576232910156, metrics={'train_runtime': 348.3227, 'train_samples_per_second': 2.871, 'train_steps_per_second': 0.718, 'total_flos': 3.2845406373052416e+16, 'train_loss': 0.19476576232910156})

Test the finetuned model.

In [12]:
inference_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
{}

### Question:
{}

### Response:
<think>

"""

instruction = dataset[10]['system']
question = dataset[10]['user']
inputs = tokenizer(
    [inference_prompt_style.format(instruction, question) + tokenizer.eos_token],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(response[0].split("### Response:")[1])


<think>


</think>
{
  "best_combination": {
    "lateral_action": "follow_lane",
    "longitudinal_action": "decelerate"
  }
  ,
  "second_best_combination": {
    "lateral_action": "follow_lane",
    "longitudinal_action": "accelerate"
  }
  ,
  "third_best_combination": {
    "lateral_action": "follow_lane",
    "longitudinal_action": "keep"
  }
}


Push the finetuned model to HuggingFace

In [13]:
new_model_name = "Qwen-3-8B-HighD"
model.push_to_hub(new_model_name)
tokenizer.push_to_hub(new_model_name)

adapter_model.safetensors:   0%|          | 0.00/698M [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Bastilling/Qwen-3-8B-HighD/commit/5ff317601efe87e8404627108d8d79250363e103', commit_message='Upload tokenizer', commit_description='', oid='5ff317601efe87e8404627108d8d79250363e103', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Bastilling/Qwen-3-8B-HighD', endpoint='https://huggingface.co', repo_type='model', repo_id='Bastilling/Qwen-3-8B-HighD'), pr_revision=None, pr_num=None)

Download the actual model and the LoRa Adapters like this:

In [None]:
%git lfs clone https://huggingface.co/Qwen/Qwen-3-8B

%git lfs clone https://huggingface.co/Bastilling/Qwen-3-8B-HighD-2000

Then merge the two by running this:

In [None]:
import os

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_path = "/home/sebastian/Documents/Uni/LlamaCPP/Qwen3-8B"
adapter_path = "/home/sebastian/Documents/Uni/LlamaCPP/Qwen-3-8B-HighD-2000"

# Load tokenizer (optional)
tokenizer = AutoTokenizer.from_pretrained(base_model_path, use_fast=True)
tokenizer.save_pretrained("qwen3-8b-merged-2000")

# Load full-precision base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    device_map="auto",
    torch_dtype=torch.float16,  # or torch.float32
    trust_remote_code=True,
)

# Load and merge adapter
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()

# Save merged full model
model.save_pretrained("qwen3-8b-merged-2000")

Then clone the llama.cpp repo and build the code. Afterwards run the conversion script:

In [None]:
%python convert_hf_to_gguf.py /home/sebastian/Documents/Uni/LlamaCPP/qwen3-8b-merged-2000 --outfile /home/sebastian/Documents/Uni/LlamaCPP/qwen3-8b-highD-2000-gguf/qwen3-8b-highD-2000-f16.gguf --outtype f16

Next, create a model file inside the gguf repo. It should be called "Modelfile" and contain:
```
FROM ./qwen3-8b-highD-2000-f16.gguf

TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 4096
```

Next, start ollama server and add the model (I didn't test the quantizing so far, but it should work in theory. Otherwise, just remove the last argument.):

In [None]:
%ollama serve
%ollama create qwen3-8b-highD -f Modelfile --quantize q5_K_M