# Environment Setup
- As with Standard Code 1, first set up the environment.

In [None]:
# ===============================================================
# 0) Fixing dependencies (countering Colab's "environment instability")
# =====================================================================
# Colab (free version) often suddenly changes its pre-installed version,
# causing previously working training code to break.
# For this reason, this cell "deletes everything and then reinstalls a version whose compatibility has been confirmed."
# This "forces reproducibility" to be ensured.
#
# Common mistakes:
# - Mixing with existing packages causes "import passes but crashes at runtime."
# - Incompatibility between transformers, trl, and unsloth can cause mysterious errors and slowdowns.
#
# *This cell is not a "magic spell" but a "critical step for reproducibility."

#!pip -q uninstall -y numpy pandas datasets trl transformers accelerate peft unsloth unsloth-zoo bitsandbytes xformers
!uv pip install "numpy==2.0.2" "pandas==2.2.2"

# Match the range required by unsloth-zoo
# Here, we'll fix transformers / trl / accelerate / peft / bitsandbytes, which are compatible with Unsloth.
# Fixing transformers is especially important, as their behavior can easily change with minor version differences.
!uv pip install \
  "datasets==4.3.0" \
  "trl==0.24.0" \
  "transformers==4.56.2" \
  "accelerate==1.4.0" \
  "peft==0.13.2" \
  "bitsandbytes==0.45.0"

# Use unsloth / zoo from the same series (matching the zoo's requirements).
# Unsloth and unsloth-zoo should be used together. Raising only one of them is likely to cause problems.
!uv pip install "unsloth-zoo==2025.12.7" "unsloth==2025.12.7"



[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m2 packages[0m [2min 231ms[0m[0m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m68 packages[0m [2min 463ms[0m[0m
[2K[2mPrepared [1m7 packages[0m [2min 1.70s[0m[0m
[2mUninstalled [1m5 packages[0m [2min 393ms[0m[0m
[2K[2mInstalled [1m7 packages[0m [2min 71ms[0m[0m
 [31m-[39m [1maccelerate[0m[2m==1.12.0[0m
 [32m+[39m [1maccelerate[0m[2m==1.4.0[0m
 [32m+[39m [1mbitsandbytes[0m[2m==0.45.0[0m
 [31m-[39m [1mdatasets[0m[2m==4.0.0[0m
 [32m+[39m [1mdatasets[0m[2m==4.3.0[0m
 [31m-[39m [1mpeft[0m[2m==0.18.1[0m
 [32m+[39m [1mpeft[0m[2m==0.13.2[0m
 [31m-[39m [1mpyarrow[0m[2m==18.1.0[0m
 [32m+[39m [1mpyarrow[0m[2m==23.0.0[0m
 [31m-[39m [1mtransformers[0m[2m==4.57.6[0m
 [32m+[39m [1mtransformers[0m[2m==4.56.2[0m
 [32m+[39m [1mtrl[0m[2m==0.24.0[0m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1

# Loading the Model and Tokenizer (4-bit Quantization)
#### What It Does
- Loads the pre-trained model Qwen/Qwen3-4B-Instruct-2507.
- The tokenizer is a tool that converts text into tokens (ID strings). It is required for both model training and inference.
- Updating the weights of the entire model using DPO training requires a large amount of computation and storage.
- Therefore, we use LoRA (Low-Rank Adaptation) to insert additional small learnable parameters into some layers within the model (in this case, the attention/MLP projection layer) and update only those layers.
#### Key Points

- Loading the model with 4-bit quantization using load_in_4bit = True saves GPU memory.
- Even 4B (4 billion parameter) models are easier to run in relatively low-memory environments such as Colab's T4.
- max_seq_length = 2048 specifies the upper limit of the maximum length that the model can handle (conceptually, the "length that can be read at one time").

In [None]:
# -----------------------------
# HF login (once)
# -----------------------------
# Log in to "read datasets on HF Hub" and "upload trained models to HF."
#
from unsloth import FastLanguageModel
import numpy as np, pandas as pd
import datasets, trl, transformers, torch

from huggingface_hub import login, HfApi
login()  # Colab will prompt
api = HfApi()

In [None]:
import sys
import os
import warnings
from unsloth import FastLanguageModel
import torch

# 1. Load Model & Tokenizer
max_seq_length = 2048
dtype = None # Auto detection
load_in_4bit = True # Use 4bit quantization

print("Loading model (Qwen/Qwen3-4B-Instruct-2507)...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen3-4B-Instruct-2507",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# 2. Configure LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 8,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

print("Model loaded and LoRA adapters applied successfully.")

# Loading and Formatting the DPO Dataset

#### What It Does
1. Setting the Chat Template (Unifying the Model Input Format)
- Chat models are internally trained using a fixed format that includes delimiters (special tokens) such as "user utterance" and "assistant utterance."
- Specifying chat_template="qwen-2.5" gives the tokenizer rules for formatting input into a Qwen-compatible chat format.
2. Loading the DPO Dataset
- Load the u-10bei/dpo-dataset-qwen-cot dataset on Hugging Face and obtain the train split.
- Provided DPO Dataset
https://huggingface.co/datasets/u-10bei/dpo-dataset-qwen-cot

DPO data basically has the following three columns:
- prompt: Question or instruction (from the user)
- chosen: Desired answer (good example)
- rejected: Undesired answer (bad example)
3. Format prompt / chosen / rejected into a "chat-style string"

This function accepts a batch (multiple lines) and returns three formatted columns so that they can be processed together with dataset.map(..., batched=True).

tokenizer.apply_chat_template(...) is important; it formats
- prompt as "user utterance," and then appends a special token indicating "the assistant will generate from here" with add_generation_prompt=True.
- chosen/rejected are formatted as "assistant utterance."

‚Äª Why is add_generation_prompt=True only for prompt?

DPO learns that "chosen" is preferable to "rejected" for the same prompt, so it's easier to handle prompts if they are formatted to assume the assistant will continue to answer.
4. Applying a Formatting Function to the Dataset
- Use map to apply formatting to all data.
- Output dataset[0] and visually verify that the formatted prompt/chosen/rejected are in the expected chat format.

In [None]:
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset

# 1. Setup Chat Template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)

# 2. Load Dataset
dataset = load_dataset("u-10bei/dpo-dataset-qwen-cot", split = "train")

# 3. Define Formatting Function
def formatting_prompts_func(examples):
    new_prompts = []
    new_chosens = []
    new_rejecteds = []

    # Iterate over the batch
    for prompt, chosen, rejected in zip(examples['prompt'], examples['chosen'], examples['rejected']):
        # Format prompt with user role and add generation prompt (e.g. <|im_start|>assistant)
        formatted_prompt = tokenizer.apply_chat_template(
            [{"role": "user", "content": prompt}],
            tokenize = False,
            add_generation_prompt = True
        )

        # Format chosen/rejected responses with assistant role
        formatted_chosen = tokenizer.apply_chat_template(
            [{"role": "assistant", "content": chosen}],
            tokenize = False,
        )

        formatted_rejected = tokenizer.apply_chat_template(
            [{"role": "assistant", "content": rejected}],
            tokenize = False,
        )

        new_prompts.append(formatted_prompt)
        new_chosens.append(formatted_chosen)
        new_rejecteds.append(formatted_rejected)

    return {
        "prompt": new_prompts,
        "chosen": new_chosens,
        "rejected": new_rejecteds,
    }

# 4. Apply Formatting
dataset = dataset.map(formatting_prompts_func, batched = True)

# 5. Verify
print("Dataset columns:", dataset.column_names)
print("\n--- Sample Prompt ---")
print(dataset[0]["prompt"])
print("\n--- Sample Chosen ---")
print(dataset[0]["chosen"])
print("\n--- Sample Rejected ---")
print(dataset[0]["rejected"])

README.md:   0%|          | 0.00/387 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/5.37M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4040 [00:00<?, ? examples/s]

Map:   0%|          | 0/4040 [00:00<?, ? examples/s]

Dataset columns: ['prompt', 'chosen', 'rejected', 'strategy']

--- Sample Prompt ---
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
<|im_start|>system
You are a helpful assistant. Please format your response as follows:
Approach: <step-by-step reasoning>
Output: <final answer><|im_end|>
<|im_start|>user
Produce a TOML document for a api specification.<|im_end|>
<|im_start|>assistant
<|im_end|>
<|im_start|>assistant


--- Sample Chosen ---
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>assistant
Approach:
1. Task: Create api specification in TOML
2. Complexity: complex - 8-10 fields, 3-4 levels
3. Format rules: key=value syntax, [sections], proper types
4. Structure: Organize data logically with appropriate nesting
5. Populate fields with realistic example data based on the schema.

Output:
openapi = "3.0.0"

[info]
title = "Face-to-face dedicated model API"

# Setting DPOConfig (training hyperparameters) and initializing DPOTrainer

#### What it does
- Passes all training conditions to the DPO implementation included in the trl library (Transformer Reinforcement Learning).
- Creates an entity (Trainer) that executes DPO training.
- model: Model to be trained using LoRA
- tokenizer: Tokenizer capable of handling chat data
- train_dataset: Formatted DPO data
- args: Training settings

#### Initial hyperparameter settings

- learning_rate=1e-7:

Small learning rate
- per_device_train_batch_size=2 and gradient_accumulation_steps=4:

The effective batch size is approximately 2 √ó 4 = 8.

Small batches are used because the GPU memory is limited.

- optim="adamw_8bit":

8-bit Optimizer (memory saving)

- fp16 / bf16:

Uses bf16 if the GPU supports it, and fp16 if it does not.

- beta=0.1:

This is an important parameter for DPO, and acts as a temperature coefficient for determining how strongly the preference difference between chosen and rejected is learned.

- max_prompt_length=512, max_length=1024:

These are the truncation standards: the prompt portion can be a maximum of 512 tokens, and the entire string (prompt + response) can be a maximum of 1024 tokens.
Data that is too long will be truncated here (some portions will not be included in the learning process).

In [None]:
from trl import DPOTrainer, DPOConfig
from unsloth import is_bfloat16_supported

# 1. Configure DPO Training Arguments
dpo_config = DPOConfig(
    learning_rate = 1e-7,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    num_train_epochs = 1,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 1,
    output_dir = "dpo_checkpoints",
    beta = 0.1,
    max_length = 1024,
    max_prompt_length = 512,
    seed = 42,
    report_to = "none",
)

# 2. Initialize DPOTrainer
trainer = DPOTrainer(
    model = model,
    ref_model = None, # Unsloth handles this implicitly for PEFT
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = dpo_config,
)

print("DPOTrainer initialized.")

# Running DPO Training
#### What it does
- trainer.train() runs the actual DPO training.
- The trained model (including LoRA) is saved in dpo_lora_model.
- trainer_stats contains statistics such as loss and training steps.

In [None]:
# 1. Adjust logging steps to reduce output verbosity
if hasattr(trainer, 'args'):
    trainer.args.logging_steps = 50

print("Starting DPO training...")
# 2. Run DPO Training
trainer_stats = trainer.train()

# 3. Save the model and tokenizer
output_dir = "dpo_lora_model"
trainer.save_model(output_dir)

# 4. Print Training Stats
print(f"\nTraining completed. Model saved to '{output_dir}'.")
print("Training Statistics:", trainer_stats)

Starting DPO training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 4,040 | Num Epochs = 1 | Total steps = 505
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 16,515,072 of 4,038,983,168 (0.41% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / chosen,logps / rejected,logits / chosen,logits / rejected,eval_logits / chosen,eval_logits / rejected,nll_loss
50,0.675,0.032653,-0.016905,0.5775,0.049558,-854.382629,-264.011444,-2.953451,-3.451738,0,0,0
100,0.4014,0.653065,-0.127522,0.96,0.780587,-802.003357,-266.913208,-3.102431,-3.48545,No Log,No Log,No Log
150,0.1077,2.189785,-0.394114,0.9975,2.583899,-753.982178,-262.196014,-2.953183,-3.468762,No Log,No Log,No Log
200,0.0208,4.304415,-0.759475,0.9975,5.063889,-762.066406,-283.616394,-3.050305,-3.500772,No Log,No Log,No Log
250,0.01,5.768434,-0.910353,0.9975,6.678787,-720.02655,-277.24353,-3.039755,-3.548172,No Log,No Log,No Log
300,0.0104,6.655694,-0.970324,0.9975,7.626018,-766.749939,-284.989868,-2.942729,-3.495684,No Log,No Log,No Log
350,0.0016,7.010244,-1.042516,1.0,8.05276,-729.103455,-272.656525,-2.936298,-3.532874,No Log,No Log,No Log
400,0.0085,7.12824,-1.065809,0.9975,8.194048,-736.437012,-289.967102,-2.938479,-3.496388,No Log,No Log,No Log
450,0.0077,7.206686,-1.034481,1.0,8.241167,-730.509399,-293.246002,-3.019848,-3.479598,No Log,No Log,No Log
500,0.0052,7.346134,-1.073525,1.0,8.41966,-759.381775,-278.812195,-2.903349,-3.48784,No Log,No Log,No Log



Training completed. Model saved to 'dpo_lora_model'.
Training Statistics: TrainOutput(global_step=505, training_loss=0.12364573818400945, metrics={'train_runtime': 5700.8135, 'train_samples_per_second': 0.709, 'train_steps_per_second': 0.089, 'total_flos': 0.0, 'train_loss': 0.12364573818400945, 'epoch': 1.0})


# Uploading a Model to HuggingFace (merged_16bit)
#### What It Does
- Use the whoami API to get your Hugging Face username.
- Automatically generate a repository name in the format username/dpo-qwen-cot-merged and upload the model there.
- Unlike the standard SFT code, this code uploads a "merged model" rather than an adapter.
- Before uploading, create a README.md containing the necessary information and upload it.

#### Key Points
- Please review the contents of the README.md in advance to avoid any errors.
- [Assignment] Please handwrite the model title in the README.md.
- push_to_hub_merged is a convenience function provided by Unsloth that merges LoRA data into a base model and saves and uploads it.
- Since save_method="merged_16bit" is specified, the model is saved as a 16-bit merged model.
- When merged at 16-bit, the file size will be larger than LoRA alone.
- However, this is also an easier-to-use format because it eliminates the need to "load base + LoRA separately" during inference.

In [None]:
import os
from google.colab import userdata
from huggingface_hub import login, HfApi

# 1. Hugging Face Login
# This assumes you have set "HF_TOKEN" as your Colab private key (üîë).
HF_TOKEN = os.environ.get('HF_TOKEN')
if not HF_TOKEN:
    try:
        HF_TOKEN = userdata.get('HF_TOKEN')
    except:
        print("Warning: HF_TOKEN not found in Secrets.")

if HF_TOKEN:
    login(token=HF_TOKEN, add_to_git_credential=True)
else:
    print("Error: Hugging Face Token is missing. Please check your Colab Secrets.")

# 2. Get username and set repository name
api = HfApi()
try:
    username = api.whoami(token=HF_TOKEN)["name"]
    repo_name = "dpo-qwen-cot-merged" # Any repository name
    repo_id = f"{username}/{repo_name}"
    print(f"Uploading to: {repo_id}")
except Exception as e:
    print(f"Failed to get username: {e}")


In [None]:

# 3. Automatic generation of README.md (model card)
import os

# Check the destination directory
readme_dir = "dpo_lora_model"
os.makedirs(readme_dir, exist_ok=True)

def _fmt_lr(x) -> str:
    try:
        return f"{float(x):.0e}"
    except Exception:
        return str(x)

# Get learning parameters (for automatic reflection)
base_model_name = "Qwen/Qwen3-4B-Instruct-2507"
dataset_name = "u-10bei/dpo-dataset-qwen-cot"
lr_str = _fmt_lr(dpo_config.learning_rate)
epochs = dpo_config.num_train_epochs
beta = dpo_config.beta
max_len = dpo_config.max_length

# Title in README (fill in this yourself)
title_line = "<„ÄêAssignment„ÄëPlease fill in this yourself>" # Example: qwen3-4b-dpo-qwen-cot-merged

readme_md = f"""---
base_model: {base_model_name}
datasets:
- {dataset_name}
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- dpo
- unsloth
- qwen
- alignment
---

# {title_line}

This model is a fine-tuned version of **{base_model_name}** using **Direct Preference Optimization (DPO)** via the **Unsloth** library.

This repository contains the **full-merged 16-bit weights**. No adapter loading is required.

## Training Objective
This model has been optimized using DPO to align its responses with preferred outputs, focusing on improving reasoning (Chain-of-Thought) and structured response quality based on the provided preference dataset.

## Training Configuration
- **Base model**: {base_model_name}
- **Method**: DPO (Direct Preference Optimization)
- **Epochs**: {epochs}
- **Learning rate**: {lr_str}
- **Beta**: {beta}
- **Max sequence length**: {max_len}
- **LoRA Config**: r=8, alpha=16 (merged into base)

## Usage
Since this is a merged model, you can use it directly with `transformers`.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "your_id/your-repo-name"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Test inference
prompt = "Your question here"
inputs = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

```

## Sources & License (IMPORTANT)

* **Training Data**: [{dataset_name}]
* **License**: MIT License. (As per dataset terms).
* **Compliance**: Users must follow the original base model's license terms.
"""

# Write file

readme_path = os.path.join(readme_dir, "README.md")
with open(readme_path, "w", encoding="utf-8") as f:
  f.write(readme_md)

print(f"‚úÖ README.md „Çí‰ΩúÊàê„Åó„Åæ„Åó„Åü: {readme_path}")

import os
from huggingface_hub import HfApi


# 4-1. Upload the model itself (Merged 16-bit)
model.push_to_hub_merged(
    repo_id,
    tokenizer,
    save_method = "merged_16bit",
    token = HF_TOKEN
)

# 4-2. Upload the created README.md file
api.upload_file(
    path_or_fileobj=os.path.join("dpo_lora_model", "README.md"),
    path_in_repo="README.md",
    repo_id=repo_id,
    repo_type="model",
)

print(f"‚úÖ All uploads completed!\nURL: https://huggingface.co/{repo_id}")