# Qwen3-14B Clarity COT Fine-tuning
Training Qwen3-14B on the Clarity Chain-of-Thought dataset for political discourse classification.

This notebook fine-tunes Qwen3-14B on the clarity-cot-dataset using:
- **Dataset**: `Saietjabojja/clarity-cot-dataset` (political discourse classification with COT reasoning)
- **Model**: `unsloth/Qwen3-14B-unsloth-bnb-4bit`
- **Method**: LoRA with `train_on_responses_only`
- **Chat Template**: Qwen-2.5

### Installation

In [None]:
%%capture
import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1" # [NEW] Extra 30% context lengths!
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install or uv pip install
    !pip install unsloth vllm
else:
    !pip install --upgrade -qqq uv
    !uv pip install vllm==0.11.1 unsloth-zoo unsloth
    !uv pip install transformers==4.57.1
    !uv pip install --no-deps trl==0.22.2

### Load Model

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 8192 # Longer for COT responses
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B-unsloth-bnb-4bit",  # Qwen3 14B Instruct
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2026.1.4 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


<a name="Data"></a>
### Data Prep
Multi Taask Learning

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-3",  # Qwen chat template
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return { "text" : texts, }

from datasets import load_dataset
dataset = load_dataset("Saitejabojja/qevasion-mtl-conversations-v2", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

README.md:   0%|          | 0.00/533 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/2.15M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/325k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3448 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/308 [00:00<?, ? examples/s]

Map:   0%|          | 0/3448 [00:00<?, ? examples/s]

In [None]:
from pprint import pprint
pprint(dataset[0]["conversations"])


[{'content': 'You are an expert political discourse analyst.',
  'role': 'system'},
 {'content': '### Question ###\n'
             'How would you respond to the accusation that the United States '
             'is containing China while pushing for diplomatic talks?\n'
             '\n'
             '### Answer ###\n'
             'Well, look, first of all, theI am sincere about getting the '
             'relationship right. And one of the things that is going on now '
             'is, China is beginning to change some of the rules of the game, '
             'in terms of trade and other issues.And so one of the things we '
             "talked about, for example, is that they're now talking about "
             'making sure that no Chineseno one in the Chinese Government can '
             'use a Western cell phone. Those kinds of things.And so, really, '
             'what this trip was aboutit was less about containing China. I '
             "don't want to contain China. I just w

In [None]:
print(dataset[0]["text"])


<|im_start|>system
You are an expert political discourse analyst.<|im_end|>
<|im_start|>user
### Question ###
How would you respond to the accusation that the United States is containing China while pushing for diplomatic talks?

### Answer ###
Well, look, first of all, theI am sincere about getting the relationship right. And one of the things that is going on now is, China is beginning to change some of the rules of the game, in terms of trade and other issues.And so one of the things we talked about, for example, is that they're now talking about making sure that no Chineseno one in the Chinese Government can use a Western cell phone. Those kinds of things.And so, really, what this trip was aboutit was less about containing China. I don't want to contain China. I just want to make sure that we have a relationship with China that is on the up and up, squared away, everybody knows what it's all about. And one of the ways you do that is, you make sure that we are talking about the same

<a name="Train"></a>
### Train the model
Training on the Clarity COT dataset for political discourse classification. Using full epochs for complete training.

In [None]:
from trl import SFTConfig, SFTTrainer
from unsloth import train_on_responses_only

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        num_train_epochs = 1,  # Full epoch
        warmup_ratio = 0.1,
        learning_rate = 2e-4,
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb",  # Enable WandB logging
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
    ),
)

# Train only on assistant responses (not system/user prompts)
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|im_start|>user\n",
    response_part = "<|im_start|>assistant\n",
)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's test the trained model on a sample clarity classification task!

In [None]:
FastLanguageModel.for_inference(model)

test_messages = [
    {
        "role": "user",
        "content": """<QUESTION>
Do you support the new tax bill?
</QUESTION>

<ANSWER>
Well, look, I think we need to have a comprehensive discussion about fiscal responsibility.
The American people deserve to know that their government is being good stewards of their money.
</ANSWER>

Respond using:
<EVASION> ... </EVASION>
<CLARITY> ... </CLARITY>
"""
    }
]


inputs = tokenizer.apply_chat_template(
    test_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=64,     # labels are short
    temperature=0.0,       # IMPORTANT: deterministic
    do_sample=False,
    use_cache=True,
)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# Streaming inference example
FastLanguageModel.for_inference(model)

test_messages = [
    {
        "role": "user",
        "content": """<QUESTION>
Will you raise taxes on the middle class?
</QUESTION>

<ANSWER>
No, I will not raise taxes on the middle class. My plan specifically targets corporations and high-income earners.
</ANSWER>

Respond using:
<EVASION> ... </EVASION>
<CLARITY> ... </CLARITY>
"""
    }
]


inputs = tokenizer.apply_chat_template(
    test_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=1024)

<a name="TestEval"></a>
### Test Set Evaluation
Evaluate the model on the full test split with metrics for Clarity classification.

In [None]:
import pandas as pd
import re
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, f1_score, classification_report, confusion_matrix
from datasets import load_dataset

def extract_clarity_label(response: str) -> str:
    """Extract CLARITY label from MTL response."""
    clarity_match = re.search(r'<CLARITY>\s*(.*?)\s*</CLARITY>', response, re.IGNORECASE | re.DOTALL)
    if clarity_match:
        return clarity_match.group(1).strip()
    return "PARSE_ERROR"

# Load test dataset
test_dataset = load_dataset("Saitejabojja/qevasion-mtl-conversations-v2", split="test")
print(f"Loaded {len(test_dataset)} test examples")
print(f"Columns: {test_dataset.column_names}")

In [None]:
# Run inference on test set
FastLanguageModel.for_inference(model)

results = []
total = len(test_dataset)

print(f"\nRunning inference on {total} test examples...\n")

for idx, example in tqdm(enumerate(test_dataset), total=total, desc="Evaluating"):
    # Get messages without assistant response
    messages = example["conversations"][:-1]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")
    
    try:
        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=64,
            temperature=0.0,
            do_sample=False,
            use_cache=True,
        )
        
        # Decode only the generated part
        input_len = inputs.shape[-1]
        response = tokenizer.decode(outputs[0, input_len:], skip_special_tokens=True).strip()
        pred_clarity = extract_clarity_label(response)
        
        results.append({
            "prediction": pred_clarity,
            "clarity_label": example["clarity_label"],
            "raw_output": response
        })
        
    except Exception as e:
        results.append({
            "prediction": "ERROR",
            "clarity_label": example["clarity_label"],
            "raw_output": str(e)
        })
    
    # Clear cache periodically
    if (idx + 1) % 25 == 0:
        torch.cuda.empty_cache()

print(f"\nInference complete!")

In [None]:
# Create DataFrame with results
results_df = pd.DataFrame(results)

# Save predictions to CSV
results_df.to_csv("qwen3_mtl_test_predictions.csv", index=False)
print(f"✅ Saved predictions to qwen3_mtl_test_predictions.csv")
print(f"Results shape: {results_df.shape}")
print(results_df.head())

# Filter to valid predictions only
y_true = results_df["clarity_label"].str.strip()
y_pred = results_df["prediction"].str.strip()

VALID_LABELS = {"Clear Reply", "Clear Non-Reply", "Ambivalent"}
mask = y_true.isin(VALID_LABELS) & y_pred.isin(VALID_LABELS)

y_true_filtered = y_true[mask]
y_pred_filtered = y_pred[mask]

print(f"Valid predictions: {mask.sum()} / {len(mask)}")
print(f"Parse errors/skipped: {(~mask).sum()}")

# Calculate metrics
accuracy = accuracy_score(y_true_filtered, y_pred_filtered)
precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
    y_true_filtered, y_pred_filtered, average="macro"
)
f1_weighted = f1_score(y_true_filtered, y_pred_filtered, average="weighted")

print("\n" + "=" * 60)
print("CLARITY CLASSIFICATION METRICS (MTL Model on Test Set)")
print("=" * 60)
print(f"\nAccuracy        : {accuracy:.4f}")
print(f"Precision (Mac) : {precision_macro:.4f}")
print(f"Recall (Mac)    : {recall_macro:.4f}")
print(f"F1 (Macro)      : {f1_macro:.4f}")
print(f"F1 (Weighted)   : {f1_weighted:.4f}")

In [None]:
# Detailed classification report
print("\n" + "=" * 60)
print("DETAILED CLASSIFICATION REPORT")
print("=" * 60 + "\n")

print(classification_report(y_true_filtered, y_pred_filtered, digits=4))

# Prediction distribution
print("\nPrediction vs Ground Truth Distribution:")
print("\nGround Truth:")
for label in ["Clear Reply", "Clear Non-Reply", "Ambivalent"]:
    count = (y_true_filtered == label).sum()
    print(f"   {label}: {count}")

print("\nPredictions:")
for label in ["Clear Reply", "Clear Non-Reply", "Ambivalent"]:
    count = (y_pred_filtered == label).sum()
    print(f"   {label}: {count}")

# Confusion Matrix
labels = ["Clear Reply", "Clear Non-Reply", "Ambivalent"]
cm = confusion_matrix(y_true_filtered, y_pred_filtered, labels=labels)

print("\n" + "=" * 60)
print("CONFUSION MATRIX")
print("=" * 60)
print(f"\n{'':20} {'Pred CR':>12} {'Pred CNR':>12} {'Pred Amb':>12}")
print("-" * 60)
for i, true_label in enumerate(labels):
    row = f"{true_label:20}"
    for j in range(3):
        row += f" {cm[i][j]:>12}"
    print(row)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("clarity-qwen3-14b-mtl-lora")  # Local saving
tokenizer.save_pretrained("clarity-qwen3-14b-mtl-lora")
# model.push_to_hub("Saietjabojja/clarity-qwen3-14b-mtl", token = "...") # Online saving
# tokenizer.push_to_hub("Saietjabojja/clarity-qwen3-14b-mtl", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "clarity-qwen3-14b-mtl-lora", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Test loaded model
test_messages = [
    {"role": "system", "content": "You are an expert political discourse analyst. Analyze political interviews step by step and classify response clarity."},
    {"role": "user", "content": """Classify this response:

### Full Interview Question ###
Q. Do you support the infrastructure bill?

### Full Interview Answer ###
Yes, I fully support the infrastructure bill.

### Specific Question to Classify ###
Do you support the infrastructure bill?

Think step by step, then provide your classification."""}
]

inputs = tokenizer.apply_chat_template(
    test_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=1024)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer

    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model",  # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit=load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("clarity-qwen3-14b-mtl", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("Saietjabojja/clarity-qwen3-14b-mtl", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("clarity-qwen3-14b-mtl-4bit", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("Saietjabojja/clarity-qwen3-14b-mtl-4bit", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("clarity-qwen3-14b-mtl-lora")
    tokenizer.save_pretrained("clarity-qwen3-14b-mtl-lora")
if False:
    model.push_to_hub("Saietjabojja/clarity-qwen3-14b-mtl-lora", token = "")
    tokenizer.push_to_hub("Saietjabojja/clarity-qwen3-14b-mtl-lora", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("clarity-qwen3-14b-mtl-gguf", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
if False: model.push_to_hub_gguf("Saietjabojja/clarity-qwen3-14b-mtl-gguf", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("clarity-qwen3-14b-mtl-gguf", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("Saietjabojja/clarity-qwen3-14b-mtl-gguf", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("clarity-qwen3-14b-mtl-gguf", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("Saietjabojja/clarity-qwen3-14b-mtl-gguf", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "Saietjabojja/clarity-qwen3-14b-mtl-gguf",
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
