### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm==0.8.5.post1

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm==0.8.5.post1
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Unsloth

Goal: To convert `Qwen3-4B-Base` into a reasoning model via GRPO by using OpenR1's Math dataset.

We first pre fine-tune the model to make GRPO skip trying to match formatting - this speeds GRPO up.

In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 8192 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B-Base",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.7, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank*2, # *2 speeds up training
    use_gradient_checkpointing = "unsloth", # Reduces memory usage
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-10 14:48:41 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-10 14:48:41 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.7.2: Fast Qwen3 patching. Transformers: 4.53.1. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/Qwen3-4B-Base with actual GPU utilization = 69.34%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 8192

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

INFO 07-10 14:49:10 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-10 14:49:10 [cuda.py:289] Using XFormers backend.
INFO 07-10 14:49:10 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 07-10 14:49:10 [model_runner.py:1108] Starting to load model unsloth/Qwen3-4B-Base...
INFO 07-10 14:49:11 [weight_utils.py:265] Using model weights format ['*.safetensors']


model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

INFO 07-10 14:52:08 [weight_utils.py:281] Time spent downloading weights for unsloth/Qwen3-4B-Base: 177.388059 seconds


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 07-10 14:52:47 [loader.py:458] Loading weights took 38.50 seconds
INFO 07-10 14:52:47 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 07-10 14:52:47 [model_runner.py:1140] Model loading took 7.6338 GiB and 216.642634 seconds
INFO 07-10 14:53:01 [worker.py:287] Memory profiling takes 12.78 seconds
INFO 07-10 14:53:01 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.69) = 10.22GiB
INFO 07-10 14:53:01 [worker.py:287] model weights take 7.63GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.91GiB; the rest of the memory reserved for KV Cache is 1.65GiB.
INFO 07-10 14:53:01 [executor_base.py:112] # cuda blocks: 750, # CPU blocks: 0
INFO 07-10 14:53:01 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 1.46x
INFO 07-10 14:53:01 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mo

Capturing CUDA graph shapes:   0%|          | 0/23 [00:00<?, ?it/s]

INFO 07-10 14:53:45 [model_runner.py:1592] Graph capturing finished in 44 secs, took 0.34 GiB
INFO 07-10 14:53:45 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 57.83 seconds
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'pre_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'pre_feedforward_layernorm']


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.7.2 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


**Initializing Chat Template**

In [None]:
qwen_chat_template = """{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{'<|im_start|>assistant\n'}}{% endif %}"""

tokenizer.chat_template = qwen_chat_template

**Load Dataset**

In [None]:
from datasets import load_dataset

dataset = load_dataset("kritsadaK/EDGAR-CORPUS-Financial-Summarization")

README.md: 0.00B [00:00, ?B/s]

EDGAR-CORPUS-Financial-Summarization.csv:   0%|          | 0.00/794M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10610 [00:00<?, ? examples/s]

**Preprocess**

In [None]:
def extract_year(example):
    # Pola regex \b memastikan kita mendapatkan kata/angka utuh
    match = re.search(r'\b(199\d|20[0-2]\d)\b', example['input'])

    # Jika tahun ditemukan, kembalikan sebagai integer. Jika tidak, kembalikan 0.
    if match:
        example['year'] = int(match.group(0))
    else:
        example['year'] = 0
    return example

print("Menambahkan kolom 'year' ke dataset...")
dataset_with_year = dataset.map(extract_year)
print("Selesai menambahkan kolom 'year'.")
print("\nStruktur dataset setelah ditambah kolom 'year':")
print(dataset_with_year)

Menambahkan kolom 'year' ke dataset...


Map:   0%|          | 0/10610 [00:00<?, ? examples/s]

Selesai menambahkan kolom 'year'.

Struktur dataset setelah ditambah kolom 'year':
DatasetDict({
    train: Dataset({
        features: ['input', 'summary', 'model', 'year'],
        num_rows: 10610
    })
})


In [None]:
print("Memfilter data untuk tahun 2018-2020...")
target_years_data = dataset_with_year['train'].filter(
    lambda example: 2018 <= example['year'] <= 2020
)
print(f"Ditemukan {len(target_years_data)} data dari rentang tahun 2018-2020.")

Memfilter data untuk tahun 2018-2020...


Filter:   0%|          | 0/10610 [00:00<?, ? examples/s]

Ditemukan 4356 data dari rentang tahun 2018-2020.


In [None]:
#Ambil 1000 sample dari dattase
num_samples = 500

if len(target_years_data) >= num_samples:

    final_dataset = target_years_data.shuffle(seed=42).select(range(num_samples))
    print(f"Berhasil membuat dataset final dengan {len(final_dataset)} sampel.")
else:
    print(f"Jumlah data hanya {len(target_years_data)}, kurang dari {num_samples}.")
    final_dataset = target_years_data.shuffle(seed=42)

print("\n--- HASIL AKHIR ---")
print("Struktur dataset final:")
print(final_dataset)

Berhasil membuat dataset final dengan 500 sampel.

--- HASIL AKHIR ---
Struktur dataset final:
Dataset({
    features: ['input', 'summary', 'model', 'year'],
    num_rows: 500
})


In [None]:
print("Contoh data:")
for i in range(5):
    print(f"Contoh {i+1}: Tahun = {final_dataset[i]['year']}, Input = \"{final_dataset[i]['input'][:70]}...\"")


Contoh data:
Contoh 1: Tahun = 2019, Input = ". To the Board of Directors and Shareholders of Applied Industrial Tec..."
Contoh 2: Tahun = 2019, Input = "INDEX TO FINANCIAL INFORMATION BLOOMIN’ BRANDS, INC. Management’s Annu..."
Contoh 3: Tahun = 2019, Input = "To the Shareholders and Board of Directors of Good Gaming, Inc. Opinio..."
Contoh 4: Tahun = 2018, Input = "FINANCIAL STATEMENTS Index to Consolidated Financial Statements Report..."
Contoh 5: Tahun = 2018, Input = "MANAGEMENT’S REPORT ON INTERNAL CONTROL OVER FINANCIAL REPORTING Manag..."


In [None]:
system_prompt = """You are a professional financial assistant.
Your objective is to generate a clear, concise, and professional summary of the provided data.
Ensure the summary accurately reflects the key information and main conclusions of the original text.

###Instructions for answering:
No wordiness, 350 words limit
No further explanation, just summary"""


user_template = """Summarize the text below:

{}"""

assistant_template = """{}
"""


def apply_chat_template(data):
  messages = [{
      "role" : "system",
      "content" : system_prompt
  }, {
      "role" : "user",
      "content" : user_template.format(data["input"])
  }, {
      "role" : "assistant",
      "content" : assistant_template.format(data["summary"])
  }]
  text = tokenizer.apply_chat_template(messages, tokenize = False, add_generation_prompt=False, enable_thinking=False)
  data["text"] = text
  return data


def apply_chat_template_eval(data):
  messages = [{
      "role" : "system",
      "content" : system_prompt
  }, {
      "role" : "user",
      "content" : user_template.format(data["input"])
  }]
  text = tokenizer.apply_chat_template(messages, tokenize = False, add_generation_prompt=True, e=False)
  data["text"] = text
  return data


In [None]:
split_dataset = final_dataset.train_test_split(test_size=0.2, seed=42)

train_data = split_dataset['train'].map(apply_chat_template)
test_data = split_dataset['test'].map(apply_chat_template_eval)

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Check to see if it worked:

In [None]:
print(train_data[0]["text"])

<|im_start|>system
You are a professional financial assistant.
Your objective is to generate a clear, concise, and professional summary of the provided data.
Ensure the summary accurately reflects the key information and main conclusions of the original text.

###Instructions for answering:
No wordiness, 350 words limit
No further explanation, just summary<|im_end|>
<|im_start|>user
Summarize the text below:

FIDELITY NATIONAL FINANCIAL, INC. AND SUBSIDIARIES INDEX TO FINANCIAL INFORMATION To the Shareholders and the Board of Directors of Fidelity National Financial, Inc. Opinion on Internal Control over Financial Reporting We have audited Fidelity National Financial, Inc. and subsidiaries’ internal control over financial reporting as of December 31, 2018, based on criteria established in Internal Control-Integrated Framework issued by the Committee of Sponsoring Organizations of the Treadway Commission (2013 framework) (the COSO criteria). In our opinion, Fidelity National Financial, 

Let's truncate the pre fine-tuning dataset to `max_seq_length/2` since we don't want too long reasoning traces.

Note this might take 2 minutes!

In [None]:
print(test_data[0]["text"])

<|im_start|>system
You are a professional financial assistant.
Your objective is to generate a clear, concise, and professional summary of the provided data.
Ensure the summary accurately reflects the key information and main conclusions of the original text.

###Instructions for answering:
No wordiness, 350 words limit
No further explanation, just summary<|im_end|>
<|im_start|>user
Summarize the text below:

To the Board of Directors and Shareholders of Cruzani, Inc. Opinion on the Financial Statements We have audited the accompanying consolidated balance sheet of Cruzani, Inc. (“the Company”) as of December 31, 2018, and the related consolidated statements of operations, changes in stockholders’ deficit, and cash flows for the year then ended, and the related notes (collectively referred to as the financial statements). In our opinion, the financial statements present fairly, in all material respects, the financial position of the Company as of December 31, 2018, and the results of i

**Finetune model**

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported

train_dataset = train_data

trainer_args = TrainingArguments(
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 16,

    num_train_epochs = 2,
    learning_rate = 2e-4,

    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    optim = "adamw_8bit",
    lr_scheduler_type = "linear",
    warmup_steps = 5,
    weight_decay = 0.01,

    output_dir = "outputs",
    logging_steps = 10,
    seed = 3407,
    report_to = "none",
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = 8192,
    dataset_num_proc = 2,
    packing = False,
    args = trainer_args,
)


Unsloth: Tokenizing ["text"]:   0%|          | 0/400 [00:00<?, ? examples/s]

In [None]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 400 | Num Epochs = 2 | Total steps = 50
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 16 x 1) = 16
 "-____-"     Trainable parameters = 66,060,288 of 4,088,528,384 (1.62% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,0.5627
20,0.3941
30,0.3565
40,0.3398
50,0.344


TrainOutput(global_step=50, training_loss=0.39939887046813966, metrics={'train_runtime': 1285.5118, 'train_samples_per_second': 0.622, 'train_steps_per_second': 0.039, 'total_flos': 1.817277230953267e+16, 'train_loss': 0.39939887046813966})

Let's check if the model has learnt to follow the custom format:

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
model.save_pretrained("/content/drive/MyDrive/qwen_finetuned_2")
tokenizer.save_pretrained("/content/drive/MyDrive/qwen_finetuned_2")

('/content/drive/MyDrive/qwen_finetuned_2/tokenizer_config.json',
 '/content/drive/MyDrive/qwen_finetuned_2/special_tokens_map.json',
 '/content/drive/MyDrive/qwen_finetuned_2/vocab.json',
 '/content/drive/MyDrive/qwen_finetuned_2/merges.txt',
 '/content/drive/MyDrive/qwen_finetuned_2/added_tokens.json',
 '/content/drive/MyDrive/qwen_finetuned_2/tokenizer.json')

In [None]:
model.save_lora("/content/drive/MyDrive/grpo_saved_lora_2")

Verify LoRA is actually trained!

In [None]:
from unsloth import FastLanguageModel
import torch

model_directory = "/content/drive/MyDrive/qwen_finetuned_2"

max_seq_length = 2048
dtype = None
load_in_4bit = False

print(f"Memuat model dari folder: {model_directory}...")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_directory,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print("\nModel berhasil dimuat!")

Now we load the LoRA and test:

In [None]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

In [None]:
import torch
import random

for i in random.sample(range(len(test_data)), 3):
    input_text = test_data[i]["input"]
    reference_summary = test_data[i]["summary"]

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_template.format(input_text)},
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")

    outputs = model.generate(input_ids=inputs, max_new_tokens=350, use_cache=True)
    generated_summary = tokenizer.batch_decode(outputs[:, inputs.shape[1]:], skip_special_tokens=True)[0]

    # --- Tampilkan Hasil Perbandingan ---
    print(f"===== CONTOH {i+1} =====")
    print("\n📜 TEKS INPUT (potongan):")
    print(input_text[:500] + "...")
    print("\n✅ RINGKASAN REFERENSI (dari dataset):")
    print(reference_summary)
    print("\n✨ RINGKASAN HASIL MODEL ANDA:")
    print(generated_summary)
    print("\n" + "="*50 + "\n")

In [13]:
from huggingface_hub import notebook_login

# Muncul prompt untuk memasukkan token Hugging Face Anda
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [15]:
hf_repo_name = "Jezzen/10K-summarization-based-qwen3"

lora_path = "/content/drive/MyDrive/grpo_saved_lora_2"
model.save_pretrained(lora_path)
tokenizer.save_pretrained(lora_path)
print(f"LoRA adapters berhasil disimpan di: {lora_path}")

print(f"\n🚀 Mendorong model dan tokenizer ke Hugging Face Hub: {hf_repo_name}")
model.push_to_hub(hf_repo_name, token=True)
tokenizer.push_to_hub(hf_repo_name, token=True)
print(f"Model dan tokenizer format asli berhasil diunggah.")


# =================================================================
# 3. Simpan dalam format GGUF
# =================================================================
# Anda bisa membuat beberapa versi kuantisasi. q4_k_m adalah yang paling umum.
print("\n⚙️  Membuat versi GGUF dari model...")

# Menyimpan GGUF dengan kuantisasi 4-bit (umum digunakan)
model.save_pretrained_gguf(
    save_directory="/content/drive/MyDrive", # Simpan di direktori saat ini
    model_name="model-qwen3_10k.gguf",
    tokenizer=tokenizer,
    quantization_method="q4_k_m" # Metode kuantisasi (opsi lain: q8_0, f16)
)
print("✅ Berhasil menyimpan GGUF dengan kuantisasi Q4_K_M.")


# =================================================================
# 4. Push file GGUF ke Hugging Face Hub
# =================================================================
from huggingface_hub import HfApi, HfFolder

# Gunakan token yang sudah tersimpan dari notebook_login()
hf_token = HfFolder.get_token()
api = HfApi()

print(f"\n🚀 Mengunggah file GGUF ke repositori: {hf_repo_name}")

# Unggah file GGUF 4-bit
api.upload_file(
    path_or_fileobj="/content/drive/MyDrive/model-qwen3_10k.gguf",
    path_in_repo="model-qwen3_10k.gguf",
    repo_id=hf_repo_name,
    token=hf_token,
)

print("✅ Berhasil mengunggah model-q8_0.gguf")

print("\n🎉 Semua proses selesai!")

LoRA adapters berhasil disimpan di: /content/drive/MyDrive/grpo_saved_lora_2

🚀 Mendorong model dan tokenizer ke Hugging Face Hub: Jezzen/10K-summarization-based-qwen3


No files have been modified since last commit. Skipping to prevent empty commit.


Saved model to https://huggingface.co/Jezzen/10K-summarization-based-qwen3


No files have been modified since last commit. Skipping to prevent empty commit.


Model dan tokenizer format asli berhasil diunggah.

⚙️  Membuat versi GGUF dari model...


TypeError: unsloth_save_pretrained_gguf() got an unexpected keyword argument 'model_name'