# Installing libraries

In [1]:
%%capture
# Skip restarting message in Colab
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

!pip install unsloth vllm
!pip install --upgrade pillow

# Unsloth

In [2]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 02-18 01:35:20 __init__.py:190] Automatically detected platform cuda.


# Load Llama model

In [3]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3.5-mini-instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/phi-3.5-mini-instruct-bnb-4bit with actual GPU utilization = 59.59%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 6.2 GB. Also swap space = 2 GB.
INFO 02-18 01:35:59 config.py:542] This model supports multiple tasks: {'reward', 'score', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bi

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-18 01:36:06 model_runner.py:1115] Loading model weights took 2.1371 GB
INFO 02-18 01:36:06 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-18 01:36:10 worker.py:267] Memory profiling takes 3.24 seconds
INFO 02-18 01:36:10 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.60) = 8.78GiB
INFO 02-18 01:36:10 worker.py:267] model weights take 2.14GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.24GiB; the rest of the memory reserved for KV Cache is 6.38GiB.
INFO 02-18 01:36:11 executor_base.py:110] # CUDA blocks: 1088, # CPU blocks: 341
INFO 02-18 01:36:11 executor_base.py:115] Maximum concurrency for 2048 tokens per request: 8.50x
INFO 02-18 01:36:12 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occur

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:40<00:00,  1.49s/it]

INFO 02-18 01:36:52 model_runner.py:1562] Graph capturing finished in 40 secs, took 0.73 GiB
INFO 02-18 01:36:52 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 46.10 seconds



Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.2.12 patched 32 layers with 0 QKV layers, 0 O layers and 32 MLP layers.


# Data prep

In [4]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

# Training

In [5]:
import wandb
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmaxwelljohn123123[0m ([33mmaxwelljohn123123-student[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [6]:
wandb.init(
    project="phi3.5_reasoning_traning",
    entity = "maxwelljohn123123-student"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


In [7]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "wandb", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We know expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


In [8]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 6 | Gradient Accumulation steps = 1
\        /    Total batch size = 6 | Total steps = 250
 "-____-"     Number of trainable parameters = 34,603,008


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
<reasoning>
First, we need to determine how many tickets Mr. Benson bought that exceeded the initial 10 tickets. He bought 12 tickets in total, so the number of tickets that exceed 10 is 12 - 10 = 2 tickets.

Next, we calculate the discount for these 2 tickets. The discount is 5% of the original price of each ticket, which is $40. So, the discount per ticket is 5/100 * $40 = $2.

Now, we apply this discount to the 2 tickets that he bought in excess. So, the discounted price for each of these tickets is $40 - $2 = $38.

Mr. Benson bought 2 discounted tickets and 10 full-priced tickets, so the total amount 
Extracted:
<reasoning>
First, we need to determine how many tickets Mr. Benson bought that exceeded the initial 10 tickets. He bought 12 tickets in total, so the number of

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,0.0,0.072833,0.19552,200.0,0.0,0.072833,0.0,0.0,0.0,0.0
2,0.0,0.125,0.0,200.0,0.0,0.125,0.0,0.0,0.0,0.0
3,0.0,0.125,0.0,200.0,1.8e-05,0.125,0.0,0.0,0.0,0.0
4,0.0,0.059667,0.160033,200.0,1.6e-05,0.059667,0.0,0.0,0.0,0.0
5,0.0,-0.039833,0.106929,187.0,2e-05,-0.039833,0.0,0.0,0.0,0.0
6,0.0,0.125,0.0,200.0,2.8e-05,0.125,0.0,0.0,0.0,0.0
7,0.0,0.125,0.0,200.0,2.4e-05,0.125,0.0,0.0,0.0,0.0
8,0.0,-0.175167,0.259475,189.666672,2.9e-05,-0.175167,0.0,0.0,0.0,0.0
9,0.0,0.125,0.0,200.0,2.4e-05,0.125,0.0,0.0,0.0,0.0
10,0.0,0.125,0.0,200.0,2.2e-05,0.125,0.0,0.0,0.0,0.0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
1. The probability of winning both lawsuits is the product of the individual probabilities of winning each lawsuit: 0.30 * 0.50 = 0.15 or 15%.

2. Similarly, the probability of losing both lawsuits is the product of the individual probabilities of losing each lawsuit: 0.70 * 0 
Extracted:
<reasoning>
First, we need to calculate the probability of Andy winning and losing each lawsuit individually, and then the probability of all scenarios: winning both, losing both, winning one and losing one, and losing one and winning one.

For the first lawsuit, the probability of winning is 30%, so the probability of losing is 70% (100% - 30%). 

For the second lawsuit, the probability of winning is 50%, so the probability of losing is also 50%.

1. The probability of winning both lawsuits is the product of the individual probabilities of winning each lawsuit: 0.30 * 0.50 = 0.15 or 15%.

2. Similarly, the probability of losing both law

TrainOutput(global_step=250, training_loss=0.00013048033748623312, metrics={'train_runtime': 6855.2568, 'train_samples_per_second': 0.219, 'train_steps_per_second': 0.036, 'total_flos': 0.0, 'train_loss': 0.00013048033748623312})

# Test the fine tuned model

Now base model

In [9]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Which is bigger? 9.11 or 9.9?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.20s/it, est. speed input: 5.64 toks/s, output: 29.13 toks/s]


' 9.9 is bigger than 9.11. When comparing decimal numbers, you look at the value of each digit in its place value. Both numbers have a "9" in the ones place, so you compare the digits in the tenths place. Since 9.9 has a "9" in the tenths place and 9.11 has a "1", 9.9 is the larger number.<|end|>'

In [12]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Ann is cutting fabric to make curtains. She cuts a 4 foot by 6 foot rectangle for the living room, and a 2 foot by 4 foot rectangle for the bedroom. If the bolt of fabric is 16 feet by 12 feet, how much fabric is left in square feet?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.81s/it, est. speed input: 12.24 toks/s, output: 31.55 toks/s]


' The area of the living room curtains is 4 feet x 6 feet = 24 square feet.\nThe area of the bedroom curtains is 2 feet x 4 feet = 8 square feet.\nThe total area of the curtains is 24 square feet + 8 square feet = 32 square feet.\nThe area of the bolt of fabric is 16 feet x 10 feet = 160 square feet.\nTo find the amount of fabric left, we subtract the area of the curtains from the area of the bolt of fabric: 160 square feet - 32 square feet = 128 square feet.\nTherefore, there are 128 square feet of fabric left.\n#### 128\nThe answer is: 128<|end|>'

Now fine tuned model

In [13]:
model.save_lora("grpo_saved_lora")

In [14]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Ann is cutting fabric to make curtains. She cuts a 4 foot by 6 foot rectangle for the living room, and a 2 foot by 4 foot rectangle for the bedroom. If the bolt of fabric is 16 feet by 12 feet, how much fabric is left in square feet?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:08<00:00,  8.95s/it, est. speed input: 11.51 toks/s, output: 31.29 toks/s]


' <reasoning>\nFirst, we need to calculate the total area of the fabric that Ann used for the curtains. \n\nFor the living room curtain, the area is 4 feet (length) times 6 feet (width), which equals 24 square feet.\n\nFor the bedroom curtain, the area is 2 feet (length) times 4 feet (width), which equals 8 square feet.\n\nSo, the total area of the fabric used by Ann is 24 square feet (living room) plus 8 square feet (bedroom), which equals 32 square feet.\n\nNext, we need to calculate the total area of the bolt of fabric. The area is 16 feet (length) times 12 feet (width), which equals 192 square feet.\n\nFinally, to find out how much fabric is left, we subtract the total area of the fabric used from the total area of the bolt of fabric. So, 192 square feet (total fabric) minus 32 square feet (used fabric) equals 160 square feet.\n\nTherefore, Ann has 160 square feet of fabric left.\n</reasoning>\n<answer>\n160 square feet\n</answer><|end|>'

In [15]:
print(output)

 <reasoning>
First, we need to calculate the total area of the fabric that Ann used for the curtains. 

For the living room curtain, the area is 4 feet (length) times 6 feet (width), which equals 24 square feet.

For the bedroom curtain, the area is 2 feet (length) times 4 feet (width), which equals 8 square feet.

So, the total area of the fabric used by Ann is 24 square feet (living room) plus 8 square feet (bedroom), which equals 32 square feet.

Next, we need to calculate the total area of the bolt of fabric. The area is 16 feet (length) times 12 feet (width), which equals 192 square feet.

Finally, to find out how much fabric is left, we subtract the total area of the fabric used from the total area of the bolt of fabric. So, 192 square feet (total fabric) minus 32 square feet (used fabric) equals 160 square feet.

Therefore, Ann has 160 square feet of fabric left.
</reasoning>
<answer>
160 square feet
</answer><|end|>


In [16]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Which is bigger? 9.11 or 9.9?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:04<00:00,  4.56s/it, est. speed input: 10.99 toks/s, output: 30.78 toks/s]


' <reasoning>\nTo determine which number is bigger between 9.11 and 9.9, we compare the numbers digit by digit from left to right. The first digit for both numbers is 9, so we move to the next digit after the decimal point. For 9.11, the second digit is 1, and for 9.9, the second digit is also 9. Since 9 is greater than 1, we can conclude that 9.9 is larger than 9.11.\n</reasoning>\n<answer>\n9.9 is bigger than 9.11.\n</answer><|end|>'

In [17]:
print(output)

 <reasoning>
To determine which number is bigger between 9.11 and 9.9, we compare the numbers digit by digit from left to right. The first digit for both numbers is 9, so we move to the next digit after the decimal point. For 9.11, the second digit is 1, and for 9.9, the second digit is also 9. Since 9 is greater than 1, we can conclude that 9.9 is larger than 9.11.
</reasoning>
<answer>
9.9 is bigger than 9.11.
</answer><|end|>


# Save model

GGUF format fp16

In [21]:
if True: model.save_pretrained_gguf("gguf/model", tokenizer,quantization_method = "f16")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.3G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 2.69 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 50%|█████     | 16/32 [00:01<00:01, 13.91it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:25<00:00,  2.66s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving gguf/model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving gguf/model/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at gguf/model into f16 GGUF format.
The output location will be /content/gguf/model/unsloth.F16.gguf
This might take 3 minutes...


Unsloth: Extending gguf/model/tokenizer.model with added_tokens.json.
Originally tokenizer.model is of size (32000).
But we need to extend to sentencepiece vocab size (32011).


INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {3072, 32064}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> F16, shape = {3072, 8192}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> F16, shape = {3072, 8192}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.float16 --> F16, shape = 