# Semi-Manual export of model for Llama.cpp

## Load Unsloth Model

In [1]:
from transformers import AutoModel, AutoTokenizer
from unsloth import FastLanguageModel
import torch

base_model_name = "llama-3.2-1b-instruct-lora_model-1epoch"
max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name = "Taiwar/" + base_model_name, # or choose "unsloth/Llama-3.2-1B-Instruct"
    model_name = "../models/" + base_model_name, # Local model
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.9: Fast Llama patching. Transformers = 4.46.3.
   \\   /|    GPU: NVIDIA GeForce RTX 2060 SUPER. Max memory: 8.0 GB. Platform = Windows.
O^O/ \_/ \    Pytorch: 2.4.0. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  self.register_buffer("cos_cached", emb.cos().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
Unsloth 2024.11.9 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


## 2. Save model in merged 16bit format

In [None]:
model.save_pretrained_merged("models/llama-3.2-1b-instruct-lora-1poch_merged16b", tokenizer, save_method = "merged_16bit",)

In [3]:
# Optionally upload 16 bit version to HF for vllm
hf_token = open(".hftoken").read().strip()
model.push_to_hub_merged("Taiwar/llama-3.2-1b-instruct-lora-1poch_merged16b", tokenizer, save_method = "merged_16bit", token = hf_token)

Unsloth: You are pushing to hub, but you passed your HF username = Taiwar.
We shall truncate Taiwar/llama-3.2-1b-instruct-lora-1poch_merged16b to llama-3.2-1b-instruct-lora-1poch_merged16b


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 7.3 out of 31.95 RAM for saving.


100%|██████████| 16/16 [00:00<00:00, 24.06it/s]


Unsloth: Saving tokenizer...

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

 Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...


README.md:   0%|          | 0.00/617 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/Taiwar/llama-3.2-1b-instruct-lora-1poch_merged16b


## 3. Run Llama.cpp
See https://github.com/unslothai/unsloth/wiki#manually-saving-to-gguf

## 4. Push model to HF

In [None]:
from huggingface_hub import HfApi
hf_token = open(".hftoken").read().strip()
api = HfApi(token=hf_token)

model_id = "Taiwar/llama-3.2-1b-instruct-lora_model-1epoch"
api.upload_file(
    path_or_fileobj="../models/llama-3.2-1b-instruct-lora_merged-1epoch-16b-gguf/llama-3.2-1b-instruct-lora_merged-1epoch-16b.gguf",
    path_in_repo="llama-3.2-1b-instruct-lora_merged-1epoch-16b.gguf",
    repo_id=model_id,
)