This is a notebook that converts our finetuned phi3 model into gguf for running locally with ollama.

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')

In [None]:
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
! pip install datasets

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from unsloth import FastLanguageModel
from transformers import TextStreamer
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [None]:
# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="zeref713/gsm8k_lora_model_2_LORA_only",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.44.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/120M [00:00<?, ?B/s]

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
from unsloth.chat_templates import get_chat_template
max_seq_length = 2048

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

def formatting_prompts_func(examples):
  # ds['train']['dialog'][0]
    # convos = examples["conversations"]
    convos = examples["dialog"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("zeref713/gsm8k_phi3Form", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)




from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 18,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # max_steps = 60,
        num_train_epochs=1,  # Train for 1 epoch
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Downloading readme:   0%|          | 0.00/514 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.48M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/813k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Unsloth: Already have LoRA adapters! We shall skip this step.


Map (num_proc=2):   0%|          | 0/7473 [00:00<?, ? examples/s]

In [None]:
# Example inference
FastLanguageModel.for_inference(model)
messages = [
    {"from": "human", "value": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

outputs = model.generate(input_ids=inputs, max_new_tokens=512, use_cache=True)
print(tokenizer.batch_decode(outputs))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["<s><|user|> Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?<|end|><|assistant|> Janet eats 3 + 4 = <<3+4=7>>7 eggs per day.\nShe has 16 - 7 = <<16-7=9>>9 eggs left to sell.\nShe makes 9 * $2 = $<<9*2=18>>18 per day at the farmers' market.\n#### 18<|end|>"]


In [None]:
FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "Jan has three times the number of pets as Marcia. Marcia has two more pets than Cindy. If Cindy has four pets, how many total pets do the three have?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 512, use_cache = True)

<s><|user|> Jan has three times the number of pets as Marcia. Marcia has two more pets than Cindy. If Cindy has four pets, how many total pets do the three have?<|end|><|assistant|> Marcia has 4+2=<<4+2=6>>6 pets.
Jan has 6*3=<<6*3=18>>18 pets.
Together, they have 4+6+18=<<4+6+18=28>>28 pets.
#### 28<|end|>


In [None]:
def generate_response(value):
    FastLanguageModel.for_inference(model)
    messages = [{"from": "human", "value": value}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,  # Must add for generation
        return_tensors="pt",
    ).to("cuda")

    text_streamer = TextStreamer(tokenizer)
    _ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=512, use_cache=True)

In [None]:
# Example usage
generate_response("Emily has 4 kids named Amy, Jackson, Corey, and James. Amy is 5 years older than Jackson and 2 years younger than Corey. If James is 10 and is 1 year younger than Corey, how old is Jackson?")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<s><|user|> Emily has 4 kids named Amy, Jackson, Corey, and James. Amy is 5 years older than Jackson and 2 years younger than Corey. If James is 10 and is 1 year younger than Corey, how old is Jackson?<|end|><|assistant|> If James is 10 and is 1 year younger than Corey, then Corey is 10+1=<<10+1=11>>11 years old.
If Amy is 2 years younger than Corey, then Amy is 11-2=<<11-2=9>>9 years old.
If Amy is 5 years older than Jackson, then Jackson is 9-5=<<9-5=4>>4 years old.
#### 4<|end|>


In [None]:
generate_response("Which one is greater: 7.9 or 7.11? How?")

<s><|user|> Which one is greater: 7.9 or 7.11? How?<|end|><|assistant|> 7.9 is greater than 7.11 because 9 is greater than 11.<|end|>


In [None]:
# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if True: model.push_to_hub_gguf("zeref713/phi3_gsm8k_q4_k_m", tokenizer, quantization_method = "q4_k_m", token = HF_TOKEN)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.3G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.79 out of 12.67 RAM for saving.


100%|██████████| 32/32 [00:02<00:00, 13.66it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving zeref713/phi3_gsm8k_q4_k_m/pytorch_model-00001-of-00002.bin...
Unsloth: Saving zeref713/phi3_gsm8k_q4_k_m/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting mistral model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...


Unsloth: Extending zeref713/phi3_gsm8k_q4_k_m/tokenizer.model with added_tokens.json.
Originally tokenizer.model is of size (32000).
But we need to extend to sentencepiece vocab size (32011).


Unsloth: [1] Converting model at zeref713/phi3_gsm8k_q4_k_m into f16 GGUF format.
The output location will be ./zeref713/phi3_gsm8k_q4_k_m/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: phi3_gsm8k_q4_k_m
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {3072, 32064}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16

unsloth.F16.gguf:   0%|          | 0.00/7.64G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/zeref713/phi3_gsm8k_q4_k_m
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q4_K_M.gguf:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/zeref713/phi3_gsm8k_q4_k_m
Saved Ollama Modelfile to https://huggingface.co/zeref713/phi3_gsm8k_q4_k_m


In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if True: model.push_to_hub_gguf("zeref713/phi3_gsm8k_Q8_0", tokenizer, token = HF_TOKEN)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 7.78 out of 12.67 RAM for saving.


100%|██████████| 32/32 [00:01<00:00, 22.81it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving zeref713/phi3_gsm8k_Q8_0/pytorch_model-00001-of-00002.bin...
Unsloth: Saving zeref713/phi3_gsm8k_Q8_0/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Extending zeref713/phi3_gsm8k_Q8_0/tokenizer.model with added_tokens.json.
Originally tokenizer.model is of size (32000).
But we need to extend to sentencepiece vocab size (32011).


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at zeref713/phi3_gsm8k_Q8_0 into q8_0 GGUF format.
The output location will be ./zeref713/phi3_gsm8k_Q8_0/unsloth.Q8_0.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: phi3_gsm8k_Q8_0
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> Q8_0, shape = {3072, 32064}
INFO:hf-to-gg

unsloth.Q8_0.gguf:   0%|          | 0.00/4.06G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/zeref713/phi3_gsm8k_Q8_0
Saved Ollama Modelfile to https://huggingface.co/zeref713/phi3_gsm8k_Q8_0


In [None]:
# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("zeref713/phi3_gsm8k_16bit", tokenizer, quantization_method = "f16", token = HF_TOKEN)

In [None]:
if True: model.push_to_hub_gguf("zeref713/phi3_gsm8k_GGUF", tokenizer, quantization_method = "q4_k_m", token = HF_TOKEN)
if True: model.push_to_hub_gguf("zeref713/phi3_gsm8k_GGUF", tokenizer, quantization_method = "f16", token = HF_TOKEN)
if True: model.push_to_hub_gguf("zeref713/phi3_gsm8k_GGUF", tokenizer, token = HF_TOKEN)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.3G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.81 out of 12.67 RAM for saving.


100%|██████████| 32/32 [00:02<00:00, 14.37it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving zeref713/phi3_gsm8k_GGUF/pytorch_model-00001-of-00002.bin...
Unsloth: Saving zeref713/phi3_gsm8k_GGUF/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting mistral model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...


Unsloth: Extending zeref713/phi3_gsm8k_GGUF/tokenizer.model with added_tokens.json.
Originally tokenizer.model is of size (32000).
But we need to extend to sentencepiece vocab size (32011).


Unsloth: [1] Converting model at zeref713/phi3_gsm8k_GGUF into f16 GGUF format.
The output location will be ./zeref713/phi3_gsm8k_GGUF/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: phi3_gsm8k_GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {3072, 32064}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.float16 --> F16, shape = {3072, 3072}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> F

unsloth.F16.gguf:   0%|          | 0.00/7.64G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/zeref713/phi3_gsm8k_GGUF
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q4_K_M.gguf:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/zeref713/phi3_gsm8k_GGUF
Saved Ollama Modelfile to https://huggingface.co/zeref713/phi3_gsm8k_GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 7.02 out of 12.67 RAM for saving.


100%|██████████| 32/32 [00:01<00:00, 27.71it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving zeref713/phi3_gsm8k_GGUF/pytorch_model-00001-of-00002.bin...
Unsloth: Saving zeref713/phi3_gsm8k_GGUF/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Extending zeref713/phi3_gsm8k_GGUF/tokenizer.model with added_tokens.json.
Originally tokenizer.model is of size (32000).
But we need to extend to sentencepiece vocab size (32011).


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at zeref713/phi3_gsm8k_GGUF into f16 GGUF format.
The output location will be ./zeref713/phi3_gsm8k_GGUF/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: phi3_gsm8k_GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {3072, 32064}
INFO:hf-to-gguf:b

100%|██████████| 32/32 [00:01<00:00, 26.90it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving zeref713/phi3_gsm8k_GGUF/pytorch_model-00001-of-00002.bin...
Unsloth: Saving zeref713/phi3_gsm8k_GGUF/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Extending zeref713/phi3_gsm8k_GGUF/tokenizer.model with added_tokens.json.
Originally tokenizer.model is of size (32000).
But we need to extend to sentencepiece vocab size (32011).


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at zeref713/phi3_gsm8k_GGUF into q8_0 GGUF format.
The output location will be ./zeref713/phi3_gsm8k_GGUF/unsloth.Q8_0.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: phi3_gsm8k_GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> Q8_0, shape = {3072, 32064}
INFO:hf-to-gg

unsloth.Q8_0.gguf:   0%|          | 0.00/4.06G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/zeref713/phi3_gsm8k_GGUF
Saved Ollama Modelfile to https://huggingface.co/zeref713/phi3_gsm8k_GGUF
