### **Fine-Tuning Meta-Llama-3.2-3B Used unsloth for CPU and GPU Inference - GGML**

On September 25, 2024, Meta introduced Llama 3.2, a collection of multilingual large language models (LLMs) in 1B and 3B sizes. These models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. Notably, the Llama 3.2 1B and 3B models support a context length of 128K tokens, making them suitable for extensive text processing tasks.
HUGGING FACE

To access the Llama 3.2-1B model, you can download it from [Hugging Face](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct) The approval process is typically swift, often taking about 20 minutes.
HUGGING FACE

### Table of Contents
1. Install dependancies
2. Download model
3. Fintuning flow
4. convert GGML formate


## Step 1: Install All the Required Packages

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

## Step 2: Import necessary libraries Load model and tokenizer

In [2]:
# Import necessary libraries
from unsloth import FastLanguageModel
import torch

# Configuration settings
max_seq_length = 2048  # Choose any! We auto support RoPE Scaling internally!
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",  # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...",  # Use token if using gated models like meta-llama/Llama-2-7b-hf
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.6: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

### We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.1.6 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

### **Load the dataset**

In [4]:
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset

# Assuming tokenizer is already defined, and chat_template is set to "llama-3.1"
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

# Function to format the prompts based on conversation examples
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        for convo in convos
    ]
    return {"text": texts}

# Load the dataset and slice to get the first 500 records
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = dataset.select(range(500))  # Select only the first 500 records



README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

In [5]:
# Import the standardize_sharegpt function from the unsloth.chat_templates module
from unsloth.chat_templates import standardize_sharegpt

# Apply the standardize_sharegpt function to the dataset to standardize its format
dataset = standardize_sharegpt(dataset)

# Map the formatting_prompts_func to the dataset in batches to format the prompts
dataset = dataset.map(formatting_prompts_func, batched=True)


Standardizing format:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

## We look at how the conversations are structured for item

In [6]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [7]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [8]:
# Import necessary modules
from trl import SFTTrainer  # Assuming 'trl' is a library or module containing SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported  # Assuming 'unsloth' is a module or function

# Initialize the trainer object
trainer = SFTTrainer(
    model=model,  # Pass the model object
    tokenizer=tokenizer,  # Pass the tokenizer object
    train_dataset=dataset,  # Pass the training dataset
    dataset_text_field="text",  # Specify the field in dataset containing text data
    max_seq_length=max_seq_length,  # Maximum sequence length
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),  # Data collator for sequence-to-sequence tasks
    dataset_num_proc=2,  # Number of processes to use for data loading
    packing=False,  # Disable packing (which can accelerate training for short sequences)
    args=TrainingArguments(
        per_device_train_batch_size=2,  # Batch size per GPU/TPU core/CPU for training
        gradient_accumulation_steps=4,  # Number of updates steps to accumulate before performing a backward/update pass
        warmup_steps=5,  # Number of steps for the learning rate scheduler to warm up
        # num_train_epochs=1,  # Uncomment and set this for 1 full training run (alternative to max_steps)
        max_steps=60,  # Total number of training steps to perform
        learning_rate=2e-4,  # Learning rate for the optimizer
        fp16=not is_bfloat16_supported(),  # Use FP16 (half-precision) training if supported
        bf16=is_bfloat16_supported(),  # Use BFloat16 training if supported
        logging_steps=1,  # Log every update step
        optim="adamw_8bit",  # Optimizer to use
        weight_decay=0.01,  # Weight decay to apply (if any)
        lr_scheduler_type="linear",  # Type of learning rate scheduler
        seed=3407,  # Random seed for reproducibility
        output_dir="outputs",  # Directory to save outputs like checkpoints
        report_to="none",  # Disable reporting to external services like WandB
    ),
)


Map (num_proc=2):   0%|          | 0/500 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [9]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [10]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

In [11]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                \n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [12]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
2.543 GB of memory reserved.


In [13]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,0.7786
2,0.8059
3,0.7266
4,0.9628
5,0.8903
6,0.5579
7,0.9552
8,0.806
9,0.7498
10,0.5429


In [14]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

504.5215 seconds used for training.
8.41 minutes used for training.
Peak reserved memory = 3.855 GB.
Peak reserved memory for training = 1.312 GB.
Peak reserved memory % of max memory = 26.139 %.
Peak reserved memory for training % of max memory = 8.896 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [17]:
# Import the function to get a pre-defined chat template
from unsloth.chat_templates import get_chat_template

# Retrieve a chat template specific to the "llama-3.1" version and apply it to the tokenizer
tokenizer = get_chat_template(
    tokenizer,  # Pass the existing tokenizer
    chat_template="llama-3.1",  # Specify the version of the chat template to use (e.g., "llama-3.1")
)

# Enable native 2x faster inference for the model, optimizing it for faster performance during inference
FastLanguageModel.for_inference(model)  # Speed up model inference with optimized methods

# Define a list of messages to simulate a conversation, with the user's role prompting the model to continue the Fibonacci sequence
messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},  # User asks the model to continue the sequence
]

# Prepare the input by applying the chat template to the message list
inputs = tokenizer.apply_chat_template(
    messages,  # Pass the list of messages to the tokenizer
    tokenize=True,  # Tokenize the message content to convert it into model-readable format
    add_generation_prompt=True,  # Add a generation prompt that is required for text generation tasks
    return_tensors="pt",  # Return the tokenized data as PyTorch tensors (so it can be used by the model)
).to("cuda")  # Move the tokenized data to the GPU (CUDA) for faster processing

# Generate output from the model using the provided inputs
outputs = model.generate(
    input_ids=inputs,  # Pass the prepared input data (tokenized)
    max_new_tokens=64,  # Limit the generation to a maximum of 64 new tokens
    use_cache=True,  # Enable the use of cache to speed up inference (no need to recompute previously computed layers)
    temperature=1.5,  # Set the temperature for sampling, controlling the randomness of the output (higher = more randomness)
    min_p=0.1,  # Minimum probability threshold for token selection during generation
)

# Decode the generated token IDs back into human-readable text
tokenizer.batch_decode(outputs)  # Convert the model output (token IDs) into readable text

['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nContinuing the Fibonacci sequence with the provided numbers: 1, 1, 2, 3, 5, 8. The next numbers would be: 13, 21.<|eot_id|>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [16]:
# Enable native 2x faster inference by using a specialized method for the model
FastLanguageModel.for_inference(model)  # Optimize model for inference (improves speed)

# Define a list of message dictionaries simulating user interaction (role-based content)
messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},  # User's message prompting the model
]

# Prepare the input data by applying a chat template to the messages
inputs = tokenizer.apply_chat_template(
    messages,  # Pass the user messages
    tokenize=True,  # Tokenize the message content
    add_generation_prompt=True,  # Add a prompt for text generation, required for generating output
    return_tensors="pt",  # Return the input as PyTorch tensors for compatibility with the model
).to("cuda")  # Move the input tensor to GPU (CUDA) for faster computation

# Importing TextStreamer from Hugging Face's transformers library to stream generated text
from transformers import TextStreamer

# Initialize a TextStreamer object to handle the streaming of the output text as it's generated
text_streamer = TextStreamer(tokenizer, skip_prompt=True)  # Set skip_prompt=True to avoid showing the prompt in the output

# Generate output using the model, passing in the inputs and the TextStreamer for real-time output handling
_ = model.generate(
    input_ids=inputs,  # Pass the prepared input tensor
    streamer=text_streamer,  # Use the text streamer to handle the output
    max_new_tokens=128,  # Limit the generation to a maximum of 128 new tokens
    use_cache=True,  # Use cached states to speed up inference (avoids recomputation)
    temperature=1.5,  # Set the temperature for sampling, influencing the randomness of generation
    min_p=0.1,  # Minimum probability threshold for token selection during generation
)


Let's continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597.<|eot_id|>


<a name="Save"></a>
### Saving,  finetuned models
To save the final model as LoRA adapters,

In [18]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [19]:
# Import FastLanguageModel from the unsloth library for loading and inference optimization
from unsloth import FastLanguageModel

# Load the pre-trained model and tokenizer with the specified settings
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="lora_model",  # Specify the model name or path to the pre-trained model
    max_seq_length=max_seq_length,  # Set the maximum sequence length for input processing (limits token length)
    dtype=dtype,  # Specify the data type (e.g., float32 or float16) for model weights
    load_in_4bit=load_in_4bit,  # Load the model with 4-bit quantization to reduce memory usage
)

# Enable native optimizations for faster inference, improving the performance of the model
FastLanguageModel.for_inference(model)  # Apply native inference optimizations for speed

# Define a list of messages representing a conversation with the model
messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},  # User message requesting a description of a tower
]

# Prepare the input by applying the chat template to the message list to format it for the model
inputs = tokenizer.apply_chat_template(
    messages,  # Pass the list of user messages to the tokenizer
    tokenize=True,  # Tokenize the message content so it can be processed by the model
    add_generation_prompt=True,  # Add a special prompt required for the generation process
    return_tensors="pt",  # Return the tokenized data as PyTorch tensors, which is the format expected by the model
).to("cuda")  # Move the tokenized data to the GPU (CUDA) for faster processing

# Import the TextStreamer class from Hugging Face's transformers library to handle the generation output stream
from transformers import TextStreamer

# Initialize a TextStreamer object to stream the generated text output in real-time as it is produced
text_streamer = TextStreamer(tokenizer, skip_prompt=True)  # Set skip_prompt=True to exclude the prompt from output

# Generate output from the model based on the tokenized input data and stream the result
_ = model.generate(
    input_ids=inputs,  # Pass the tokenized input data to the model for text generation
    streamer=text_streamer,  # Use the text streamer to output the generated text as it is produced
    max_new_tokens=128,  # Limit the output to a maximum of 128 new tokens
    use_cache=True,  # Use the model cache to speed up inference by avoiding redundant computations
    temperature=1.5,  # Set the temperature to control the randomness of the output (higher means more randomness)
    min_p=0.1,  # Minimum probability for token selection, controlling which tokens are likely to be chosen
)


==((====))==  Unsloth 2025.1.6: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
There is not much information in the prompt to suggest a specific location. The tower can be situated in the capital city of France which has the highest number of world-famous buildings and iconic structures.

A famous tower in France is the Eiffel Tower, which is known for its impressive height, being the tallest tower in Paris and an iconic symbol of France. The Eiffel Tower stands at a height of 324 meters (1,063 feet). Its four pillars support a lattice structure, and the tower has two observation decks: one at the top and one half-way

In [None]:
# Push the trained model to the Hugging Face Model Hub using the GGUF format
model.push_to_hub_gguf(
    "SURESHBEEKHANI/Llama_3_2_3B_SFT_GGUF",  # Specify the model repository path on Hugging Face Hub. Replace "hf" with your Hugging Face username.
    tokenizer,  # Pass the tokenizer associated with the model to ensure compatibility on the hub
    quantization_method=["q4_k_m", "q8_0", "q5_k_m"],  # Specify the quantization methods to apply for optimized model storage (e.g., q4_k_m, q8_0, q5_k_m)
    token="",  # Provide the Hugging Face token for authentication. Obtain a token at https://huggingface.co/settings/tokens
)


Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.2G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.91 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 14.81it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving SURESHBEEKHANI/Llama_3_2_3B_SFT_GGUF/pytorch_model-00001-of-00002.bin...
Unsloth: Saving SURESHBEEKHANI/Llama_3_2_3B_SFT_GGUF/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m', 'q8_0', 'q5_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at SURESHBEEKHANI/Llama_3_2_3B_SFT_GGUF into f16 GGUF format.
The output location will be /content/SURESHBEEKHANI/Llama_3_2_3B_SFT_GGUF/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: Llama_3_2_3B_SFT_GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/SURESHBEEKHANI/Llama_3_2_3B_SFT_GGUF
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/SURESHBEEKHANI/Llama_3_2_3B_SFT_GGUF
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q5_K_M.gguf:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/SURESHBEEKHANI/Llama_3_2_3B_SFT_GGUF
