<a href="https://colab.research.google.com/github/BryanTheLai/LLM-Stuff/blob/main/Fine_Tuning_for_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024  # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",       # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",      # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",      # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",            # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",              # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",            # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct", # or choose "unsloth/Llama-3.2-3B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit, # Quantization, instead of fp32, fp4. Faster but less accurate
    # token = "hf_...", # use one if using gated models like meta-llama/llama-2-7b-hf
)

==((====))==  Unsloth 2025.6.3: Fast Llama patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    # r is how many params impacted by lora.
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128

    # Which specific module to change
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    # Improving Attention: "q_proj", "k_proj", "v_proj", "o_proj"
    # Enhande understanding: "gate_proj"
    # Improve feedforward Layer: "up_proj", "down_proj"

    lora_alpha = 16, # bigger = overfit, lower = underfit. 16 is good start
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False, # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.6.3 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.



Data Prep

We now use the Llama-3.1 format for conversation style finetunes. We use Maxime Labonne's FineTome-100k dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format ("role", "content") instead of ("from", "value") / Llama-3 renders multi turn conversations like below:
```
<|begin_of_text|>

<|start_header_id|>user<|end_header_id|>
Hello!
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
Hey there! How are you?
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
I'm great thanks!<|eot_id|>

```
We use our get_chat_template function to get the correct chat template. We support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3 and more.


In [None]:
from datasets import load_dataset

# Load the WikiSQL dataset
dataset = load_dataset("wikisql")

print(dataset)

DatasetDict({
    test: Dataset({
        features: ['phase', 'question', 'table', 'sql'],
        num_rows: 15878
    })
    validation: Dataset({
        features: ['phase', 'question', 'table', 'sql'],
        num_rows: 8421
    })
    train: Dataset({
        features: ['phase', 'question', 'table', 'sql'],
        num_rows: 56355
    })
})


In [None]:
from pydantic import BaseModel
import json

# Define a Pydantic model to represent the structure of a dataset example
class WikiSQLExample(BaseModel):
    phase: int      # Ignorable, indicates dataset creation process
    question: str   # user prompt
    # Describes relavent table to question. header is column name, row is actual data. Others (page_title, caption) are additional context
    table: dict
    sql: dict       # Correct SQL query for question and table.

# Get the first 3 examples from the train split
examples = dataset['train'][:1]

# Convert the examples to a list of WikiSQLEample objects
formatted_examples = [WikiSQLExample(**{k: examples[k][i] for k in examples}) for i in range(len(examples['phase']))]

# Print the examples as JSON
for example in formatted_examples:
    print(example.model_dump_json(indent=2))

{
  "phase": 1,
  "question": "Tell me what the notes are for South Australia ",
  "table": {
    "header": [
      "State/territory",
      "Text/background colour",
      "Format",
      "Current slogan",
      "Current series",
      "Notes"
    ],
    "page_title": "",
    "page_id": "",
    "types": [
      "text",
      "text",
      "text",
      "text",
      "text",
      "text"
    ],
    "id": "1-1000181-1",
    "section_title": "",
    "caption": "",
    "rows": [
      [
        "Australian Capital Territory",
        "blue/white",
        "Yaa·nna",
        "ACT · CELEBRATION OF A CENTURY 2013",
        "YIL·00A",
        "Slogan screenprinted on plate"
      ],
      [
        "New South Wales",
        "black/yellow",
        "aa·nn·aa",
        "NEW SOUTH WALES",
        "BX·99·HI",
        "No slogan on current series"
      ],
      [
        "New South Wales",
        "black/white",
        "aaa·nna",
        "NSW",
        "CPX·12A",
        "Optional white slimlin

standardize_sharegpt converts ShareGPT style datasets into HuggingFace's generic format.

ShareGPT format:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
HuggingFace format:
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1", # Or use a different chat template if needed
)

def formatting_prompts_func(examples):
    questions = examples["question"]
    tables = examples["table"]
    sqls = examples["sql"]

    # Construct the conversational turns for each example
    convos = []
    for question, table, sql in zip(questions, tables, sqls):
        # Format the table information to be included in the prompt
        table_info = f"Table header: {table['header']}\nTable rows (sample): {table['rows'][:3]}..." # Include relevant table info

        convo = [
            {"role": "user", "content": f"Convert the following question to SQL based on the table schema.\n\nQuestion: {question}\n{table_info}"},
            {"role": "assistant", "content": sql["human_readable"]},
        ]
        convos.append(convo)

    # Apply the chat template to format the conversations
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]

    return { "text" : texts, }

# The wikisql dataset does not need standardization
# dataset = standardize_sharegpt(dataset)

# Apply the formatting function to the dataset
dataset = dataset.map(formatting_prompts_func, batched = True,)

In [None]:
# Display the first few examples of the formatted text
# train.[phase, question, table, sql, text]
formatted_texts = dataset['train']["text"][6:7]

for i, text in enumerate(formatted_texts):
    print(f"--- Example {i+1} ---")
    print(text)
    print("-" * 20) # Print a separator line

--- Example 1 ---
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Convert the following question to SQL based on the table schema.

Question: who is the manufacturer for the order year 1998?
Table header: ['Order Year', 'Manufacturer', 'Model', 'Fleet Series (Quantity)', 'Powertrain (Engine/Transmission)', 'Fuel Propulsion']
Table rows (sample): [['1992-93', 'Gillig', 'Phantom (High Floor)', '444-464 (21)', 'DD S50EGR Allison WB-400R', 'Diesel'], ['1996', 'Gillig', 'Phantom (High Floor)', '465-467 (3)', 'DD S50 Allison WB-400R', 'Diesel'], ['1998', 'Gillig', 'Phantom (High Floor)', '468-473 (6)', 'DD S50 Allison WB-400R', 'Diesel']]...<|eot_id|><|start_header_id|>assistant<|end_header_id|>

SELECT Manufacturer FROM table WHERE Order Year = 1998<|eot_id|>
--------------------


In [None]:
# Check the number of examples in each split
print(f"Train dataset size: {dataset['train'].num_rows}")
print(f"Validation dataset size: {dataset['validation'].num_rows}")
print(f"Test dataset size: {dataset['test'].num_rows}")

Train dataset size: 56355
Validation dataset size: 8421
Test dataset size: 15878


In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Get an example from the validation set for inference
example = dataset['validation'][0] # First example from validation split of wikisql

# Format the table information for the prompt
table = example['table']
table_info = f"Table header: {table['header']}\nTable rows (sample): {table['rows'][:3]}..." # Include relevant table info

question = example['question']

# Construct the messages for the text-to-SQL task
messages = [
    {"role": "user", "content": f"Convert the following question to SQL based on the table schema.\n\nQuestion: {question}\n{table_info}"},
]

# Convert messages (logical structure) -> <|begin_of_text|> (special tokens) -> token ID -> PyTorch Tensors
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,              # Convert text to token ID (embedding)
    add_generation_prompt = True, # Add special token to signal model to generate response (assistant turn)
    return_tensors = "pt",        # Return token ID as PyTorch tensors. model.generate() expects PyTorch tensors.
    return_attention_mask=True    # Tell model which tokens are real data and padding. So pay less attention to padding.
    # S1 = Hello, S2 = Hello there
    # When send to model, both need to be the same token size
    # Add padding, S1 = Hello[pad]
    # Attention mask:
    # Padded S1 = [1,0], S2 = [1,1]
    # Now model knows to pay more attention to overlap 1s of tokens.
).to("cuda")                      # Move input tensor to GPU for faster processing

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = False) # Stream generated text.

# Tell model to generate.
# Input Pytorch Tensors, stream response            , model only generates 512 tokens, attention cache
# Attention Cache/ KV (Key Value) Cache speds up text generation. Reuse computed KV vectors from previous step of current generation.
original_model_output = model.generate(inputs, streamer = text_streamer, max_new_tokens = 512, use_cache = True, temperature = 1.5, min_p = 0.1)
# temperature: affects randomness of sampling (higher = random)
# min_p:       minimum probability threshold to be considered during sampling.

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Convert the following question to SQL based on the table schema.

Question: What position does the player who played for butler cc (ks) play?
Table header: ['Player', 'No.', 'Nationality', 'Position', 'Years in Toronto', 'School/Club Team']
Table rows (sample): [['Antonio Lang', '21', 'United States', 'Guard-Forward', '1999-2000', 'Duke'], ['Voshon Lenard', '2', 'United States', 'Guard', '2002-03', 'Minnesota'], ['Martin Lewis', '32, 44', 'United States', 'Guard-Forward', '1996-97', 'Butler CC (KS)']]...<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Here is the SQL query based on the provided table schema:

```sql
SELECT Player
FROM table_name
WHERE No. = 21;
```

Please make sure to replace `table_name` with your actual table name.

The above query will return the position of the player who played for

Huggingface TRL's SFTTrainer! More docs here: TRL SFT docs.
- 16 steps to speed things up,
but you can set num_train_epochs=1 for a full run, and
- turn off max_steps=None.

We also support TRL's DPOTrainer!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset['train'], # Specify the train split
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 16,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)


Use `Unsloth's` train_on_completion method to train on assistant output, ignore loss on user inputs.



In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

In [None]:
# Verify masking is actually done:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nConvert the following question to SQL based on the table schema.\n\nQuestion: what is the fuel propulsion where the fleet series (quantity) is 310-329 (20)?\nTable header: ['Order Year', 'Manufacturer', 'Model', 'Fleet Series (Quantity)', 'Powertrain (Engine/Transmission)', 'Fuel Propulsion']\nTable rows (sample): [['1992-93', 'Gillig', 'Phantom (High Floor)', '444-464 (21)', 'DD S50EGR Allison WB-400R', 'Diesel'], ['1996', 'Gillig', 'Phantom (High Floor)', '465-467 (3)', 'DD S50 Allison WB-400R', 'Diesel'], ['1998', 'Gillig', 'Phantom (High Floor)', '468-473 (6)', 'DD S50 Allison WB-400R', 'Diesel']]...<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nSELECT Fuel Propulsion FROM table WHERE Fleet Series (Quantity) = 310-329 (20)<|eot_id|>"

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                                                                                                                                SELECT Fuel Propulsion FROM table WHERE Fleet Series (Quantity) = 310-329 (20)<|eot_id|>'

In [None]:
# Fix RuntimeError: PassManager::run failed error (https://github.com/unslothai/unsloth/issues/2482)

import os
# os.environ['TRITON_DISABLE_LINE_INFO'] = '1' # Optional, tested with and without
os.environ['TRITON_JIT_DISABLE_OPT'] = '1' # Likely the most critical change
# Inside FastLanguageModel.get_peft_model(...)
use_gradient_checkpointing = False
# Inside TrainingArguments(...)
fp16 = False

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 56,355 | Num Epochs = 1 | Total steps = 16
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192/1,000,000,000 (1.13% trained)


Step,Training Loss
1,1.5955
2,1.5219
3,1.3223
4,1.2092
5,0.7076
6,0.5487
7,0.3254
8,0.3104
9,0.3714
10,0.2601


Unsloth: Will smartly offload gradients to save VRAM!


In [None]:
new_model_output = model.generate(inputs, streamer = text_streamer, max_new_tokens = 512, use_cache = True, temperature = 1.5, min_p = 0.1)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Convert the following question to SQL based on the table schema.

Question: What position does the player who played for butler cc (ks) play?
Table header: ['Player', 'No.', 'Nationality', 'Position', 'Years in Toronto', 'School/Club Team']
Table rows (sample): [['Antonio Lang', '21', 'United States', 'Guard-Forward', '1999-2000', 'Duke'], ['Voshon Lenard', '2', 'United States', 'Guard', '2002-03', 'Minnesota'], ['Martin Lewis', '32, 44', 'United States', 'Guard-Forward', '1996-97', 'Butler CC (KS)']]...<|eot_id|><|start_header_id|>assistant<|end_header_id|>

SELECT Position FROM table WHERE Player = martin lewis<|eot_id|>


Saving, loading finetuned models

To save the final model as LoRA adapters, either use Huggingface's push_to_hub for an online save or save_pretrained for a local save.
[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!


In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

In [None]:
from google.colab import userdata
HF_TOKEN_KEY = userdata.get('HF_TOKEN_KEY')
model.push_to_hub("NotebookML/Llama-3.2-1B-SQL", token = HF_TOKEN_KEY) # Online saving
tokenizer.push_to_hub("NotebookML/Llama-3.2-1B-SQL", token = HF_TOKEN_KEY) # Online saving

No files have been modified since last commit. Skipping to prevent empty commit.


Saved model to https://huggingface.co/NotebookML/Llama-3.2-1B-SQL


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Now if you want to load the LoRA adapters we just saved for inference, set False to True:



In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "NotebookML/Llama-3.2-1B-SQL", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "system", "content": "Convert the following question to SQL based on the table schema."},
    {"role": "user", "content": """
Question: What position does the player who played for butler cc (ks) play?
Table header: ['Player', 'No.', 'Nationality', 'Position', 'Years in Toronto', 'School/Club Team']
Table rows (sample): [['Antonio Lang', '21', 'United States', 'Guard-Forward', '1999-2000', 'Duke'], ['Voshon Lenard', '2', 'United States', 'Guard', '2002-03', 'Minnesota'], ['Martin Lewis', '32, 44', 'United States', 'Guard-Forward', '1996-97', 'Butler CC (KS)']]...<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = False)
_ = model.generate(inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True, temperature = 1.5, min_p = 0.1)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

Convert the following question to SQL based on the table schema.<|eot_id|><|start_header_id|>user<|end_header_id|>


Question: What position does the player who played for butler cc (ks) play?
Table header: ['Player', 'No.', 'Nationality', 'Position', 'Years in Toronto', 'School/Club Team']
Table rows (sample): [['Antonio Lang', '21', 'United States', 'Guard-Forward', '1999-2000', 'Duke'], ['Voshon Lenard', '2', 'United States', 'Guard', '2002-03', 'Minnesota'], ['Martin Lewis', '32, 44', 'United States', 'Guard-Forward', '1996-97', 'Butler CC (KS)']]...<|eot_id|><|start_header_id|>assistant<|end_header_id|>
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

SELECT Position FROM table WHERE Player = Martin Lewis<|eot_id|>


In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = HF_TOKEN_KEY)

# Save to 16bit GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if True: model.push_to_hub_gguf("NotebookML/Llama-3.2-1B-SQL", tokenizer, quantization_method = "f16", token = HF_TOKEN_KEY)

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("NotebookML/Llama-3.2-1B-SQL", tokenizer, quantization_method = "q4_k_m", token = HF_TOKEN_KEY)

# Save to multiple GGUF options - much faster if you want multiple!
if True:
    model.push_to_hub_gguf(
        "NotebookML/Llama-3.2-1B-SQL", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m",],
        token = HF_TOKEN_KEY, # Get a token at https://huggingface.co/settings/tokens
    )

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 1.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.62 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 20.03it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model into f16 GGUF format.
The output location will be /content/model/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {32}
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model.bin'
INFO:hf-to-gguf:token_embd.weight,           t

100%|██████████| 16/16 [00:00<00:00, 45.95it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving NotebookML/Llama-3.2-1B-SQL/pytorch_model.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at NotebookML/Llama-3.2-1B-SQL into f16 GGUF format.
The output location will be /content/NotebookML/Llama-3.2-1B-SQL/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: Llama-3.2-1B-SQL
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {32}
INFO:hf-to-

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.F16.gguf:   0%|          | 0.00/2.48G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/NotebookML/Llama-3.2-1B-SQL


No files have been modified since last commit. Skipping to prevent empty commit.


Saved Ollama Modelfile to https://huggingface.co/NotebookML/Llama-3.2-1B-SQL
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.61 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 42.50it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving NotebookML/Llama-3.2-1B-SQL/pytorch_model.bin...
