To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.55.4
!pip install --no-deps trl==0.22.2

In [None]:
%%capture
# These are mamba kernels and we must have these for faster training
!pip install --no-build-isolation mamba_ssm==2.2.5
!pip install --no-build-isolation causal_conv1d==1.5.2

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/granite-4.0-micro",
    "unsloth/granite-4.0-h-micro",
    "unsloth/granite-4.0-h-tiny",
    "unsloth/granite-4.0-h-small",

    # Base pretrained Granite 4 models
    "unsloth/granite-4.0-micro-base",
    "unsloth/granite-4.0-h-micro-base",
    "unsloth/granite-4.0-h-tiny-base",
    "unsloth/granite-4.0-h-small-base",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/granite-4.0-h-tiny-base",
    max_seq_length = 2048,   # Choose any for long context!
    load_in_4bit = False,    # 4 bit quantization to reduce memory
    load_in_8bit = False,    # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
)

==((====))==  Unsloth 2025.9.11: Fast Granitemoehybrid patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


The fast path for GraniteMoeHybrid will be used when running the model on a GPU


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "shared_mlp.input_linear", "shared_mlp.output_linear"],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Making `model.base_model.model.model` require gradients


<a name="Data"></a>
### Data Prep
#### üìÑ Using Google Sheets as Training Data
Our goal is to create a customer support bot that proactively helps and solves issues.

We‚Äôre storing examples in a Google Sheet with two columns:

- **Snippet**: A short customer support interaction
- **Recommendation**: A suggestion for how the agent should respond

This keeps things simple and collaborative. Anyone can edit the sheet, no database setup required.  
<br>

---
<br>

#### üîç Why This Format?

This setup works well for tasks like:

- `Input snippet ‚Üí Suggested reply`
- `Prompt ‚Üí Rewrite`
- `Bug report ‚Üí Diagnosis`
- `Text ‚Üí Label or Category`

Just collect examples in a spreadsheet, and you‚Äôve got usable training data.  
<br>

---
<br>

#### ‚úÖ What You'll Learn

We‚Äôll show how to:

1. Load the Google Sheet into your notebook
2. Format it into a dataset
3. Use it to train or prompt an LLM


The chat template for granite-4 look like this:
```
<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: June 24, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>

<|start_of_role|>user<|end_of_role|>How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|end_of_text|>

<|start_of_role|>assistant<|end_of_role|>Astronomers make use of the unique spectral fingerprints of elements found in stars...<|end_of_text|>
```

In [None]:
import pandas as pd
import numpy as np
import re
from datasets import Dataset

# Enhanced text cleaning function - extracts key features AND keeps full text
def clean_text_enhanced(text):
    if pd.isnull(text):
        return ""
    
    # Convert to string and clean basic issues
    text = str(text).strip()
    
    # Extract ALL structured information (not just top 3)
    item_name = re.search(r"Item Name:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    brand = re.search(r"Brand:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    color = re.search(r"Color:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    size = re.search(r"Size:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    material = re.search(r"Material:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    model = re.search(r"Model:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    
    # Extract bullet points (all of them)
    bp1 = re.search(r"Bullet Point\s*1:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    bp2 = re.search(r"Bullet Point\s*2:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    bp3 = re.search(r"Bullet Point\s*3:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    bp4 = re.search(r"Bullet Point\s*4:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    bp5 = re.search(r"Bullet Point\s*5:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    
    # Extract value and unit
    value = re.search(r"Value:\s*([\d.,]+)", text, re.IGNORECASE)
    unit = re.search(r"Unit:\s*([A-Za-z]+)", text, re.IGNORECASE)
    
    # Extract description if present
    description = re.search(r"Description:\s*(.*?)(?=\n|$)", text, re.IGNORECASE)
    
    # Build structured output with KEY features first, then append everything else
    structured_parts = []
    
    # Top priority features (Item Name, Value, Unit)
    if item_name:
        structured_parts.append(f"Item: {item_name.group(1).strip()}")
    if value and unit:
        structured_parts.append(f"Quantity: {value.group(1).strip()} {unit.group(1).strip()}")
    elif value:
        structured_parts.append(f"Value: {value.group(1).strip()}")
    
    # Additional important features
    if brand:
        structured_parts.append(f"Brand: {brand.group(1).strip()}")
    if color:
        structured_parts.append(f"Color: {color.group(1).strip()}")
    if size:
        structured_parts.append(f"Size: {size.group(1).strip()}")
    if material:
        structured_parts.append(f"Material: {material.group(1).strip()}")
    if model:
        structured_parts.append(f"Model: {model.group(1).strip()}")
    
    # All bullet points
    if bp1:
        structured_parts.append(f"Feature 1: {bp1.group(1).strip()}")
    if bp2:
        structured_parts.append(f"Feature 2: {bp2.group(1).strip()}")
    if bp3:
        structured_parts.append(f"Feature 3: {bp3.group(1).strip()}")
    if bp4:
        structured_parts.append(f"Feature 4: {bp4.group(1).strip()}")
    if bp5:
        structured_parts.append(f"Feature 5: {bp5.group(1).strip()}")
    
    if description:
        structured_parts.append(f"Description: {description.group(1).strip()}")
    
    # Join structured parts
    cleaned_text = ". ".join(structured_parts)
    
    # IMPORTANT: Append the FULL original text (cleaned) so nothing is lost
    # This ensures ALL information is available to the model
    full_text_cleaned = text.lower()
    full_text_cleaned = re.sub(r'[^\w\s.,:\-]', ' ', full_text_cleaned)
    full_text_cleaned = re.sub(r'\s+', ' ', full_text_cleaned)
    full_text_cleaned = full_text_cleaned.strip()
    
    # Combine: structured features first, then full text for additional context
    if cleaned_text and full_text_cleaned:
        final_text = f"{cleaned_text}. Full Details: {full_text_cleaned}"
    elif cleaned_text:
        final_text = cleaned_text
    else:
        final_text = full_text_cleaned
    
    return final_text

print("Loading training data from dataset/train.csv...")
train_df = pd.read_csv('dataset/train.csv', encoding='latin1')

print(f"Original data shape: {train_df.shape}")
print(f"Columns: {train_df.columns.tolist()}")

# Apply text cleaning
print("\nApplying enhanced text cleaning...")
train_df['catalog_content'] = train_df['catalog_content'].apply(clean_text_enhanced)

# Filter out empty or very short text
train_df['text_length'] = train_df['catalog_content'].str.len()
train_df = train_df[train_df['text_length'] > 10].copy()

print(f"Data shape after cleaning: {train_df.shape}")
print(f"\nPrice statistics:")
print(train_df['price'].describe())

# Convert to HuggingFace Dataset format
dataset = Dataset.from_pandas(train_df[['catalog_content', 'price']])

print(f"\n‚úÖ Dataset loaded: {len(dataset)} samples")

Downloading data:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

We've just loaded the Google Sheet as a csv style Dataset, but we still need to format it into conversational style like below and then apply the chat template.

```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

We'll use a helper function `formatting_prompts_func` to do both!

In [None]:
def formatting_prompts_func(examples):
    catalog_texts = examples['catalog_content']
    prices = examples['price']
    
    messages = [
        [{"role": "user", "content": f"Predict the price for this product: {catalog_text}"},
         {"role": "assistant", "content": f"The predicted price is ${price:.2f}"}] 
        for catalog_text, price in zip(catalog_texts, prices)
    ]
    
    # This will now work correctly
    texts = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) 
             for message in messages]
    
    return {"text": texts}

print("Formatting dataset with chat template...")
dataset = dataset.map(formatting_prompts_func, batched=True)
print(f"‚úÖ Dataset formatted: {len(dataset)} samples")

Map:   0%|          | 0/504 [00:00<?, ? examples/s]

We now look at the raw input data before formatting.

In [None]:
# Show raw catalog content before formatting
print("Sample catalog content:")
print(dataset[5]["catalog_content"][:500])  # Show first 500 chars

'User: I\'m getting an error when trying to log in. \nAgent: What error message are you seeing? \nUser: It says "Invalid credentials" even though I\'m sure my password is correct. \nAgent: Have you tried clearing your browser cache? \nUser: Yes, I cleared it already. \nAgent: Let me check your account status. \nUser: I\'ve been using this account for months without issues. \nAgent: I found no issues with your account. \nUser: Maybe there\'s a problem with the login server? \nAgent: Let\'s try resetting your password. \nUser: I just did that, and it\'s not working either. \nAgent: I\'ll need to escalate this to our engineering team. \nUser: Okay, what should I do in the meantime? \nAgent: Try using a different browser or device. \nUser: I\'ll try Chrome on my laptop. \nAgent: Let me know if that resolves the issue.'

In [None]:
# Show the corresponding price
print("Sample price:")
print(f"${dataset[5]['price']:.2f}")

'#### Analysis\nThe user is experiencing persistent login issues ("Invalid credentials", password reset failure) despite clearing cache and confirming correct credentials. No account anomalies were detected by the agent. The root cause remains unresolved and potentially related to server-side authentication or user-specific credential handling.\n\n#### Recommendation\n- Step 1: Confirm if using Chrome on the laptop resolved the login issue. (User action)\n- Step 2: If Step 1 was successful, no further immediate action needed. If not, proceed to escalate based on user feedback.\n- *Next best action for the agent*: Report back to the user whether using Chrome on their laptop confirmed or failed to resolve the issue.'

And we see how the chat template transformed these conversations.

In [None]:
dataset[5]["text"]

'<|start_of_role|>user<|end_of_role|>User: I\'m getting an error when trying to log in. \nAgent: What error message are you seeing? \nUser: It says "Invalid credentials" even though I\'m sure my password is correct. \nAgent: Have you tried clearing your browser cache? \nUser: Yes, I cleared it already. \nAgent: Let me check your account status. \nUser: I\'ve been using this account for months without issues. \nAgent: I found no issues with your account. \nUser: Maybe there\'s a problem with the login server? \nAgent: Let\'s try resetting your password. \nUser: I just did that, and it\'s not working either. \nAgent: I\'ll need to escalate this to our engineering team. \nUser: Okay, what should I do in the meantime? \nAgent: Try using a different browser or device. \nUser: I\'ll try Chrome on my laptop. \nAgent: Let me know if that resolves the issue.<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>#### Analysis\nThe user is experiencing persistent login issues ("Invalid credent

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 2, # Set this for 1 full training run.
        # max_steps = 60,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/504 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [None]:
# from unsloth.chat_templates import train_on_responses_only
# trainer = train_on_responses_only(
#     trainer,
#     instruction_part = "<|start_of_role|>user<|end_of_role|>",
#     response_part = "<|start_of_role|>assistant<|end_of_role|>",
# )

Map (num_proc=2):   0%|          | 0/504 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again.

In [None]:
# Verify the full formatted text (input_ids)
if len(trainer.train_dataset) > 100:
    print("Full formatted example:")
    print(tokenizer.decode(trainer.train_dataset[100]["input_ids"]))
else:
    print(f"Dataset only has {len(trainer.train_dataset)} samples. Showing first sample:")
    print(tokenizer.decode(trainer.train_dataset[0]["input_ids"]))

'<|start_of_role|>user<|end_of_role|>User: My account is locked. I tried to log in but got an error message saying "Too many failed attempts".\n\nAgent: Can you please try logging in again and enter the security code sent to your email? That should unlock your account temporarily.\n\nUser: I did that already. I received the code and entered it, but my account is still locked.\n\nAgent: I see. Have you tried resetting your password via the \'Forgot Password\' link?\n\nUser: Yes, I clicked on that. It sent me an email with a reset link, but when I tried to reset my password, I got an error message saying "Invalid request".\n\nAgent: Okay, I can\'t access your account to check directly. Could you please provide me with your account ID or email address associated with the account?\n\nUser: My email is user@example.com. Account ID is 123456789.\n\nAgent: Thank you. I\'m looking into this. It seems there might be an issue with the account lockout mechanism or the password reset process. I\'l

Now let's print the masked out example - you should see only the answer is present:

In [None]:
# Now let's print the masked out example - you should see only the assistant response
if len(trainer.train_dataset) > 100:
    sample_idx = 100
else:
    sample_idx = 0

if "labels" in trainer.train_dataset[sample_idx]:
    masked_labels = [tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[sample_idx]["labels"]]
    decoded = tokenizer.decode(masked_labels)
    if tokenizer.pad_token:
        decoded = decoded.replace(tokenizer.pad_token, " ")
    print("Masked output (only assistant response should be visible):")
    print(decoded)
else:
    print("Labels field not found. The masking will be applied during training.")

"                                                                                                                                                                                                                                                 #### Analysis\nThe user is experiencing account lockout due to multiple failed login attempts, and standard troubleshooting steps like password reset and security code entry are failing, indicating a potential issue with the account lockout mechanism or password recovery system.\n\n#### Recommendation\n- Step 1: Attempt to reset the password using the 'Forgot Password' link and provide the error details received.\n- Step 2: Contact support with the account ID/email and request manual account unlock and investigation.\n- *Next best action for the agent*: Instruct the user to contact support immediately, providing their account details for manual intervention and further investigation.<|end_of_text|>\n"

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.059 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

```
Notice you might have to wait ~10 minutes for the Mamba kernels to compile! Please be patient!
```

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 504 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 1,703,936 of 3,193,100,032 (0.05% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.8932
2,1.1937
3,1.3853
4,1.9493
5,1.9097
6,2.0669
7,1.9231
8,2.0533
9,1.5033
10,1.3257


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

954.0727 seconds used for training.
15.9 minutes used for training.
Peak reserved memory = 10.42 GB.
Peak reserved memory for training = 4.361 GB.
Peak reserved memory % of max memory = 70.687 %.
Peak reserved memory for training % of max memory = 29.584 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! We'll use some example snippets not contained in our training data to get a sense of what was learned.

<a name="Inference"></a>
### Fast Inference with vLLM

‚ö° **IMPORTANT**: Instead of slow one-by-one predictions (30+ hours), we'll use vLLM for fast batched inference!

**Steps:**
1. ‚úÖ Train the model (done above)
2. ‚úÖ Save in vLLM-compatible format (merged 16-bit)
3. üöÄ Use the vLLM script below for fast batched predictions

With vLLM on A100, predictions should take **minutes instead of hours**!

In [None]:
# Create a fast vLLM inference script
vllm_script = '''
import pandas as pd
import numpy as np
import re
from vllm import LLM, SamplingParams
from tqdm import tqdm

# Same text cleaning function
def clean_text_enhanced(text):
    if pd.isnull(text):
        return ""
    
    text = str(text).strip()
    
    # Extract structured information
    item_name = re.search(r"Item Name:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    brand = re.search(r"Brand:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    color = re.search(r"Color:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    size = re.search(r"Size:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    material = re.search(r"Material:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    model = re.search(r"Model:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    
    bp1 = re.search(r"Bullet Point\\s*1:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    bp2 = re.search(r"Bullet Point\\s*2:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    bp3 = re.search(r"Bullet Point\\s*3:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    bp4 = re.search(r"Bullet Point\\s*4:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    bp5 = re.search(r"Bullet Point\\s*5:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    
    value = re.search(r"Value:\\s*([\\d.,]+)", text, re.IGNORECASE)
    unit = re.search(r"Unit:\\s*([A-Za-z]+)", text, re.IGNORECASE)
    description = re.search(r"Description:\\s*(.*?)(?=\\n|$)", text, re.IGNORECASE)
    
    structured_parts = []
    
    if item_name:
        structured_parts.append(f"Item: {item_name.group(1).strip()}")
    if value and unit:
        structured_parts.append(f"Quantity: {value.group(1).strip()} {unit.group(1).strip()}")
    elif value:
        structured_parts.append(f"Value: {value.group(1).strip()}")
    
    if brand:
        structured_parts.append(f"Brand: {brand.group(1).strip()}")
    if color:
        structured_parts.append(f"Color: {color.group(1).strip()}")
    if size:
        structured_parts.append(f"Size: {size.group(1).strip()}")
    if material:
        structured_parts.append(f"Material: {material.group(1).strip()}")
    if model:
        structured_parts.append(f"Model: {model.group(1).strip()}")
    
    if bp1:
        structured_parts.append(f"Feature 1: {bp1.group(1).strip()}")
    if bp2:
        structured_parts.append(f"Feature 2: {bp2.group(1).strip()}")
    if bp3:
        structured_parts.append(f"Feature 3: {bp3.group(1).strip()}")
    if bp4:
        structured_parts.append(f"Feature 4: {bp4.group(1).strip()}")
    if bp5:
        structured_parts.append(f"Feature 5: {bp5.group(1).strip()}")
    
    if description:
        structured_parts.append(f"Description: {description.group(1).strip()}")
    
    cleaned_text = ". ".join(structured_parts)
    
    full_text_cleaned = text.lower()
    full_text_cleaned = re.sub(r\'[^\\w\\s.,:\\-]\', \' \', full_text_cleaned)
    full_text_cleaned = re.sub(r\'\\s+\', \' \', full_text_cleaned)
    full_text_cleaned = full_text_cleaned.strip()
    
    if cleaned_text and full_text_cleaned:
        final_text = f"{cleaned_text}. Full Details: {full_text_cleaned}"
    elif cleaned_text:
        final_text = cleaned_text
    else:
        final_text = full_text_cleaned
    
    return final_text

print("üöÄ Loading model with vLLM...")
llm = LLM(
    model="granite_price_predictor_vllm",
    tensor_parallel_size=1,  # Adjust based on your GPU setup
    max_model_len=2048,
    gpu_memory_utilization=0.9,
    trust_remote_code=True
)

print("üìÇ Loading test data...")
test_df = pd.read_csv(\'dataset/test.csv\', encoding=\'latin1\')
print(f"Test data shape: {test_df.shape}")

# Clean text
print("üßπ Cleaning text...")
test_df[\'catalog_content\'] = test_df[\'catalog_content\'].apply(clean_text_enhanced)

# Create prompts
print("üìù Creating prompts...")
prompts = [
    f"<|start_of_role|>user<|end_of_role|>Predict the price for this product: {text}<|end_of_text|>\\n<|start_of_role|>assistant<|end_of_role|>"
    for text in test_df[\'catalog_content\']
]

# Sampling parameters for deterministic output
sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.95,
    max_tokens=64,
    stop=["<|end_of_text|>", "\\n\\n"]
)

print(f"\\n‚ö° Generating predictions for {len(prompts)} samples with vLLM...")
print("This should be MUCH faster than one-by-one generation!\\n")

# Batch inference - THIS IS THE KEY!
outputs = llm.generate(prompts, sampling_params)

# Extract prices
print("üí∞ Extracting prices from predictions...")
all_predictions = []

for i, output in enumerate(tqdm(outputs, desc="Processing outputs")):
    predicted_text = output.outputs[0].text
    
    # Extract price from text
    price_match = re.search(r\'\\$(\\d+\\.?\\d*)|price is (\\d+\\.?\\d*)\', predicted_text, re.IGNORECASE)
    
    if price_match:
        price = float(price_match.group(1) or price_match.group(2))
    else:
        # Fallback
        price = 50.0
    
    all_predictions.append(price)

# Create submission
print("\\nüíæ Creating submission file...")
submission = pd.DataFrame({
    \'sample_id\': test_df[\'sample_id\'],
    \'price\': all_predictions
})

submission.to_csv(\'submission_granite_vllm.csv\', index=False)

print(f"\\n‚úÖ Submission saved to submission_granite_vllm.csv")
print(f"Shape: {submission.shape}")
print(f"\\nPrice statistics:")
print(submission[\'price\'].describe())
print(f"\\nüéâ Done! Predictions completed in minutes instead of hours!")
'''

# Save the script
with open('vllm_inference.py', 'w') as f:
    f.write(vllm_script)

print("‚úÖ vLLM inference script saved to 'vllm_inference.py'")
print("\nüìã To run fast inference:")
print("1. First, complete training and model saving (cells above)")
print("2. Install vLLM: pip install vllm")
print("3. Run: python vllm_inference.py")
print("\n‚ö° This will generate predictions in MINUTES instead of 30+ hours!")

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
# Save LoRA adapters first (lightweight backup)
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
print("‚úÖ LoRA adapters saved to 'lora_model/'")

# IMPORTANT: Merge and save to 16-bit for vLLM inference
print("\nüîÑ Merging LoRA weights and saving for vLLM...")
print("This may take a few minutes...")
model.save_pretrained_merged("granite_price_predictor_vllm", tokenizer, save_method="merged_16bit")
print("‚úÖ Model saved in vLLM-compatible format to 'granite_price_predictor_vllm/'")
print("\n‚ö° Ready for fast batched inference with vLLM!")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/chat_template.jinja',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

## üöÄ Fast Inference Strategy

**Problem**: One-by-one predictions take 30+ hours for 75,000 samples ‚è∞

**Solution**: Use vLLM for batched inference - takes only **5-10 minutes** on A100! ‚ö°

### Plan:
1. ‚úÖ Save model in vLLM-compatible format (merged 16-bit)
2. ‚ö° Use vLLM to batch process ALL test samples at once
3. üíæ Generate submission in minutes instead of hours

Let's start by saving the model properly for vLLM:

### Install vLLM

Now let's install vLLM for fast inference. This only needs to be done once.

In [None]:
# Install vLLM for fast batched inference
!pip install vllm -q
print("‚úÖ vLLM installed successfully!")

### üöÄ Fast Batched Inference with vLLM

Now we'll use vLLM to process ALL 75,000 test samples in one go!

**Why vLLM is fast:**
- ‚ö° Batched processing (not one-by-one)
- üß† PagedAttention for efficient memory
- üî• Optimized CUDA kernels
- üì¶ Continuous batching

**Expected time on A100:** 5-10 minutes for full test set!

In [None]:
from vllm import LLM, SamplingParams
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import time

print("üöÄ FAST BATCHED INFERENCE WITH vLLM")
print("="*60)

# Load model with vLLM
print("\nüì¶ Loading model with vLLM...")
print("This will take a minute to initialize...\n")

llm = LLM(
    model="granite_price_predictor_vllm",
    tensor_parallel_size=1,  # Use 1 GPU, increase if you have multiple
    max_model_len=2048,
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory
    trust_remote_code=True,
    dtype="float16"
)

print("‚úÖ Model loaded successfully!\n")

In [None]:
# Load test data
print("üìÇ Loading test data...")
test_df = pd.read_csv('dataset/test.csv', encoding='latin1')
print(f"   Test samples: {len(test_df):,}")

# Apply same text cleaning
print("\nüßπ Cleaning text...")
test_df['catalog_content_cleaned'] = test_df['catalog_content'].apply(clean_text_enhanced)

# Create prompts in Granite format
print("\nüìù Creating prompts...")
prompts = [
    f"<|start_of_role|>user<|end_of_role|>Predict the price for this product: {text}<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>"
    for text in test_df['catalog_content_cleaned']
]

print(f"   Created {len(prompts):,} prompts")
print(f"\n‚úÖ Data prepared for inference")

In [None]:
# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.1,  # Low temperature for consistent outputs
    top_p=0.95,
    max_tokens=64,    # Enough for "The predicted price is $XX.XX"
    stop=["<|end_of_text|>", "\n\n"]  # Stop tokens
)

print("\n‚ö° RUNNING BATCHED INFERENCE WITH vLLM")
print("="*60)
print(f"Processing {len(prompts):,} samples...\n")

start_time = time.time()

# THE KEY: Batched generation - processes ALL prompts efficiently!
outputs = llm.generate(prompts, sampling_params, use_tqdm=True)

end_time = time.time()
total_time = end_time - start_time

print(f"\n‚úÖ Inference complete!")
print(f"   Total time: {total_time/60:.1f} minutes")
print(f"   Speed: {len(prompts)/total_time:.1f} samples/second")
print(f"\nüéâ That's {30*60/total_time:.0f}x faster than one-by-one!")

In [None]:
# Extract prices from outputs
print("\nüí∞ Extracting prices from predictions...")
all_predictions = []

for output in tqdm(outputs, desc="Processing outputs"):
    predicted_text = output.outputs[0].text
    
    # Extract price from text (patterns: $XX.XX or "price is XX.XX")
    price_match = re.search(r'\$(\d+\.?\d*)|price is (\d+\.?\d*)|predicted price is (\d+\.?\d*)', 
                           predicted_text, re.IGNORECASE)
    
    if price_match:
        # Get the first non-None group
        price = float([g for g in price_match.groups() if g is not None][0])
    else:
        # Fallback to median price if parsing fails
        price = 50.0
    
    # Ensure reasonable price range
    price = np.clip(price, 0.01, 10000.0)
    all_predictions.append(price)

print(f"‚úÖ Extracted {len(all_predictions):,} prices")

In [None]:
# Create submission DataFrame
print("\nüìä Creating submission DataFrame...")
submission = pd.DataFrame({
    'sample_id': test_df['sample_id'],
    'price': all_predictions
})

# Save submission
submission_file = 'submission_granite_vllm.csv'
submission.to_csv(submission_file, index=False)

print(f"\n‚úÖ Submission saved to: {submission_file}")
print(f"   Shape: {submission.shape}")
print(f"\nüìà Price Statistics:")
print(submission['price'].describe())

print("\n" + "="*60)
print("üéâ FAST INFERENCE COMPLETE!")
print("="*60)
print(f"Generated {len(submission):,} predictions in {total_time/60:.1f} minutes")
print(f"Average: {total_time/len(submission):.3f} seconds per sample")
print("\nüöÄ Ready for submission!")

### üîç Optional: Inspect Sample Predictions

Let's look at a few predictions to verify they make sense:

In [None]:
# Show 5 random samples
import random

print("Sample Predictions:\n" + "="*80)

for i in random.sample(range(len(test_df)), min(5, len(test_df))):
    print(f"\nSample ID: {test_df.iloc[i]['sample_id']}")
    print(f"Catalog (first 150 chars): {test_df.iloc[i]['catalog_content'][:150]}...")
    print(f"Cleaned text (first 150 chars): {test_df.iloc[i]['catalog_content_cleaned'][:150]}...")
    print(f"Model output: {outputs[i].outputs[0].text}")
    print(f"Extracted price: ${all_predictions[i]:.2f}")
    print("-"*80)

---

## üõë STOP HERE FOR FAST INFERENCE! üõë

‚úÖ Your model is now saved and ready for fast inference!

### Next Steps:

**DO NOT run one-by-one predictions in this notebook** (it will take 30+ hours!)

**Instead:**

1. ‚úÖ Close this notebook (training is complete)
2. üöÄ Run the fast vLLM inference script:
   ```bash
   python vllm_fast_inference.py
   ```
3. ‚è±Ô∏è Get predictions in **5-10 minutes** instead of 30+ hours!

**Read the guide:** `VLLM_INFERENCE_GUIDE.md`

---

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False: # Pushing to HF Hub
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>
