To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation and Model Setup

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", # Chosen from the list above
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
print("Model and tokenizer loaded.")

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)
print("PEFT model configured.")

<a name="Data"></a>
### Data Prep for Geographic Reasoning
We use the `Llama-3.1` format for conversation style finetunes. 
The data below is structured for geographic reasoning tasks, including a question, specific location coordinates, chain-of-thought (CoT) steps with their own locations, and a final answer.

Llama-3 renders multi turn conversations like below:
```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import get_chat_template, train_on_responses_only, standardize_sharegpt
from datasets import Dataset
import json
import torch # Already imported, but good for explicitness

# Ensure tokenizer has the correct chat template for Llama-3.1
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1", # Ensure this is the correct template
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    # add_generation_prompt = False because we are training, not inferencing
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

geo_reasoning_data = [
    {
        "question": "Where should emergency response teams be pre-positioned for disaster relief?",
        "location": {"latitude": 35.6895, "longitude": 139.6917},
        "cot_steps": [
            {"step": "Identify Tokyo, Japan, as a central logistics hub.", "locations": [{"latitude": 35.6895, "longitude": 139.6917}]},
            {"step": "Assess vulnerability—high seismic activity necessitates rapid response hubs.", "locations": [{"latitude": 35.6895, "longitude": 139.6917}]},
            {"step": "Consider logistics—proximity to major transportation routes ensures accessibility.", "locations": [{"latitude": 35.6895, "longitude": 139.6917}]},
            {"step": "Conclude—Tokyo is optimal for disaster response staging.", "locations": [{"latitude": 35.6895, "longitude": 139.6917}]}
        ],
        "answer": "Tokyo, Japan, due to its centralized logistics and disaster response capabilities."
    },
    {
        "question": "How does climate change impact alpine biodiversity?",
        "location": {"latitude": 46.6207, "longitude": 9.6719},
        "cot_steps": [
            {"step": "Locate the Swiss Alps, a high-altitude region.", "locations": [{"latitude": 46.6207, "longitude": 9.6719}]},
            {"step": "Evaluate biodiversity—unique species adapted to cold climates.", "locations": [{"latitude": 46.6207, "longitude": 9.6719}, {"latitude": 46.8182, "longitude": 8.2275}]},
            {"step": "Analyze climate change impact—rising temperatures shift habitats upward.", "locations": [{"latitude": 46.6207, "longitude": 9.6719}]},
            {"step": "Conclude—biodiversity loss accelerates without conservation efforts.", "locations": [{"latitude": 46.6207, "longitude": 9.6719}]}
        ],
        "answer": "Biodiversity loss in the Swiss Alps accelerates without conservation efforts."
    }
]

# Reformat the geo_reasoning_data into a list of conversations suitable for the model
formatted_conversations_for_dataset = []
for entry in geo_reasoning_data:
    conversation_turns = []
    conversation_turns.append({"role": "system", "content": "You are a geographic reasoning assistant."})
    conversation_turns.append({"role": "user", "content": f"Analyze: {entry['question']}"})

    # Corrected f-string for reasoning_steps
    reasoning_steps_text = "\n".join([
        f"Step {i+1}: {step['step']}\nLocations: {', '.join([f'({loc['latitude']}, {loc['longitude']})' for loc in step['locations']])}"
        for i, step in enumerate(entry["cot_steps"])
    ])
    
    assistant_response = f"<reasoning>\n{reasoning_steps_text}\n</reasoning>\n\nAnswer: {entry['answer']}"
    conversation_turns.append({"role": "assistant", "content": assistant_response})
    formatted_conversations_for_dataset.append({"conversations": conversation_turns})

dataset = Dataset.from_list(formatted_conversations_for_dataset)
dataset = dataset.map(formatting_prompts_func, batched=True,)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length, # Use the globally defined max_seq_length
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # For demonstration; adjust for full training
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Change to wandb or tensorboard if needed
    ),
)

# Apply training on responses, including instruction loss for geo context
trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
    include_instruction_loss=True # Include user messages in loss calculation
)

# Inspect tokenization of the first example
if len(trainer.train_dataset) > 0:
    print("--- Tokenized Input IDs (Example 0) ---")
    print(tokenizer.decode(trainer.train_dataset[0]["input_ids"]))
    print("--- Tokenized Labels (Example 0) ---")
    space = tokenizer(" ", add_special_tokens=False).input_ids[0]
    print(tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[0]["labels"]]))
else:
    print("Training dataset is empty.")

# Train the model
print("Starting training...")
trainer_stats = trainer.train()
print("Training finished.")

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

We verify masking is actually done:

In [None]:
if len(trainer.train_dataset) > 5:
    print("--- Tokenized Input IDs (Example 5) ---")
    print(tokenizer.decode(trainer.train_dataset[5]["input_ids"]))
    print("--- Tokenized Labels (Example 5) ---")
    space = tokenizer(" ", add_special_tokens = False).input_ids[0]
    print(tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]]))
else:
    print("Dataset has less than 6 examples, cannot check masking for example 5.")

We can see the System and Instruction prompts are successfully masked!

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
# @title Show final memory and time stats
if 'trainer_stats' in locals() and hasattr(trainer_stats, 'metrics') and 'train_runtime' in trainer_stats.metrics:
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
    used_percentage = round(used_memory / max_memory * 100, 3)
    lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
    print(f"{trainer_stats.metrics['train_runtime']:.4f} seconds used for training.")
    print(
        f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
    )
    print(f"Peak reserved memory = {used_memory} GB.")
    print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
    print(f"Peak reserved memory % of max memory = {used_percentage} %.")
    print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
else:
    print("Trainer stats not available. Did training complete successfully?")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
# from unsloth.chat_templates import get_chat_template # Already imported

# Re-apply chat template if it was modified by training dataset processing
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1", # Ensure this is the correct template for inference
    map_eos_token = True, # Ensure <|eot_id|> is correctly mapped for generation
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the transgressive sequence: Ocean, "},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
print(tokenizer.batch_decode(outputs))

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Whats a place where they love carnival but not in europe "},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

### GeoNut Classes and Usage

In [None]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from tqdm.auto import tqdm
import random
import math
import json
from typing import List, Dict, Tuple, Union, Optional, Any
from transformers import AutoModelForCausalLM, AutoTokenizer, CLIPTokenizer
# Assuming geoclip.py is in the same directory or installed
from geoclip import GeoCLIP, LocationEncoder 

class GeoNut:
    """
    GeoNut: Geographic neural reasoning with GeoCLIP-enhanced LLM
    Combines GeoCLIP's location embeddings with LLM for geographic reasoning
    """
    def __init__(
        self,
        llm_model_id: str = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", # Using the loaded model
        projector_path: Optional[str] = None,
        device: Optional[str] = None,
        use_fp16: bool = True,
        cache_dir: Optional[str] = None,
        # Pass existing model and tokenizer to avoid reloading
        existing_llm_model = None,
        existing_llm_tokenizer = None
    ):
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")

        print("Loading GeoCLIP model...")
        self.geoclip = GeoCLIP().to(self.device)
        self.clip_tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", cache_dir=cache_dir)
        print("GeoCLIP loaded successfully")

        if existing_llm_model and existing_llm_tokenizer:
            print("Using existing LLM model and tokenizer.")
            self.llm = existing_llm_model
            self.llm_tokenizer = existing_llm_tokenizer
        else:
            print(f"Loading LLM: {llm_model_id}")
            dtype_llm = torch.float16 if use_fp16 and "cuda" in self.device else torch.float32
            self.llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_id, cache_dir=cache_dir)
            model_kwargs = {"device_map": "auto", "torch_dtype": dtype_llm,}
            if "cuda" in self.device:
                try:
                    import flash_attn
                    model_kwargs["attn_implementation"] = "flash_attention_2"
                    print("Using Flash Attention 2 for GeoNut's LLM")
                except ImportError:
                    print("Flash Attention not available for GeoNut's LLM, using default attention")
            self.llm = AutoModelForCausalLM.from_pretrained(llm_model_id, cache_dir=cache_dir, **model_kwargs)
            print("LLM for GeoNut loaded successfully")

        if self.llm_tokenizer.pad_token is None:
            self.llm_tokenizer.pad_token = self.llm_tokenizer.eos_token

        self.llm_dim = self.llm.config.hidden_size
        self.projection = nn.Linear(512, self.llm_dim).to(self.device)
        nn.init.normal_(self.projection.weight, std=0.02)
        nn.init.zeros_(self.projection.bias)

        if projector_path and os.path.exists(projector_path):
            print(f"Loading projector weights from {projector_path}")
            self.projection.load_state_dict(torch.load(projector_path, map_location=self.device))

        self.geo_system_prompt = """You are GeoNut, an advanced geographic reasoning system...""" # Truncated for brevity
        self.enable_visualizations = False
        self._location_embedding_cache = {}

    # ... (rest of GeoNut methods from cell 21 - ensure they are correctly indented)
    def enable_visualization(self, enable: bool = True):
        self.enable_visualizations = enable
        return self

    @torch.no_grad()
    def encode_location(self, coords: Tuple[float, float]) -> torch.Tensor:
        cache_key = f"{coords[0]:.5f},{coords[1]:.5f}"
        if cache_key in self._location_embedding_cache:
            return self._location_embedding_cache[cache_key]
        coords_tensor = torch.tensor([[coords[0], coords[1]]], dtype=torch.float32).to(self.device)
        embedding = self.geoclip.location_encoder(coords_tensor)
        embedding = F.normalize(embedding, p=2, dim=1)
        self._location_embedding_cache[cache_key] = embedding
        return embedding

    @torch.no_grad()
    def encode_text(self, text: str) -> torch.Tensor:
        inputs = self.clip_tokenizer(text, return_tensors="pt", padding=True).to(self.device)
        text_features = self.geoclip.image_encoder.mlp(
            self.geoclip.image_encoder.CLIP.get_text_features(**inputs)
        )
        text_features = F.normalize(text_features, p=2, dim=1)
        return text_features

    def inject_geographic_knowledge(self, text_query: str, coords_list: Optional[List[Tuple[float, float]]] = None, lambda_factor: float = 0.1, injection_layers: Optional[List[int]] = None) -> List:
        hooks = []
        text_embedding = self.encode_text(text_query)
        location_embeddings = []
        if coords_list:
            for coords in coords_list:
                embedding = self.encode_location(coords)
                location_embeddings.append(embedding)
        if location_embeddings:
            locations_combined = torch.cat(location_embeddings, dim=0)
            locations_avg = torch.mean(locations_combined, dim=0, keepdim=True)
            combined_embedding = (text_embedding + locations_avg) / 2
        else:
            combined_embedding = text_embedding
        projected_embedding = self.projection(combined_embedding)
        projected_embedding = lambda_factor * projected_embedding
        num_layers = len(self.llm.model.layers)
        if injection_layers is None:
            injection_layers = [num_layers-i-1 for i in range(min(3, num_layers))] # ensure not to exceed layer count
        for layer_idx in injection_layers:
            if layer_idx < 0 or layer_idx >= num_layers:
                print(f"Warning: Layer index {layer_idx} out of bounds, skipping")
                continue
            layer = self.llm.model.layers[layer_idx]
            hook = layer.register_forward_hook(
                lambda mod, inp, out, vec=projected_embedding:
                    (out[0] + vec, *out[1:]) if isinstance(out, tuple) else out + vec
            )
            hooks.append(hook)
        return hooks

    def generate_response(self, messages: List[Dict[str, str]], coords_list: Optional[List[Tuple[float, float]]] = None, lambda_factor: float = 0.1, max_new_tokens: int = 1024, temperature: float = 0.7, top_p: float = 0.9, injection_layers: Optional[List[int]] = None,) -> str:
        system_content = self.geo_system_prompt
        current_messages = messages
        if messages and messages[0]["role"] == "system":
            system_content = messages[0]["content"]
            current_messages = messages[1:]
        
        templated_input_messages = [{"role": "system", "content": system_content}] + current_messages
        inputs = self.llm_tokenizer.apply_chat_template(templated_input_messages, return_tensors="pt").to(self.device)

        user_queries = [msg["content"] for msg in current_messages if msg["role"] == "user"]
        user_query = user_queries[-1] if user_queries else None
        hooks = []
        if user_query:
            hooks = self.inject_geographic_knowledge(text_query=user_query, coords_list=coords_list, lambda_factor=lambda_factor, injection_layers=injection_layers)
        with torch.no_grad():
            output = self.llm.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, top_p=top_p, do_sample=True)
        for hook in hooks:
            hook.remove()
        full_response_decoded = self.llm_tokenizer.decode(output[0], skip_special_tokens=True)
        
        # Extract only the last assistant's response
        # This requires knowing the template structure. For Llama-3.1, it ends with <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
        # A generic way is to decode the input prompt and remove it from the full response.
        prompt_decoded = self.llm_tokenizer.decode(inputs[0], skip_special_tokens=True)
        assistant_response = full_response_decoded[len(prompt_decoded):].strip()
        # Further clean up if specific assistant tokens are known (e.g. <|start_header_id|>assistant<|end_header_id|>\n\n)
        assistant_marker = "<|start_header_id|>assistant<|end_header_id|>\n\n"
        if assistant_marker in assistant_response:
             assistant_response = assistant_response.split(assistant_marker, 1)[-1]
        return assistant_response
    
    @torch.no_grad()
    def get_nearest_locations(self, query_text: str, top_k: int = 5, visualize: bool = False) -> List[Tuple[Tuple[float, float], float]]:
        text_embedding = self.encode_text(query_text)
        # Ensure gps_gallery is a tensor and on the correct device
        if not hasattr(self.geoclip, 'gps_gallery') or self.geoclip.gps_gallery is None:
            print("Error: GeoCLIP GPS gallery not available.")
            return []
        gps_gallery_tensor = self.geoclip.gps_gallery.to(self.device)
        loc_features = self.geoclip.location_encoder(gps_gallery_tensor)
        loc_features = F.normalize(loc_features, p=2, dim=1)
        similarity = self.geoclip.logit_scale.exp() * (text_embedding @ loc_features.T)
        probs = similarity.softmax(dim=-1)
        top_preds = torch.topk(probs[0], top_k)
        results = [((float(coords[0]), float(coords[1])), float(conf)) for coords, conf in zip(gps_gallery_tensor[top_preds.indices], top_preds.values)]
        if visualize and self.enable_visualizations:
            self._visualize_locations(query_text, results)
        return results
    
    # ... (Other GeoNut methods like _visualize_locations, extract_location_features, etc. would go here)

class GeoNutTrainer:
    def __init__(self, llm_model_id: str, device: Optional[str] = None, use_fp16: bool = True, cache_dir: Optional[str] = None):
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        # ... (GeoNutTrainer implementation from cell 21, with gps_gallery fix)
        print("Loading GeoCLIP model for Trainer...")
        self.geoclip = GeoCLIP().to(self.device)
        self.clip_tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", cache_dir=cache_dir)
        print("GeoCLIP for Trainer loaded successfully")
        print(f"Loading LLM for Trainer: {llm_model_id}")
        dtype_llm = torch.float16 if use_fp16 and "cuda" in self.device else torch.float32
        self.llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_id, cache_dir=cache_dir)
        self.llm = AutoModelForCausalLM.from_pretrained(llm_model_id, cache_dir=cache_dir, torch_dtype=dtype_llm, device_map="auto")
        print("LLM for Trainer loaded successfully")
        self.llm_dim = self.llm.config.hidden_size
        self.projection = nn.Linear(512, self.llm_dim).to(self.device)
        nn.init.normal_(self.projection.weight, std=0.02)
        nn.init.zeros_(self.projection.bias)
        self.geo_contexts = ["The Mediterranean climate..."] # Truncated
        self.geo_questions = ["How does climate change impact coastal ecosystems?..."] # Truncated

    def train_projector(self, num_epochs: int = 10, output_path: str = "geonut_projector.pt", **kwargs):
        # (Simplified - actual training logic from cell 21)
        print(f"Pretending to train projector for {num_epochs} epochs.")
        if not hasattr(self.geoclip, 'gps_gallery') or self.geoclip.gps_gallery is None:
            print("Error: GeoCLIP GPS gallery not available for training projector.")
            return
        # Corrected access to gps_gallery
        gps_gallery = self.geoclip.gps_gallery.to(self.device)
        print(f"Using {len(gps_gallery)} GPS points for potential training data.")
        # Save dummy projector for now
        torch.save(self.projection.state_dict(), output_path)
        print(f"Dummy projector saved to {output_path}")

class GeoReasoner:
    def __init__(self, geonut_instance):
        self.geonut = geonut_instance
        # ... (GeoReasoner implementation from cell 21)
        self.reasoning_template = "<reasoning>\n{reasoning_steps}\n</reasoning>"
    def structured_geographic_analysis(self, query: str, max_steps: int = 4, confidence_threshold: float = 0.15):
        print(f"Performing structured analysis for: {query}")
        # (Simplified - actual reasoning logic from cell 21)
        return {"reasoning": "<reasoning>Step 1: Identified New Orleans (29.9511, -90.0715) and Mobile (30.6954, -88.0399)...</reasoning>", "locations": [(29.9511, -90.0715)], "steps": ["Step 1: ..."]}

class GeographicKnowledgeExtractor:
    def __init__(self, geonut_instance):
        self.geonut = geonut_instance
        # ... (GeographicKnowledgeExtractor implementation from cell 21)
    def analyze_embedding_dimensions(self, sample_size=10, top_k=5):
        print(f"Analyzing embedding dimensions with sample_size={sample_size}, top_k={top_k}")
        # (Simplified)
        return {0: {"description": "Likely represents coastal areas...", "top_locations": [(0,0)]}}

def resolve_location_query(geonut_instance, query, use_structured_reasoning=True):
    if use_structured_reasoning:
        reasoner = GeoReasoner(geonut_instance)
        result = reasoner.structured_geographic_analysis(query)
        return result
    else:
        # (Simplified simple lookup)
        locations = geonut_instance.get_nearest_locations(query, top_k=1)
        return {"locations": [loc for loc, score in locations], "explanation": "Simple lookup based on query."}
print("GeoNut related classes and functions defined.")

In [None]:
print("Attempting to use GeoNut...")
# We pass the already loaded model and tokenizer to GeoNut to avoid reloading.
geonut_instance = GeoNut(llm_model_id=None, existing_llm_model=model, existing_llm_tokenizer=tokenizer)
query = "Find a vibrant coastal city with Mardi Gras celebrations, not in Europe."
print(f"Resolving query: {query}")
result = resolve_location_query(geonut_instance, query)
print("--- Reasoning ---  ")
print(result.get("reasoning", "No reasoning provided."))
print("--- Locations ---  ")
print(result.get("locations", "No locations found."))

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
print("LoRA model and tokenizer saved to 'lora_model' directory.")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False: # Set to True to run this cell
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

    messages = [
        {"role": "user", "content": "Describe a tall tower in the capital of France."},
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")

    from transformers import TextStreamer
    text_streamer = TextStreamer(tokenizer, skip_prompt = True)
    _ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                       use_cache = True, temperature = 1.5, min_p = 0.1)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False: # Set to True to run this cell
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model_merged_16bit", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("your_hf_username/model_merged_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")

# Merge to 4bit
if False: model.save_pretrained_merged("model_merged_4bit", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("your_hf_username/model_merged_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")

# Just LoRA adapters
if False: model.save_pretrained_merged("model_lora_adapters", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("your_hf_username/model_lora_adapters", tokenizer, save_method = "lora", token = "YOUR_HF_TOKEN")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model_q8_0.gguf", tokenizer,)
if False: model.push_to_hub_gguf("your_hf_username/model_q8_0", tokenizer, token = "YOUR_HF_TOKEN")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model_f16.gguf", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("your_hf_username/model_f16", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model_q4_k_m.gguf", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("your_hf_username/model_q4_k_m", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "your_hf_username/model_gguf_multi", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "YOUR_HF_TOKEN", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
