# ü¶ô Fine-tune LLaMA 3.1 with Unsloth

This notebook demonstrates how to fine-tune LLaMA 3.1 8B for structured JSON output generation using QLoRA and Unsloth.

## Prerequisites
- Google Colab with GPU runtime (T4 or better)
- HuggingFace account with access to LLaMA 3.1

In [None]:
# Install dependencies
!pip install -q unsloth
!pip install -q transformers datasets peft accelerate bitsandbytes

In [None]:
# Imports
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
import torch

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


## 1. Load Base Model with 4-bit Quantization

In [None]:
# Configuration
max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = True

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print(f"Model loaded successfully!")

==((====))==  Unsloth 2026.1.4: Fast Llama patching. Transformers: 4.57.6.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model loaded successfully!


## 2. Add LoRA Adapters

In [None]:
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

print(f"LoRA adapters added!")
model.print_trainable_parameters()

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2026.1.4 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


LoRA adapters added!
trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196


## 3. Load and Prepare Dataset

In [None]:
# Define prompt template
PROMPT_TEMPLATE = """### Instruction:
You are an AI that converts natural language instructions into structured JSON action plans.
Given the following instruction, output a valid JSON with these fields:
- object: the object to manipulate
- initial_position: where the object currently is
- action: what to do (move, rotate, scale)
- target_position: the destination or target state

### Input:
{instruction}

### Response:
{output}"""

def format_prompts(examples):
    texts = []
    for instruction, output in zip(examples['instruction'], examples['output']):
        text = PROMPT_TEMPLATE.format(
            instruction=instruction,
            output=output
        )
        texts.append(text)
    return {"text": texts}

In [None]:
# Load dataset (from local file or HuggingFace)
# Option 1: Load from local JSONL
dataset = load_dataset('json', data_files='/content/data/train.jsonl')

# Option 2: Create sample dataset for demo
# sample_data = {
#     "instruction": [
#         "Move the red box to the blue platform",
#         "Rotate the green sphere 90 degrees",
#         "Scale the yellow cube to twice its size",
#     ],
#     "output": [
#         '{"object": "red box", "initial_position": "floor", "action": "move", "target_position": "top of blue platform"}',
#         '{"object": "green sphere", "initial_position": "center", "action": "rotate", "target_position": "90 degrees clockwise"}',
#         '{"object": "yellow cube", "initial_position": "origin", "action": "scale", "target_position": "2x original size"}',
#     ]
# }

# from datasets import Dataset
# dataset = Dataset.from_dict(dataset)
dataset = dataset.map(format_prompts, batched=True)

print(f"Dataset size: {len(dataset['train'])}")
print(f"Sample:\n{dataset['train'][0]['text'][:500]}...")

Dataset size: 800
Sample:
### Instruction:
You are an AI that converts natural language instructions into structured JSON action plans.
Given the following instruction, output a valid JSON with these fields:
- object: the object to manipulate
- initial_position: where the object currently is
- action: what to do (move, rotate, scale)
- target_position: the destination or target state

### Input:
Place orange pyramid on top shelf

### Response:
{"object": "orange pyramid", "initial_position": "bottom shelf", "action": "mo...


## 4. Training Configuration

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=10,
    max_steps=100,  # Increase for full training
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    save_steps=50,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
)

In [None]:
# Initialize trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset['train'],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    args=training_args,
)

## 5. Train the Model

In [None]:
# Start training
trainer_stats = trainer.train()

print(f"Training complete!")
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 800 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:

 3


wandb: You chose "Don't visualize my results"
wandb: Using W&B in offline mode.
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin


wandb: Detected [huggingface_hub.inference, openai] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/


Step,Training Loss
10,1.3864
20,0.1801
30,0.1191
40,0.1103
50,0.1055
60,0.102
70,0.0988
80,0.0999
90,0.0978
100,0.099




0,1
train/epoch,‚ñÅ‚ñÇ‚ñÉ‚ñÉ‚ñÑ‚ñÖ‚ñÜ‚ñÜ‚ñá‚ñà‚ñà
train/global_step,‚ñÅ‚ñÇ‚ñÉ‚ñÉ‚ñÑ‚ñÖ‚ñÜ‚ñÜ‚ñá‚ñà‚ñà
train/grad_norm,‚ñà‚ñÇ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ
train/learning_rate,‚ñà‚ñà‚ñá‚ñÜ‚ñÖ‚ñÖ‚ñÑ‚ñÉ‚ñÇ‚ñÅ
train/loss,‚ñà‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ

0,1
total_flos,4346724114874368.0
train/epoch,1.0
train/global_step,100.0
train/grad_norm,0.1522
train/learning_rate,0.0
train/loss,0.099
train_loss,0.23988
train_runtime,441.5714
train_samples_per_second,1.812
train_steps_per_second,0.226


Training complete!
Training time: 441.57 seconds


## 6. Save the Model

In [None]:
# Save LoRA adapters
model.save_pretrained("text-to-action-lora")
tokenizer.save_pretrained("text-to-action-lora")

print("Model saved to 'text-to-action-lora'")

Model saved to 'text-to-action-lora'


In [None]:
# Optional: Save merged model for easier inference
model.save_pretrained_merged("text-to-action-merged", tokenizer, save_method="merged_16bit")

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model-00001-of-00004.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 10305.42it/s]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [06:41<00:00, 100.49s/it]


Unsloth: Merge process complete. Saved to `/content/text-to-action-merged`


## 7. Quick Inference Test

In [None]:
# Test inference
FastLanguageModel.for_inference(model)

test_instruction = "Move the purple cylinder to the corner"

test_prompt = f"""### Instruction:
You are an AI that converts natural language instructions into structured JSON action plans.
Given the following instruction, output a valid JSON with these fields:
- object: the object to manipulate
- initial_position: where the object currently is
- action: what to do (move, rotate, scale)
- target_position: the destination or target state

### Input:
{test_instruction}

### Response:
"""

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.1,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("### Response:")[-1].strip())

{"object": "purple cylinder", "initial_position": "desk", "action": "move", "target_position": "corner"}


In [None]:
model.save_pretrained_gguf("gguf_model", tokenizer, quantization_method="q4_k_m")

Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00004.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  25%|‚ñà‚ñà‚ñå       | 1/4 [02:10<06:32, 130.86s/it]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [03:57<03:53, 116.78s/it]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [05:40<01:50, 110.30s/it]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [05:51<00:00, 87.84s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [05:10<00:00, 77.55s/it]


Unsloth: Merge process complete. Saved to `/content/gguf_model`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['meta-llama-3.1-8b-instruct.F16.gguf']
Un

{'save_directory': 'gguf_model',
 'gguf_files': ['meta-llama-3.1-8b-instruct.Q4_K_M.gguf'],
 'modelfile_location': '/content/Modelfile',
 'want_full_precision': False,
 'is_vlm': False,
 'fix_bos_token': False}

In [None]:
from google.colab import files
import os

gguf_files = [f for f in os.listdir("gguf_model") if f.endswith(".gguf")]
if gguf_files:
    gguf_file = os.path.join("gguf_model", gguf_files[0])
    print(f"Downloading: {gguf_file}")
    files.download(gguf_file)

## Next Steps

1. **Expand dataset** - Add more diverse instruction-action pairs
2. **Hyperparameter tuning** - Experiment with LoRA rank, learning rate
3. **Evaluation** - Run on held-out test set
4. **Deploy** - Export for Ollama or serve with FastAPI

In [None]:
# Install backend dependencies
!pip install -q fastapi uvicorn pyngrok nest_asyncio
!pip install -q pydantic python-multipart

# Remove existing repo and re-clone for a clean state
!rm -rf text-to-action-llm
!git clone https://github.com/Rockstatata/text-to-action-llm.git
%cd text-to-action-llm/backend

Cloning into 'text-to-action-llm'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 48 (delta 1), reused 48 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (48/48), 39.34 KiB | 9.83 MiB/s, done.
Resolving deltas: 100% (1/1), done.
/content/text-to-action-llm/backend


In [None]:
# Create necessary directories if they don't exist for model.py
!mkdir -p app/llm

In [None]:
%%writefile app/llm/model.py
import os
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
import torch
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s | %(levelname)s | %(name)s | %(message)s')
logger = logging.getLogger('text-to-action')

# Global variables to store the loaded model and tokenizer
_model = None
_tokenizer = None

def load_model():
    global _model, _tokenizer
    if _model is not None and _tokenizer is not None:
        logger.info("Model already loaded")
        return _model, _tokenizer

    backend = os.environ.get("LLM_BACKEND", "transformers")
    lora_path = os.environ.get("MODEL_PATH", "/content/text-to-action-lora") # Path to saved LoRA adapters
    base_model_name = "unsloth/Meta-Llama-3.1-8B-Instruct" # The original base model

    logger.info(f"Loading model with backend: {backend}")
    logger.info(f"Loading LoRA adapters from: {lora_path}")

    if backend == "transformers":
        # 1. Load the base model first
        _model, _tokenizer = FastLanguageModel.from_pretrained(
            model_name=base_model_name,
            max_seq_length=2048,
            dtype=None,  # Auto-detect from base model
            load_in_4bit=True, # Should match training setup
        )

        # 2. Apply Unsloth's inference optimizations to the base model
        FastLanguageModel.for_inference(_model)

        # 3. Load the LoRA adapters onto the base model
        _model = FastLanguageModel.get_peft_model(
            _model,
            r=16,  # Must match training r
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                            "gate_proj", "up_proj", "down_proj"], # Must match training
            lora_alpha=32, # Must match training
            lora_dropout=0.05, # Match training dropout, or 0 for inference
            bias="none", # Must match training
        )
        _model.load_adapter(lora_path, adapter_name="text_to_action_adapter") # Added adapter_name

        logger.info("Transformers model with LoRA adapters loaded successfully")
        return _model, _tokenizer
    else:
        raise ValueError(f"Unknown LLM_BACKEND: {backend}")

def get_model():
    return _model

def get_tokenizer():
    return _tokenizer


Overwriting app/llm/model.py


In [None]:
%%writefile app/api/infer.py
import torch
from fastapi import APIRouter, HTTPException
import logging

from app.llm.model import get_model, get_tokenizer # Corrected import
from app.llm.schema import ActionPlan, InferenceRequest # Assuming these exist
from app.utils.json_validator import validate_and_parse_json # Assuming this exists

logger = logging.getLogger('text-to-action')
router = APIRouter()

# Define the PROMPT_TEMPLATE here, as it's specific to the model's expected input
PROMPT_TEMPLATE = """### Instruction:
You are an AI that converts natural language instructions into structured JSON action plans.
Given the following instruction, output a valid JSON with these fields:
- object: the object to manipulate
- initial_position: where the object currently is
- action: what to do (move, rotate, scale)
- target_position: the destination or target state

### Input:
{instruction}

### Response:
{output}"""

async def perform_inference(instruction: str) -> str:
    model = get_model()
    tokenizer = get_tokenizer()

    if model is None or tokenizer is None:
        logger.error("LLM model or tokenizer not loaded.")
        raise HTTPException(status_code=500, detail="LLM model or tokenizer not loaded.")

    test_prompt = PROMPT_TEMPLATE.format(
        instruction=instruction,
        output="" # Expecting model to fill this
    )

    inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.1,
        do_sample=True,
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    json_output_start = response.find("### Response:")
    if json_output_start != -1:
        extracted_json = response[json_output_start + len("### Response:"):].strip()
        # Further clean up if there's any text after the JSON (e.g., EOS tokens)
        if extracted_json.endswith("<|end_of_text|>"):
            extracted_json = extracted_json.replace("<|end_of_text|>", "").strip()
        if extracted_json.endswith("<|eot_id|>"):
            extracted_json = extracted_json.replace("<|eot_id|>", "").strip()
        return extracted_json
    return response # Fallback if response format is unexpected

@router.post("/infer", response_model=ActionPlan)
async def infer_action(request: InferenceRequest):
    logger.info(f"Received instruction: {request.instruction}")
    try:
        raw_json_output = await perform_inference(request.instruction) # Use the new function
        validated_json = validate_and_parse_json(raw_json_output)
        return validated_json # Changed from ActionPlan(**validated_json)
    except Exception as e:
        logger.error(f"Inference error: {e}", exc_info=True)
        raise HTTPException(status_code=500, detail=f"Internal server error: {e}")


Overwriting app/api/infer.py


In [None]:
%%writefile app/api/health.py
from fastapi import APIRouter
from app.llm.model import get_model # Corrected import

router = APIRouter()

@router.get("/health")
async def health_check():
    # Check if the model has been loaded (get_model will return None if not)
    model_status = "ok" if get_model() is not None else "loading"
    return {"status": model_status, "model_loaded": get_model() is not None}

Overwriting app/api/health.py


In [None]:
%%writefile app/main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware

from app.api.infer import router as infer_router
from app.api.health import router as health_router

app = FastAPI(
    title="Text-to-Action API",
    description="API for converting natural language instructions to structured JSON action plans.",
    version="0.1.0",
)

# Configure CORS
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Allows all origins
    allow_credentials=True,
    allow_methods=["*"],  # Allows all methods
    allow_headers=["*"],  # Allows all headers
)

app.include_router(infer_router, prefix="/api")
app.include_router(health_router)


Overwriting app/main.py


In [None]:
# Ensure we are in the base content directory
%cd /content

# Change directory to the backend folder for API startup
%cd text-to-action-llm/backend

# Restart the model loading and API server

# Close any existing ngrok tunnels
# (ngrok will be re-imported below)
# ngrok.kill() # Commented out as ngrok is not yet imported

# Attempt to kill any existing process on port 8000 more robustly
import subprocess
try:
    # Find PIDs of processes listening on port 8000
    pids_output = subprocess.check_output(['lsof', '-ti', ':8000']).decode().strip()
    if pids_output:
        pids = pids_output.split('\n')
        print(f"Found processes on port 8000: {pids}. Attempting to kill them.")
        for pid in pids:
            subprocess.run(['kill', '-9', pid])
        import time
        time.sleep(3) # Give processes time to die
except (subprocess.CalledProcessError, FileNotFoundError):
    print("No processes found on port 8000 or lsof not available. Proceeding...")

# Clear Python kernel state to force module reloads
%reset -f

# Import ngrok AFTER kernel reset
from pyngrok import ngrok

# Now that ngrok is imported, we can kill any lingering tunnels
ngrok.kill()

# Re-run Juyv6zb7Qd-C to load the model with updated logic
import os
os.environ["LLM_BACKEND"] = "transformers"
os.environ["MODEL_PATH"] = "/content/text-to-action-lora"  # Adjust path if needed

from app.llm.model import load_model
load_model()

import nest_asyncio
import uvicorn
from threading import Thread

nest_asyncio.apply()

from app.main import app

def run():
    uvicorn.run(app, host="0.0.0.0", port=8000)

thread = Thread(target=run, daemon=True)
thread.start()

import time
time.sleep(5) # Give it more time to start up

public_url = ngrok.connect(8000)
print(f"üöÄ API live at: {public_url}")

# Perform inference test here
import requests
response1 = requests.get(f"{public_url.public_url}/health")
print(response1.json())
response = requests.post(f"{public_url.public_url}/api/infer", json={"instruction": "Move red box to platform"})

if response.status_code == 200:
    print("\n--- Inference Test Result ---")
    print(response.json())
else:
    print("\n--- Inference Test Failed ---")
    print(f"Error: {response.status_code}")
    print(response.text)

/content
/content/text-to-action-llm/backend
No processes found on port 8000 or lsof not available. Proceeding...
==((====))==  Unsloth 2026.1.4: Fast Llama patching. Transformers: 4.57.6.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

INFO:     Started server process [32712]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


üöÄ API live at: NgrokTunnel: "https://anthroponomical-jodi-mercilessly.ngrok-free.dev" -> "http://localhost:8000"
INFO:     34.143.220.176:0 - "GET /health HTTP/1.1" 200 OK
{'status': 'ok', 'model_loaded': True}
2026-01-26 00:58:41 | INFO     | text-to-action | Received instruction: Move red box to platform


INFO:text-to-action:Received instruction: Move red box to platform


INFO:     34.143.220.176:0 - "POST /api/infer HTTP/1.1" 200 OK

--- Inference Test Result ---
{'object': 'red box', 'initial_position': 'current location', 'action': 'move', 'target_position': 'platform'}


In [None]:
import requests
print(f"üöÄ API live at: {public_url}")

# Perform inference test here

response1 = requests.get(f"{public_url.public_url}/health")
print(response1.json())
response = requests.post(f"{public_url.public_url}/api/infer", json={"instruction": "Move red box to platform"})

if response.status_code == 200:
    print("\n--- Inference Test Result ---")
    print(response.json())
else:
    print("\n--- Inference Test Failed ---")
    print(f"Error: {response.status_code}")
    print(response.text)

üöÄ API live at: NgrokTunnel: "https://anthroponomical-jodi-mercilessly.ngrok-free.dev" -> "http://localhost:8000"
INFO:     34.143.220.176:0 - "GET /health HTTP/1.1" 200 OK
{'status': 'ok', 'model_loaded': True}
2026-01-26 00:58:53 | INFO     | text-to-action | Received instruction: Move red box to platform


INFO:text-to-action:Received instruction: Move red box to platform


INFO:     34.143.220.176:0 - "POST /api/infer HTTP/1.1" 200 OK

--- Inference Test Result ---
{'object': 'red box', 'initial_position': 'current location', 'action': 'move', 'target_position': 'platform'}
