# FastVLM Training with Ouro-1.4B (LoopLM)

Train FastVLM using **ByteDance Ouro-1.4B** - a Looped Language Model.

- **LLM Backbone**: Ouro-1.4B (LoopLM architecture)
- **Vision Encoder**: FastViTHD (MobileCLIP)
- **Dataset**: 5CD-AI/Viet-multimodal-open-r1-8k-verified

## Ouro Architecture
- **Hidden Size**: 2048
- **Layers**: 24 (x4 recurrent steps)
- **Effective capacity**: ~4-12B model performance

## 1. Install Dependencies

In [None]:
# Install required packages
# IMPORTANT: Ouro requires transformers < 4.56.0
!pip install -q transformers==4.54.1
!pip install -q torch>=2.1.0 torchvision>=0.16.0
!pip install -q accelerate>=0.26.0 peft>=0.10.0
!pip install -q bitsandbytes>=0.43.0
!pip install -q datasets pillow einops timm>=0.9.0
!pip install -q sentencepiece safetensors
!pip install -q huggingface_hub

In [None]:
# Verify transformers version
import transformers
print(f"Transformers version: {transformers.__version__}")
assert transformers.__version__ < "4.56.0", "Ouro requires transformers < 4.56.0!"

## 2. Configuration

In [None]:
import os
import json
import torch
from pathlib import Path

# ============================================
# CONFIGURATION - OURO-1.4B
# ============================================
CONFIG = {
    # Model - OURO
    "llm_model": "ByteDance/Ouro-1.4B",
    "vision_tower": "mobileclip_l_384",
    "mm_hidden_size": 3072,       # MobileCLIP output
    "llm_hidden_size": 2048,      # Ouro hidden size
    
    # Dataset
    "dataset_name": "5CD-AI/Viet-multimodal-open-r1-8k-verified",
    "image_column": "image",
    "question_column": "vi_problem",
    "answer_column": "vi_solution",
    
    # Training
    "output_dir": "./outputs/fastvlm-ouro-1.4b-vietnamese",
    "num_train_epochs": 3,
    "per_device_train_batch_size": 2,  # Ouro is larger, need smaller batch
    "gradient_accumulation_steps": 1,
    "learning_rate": 1e-5,
    "warmup_ratio": 0.03,
    "lr_scheduler_type": "cosine",
    "bf16": True,
    "model_max_length": 2048,
    
    # LoRA
    "use_lora": True,
    "lora_r": 8,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    
    # Save
    "save_strategy": "epoch",
    "save_total_limit": 2,
    
    # HuggingFace
    "hf_repo": "beyoru/Belle-VLM-Ouro",
    "hf_token": os.environ.get("HF_TOKEN", ""),
}

os.makedirs(CONFIG["output_dir"], exist_ok=True)

print("Configuration (Ouro-1.4B):")
for k, v in CONFIG.items():
    if k != "hf_token":
        print(f"  {k}: {v}")

In [None]:
# Login to HuggingFace
from huggingface_hub import login

if CONFIG["hf_token"]:
    login(token=CONFIG["hf_token"])
    print("Logged in to HuggingFace!")
else:
    print("HF_TOKEN not set.")

## 3. Load Ouro Model

In [None]:
# Check GPU
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, BitsAndBytesConfig

print(f"Loading Ouro model: {CONFIG['llm_model']}")

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    CONFIG["llm_model"],
    trust_remote_code=True,
    model_max_length=CONFIG["model_max_length"],
    padding_side="right",
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Tokenizer vocab size: {tokenizer.vocab_size}")

# Load Ouro model
model = AutoModelForCausalLM.from_pretrained(
    CONFIG["llm_model"],
    trust_remote_code=True,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

print(f"Model loaded!")
print(f"Model type: {model.config.model_type}")
print(f"Hidden size: {model.config.hidden_size}")
print(f"Layers: {model.config.num_hidden_layers}")
print(f"Total UT steps: {getattr(model.config, 'total_ut_steps', 'N/A')}")

In [None]:
# Add vision modules to Ouro
import torch.nn as nn

# Build mm_projector: 3072 (MobileCLIP) -> 2048 (Ouro)
mm_projector = nn.Sequential(
    nn.Linear(CONFIG["mm_hidden_size"], CONFIG["llm_hidden_size"]),
    nn.GELU(),
    nn.Linear(CONFIG["llm_hidden_size"], CONFIG["llm_hidden_size"]),
).to(model.device, dtype=torch.bfloat16)

# Attach to model
model.mm_projector = mm_projector

print(f"Added mm_projector: {CONFIG['mm_hidden_size']} -> {CONFIG['llm_hidden_size']}")

# Count parameters
proj_params = sum(p.numel() for p in mm_projector.parameters())
print(f"mm_projector parameters: {proj_params / 1e6:.2f}M")

## 4. Setup LoRA

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

# LoRA config for Ouro
lora_config = LoraConfig(
    r=CONFIG["lora_r"],
    lora_alpha=CONFIG["lora_alpha"],
    lora_dropout=CONFIG["lora_dropout"],
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

## 5. Load Vision Tower

In [None]:
import timm
from transformers import CLIPImageProcessor

# Load MobileCLIP vision tower
print("Loading MobileCLIP vision tower...")
vision_tower = timm.create_model(
    "fastvit_mci2.apple_mclip",
    pretrained=True,
    num_classes=0,
)
vision_tower.eval()
vision_tower = vision_tower.to(model.device, dtype=torch.bfloat16)

# Image processor
image_processor = CLIPImageProcessor(
    size={"shortest_edge": 384},
    crop_size={"height": 384, "width": 384},
    do_center_crop=True,
    do_normalize=True,
    image_mean=[0.48145466, 0.4578275, 0.40821073],
    image_std=[0.26862954, 0.26130258, 0.27577711],
)

# Test
dummy_img = torch.randn(1, 3, 384, 384).to(model.device, dtype=torch.bfloat16)
with torch.no_grad():
    features = vision_tower.forward_features(dummy_img)
print(f"Vision tower output: {features.shape}")

# Attach to model
model.vision_tower = vision_tower
model.image_processor = image_processor

print("Vision tower ready!")

## 6. Prepare Dataset

In [None]:
from datasets import load_dataset
from PIL import Image
from tqdm import tqdm

# Load dataset
print(f"Loading dataset: {CONFIG['dataset_name']}")
dataset = load_dataset(CONFIG["dataset_name"], split="train")

print(f"Total samples: {len(dataset)}")
print(f"Columns: {dataset.column_names}")

In [None]:
# Create LLaVA format data
import json
import os

DATA_DIR = "./data"
IMAGE_FOLDER = os.path.join(DATA_DIR, "images")
os.makedirs(IMAGE_FOLDER, exist_ok=True)

llava_data = []

for idx, sample in enumerate(tqdm(dataset, desc="Converting")):
    # Save image
    image_filename = f"{idx:06d}.jpg"
    image_path = os.path.join(IMAGE_FOLDER, image_filename)
    
    img = sample['image']
    if isinstance(img, Image.Image):
        if img.mode != 'RGB':
            img = img.convert('RGB')
        img.save(image_path, 'JPEG', quality=95)
    
    # Conversation
    question = sample['vi_problem'].strip()
    answer = sample['vi_solution'].strip()
    if len(answer) > 4096:
        answer = answer[:4096] + "..."
    
    llava_data.append({
        "id": str(idx),
        "image": image_filename,
        "conversations": [
            {"from": "human", "value": f"<image>\n{question}"},
            {"from": "gpt", "value": answer}
        ]
    })

# Save
json_path = os.path.join(DATA_DIR, "train_data.json")
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(llava_data, f, ensure_ascii=False, indent=2)

print(f"Dataset converted: {len(llava_data)} samples")

## 7. Training

In [None]:
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

IMAGE_TOKEN_INDEX = -200

class VLMDataset(Dataset):
    """Simple VLM dataset for training."""
    
    def __init__(self, data, image_folder, tokenizer, image_processor, max_length=2048):
        self.data = data
        self.image_folder = image_folder
        self.tokenizer = tokenizer
        self.image_processor = image_processor
        self.max_length = max_length
        
        self.transform = transforms.Compose([
            transforms.Resize((384, 384)),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.48145466, 0.4578275, 0.40821073],
                std=[0.26862954, 0.26130258, 0.27577711]
            ),
        ])
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        
        # Load image
        image_path = os.path.join(self.image_folder, item['image'])
        image = Image.open(image_path).convert('RGB')
        image_tensor = self.transform(image)
        
        # Build conversation
        conv = item['conversations']
        question = conv[0]['value'].replace('<image>', '').strip()
        answer = conv[1]['value']
        
        # Format prompt (Ouro uses standard format)
        prompt = f"User: <image>\n{question}\nAssistant: {answer}"
        
        # Tokenize
        tokens = self.tokenizer(
            prompt,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt'
        )
        
        input_ids = tokens['input_ids'].squeeze(0)
        attention_mask = tokens['attention_mask'].squeeze(0)
        
        # Labels (mask prompt, only train on answer)
        labels = input_ids.clone()
        # Find where answer starts
        answer_start = prompt.find('Assistant:') + len('Assistant:')
        answer_tokens = self.tokenizer(prompt[:answer_start], return_tensors='pt')['input_ids'].shape[1]
        labels[:answer_tokens] = -100  # Ignore prompt tokens
        
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels,
            'images': image_tensor,
        }

# Create dataset
train_dataset = VLMDataset(
    llava_data, 
    IMAGE_FOLDER, 
    tokenizer, 
    image_processor,
    max_length=CONFIG["model_max_length"]
)

print(f"Training dataset: {len(train_dataset)} samples")

In [None]:
from transformers import Trainer, TrainingArguments

# Training arguments
training_args = TrainingArguments(
    output_dir=CONFIG["output_dir"],
    num_train_epochs=CONFIG["num_train_epochs"],
    per_device_train_batch_size=CONFIG["per_device_train_batch_size"],
    gradient_accumulation_steps=CONFIG["gradient_accumulation_steps"],
    learning_rate=CONFIG["learning_rate"],
    warmup_ratio=CONFIG["warmup_ratio"],
    lr_scheduler_type=CONFIG["lr_scheduler_type"],
    bf16=CONFIG["bf16"],
    logging_steps=10,
    save_strategy=CONFIG["save_strategy"],
    save_total_limit=CONFIG["save_total_limit"],
    gradient_checkpointing=True,
    dataloader_num_workers=4,
    report_to="none",
    remove_unused_columns=False,
)

# Custom collate function
def collate_fn(batch):
    return {
        'input_ids': torch.stack([x['input_ids'] for x in batch]),
        'attention_mask': torch.stack([x['attention_mask'] for x in batch]),
        'labels': torch.stack([x['labels'] for x in batch]),
        'images': torch.stack([x['images'] for x in batch]),
    }

print("Training arguments ready!")

In [None]:
# Custom Trainer for multimodal
class VLMTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        images = inputs.pop('images').to(model.device, dtype=torch.bfloat16)
        input_ids = inputs['input_ids']
        labels = inputs['labels']
        attention_mask = inputs['attention_mask']
        
        # Encode images
        with torch.no_grad():
            image_features = model.vision_tower.forward_features(images)
            if image_features.dim() == 4:
                B, C, H, W = image_features.shape
                image_features = image_features.flatten(2).transpose(1, 2)
        
        # Project
        image_features = model.mm_projector(image_features)
        
        # Get text embeddings
        text_embeds = model.get_input_embeddings()(input_ids)
        
        # Concatenate image + text
        inputs_embeds = torch.cat([image_features, text_embeds], dim=1)
        
        # Adjust attention mask
        image_mask = torch.ones(
            images.size(0), image_features.size(1),
            device=attention_mask.device, dtype=attention_mask.dtype
        )
        attention_mask = torch.cat([image_mask, attention_mask], dim=1)
        
        # Adjust labels
        image_labels = torch.full(
            (images.size(0), image_features.size(1)),
            -100,
            device=labels.device, dtype=labels.dtype
        )
        labels = torch.cat([image_labels, labels], dim=1)
        
        # Forward
        outputs = model(
            inputs_embeds=inputs_embeds,
            attention_mask=attention_mask,
            labels=labels,
        )
        
        return (outputs.loss, outputs) if return_outputs else outputs.loss

# Create trainer
trainer = VLMTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=collate_fn,
)

print("Trainer ready!")

In [None]:
# Start training
print(f"Starting training for {CONFIG['num_train_epochs']} epochs...")
trainer.train()
print("Training completed!")

In [None]:
# Save model
print("Saving model...")
trainer.save_model(CONFIG["output_dir"])

# Save mm_projector separately
mm_projector_path = os.path.join(CONFIG["output_dir"], "mm_projector.bin")
torch.save(model.mm_projector.state_dict(), mm_projector_path)
print(f"Saved mm_projector to {mm_projector_path}")

## 8. Merge and Save

In [None]:
from peft import PeftModel

OUTPUT_DIR = CONFIG["output_dir"]
MERGED_DIR = os.path.join(OUTPUT_DIR, "merged")
os.makedirs(MERGED_DIR, exist_ok=True)

print("Loading base model for merging...")
base_model = AutoModelForCausalLM.from_pretrained(
    CONFIG["llm_model"],
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="cpu",
)

print("Loading LoRA adapter...")
merged_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)

print("Merging weights...")
merged_model = merged_model.merge_and_unload()

# Load mm_projector
mm_projector_weights = torch.load(mm_projector_path, map_location="cpu")

# Get state dict and add mm_projector
state_dict = merged_model.state_dict()
for k, v in mm_projector_weights.items():
    state_dict[f"model.mm_projector.{k}"] = v.to(torch.float16)
    print(f"Added: model.mm_projector.{k}")

# Save
merged_model.save_pretrained(MERGED_DIR, state_dict=state_dict, safe_serialization=True)
tokenizer.save_pretrained(MERGED_DIR)

print(f"\nModel saved to: {MERGED_DIR}")

In [None]:
# Create config for LLaVA-Ouro
import json

config_data = merged_model.config.to_dict()
config_data["model_type"] = "llava_ouro"
config_data["architectures"] = ["LlavaOuroForCausalLM"]
config_data["mm_vision_tower"] = CONFIG["vision_tower"]
config_data["mm_hidden_size"] = CONFIG["mm_hidden_size"]
config_data["mm_projector_type"] = "mlp2x_gelu"
config_data["auto_map"] = {
    "AutoConfig": "configuration_llava_ouro.LlavaOuroConfig",
    "AutoModelForCausalLM": "modeling_llava_ouro.LlavaOuroForCausalLM"
}

config_path = os.path.join(MERGED_DIR, "config.json")
with open(config_path, 'w') as f:
    json.dump(config_data, f, indent=2)

print("Config saved!")

In [None]:
# Verify
from safetensors import safe_open

safetensor_path = os.path.join(MERGED_DIR, "model.safetensors")
print(f"Model size: {os.path.getsize(safetensor_path) / 1024 / 1024:.2f} MB")

with safe_open(safetensor_path, framework="pt") as f:
    mm_keys = [k for k in f.keys() if 'mm_projector' in k]
    if mm_keys:
        print("\nmm_projector found:")
        for k in mm_keys:
            print(f"  {k}: {f.get_tensor(k).shape}")

## 9. Upload to HuggingFace

In [None]:
# Create model card
model_card = f"""---
license: apache-2.0
language:
- vi
- en
tags:
- vision-language-model
- vlm
- ouro
- looplm
- fastvlm
- vietnamese
base_model: {CONFIG['llm_model']}
datasets:
- {CONFIG['dataset_name']}
---

# Belle-VLM-Ouro: Vietnamese Vision Language Model

Built on **ByteDance Ouro-1.4B** (Looped Language Model).

## Architecture
- **LLM**: Ouro-1.4B (LoopLM, 4 recurrent steps)
- **Vision**: FastViTHD (MobileCLIP)
- **Projector**: MLP 3072 -> 2048

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "{CONFIG['hf_repo']}",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
```

## Training
- Dataset: {CONFIG['dataset_name']}
- Epochs: {CONFIG['num_train_epochs']}
- LoRA: r={CONFIG['lora_r']}, alpha={CONFIG['lora_alpha']}
"""

with open(os.path.join(MERGED_DIR, "README.md"), "w") as f:
    f.write(model_card)

print("Model card created!")

In [None]:
# Upload
from huggingface_hub import HfApi, create_repo

api = HfApi(token=CONFIG["hf_token"])

create_repo(CONFIG["hf_repo"], exist_ok=True, token=CONFIG["hf_token"])

api.upload_folder(
    folder_path=MERGED_DIR,
    repo_id=CONFIG["hf_repo"],
    commit_message="Upload Belle-VLM-Ouro",
)

print(f"\nUploaded to: https://huggingface.co/{CONFIG['hf_repo']}")

## Done!

Model trained with **Ouro-1.4B** and uploaded to HuggingFace.

Ouro's LoopLM architecture provides:
- 2-3x parameter efficiency
- 4-12B model performance with only 1.4B params
- Adaptive computation via early exit