# 🧠 SigLIP Person Finder - Complete Google Colab Setup

This notebook provides a complete setup for the open-set person search system using natural language descriptions and a fine-tuned SigLIP model.

## What this notebook does:
1. ✅ Clone the repository
2. ✅ Install all dependencies
3. ✅ Download the dataset
4. ✅ Train the SigLIP model
5. ✅ Run inference on images
6. ✅ Visualize results

## Features:
- Text-based person search using natural language descriptions
- Multi-view ReID dataset with rich semantic attributes
- Real-time person detection with YOLOv8
- Fine-tuned SigLIP model for open-set retrieval

## 🚀 Setup and Installation

In [23]:
# Clone the repository
!git clone https://github.com/AdonaiVera/openset-reid-finetune
%cd openset-reid-finetune
print("✅ Repository cloned successfully!")

Cloning into 'openset-reid-finetune'...
remote: Enumerating objects: 78, done.[K
remote: Counting objects: 100% (78/78), done.[K
remote: Compressing objects: 100% (62/62), done.[K
remote: Total 78 (delta 32), reused 51 (delta 14), pack-reused 0 (from 0)[K
Receiving objects: 100% (78/78), 1.50 MiB | 29.49 MiB/s, done.
Resolving deltas: 100% (32/32), done.
/content/openset-reid-finetune/openset-reid-finetune
✅ Repository cloned successfully!


In [24]:
# Install all required dependencies
!pip install -q datasets>=3.6.0 fiftyone>=1.5.2 google-generativeai>=0.8.5 \
    gradio>=5.33.0 huggingface-hub>=0.32.3 numpy>=2.2.6 pillow>=11.2.1 \
    python-dotenv>=1.1.0 sentencepiece>=0.2.0 spaces>=0.36.0 torch>=2.7.0 \
    torchvision>=0.22.0 transformers>=4.52.4 ultralytics>=8.3.148 wandb>=0.19.11 \
    tqdm opencv-python

print("✅ All dependencies installed successfully!")

✅ All dependencies installed successfully!


In [25]:
# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("⚠️ No GPU detected. Training will be slower on CPU.")

CUDA available: True
GPU: Tesla T4
GPU Memory: 15.8 GB


## 📊 Dataset Setup

In [26]:
from huggingface_hub import login

# Paste your Hugging Face token here
login("your-huggingface-token")

In [27]:
# Download the dataset from Hugging Face
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

print("📥 Downloading dataset from Hugging Face...")
dataset = load_from_hub(
    repo_id="adonaivera/fiftyone-multiview-reid-attributes",
    dataset_name="fiftyone-multiview-reid2",
    overwrite=True
)

print(f"✅ Dataset loaded successfully!")
print(f"📊 Total samples: {len(dataset)}")
print(f"🏷️ Tags: {dataset.count_values('tags')}")

# Show a sample
sample = dataset.first()
print(f"\n📸 Sample image: {sample.filepath}")
print(f"👤 Person ID: {sample.person_id}")
print(f"📝 Description: {sample.description}")
if sample.attributes:
    print(f"🏷️ Attributes: {sample.attributes}")

📥 Downloading dataset from Hugging Face...
Downloading config file fiftyone.yml from adonaivera/fiftyone-multiview-reid-attributes


INFO:fiftyone.utils.huggingface:Downloading config file fiftyone.yml from adonaivera/fiftyone-multiview-reid-attributes


Loading dataset


INFO:fiftyone.utils.huggingface:Loading dataset


Importing samples...


INFO:fiftyone.utils.data.importers:Importing samples...


 100% |███████████████| 6455/6455 [668.7ms elapsed, 0s remaining, 9.8K samples/s]       


INFO:eta.core.utils: 100% |███████████████| 6455/6455 [668.7ms elapsed, 0s remaining, 9.8K samples/s]       


✅ Dataset loaded successfully!
📊 Total samples: 6455
🏷️ Tags: {'train': 3181, 'query': 3269, 'gallery': 5}

📸 Sample image: /root/fiftyone/huggingface/hub/adonaivera/fiftyone-multiview-reid-attributes/data/00000002_01.jpg
👤 Person ID: 2
📝 Description: The person is a male adult wearing a red t-shirt, blue denim shorts, and dark sandals. He has short, dark hair and is walking.
🏷️ Attributes: {'gender': 'Male', 'age': 'Adult', 'ethnicity': 'Unknown', 'occupation': 'Unknown', 'appearance': {'hair': {'type': 'Short', 'color': 'Black', 'description': 'Short, dark hair.'}, 'beard': {'type': 'None', 'color': 'Unknown', 'description': 'No beard visible.'}, 'expression': {'type': 'Unknown', 'description': 'Facial expression is not clearly visible.'}}, 'posture': {'type': 'Walking', 'description': 'The person is walking.'}, 'actions': {'type': 'Walking', 'description': 'The person is walking.'}, 'clothing': {'upper': {'type': 'T-shirt', 'color': 'Red', 'description': 'A red t-shirt with a logo.'

In [None]:
# Visualize some samples from the dataset
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

def show_samples(dataset, num_samples=6):
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.ravel()

    samples = dataset.limit(num_samples)

    for i, sample in enumerate(samples):
        img = Image.open(sample.filepath)
        axes[i].imshow(img)
        axes[i].set_title(f"Person {sample.person_id}\n{sample.description[:50]}...")
        axes[i].axis('off')

    plt.tight_layout()
    plt.show()

# Show training samples
print("📸 Training samples:")
train_samples = dataset.match_tags("train")
show_samples(train_samples)

# Show query samples
print("🔍 Query samples:")
query_samples = dataset.match_tags("query")
show_samples(query_samples)

## 🧠 Model Training

In [None]:
# Import training modules
import sys
sys.path.append('.')

from utils.datasets import TextImageDataset, load_dataset
from utils.collators import TextImageCollator
from transformers import AutoProcessor, SiglipModel, get_scheduler
from torch.optim import AdamW
from torch.utils.data import DataLoader
import torch.nn.functional as F
from tqdm import tqdm
import os

print("✅ Training modules imported successfully!")

In [None]:
# Training configuration
class TrainingConfig:
    def __init__(self):
        self.epochs = 5  # Reduced for Colab demo
        self.batch_size = 8  # Reduced for Colab memory
        self.lr = 1e-5
        self.save_dir = "models"
        self.patience = 3
        self.temperature = 0.07

config = TrainingConfig()
print(f"⚙️ Training config:")
print(f"   Epochs: {config.epochs}")
print(f"   Batch size: {config.batch_size}")
print(f"   Learning rate: {config.lr}")
print(f"   Save directory: {config.save_dir}")

In [None]:
# Prepare dataloaders
print("📊 Preparing dataloaders...")

# Load model and processor
model = SiglipModel.from_pretrained("google/siglip-base-patch16-224")
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Create datasets
train_ds = TextImageDataset(dataset, split="train", processor=processor, augment=True)
val_ds = TextImageDataset(dataset, split="query", processor=processor, augment=False)

# Create dataloaders
train_dl = DataLoader(
    train_ds,
    batch_size=config.batch_size,
    shuffle=True,
    collate_fn=TextImageCollator(processor),
    num_workers=2,  # Reduced for Colab
)

val_dl = DataLoader(
    val_ds,
    batch_size=config.batch_size,
    shuffle=False,
    collate_fn=TextImageCollator(processor),
    num_workers=2,  # Reduced for Colab
)

print(f"✅ Dataloaders created!")
print(f"   Training batches: {len(train_dl)}")
print(f"   Validation batches: {len(val_dl)}")

In [None]:
# Training loop
print("🚀 Starting training...")

# Create save directory
os.makedirs(config.save_dir, exist_ok=True)

# Optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=config.lr, weight_decay=0.01)
num_training_steps = config.epochs * len(train_dl)
scheduler = get_scheduler("cosine", optimizer=optimizer, num_warmup_steps=50, num_training_steps=num_training_steps)

# Training history
train_losses = []
val_scores = []
best_loss = float("inf")
no_improve_epochs = 0

for epoch in range(config.epochs):
    print(f"\n📈 Epoch {epoch + 1}/{config.epochs}")

    # Training phase
    model.train()
    total_loss = 0

    for step, batch in enumerate(tqdm(train_dl, desc="Training")):
        batch = {k: v.to(device) for k, v in batch.items() if torch.is_tensor(v)}

        image_features = model.get_image_features(pixel_values=batch['pixel_values'])
        text_features = model.get_text_features(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])

        image_features = F.normalize(image_features, dim=-1)
        text_features = F.normalize(text_features, dim=-1)

        logits_per_image = torch.matmul(image_features, text_features.T) / config.temperature
        labels = torch.arange(len(image_features), device=device)

        loss = (F.cross_entropy(logits_per_image, labels) + F.cross_entropy(logits_per_image.T, labels)) / 2
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        torch.cuda.empty_cache()

        total_loss += loss.item()

        if step % 10 == 0:
            print(f"   Step {step} | Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(train_dl)
    train_losses.append(avg_loss)

    # Validation phase
    model.eval()
    all_image_feats = []
    all_text_feats = []

    with torch.no_grad():
        for batch in tqdm(val_dl, desc="Validation"):
            batch = {k: v.to(device) for k, v in batch.items() if torch.is_tensor(v)}

            image_feats = model.get_image_features(batch['pixel_values'])
            text_feats = model.get_text_features(batch['input_ids'], batch['attention_mask'])

            image_feats = F.normalize(image_feats, dim=-1)
            text_feats = F.normalize(text_feats, dim=-1)

            all_image_feats.append(image_feats)
            all_text_feats.append(text_feats)

        image_feats = torch.cat(all_image_feats, dim=0)
        text_feats = torch.cat(all_text_feats, dim=0)

        similarity_matrix = image_feats @ text_feats.T
        target = torch.arange(similarity_matrix.size(0)).to(device)

        # Image-to-text retrieval
        top1 = similarity_matrix.topk(1, dim=1).indices.squeeze()
        recall_at_1 = (top1 == target).float().mean().item()

        top5 = similarity_matrix.topk(5, dim=1).indices
        recall_at_5 = (top5 == target.unsqueeze(1)).any(dim=1).float().mean().item()

        val_score = recall_at_5
        val_scores.append(val_score)

    print(f"📊 Epoch {epoch + 1} | Train Loss: {avg_loss:.4f} | Val Recall@1: {recall_at_1:.4f} | Val Recall@5: {recall_at_5:.4f}")

    # Save best model
    if avg_loss < best_loss:
        best_loss = avg_loss
        no_improve_epochs = 0
        best_path = os.path.join(config.save_dir, f"best_model_epoch_{epoch + 1}_loss_{avg_loss:.4f}.pt")
        torch.save(model.state_dict(), best_path)
        print(f"✅ New best model saved to {best_path}")
    else:
        no_improve_epochs += 1
        print(f"⚠️ No improvement. Patience counter: {no_improve_epochs}/{config.patience}")

    if no_improve_epochs >= config.patience:
        print("🛑 Early stopping triggered.")
        break

print("🎉 Training completed!")

In [None]:
# Plot training progress
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Plot training loss
ax1.plot(train_losses, 'b-', label='Training Loss')
ax1.set_title('Training Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()
ax1.grid(True)

# Plot validation score
ax2.plot(val_scores, 'r-', label='Validation Recall@5')
ax2.set_title('Validation Recall@5')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Recall@5')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()

print(f"📈 Final training loss: {train_losses[-1]:.4f}")
print(f"📈 Final validation recall@5: {val_scores[-1]:.4f}")

## 🔍 Inference and Testing

In [None]:
# Load the trained model
print("🔧 Loading trained model...")

# Find the best model file
model_files = [f for f in os.listdir(config.save_dir) if f.endswith('.pt')]
if model_files:
    best_model_file = sorted(model_files)[-1]  # Get the latest model
    model_path = os.path.join(config.save_dir, best_model_file)
    print(f"📁 Loading model from: {model_path}")

    # Load model weights
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()
    print("✅ Trained model loaded successfully!")
else:
    print("⚠️ No trained model found. Using pretrained model.")
    model = SiglipModel.from_pretrained("google/siglip-base-patch16-224")
    model.to(device)
    model.eval()

In [None]:
# Download YOLOv8 model for person detection
from ultralytics import YOLO

print("📥 Downloading YOLOv8 model...")
detector = YOLO("yolov8n.pt")
print("✅ YOLOv8 model loaded successfully!")

In [None]:
# Test inference on a sample from the dataset
import cv2
from PIL import Image
import numpy as np

def test_inference_on_sample(sample, text_prompt):
    """Test inference on a single sample from the dataset"""

    # Load image
    image = cv2.imread(sample.filepath)
    rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Detect persons with YOLO
    results = detector(rgb_image)[0]
    boxes = results.boxes

    # Encode prompt text
    with torch.no_grad():
        text_inputs = processor(text=text_prompt, return_tensors="pt", padding=True).to(device)
        text_feat = model.get_text_features(**text_inputs)
        text_feat = torch.nn.functional.normalize(text_feat, dim=-1)

    # Process detections
    if boxes is not None:
        for box in boxes:
            x1, y1, x2, y2 = map(int, box.xyxy[0].cpu().numpy())
            cls_id = int(box.cls[0].item())

            if cls_id != 0:  # Only process persons
                continue

            # Draw default box (green)
            cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)

            # Crop and encode with SigLIP
            crop = rgb_image[y1:y2, x1:x2]
            image_input = processor(images=crop, return_tensors="pt").to(device)

            with torch.no_grad():
                image_feat = model.get_image_features(**image_input)
                image_feat = torch.nn.functional.normalize(image_feat, dim=-1)
                sim = torch.matmul(image_feat, text_feat.T).item()

            # Label the similarity
            cv2.putText(image, f"{sim:.2f}", (x1, y1 - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

            # Highlight matched person
            if sim > 0.15:  # Threshold
                cv2.rectangle(image, (x1, y1), (x2, y2), (0, 0, 255), 2)
                cv2.putText(image, "MATCH!", (x1, y1 - 30),
                            cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 255), 2)

    return image, sim if boxes is not None and len(boxes) > 0 else 0.0

# Test on a few samples
test_samples = query_samples.limit(3)

for i, sample in enumerate(test_samples):
    print(f"\n🔍 Testing sample {i+1}:")
    print(f"📝 Original description: {sample.description}")

    # Use the original description as the query
    result_image, similarity = test_inference_on_sample(sample, sample.description)

    print(f"📊 Similarity score: {similarity:.4f}")

    # Display result
    plt.figure(figsize=(10, 8))
    plt.imshow(cv2.cvtColor(result_image, cv2.COLOR_BGR2RGB))
    plt.title(f"Query: {sample.description[:50]}...\nSimilarity: {similarity:.4f}")
    plt.axis('off')
    plt.show()

In [None]:
# Custom inference with your own text prompts
def custom_person_search(image_path, text_prompt, similarity_threshold=0.15):
    """Search for a person in an image using a text description"""

    # Load image
    image = cv2.imread(image_path)
    if image is None:
        print(f"❌ Could not load image: {image_path}")
        return None, 0.0

    rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Detect persons with YOLO
    results = detector(rgb_image)[0]
    boxes = results.boxes

    # Encode prompt text
    with torch.no_grad():
        text_inputs = processor(text=text_prompt, return_tensors="pt", padding=True).to(device)
        text_feat = model.get_text_features(**text_inputs)
        text_feat = torch.nn.functional.normalize(text_feat, dim=-1)

    max_similarity = 0.0

    # Process detections
    if boxes is not None:
        for box in boxes:
            x1, y1, x2, y2 = map(int, box.xyxy[0].cpu().numpy())
            cls_id = int(box.cls[0].item())

            if cls_id != 0:  # Only process persons
                continue

            # Draw default box (green)
            cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)

            # Crop and encode with SigLIP
            crop = rgb_image[y1:y2, x1:x2]
            image_input = processor(images=crop, return_tensors="pt").to(device)

            with torch.no_grad():
                image_feat = model.get_image_features(**image_input)
                image_feat = torch.nn.functional.normalize(image_feat, dim=-1)
                sim = torch.matmul(image_feat, text_feat.T).item()

            max_similarity = max(max_similarity, sim)

            # Label the similarity
            cv2.putText(image, f"{sim:.2f}", (x1, y1 - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

            # Highlight matched person
            if sim > similarity_threshold:
                cv2.rectangle(image, (x1, y1), (x2, y2), (0, 0, 255), 2)
                cv2.putText(image, "MATCH!", (x1, y1 - 30),
                            cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 255), 2)

    return image, max_similarity

# Test with a sample from the dataset
sample = query_samples.first()
print(f"🔍 Testing custom search on: {sample.filepath}")
print(f"📝 Query: {sample.description}")

result_image, similarity = custom_person_search(sample.filepath, sample.description)

if result_image is not None:
    plt.figure(figsize=(12, 8))
    plt.imshow(cv2.cvtColor(result_image, cv2.COLOR_BGR2RGB))
    plt.title(f"Query: {sample.description}\nMax Similarity: {similarity:.4f}")
    plt.axis('off')
    plt.show()

    print(f"✅ Search completed! Max similarity: {similarity:.4f}")

## 📤 Upload Your Own Images (Optional)

In [None]:
# Upload your own image for testing
from google.colab import files
import os

print("📤 Upload an image to test the person search system...")
uploaded = files.upload()

# Get the uploaded file
uploaded_filename = list(uploaded.keys())[0]
print(f"✅ Uploaded: {uploaded_filename}")

# Test with different text prompts
test_prompts = [
    "A person wearing casual clothes",
    "A man in a business suit",
    "A woman with a handbag",
    "Someone wearing jeans and a t-shirt",
    "A person carrying a backpack"
]

for prompt in test_prompts:
    print(f"\n🔍 Testing prompt: {prompt}")
    result_image, similarity = custom_person_search(uploaded_filename, prompt)

    if result_image is not None:
        plt.figure(figsize=(10, 8))
        plt.imshow(cv2.cvtColor(result_image, cv2.COLOR_BGR2RGB))
        plt.title(f"Query: {prompt}\nSimilarity: {similarity:.4f}")
        plt.axis('off')
        plt.show()

        print(f"📊 Similarity score: {similarity:.4f}")

## 🎉 Summary

Congratulations! You've successfully:

✅ **Cloned the repository** and set up the environment
✅ **Installed all dependencies** including PyTorch, Transformers, and YOLOv8
✅ **Downloaded the dataset** with rich semantic attributes
✅ **Trained a SigLIP model** for open-set person search
✅ **Performed inference** on images with text descriptions
✅ **Tested the system** with custom prompts and uploaded images

## 🔧 Key Features Implemented:

- **Text-based person search** using natural language descriptions
- **Multi-view ReID dataset** with 6,455 samples and rich attributes
- **YOLOv8 person detection** for automatic bounding box generation
- **Fine-tuned SigLIP model** for improved text-image similarity
- **Cosine similarity scoring** with configurable thresholds
- **Real-time inference** with optimized tracking for videos

## 🚀 Next Steps:

1. **Experiment with different text prompts** to find the best descriptions
2. **Adjust similarity thresholds** based on your use case
3. **Try video inference** using the `inference_video.py` script
4. **Fine-tune the model** on your own dataset for better performance
5. **Deploy the model** for real-world applications

## 📚 Resources:

- [Original Repository](https://github.com/AdonaiVera/openset-reid-finetune)
- [Hugging Face Model](https://huggingface.co/adonaivera/siglip-person-search-openset)
- [Dataset](https://huggingface.co/datasets/adonaivera/fiftyone-multiview-reid-attributes)
- [Gradio Demo](https://huggingface.co/spaces/adonaivera/siglip-person-finder)

Happy person searching! 🕵️‍♀️