# VGGT Preprocessing for Nerfstudio

This notebook demonstrates how to use VGGT (Visual Geometry Grounded deep Transformer) for structure-from-motion preprocessing with nerfstudio.

**VGGT** is now integrated directly into nerfstudio's custom implementation!

## What is VGGT?

VGGT uses deep learning to estimate camera poses and depth maps directly from images, without traditional feature matching. This can be faster and more robust than COLMAP for certain scenarios.

## How VGGT Integration Works

The nerfstudio integration (`nerfstudio/process_data/vggt_utils.py`) runs VGGT inference and converts the output to **COLMAP-compatible format**:
- `colmap/sparse/0/cameras.bin` - Camera intrinsics
- `colmap/sparse/0/images.bin` - Camera poses and 2D keypoints
- `colmap/sparse/0/points3D.bin` - 3D point cloud with tracks
- `transforms.json` - Nerfstudio format (generated from COLMAP output)

This means VGGT output can be used exactly like COLMAP output!

## Prerequisites

Make sure VGGT is installed:
```bash
pip install git+https://github.com/facebookresearch/vggt.git
```

## Usage Modes

1. **Using Splatter wrapper** (recommended): `splatter.preprocess(sfm_tool='vggt')`
2. **Using ns-process-data CLI**: `ns-process-data video --sfm-tool vggt`
3. **Using Python API directly**: `vggt_utils.run_vggt(image_dir, colmap_dir, ...)`
4. **Legacy method**: `splatter.preprocess_vggt()` (deprecated, kept for compatibility)

In [1]:
# Verify conda environment
import sys
print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")

# Check if we're in the nerfstudio environment
if 'nerfstudio' not in sys.executable:
    print("\n‚ö†Ô∏è  WARNING: Not running in nerfstudio conda environment!")
    print("Please activate with: conda activate nerfstudio")
else:
    print("\n‚úì Running in nerfstudio environment")

Python executable: /opt/conda/envs/nerfstudio/bin/python
Python version: 3.10.18 | packaged by conda-forge | (main, Jun  4 2025, 14:45:41) [GCC 13.3.0]

‚úì Running in nerfstudio environment


In [2]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from collab_splats.wrapper import Splatter, SplatterConfig

print("‚úì Imports complete")

  warn(


Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.
‚úì Imports complete


## Configuration

Set up your input video and output paths.

In [3]:
# Configuration
base_dir = Path("/workspace/fieldwork-data/")
# session_dir = base_dir / "rats/2024-07-11/SplatsSD" / "C0119.MP4"
session_dir = base_dir / "birds/2024-02-06/SplatsSD" / "C0043.MP4"

# Create splatter configuration
splatter_config = SplatterConfig(
    file_path=session_dir,
    method="rade-features",
    frame_proportion=0.01,
    min_frames=15,
)

# Initialize the Splatter class
splatter = Splatter(splatter_config)

print(f"Input video: {splatter.config['file_path']}")
print(f"Output path: {splatter.config['output_path']}")

Input video: /workspace/fieldwork-data/birds/2024-02-06/SplatsSD/C0043.MP4
Output path: /workspace/fieldwork-data/birds/2024-02-06/environment/C0043


## Method 1: Using Splatter Wrapper (Recommended)

The simplest way to use VGGT - just specify `sfm_tool='vggt'` in preprocess().

In [4]:
# Run preprocessing with VGGT
# This will:
# 1. Extract frames from the video
# 2. Run VGGT to estimate camera poses and depth
# 3. Generate transforms.json for nerfstudio training

preproc_kwargs = {
    "refine_vggt_ba": "", # This sets to true?
}
splatter.preprocess(
    sfm_tool='vggt', 
    overwrite=True,
    kwargs=preproc_kwargs
) 

print("\n‚úì VGGT preprocessing complete!")
print(f"Output directory: {splatter.config['preproc_data_path']}")

Number of frames to sample:  23


  warn(


[2KNumber of frames in video: [1;36m2388[0mes.....
[2KExtracting [1;36m24[0m frames in evenly spaced intervals
[2K[2;36m[16:46:06][0m[2;36m [0m[1;32müéâ Done converting video to images.[0m                                                 ]8;id=360715;file:///workspace/nerfstudio/nerfstudio/process_data/process_data_utils.py\[2mprocess_data_utils.py[0m]8;;\[2m:[0m]8;id=336181;file:///workspace/nerfstudio/nerfstudio/process_data/process_data_utils.py#227\[2m227[0m]8;;\
[2K[32m(     ‚óè)[0m Converting video to images...
[1A[2K[1;32mUsing device: cuda[0m
[1;33mLoading VGGT model: facebook/VGGT-1B[0m
[1;32mFound [0m[1;32m24[0m[1;32m images[0m
[1;33mRunning VGGT inference[0m[1;33m...[0m
  - Images shape after preprocessing: [1;35mtorch.Size[0m[1m([0m[1m[[0m[1;36m24[0m, [1;36m3[0m, [1;36m518[0m, [1;36m518[0m[1m][0m[1m)[0m


  with torch.cuda.amp.autocast(dtype=dtype):
  with torch.cuda.amp.autocast(enabled=False):


[1;33mConverting tensors to numpy[0m[1;33m...[0m
  - predictions keys: [1m[[0m[32m'pose_enc'[0m, [32m'pose_enc_list'[0m, [32m'depth'[0m, [32m'depth_conf'[0m, [32m'world_points'[0m, [32m'world_points_conf'[0m, 
[32m'images'[0m[1m][0m
  - pose_enc shape before squeeze: [1;35mtorch.Size[0m[1m([0m[1m[[0m[1;36m1[0m, [1;36m24[0m, [1;36m9[0m[1m][0m[1m)[0m
  - depth shape before squeeze: [1;35mtorch.Size[0m[1m([0m[1m[[0m[1;36m1[0m, [1;36m24[0m, [1;36m518[0m, [1;36m518[0m, [1;36m1[0m[1m][0m[1m)[0m
  - depth_conf shape before squeeze: [1;35mtorch.Size[0m[1m([0m[1m[[0m[1;36m1[0m, [1;36m24[0m, [1;36m518[0m, [1;36m518[0m[1m][0m[1m)[0m
  - world_points shape before squeeze: [1;35mtorch.Size[0m[1m([0m[1m[[0m[1;36m1[0m, [1;36m24[0m, [1;36m518[0m, [1;36m518[0m, [1;36m3[0m[1m][0m[1m)[0m
  - world_points_conf shape before squeeze: [1;35mtorch.Size[0m[1m([0m[1m[[0m[1;36m1[0m, [1;36m24[0m, [1;36m518[

Using cache found in /workspace/models/hub/facebookresearch_dinov2_main


For faster inference, consider disabling fine_tracking
Predicting tracks for query frame 0
[2;36m[16:46:52][0m[2;36m [0m[1;31mCouldn't load custom C++ ops. This can happen if your PyTorch and torchvision versions [0m   ]8;id=498156;file:///workspace/nerfstudio/nerfstudio/scripts/process_data.py\[2mprocess_data.py[0m]8;;\[2m:[0m]8;id=237606;file:///workspace/nerfstudio/nerfstudio/scripts/process_data.py#564\[2m564[0m]8;;\
[2;36m           [0m[1;31mare incompatible, or if you had errors while compiling torchvision from source. For [0m      [2m                   [0m
[2;36m           [0m[1;31mfurther information on the compatible versions, check [0m                                    [2m                   [0m
[2;36m           [0m[1;4;31mhttps://github.com/pytorch/vision#installation[0m[1;31m for the compatibility matrix. Please check[0m [2m                   [0m
[2;36m           [0m[1;31myour PyTorch version with torch.__version__ and your torchvi

## Inspecting VGGT Output

After running VGGT, you can inspect the generated COLMAP files:

In [None]:
# Inspect the COLMAP output from VGGT
import struct
from pathlib import Path

def inspect_colmap_output(colmap_dir):
    """Inspect COLMAP binary files generated by VGGT."""
    sparse_dir = Path(colmap_dir) / "sparse" / "0"
    
    print(f"Checking COLMAP output in: {sparse_dir}\n")
    
    # Check for required files
    cameras_file = sparse_dir / "cameras.bin"
    images_file = sparse_dir / "images.bin"
    points3D_file = sparse_dir / "points3D.bin"
    
    if not sparse_dir.exists():
        print("‚ùå COLMAP sparse directory not found. Run VGGT first!")
        return
    
    # Read cameras.bin
    if cameras_file.exists():
        with open(cameras_file, "rb") as f:
            num_cameras = struct.unpack("<Q", f.read(8))[0]
        print(f"‚úì cameras.bin: {num_cameras} cameras")
    else:
        print("‚ùå cameras.bin not found")
    
    # Read images.bin
    if images_file.exists():
        with open(images_file, "rb") as f:
            num_images = struct.unpack("<Q", f.read(8))[0]
        print(f"‚úì images.bin: {num_images} camera poses")
    else:
        print("‚ùå images.bin not found")
    
    # Read points3D.bin
    if points3D_file.exists():
        with open(points3D_file, "rb") as f:
            num_points = struct.unpack("<Q", f.read(8))[0]
        print(f"‚úì points3D.bin: {num_points:,} 3D points")
    else:
        print("‚ùå points3D.bin not found")
    
    print("\nThese files are in COLMAP binary format and can be:")
    print("  - Visualized with COLMAP GUI: colmap gui")
    print("  - Converted to nerfstudio format (transforms.json) automatically")
    print("  - Used with any tool that reads COLMAP format")

# Example usage (uncomment if you've run VGGT):
inspect_colmap_output(splatter.config['preproc_data_path'] / 'colmap')

In [None]:
import torch
import json
from nerfstudio.process_data.colmap_utils import create_ply_from_colmap

data_path = splatter.config["preproc_data_path"]

with open(data_path / "transforms.json") as f:
    transforms = json.load(f)

applied_transform = torch.tensor(transforms["applied_transform"])

ply_filename = "sparse_pc.ply"
create_ply_from_colmap(
    filename=ply_filename,
    recon_dir=data_path / "colmap" / "sparse" / "0",
    output_dir=data_path,
    applied_transform=applied_transform,
)
ply_file_path = data_path / ply_filename
transforms["ply_file_path"] = ply_filename

In [None]:
import pyvista as pv

from collab_splats.utils.visualization import (
    CAMERA_KWARGS,
    MESH_KWARGS,
    VIZ_KWARGS,
    visualize_splat,
)

old_fn = '/workspace/fieldwork-data/birds/2024-02-06/environment/C0043/archive/sparse_pc.ply'
old_fn = ply_file_path
splat = pv.PolyData(old_fn)
# splat.point_data["RGB"] = np.asarray(pcd.colors)

pcd_kwargs = MESH_KWARGS.copy()
pcd_kwargs.update(
    {
        "point_size": 2,
        "render_points_as_spheres": True,
        "ambient": 0.3,
        "diffuse": 0.8,
        "specular": 0.1,
    }
)

plotter = visualize_splat(
    mesh=splat,
    mesh_kwargs=pcd_kwargs,
    viz_kwargs=VIZ_KWARGS,
)

plotter.show()

## Training the Model

After preprocessing with VGGT, train your model as usual. The workflow is identical regardless of which SfM tool you used:

In [None]:
# Train the model (same for any SfM tool)
feature_kwargs = {
    "pipeline.model.output-depth-during-training": True,
    "pipeline.model.rasterize-mode": "antialiased",
    "pipeline.model.use-scale-regularization": True,
    "pipeline.model.random-scale": 1.0,
    # "pipeline.model.cull-alpha-thresh": 0.01,
    "pipeline.model.collider-params": "near_plane 0.1 far_plane 3.0",
}

splatter.extract_features(kwargs=feature_kwargs, overwrite=True)
print("\n‚úì Training complete!")

## Method 4: Direct CLI Usage

You can also use VGGT directly from the command line with `ns-process-data`:

```bash
# Process video with VGGT
ns-process-data video \
    --data /path/to/video.mp4 \
    --output-dir /path/to/output \
    --sfm-tool vggt

# Process images with VGGT
ns-process-data images \
    --data /path/to/images/ \
    --output-dir /path/to/output \
    --sfm-tool vggt
```

### Behind the Scenes

When you run `ns-process-data` with `--sfm-tool vggt`, it:
1. Extracts/copies images to the output directory
2. Calls `nerfstudio/process_data/vggt_utils.run_vggt()` 
3. VGGT runs inference and generates COLMAP binary files
4. `colmap_utils.colmap_to_json()` converts COLMAP ‚Üí `transforms.json`
5. You can then train with: `ns-train rade-features --data /path/to/output`

This is the same pipeline whether you use Splatter, CLI, or Python API!

## When to Use VGGT vs COLMAP

**Use VGGT when:**
- You have textureless or repetitive scenes (where COLMAP struggles)
- You want faster preprocessing
- You have good lighting and clear images

**Use COLMAP when:**
- You need maximum accuracy
- You have well-textured scenes
- You have challenging camera motions or occlusions

**Use hloc when:**
- You want modern deep features (SuperPoint + SuperGlue)
- You need a balance between speed and accuracy

## Key Integration Details

### VGGT ‚Üí COLMAP Format Conversion

The nerfstudio integration (`nerfstudio/process_data/vggt_utils.py`) performs the following conversions:

**1. Camera Poses:**
- VGGT outputs: 4√ó4 extrinsic matrices (world-to-camera transform)
- Converted to: COLMAP quaternion + translation format
- Stored in: `images.bin` (COLMAP binary format)

**2. Camera Intrinsics:**
- VGGT estimates: Per-image intrinsic matrices (fx, fy, cx, cy)
- Stored in: `cameras.bin` with PINHOLE camera model

**3. 3D Points:**
- VGGT outputs: Dense depth maps for each image
- Unprojected to: 3D world coordinates using camera poses
- Filtered by: Confidence threshold (default: top 50%)
- Stored in: `points3D.bin` with RGB colors and track information

**4. Point Tracks:**
- VGGT depth ‚Üí 3D points are matched across views by spatial proximity
- Tracks link 2D observations (in `images.bin`) to 3D points (in `points3D.bin`)

### Implementation Files

- **Main integration**: `nerfstudio/process_data/vggt_utils.py` (lines 35-400)
- **CLI hook**: `nerfstudio/process_data/colmap_converter_to_nerfstudio_dataset.py` (lines 242-251)
- **Splatter wrapper**: `collab_splats/wrapper/splatter.py` (lines 132-236)
- **Legacy (deprecated)**: `collab_splats/utils/vggt_utils.py`