# VGGT Preprocessing for Nerfstudio - Direct API Usage

This notebook demonstrates how to use VGGT (Visual Geometry Grounded deep Transformer) for structure-from-motion preprocessing **using internal vggt_utils functions directly**, bypassing the ns-process-data CLI wrapper.

**VGGT** is integrated directly into nerfstudio's custom implementation via `nerfstudio/process_data/vggt_utils.py`!

## What is VGGT?

VGGT uses deep learning to estimate camera poses and depth maps directly from images, without traditional feature matching. This can be faster and more robust than COLMAP for certain scenarios.

## Direct API Approach (This Notebook)

This notebook uses the **internal Python API** for easier prototyping and debugging:
- `nerfstudio.process_data.vggt_utils.run_vggt_ba()` - VGGT with bundle adjustment
- `nerfstudio.process_data.vggt_utils.run_vggt()` - VGGT without bundle adjustment
- `nerfstudio.process_data.colmap_utils.colmap_to_json()` - Convert COLMAP to transforms.json
- `nerfstudio.process_data.process_data_utils.convert_video_to_images()` - Extract video frames

Benefits of direct API usage:
- Full control over parameters
- Easier to prototype and experiment
- Better error handling and debugging
- No subprocess overhead
- Can modify and test code changes immediately

## Output Format

The internal functions generate COLMAP-compatible format:
- `colmap/sparse/0/cameras.bin` - Camera intrinsics
- `colmap/sparse/0/images.bin` - Camera poses and 2D keypoints
- `colmap/sparse/0/points3D.bin` - 3D point cloud with tracks
- `transforms.json` - Nerfstudio format (generated from COLMAP output)

## Prerequisites

Make sure VGGT is installed:
```bash
pip install git+https://github.com/facebookresearch/vggt.git
```

In [None]:
%load_ext autoreload
%autoreload 2

# Verify conda environment
import sys
print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")

# Check if we're in the nerfstudio environment
if 'nerfstudio' not in sys.executable:
    print("\n⚠️  WARNING: Not running in nerfstudio conda environment!")
    print("Please activate with: conda activate nerfstudio")
else:
    print("\n✓ Running in nerfstudio environment")

from pathlib import Path
import cv2

# Import internal nerfstudio utilities
from nerfstudio.process_data import vggt_utils
from nerfstudio.process_data import colmap_utils
from nerfstudio.process_data.process_data_utils import (
    convert_video_to_images,
    CameraModel,
)

print("✓ Imports complete")

## Configuration

Set up your input video and output paths.

In [None]:
# Configuration
base_dir = Path("/workspace/fieldwork-data/")
video_path = base_dir / "birds/2024-02-06/SplatsSD" / "C0043.MP4"

# Output configuration
output_dir = base_dir / "birds/2024-02-06/environment/C0043"
preproc_dir = output_dir / "preproc"
image_dir = preproc_dir / "images"
colmap_dir = preproc_dir / "colmap"

# Sampling configuration
frame_proportion = 0.08
min_frames = 15

print(f"Input video: {video_path}")
print(f"Output path: {output_dir}")
print(f"Images will be saved to: {image_dir}")
print(f"COLMAP output will be saved to: {colmap_dir}")

## Step 1: Extract Video Frames

First, we'll extract frames from the video using nerfstudio's built-in utilities.

In [None]:
# Calculate number of frames to sample
video_capture = cv2.VideoCapture(str(video_path))
n_frames = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT))
n_samples = int(n_frames * frame_proportion)
n_samples = max(n_samples, min_frames)
n_samples = min(n_samples, n_frames)  # Can't sample more than total frames
video_capture.release()

print(f"Total frames in video: {n_frames}")
print(f"Frames to sample: {n_samples}")

# # Extract frames from video
# image_dir.mkdir(parents=True, exist_ok=True)

# # # Check if frames already exist
# # existing_images = list(image_dir.glob("frame_*.png"))
# # if len(existing_images) > 0 and len(existing_images) == (n_samples + 1):
# #     print(f"\n✓ {len(existing_images)} frames already extracted")
# # else:
# #     print(f"\nExtracting {n_samples} frames...")
# #     summary_log, num_extracted_frames = convert_video_to_images(
# #         video_path,
# #         image_dir=image_dir,
# #         num_frames_target=n_samples,
# #         num_downscales=0,
# #         crop_factor=(0.0, 0.0, 0.0, 0.0),
# #         verbose=True,
# #         image_prefix="frame_",
# #         keep_image_dir=False,
# #     )
# #     print(f"✓ Extracted {num_extracted_frames} frames to {image_dir}")

## Step 2: Run VGGT with Bundle Adjustment

Now we'll run VGGT inference directly using the internal API. This gives us full control over parameters.

In [None]:
# VGGT parameters
verbose = True

# Run VGGT with bundle adjustment
vggt_utils.run_vggt(
    image_dir=image_dir,
    colmap_dir=colmap_dir,
    camera_model="PINHOLE",
    scale_factor=2.5,
    verbose=verbose,
    use_global_alignment=True,
)

print("\n✓ VGGT reconstruction complete!")
print(f"COLMAP files saved to: {colmap_dir / 'sparse' / '0'}")

In [None]:
# VGGT parameters
camera_model = CameraModel.OPENCV
verbose = True

# Bundle adjustment parameters
ba_refine_focal_length = True
ba_refine_principal_point = False
ba_refine_extra_params = False

# Track prediction parameters (adjust these for memory constraints)
max_query_pts = 2048  # Lower to 1024 or 512 if OOM
max_points_num = 512
query_frame_num = 5   # Lower to 3 if OOM
fine_tracking = True  # Set to False if OOM (significant memory reduction)

# Run VGGT with bundle adjustment
# vggt_utils.run_vggt_ba(
#     image_dir=image_dir,
#     colmap_dir=colmap_dir,
#     camera_model=camera_model,
#     verbose=verbose,
#     ba_refine_focal_length=ba_refine_focal_length,
#     ba_refine_principal_point=ba_refine_principal_point,
#     ba_refine_extra_params=ba_refine_extra_params,
#     max_query_pts=max_query_pts,
#     max_points_num=max_points_num,
#     query_frame_num=query_frame_num,
#     fine_tracking=fine_tracking,
# )

vggt_utils.run_vggt_ba(
    image_dir=image_dir,
    colmap_dir=colmap_dir,
    camera_model="OPENCV",
    use_global_alignment=True,  # Enable GA for better pose accuracy
    max_query_pts=4096,         # Number of feature points for matching
    shared_camera=False,
    verbose=True,
)

print("\n✓ VGGT reconstruction complete!")
print(f"COLMAP files saved to: {colmap_dir / 'sparse' / '0'}")

In [None]:
## Step 3: Convert COLMAP to Nerfstudio Format

# Convert COLMAP reconstruction to transforms.json
print("Converting COLMAP to transforms.json...")

# colmap_utils.colmap_to_json(
#     recon_dir=colmap_dir / "sparse" / "0",
#     output_dir=preproc_dir,
#     # camera_model=CameraModel.PINHOLE,
# )

# Create point cloud PLY file
import torch
import json

# transforms_path = preproc_dir / "transforms.json"
# with open(transforms_path) as f:
#     transforms = json.load(f)

# applied_transform = torch.tensor(transforms["applied_transform"])

ply_filename = "sparse_pc.ply"
# colmap_utils.create_ply_from_colmap(
#     filename=ply_filename,
#     recon_dir=colmap_dir / "sparse" / "0",
#     output_dir=preproc_dir,
#     applied_transform=applied_transform,
# )

preproc_dir = splatter.config["output_path"] / "preproc"
ply_file_path = preproc_dir / ply_filename
# transforms["ply_file_path"] = ply_filename

# print(f"\n✓ Conversion complete!")
# print(f"  - transforms.json: {transforms_path}")
# print(f"  - Point cloud: {ply_file_path}")

In [None]:
## Step 4: Visualize the Sparse Point Cloud

import pyvista as pv
from collab_splats.utils.visualization import (
    CAMERA_KWARGS,
    MESH_KWARGS,
    VIZ_KWARGS,
    visualize_splat,
)

# Load the sparse point cloud
splat = pv.PolyData(str(ply_file_path))

pcd_kwargs = MESH_KWARGS.copy()
pcd_kwargs.update(
    {
        "point_size": 2,
        "render_points_as_spheres": True,
        "ambient": 0.3,
        "diffuse": 0.8,
        "specular": 0.1,
    }
)

plotter = visualize_splat(
    mesh=splat,
    mesh_kwargs=pcd_kwargs,
    viz_kwargs=VIZ_KWARGS,
)

plotter.show()

## Step 5: Training the Model (Optional)

After preprocessing with VGGT, train your model as usual using the Splatter wrapper or ns-train directly.

In [2]:
%load_ext autoreload
%autoreload 2

# Verify conda environment
import sys
print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")

# Check if we're in the nerfstudio environment
if 'nerfstudio' not in sys.executable:
    print("\n⚠️  WARNING: Not running in nerfstudio conda environment!")
    print("Please activate with: conda activate nerfstudio")
else:
    print("\n✓ Running in nerfstudio environment")

from pathlib import Path
import cv2

# Import internal nerfstudio utilities
from nerfstudio.process_data import vggt_utils
from nerfstudio.process_data import colmap_utils
from nerfstudio.process_data.process_data_utils import (
    convert_video_to_images,
    CameraModel,
)

print("✓ Imports complete")

Python executable: /opt/conda/envs/nerfstudio/bin/python
Python version: 3.10.18 | packaged by conda-forge | (main, Jun  4 2025, 14:45:41) [GCC 13.3.0]

✓ Running in nerfstudio environment
✓ Imports complete


### Evaluate on bicycles dataset

In [5]:
# Option 1: Use Splatter wrapper for training
from collab_splats.wrapper import Splatter, SplatterConfig

# Configuration
config_dir = Path("/workspace/collab-splats/docs/splats/configs/")
# dataset_name = "bicycle"
dataset_name = "birds_date-02062024_video-C0043"

# Create splatter from config
splatter = Splatter.from_config_file(
    dataset=dataset_name,
    config_dir=config_dir,
    # overrides={
    #     "frame_proportion": 0.1,
    # }
)

# splatter.preprocess()

splatter.preprocess(
    sfm_tool='vggt',
    overwrite=False, 
    kwargs={
        "refine-vggt": "",
        "camera-type": "pinhole",
        "verbose": "",
        "num_downscales": 0,
        "vggt_conf_threshold": 35.0,
        # "skip_image_processing": "",
    }  # Enable bundle adjustment
)

✓ Valid video file with 2388 frames
transforms.json already exists at /workspace/fieldwork-data/birds/2024-02-06/environment/C0043/preproc/transforms.json
To rerun preprocessing, set overwrite=True


In [6]:
feature_kwargs = {
    # "pipeline.model.strategy": "mcmc",
    "pipeline.model.output-depth-during-training": True,
    "pipeline.model.rasterize-mode": "antialiased",
    "pipeline.model.use-scale-regularization": True,
    "pipeline.model.random-scale": 1.0,
    "pipeline.model.num-downscales": 1,
    # "pipeline.datamanager.dataparser.downscale-factor": 1,
    # "pipeline.model.collider-params": "near_plane 0.1 far_plane 3.0",
}

splatter.extract_features(
    kwargs=feature_kwargs, 
    overwrite=True
)
print("\n✓ Training complete!")

[Taichi] version 1.7.4, llvm 15.0.4, commit b4b956fd, linux, python 3.10.18
[2;36m[04:35:31][0m[2;36m [0mUsing --data alias for --data.pipeline.datamanager.data                                          ]8;id=446101;file:///workspace/nerfstudio/nerfstudio/scripts/train.py\[2mtrain.py[0m]8;;\[2m:[0m]8;id=740028;file:///workspace/nerfstudio/nerfstudio/scripts/train.py#241\[2m241[0m]8;;\
[92m──────────────────────────────────────────────────────── [0mConfig[92m ────────────────────────────────────────────────────────[0m
[1;35m_TrainerConfig[0m[1m([0m
    [33m_target[0m=[1m<[0m[1;95mclass[0m[39m [0m[32m'nerfstudio.engine.trainer.Trainer'[0m[39m>,[0m
[39m    [0m[33moutput_dir[0m[39m=[0m[1;35mPosixPath[0m[1;39m([0m[32m'/workspace/fieldwork-data/birds/2024-02-06/environment/C0043'[0m[1;39m)[0m[39m,[0m
[39m    [0m[33mmethod_name[0m[39m=[0m[32m'rade-features'[0m[39m,[0m
[39m    [0m[33mexperiment_name[0m[39m=[0m[32m''[0m[39m

Using cache found in /workspace/models/hub/facebookresearch_dinov2_main
  return F.conv2d(input, weight, bias, self.stride,
Extracting dinov2 features: 100%|██████████| 552/552 [01:17<00:00,  7.14it/s]
Using cache found in /workspace/models/hub/RogerQi_MobileSAMV2_main


checkpoint_load_scucess


Extracting samclip features:   0%|          | 0/552 [00:00<?, ?it/s]
0: 1024x576 51 objects, 507.6ms
Speed: 5.0ms preprocess, 507.6ms inference, 42.1ms postprocess per image at shape (1, 3, 1024, 1024)
Extracting samclip features:   0%|          | 1/552 [00:05<48:33,  5.29s/it]
0: 1024x576 54 objects, 19.7ms
Speed: 1.5ms preprocess, 19.7ms inference, 1.4ms postprocess per image at shape (1, 3, 1024, 1024)
Extracting samclip features:   0%|          | 2/552 [00:06<25:21,  2.77s/it]
0: 1024x576 52 objects, 19.7ms
Speed: 1.9ms preprocess, 19.7ms inference, 1.4ms postprocess per image at shape (1, 3, 1024, 1024)
Extracting samclip features:   1%|          | 3/552 [00:07<18:10,  1.99s/it]
0: 1024x576 46 objects, 19.1ms
Speed: 1.4ms preprocess, 19.1ms inference, 1.3ms postprocess per image at shape (1, 3, 1024, 1024)
Extracting samclip features:   1%|          | 4/552 [00:08<13:58,  1.53s/it]
0: 1024x576 49 objects, 19.7ms
Speed: 1.5ms preprocess, 19.7ms inference, 1.4ms postprocess per imag

Saved samclip features to cache at 
[35m/workspace/fieldwork-data/birds/2024-02-06/environment/C0043/preproc/[0m[95mfeature-splatting_samclip-features.pt[0m
[2;36m[04:47:09][0m[2;36m [0muse color only optimization with sigmoid activation                                         ]8;id=208496;file:///workspace/nerfstudio/nerfstudio/models/splatfacto.py\[2msplatfacto.py[0m]8;;\[2m:[0m]8;id=750800;file:///workspace/nerfstudio/nerfstudio/models/splatfacto.py#213\[2m213[0m]8;;\
╭──────────────── [1mviser[0m ────────────────╮
│             ╷                         │
│   HTTP      │ http://localhost:7007   │
│   Websocket │ ws://localhost:7007     │
│             ╵                         │
╰───────────────────────────────────────╯
[1m([0m[1mviser[0m[1m)[0m Passing [1m[[0m[32m'initial_value'[0m[1m][0m as positional arguments to add_dropdown is 
deprecated. Please use keyword arguments instead: [33minitial_value[0m=[35mnot[0m set
[1m([0m[1mviser[0m[1

  torch.tensor(get_world2view_transform(R, T, trans, scale))
  torch.tensor(get_world2view_transform(R, T, trans, scale))


[2;36m[04:49:26][0m[2;36m [0mCaching [35m/[0m undistorting train images                                            ]8;id=617889;file:///workspace/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py\[2mfull_images_datamanager.py[0m]8;;\[2m:[0m]8;id=291704;file:///workspace/nerfstudio/nerfstudio/data/datamanagers/full_images_datamanager.py#239\[2m239[0m]8;;\
[2KCaching / undistorting train images [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [33m0:00:15[0m00:01[0m00:01[0m
[1A[2K[2;36m[04:49:45][0m[2;36m [0mPrinting max of [1;36m10[0m lines. Set flag [33m--logging.local-writer.max-log-[0m[33msize[0m[33m=[0m[1;33m0[0m to disable line        ]8;id=443143;file:///workspace/nerfstudio/nerfstudio/utils/writer.py\[2mwriter.py[0m]8;;\[2m:[0m]8;id=356778;file:///workspace/nerfstudio/nerfstudio/utils/writer.py#449\[2m449[0m]8;;\
[2;36m           [0mwrapping.                                                             

Process ForkProcess-29:
Process ForkProcess-14:
Process ForkProcess-17:
Process ForkProcess-13:
Process ForkProcess-30:
Process ForkProcess-31:
Process ForkProcess-32:
Process ForkProcess-27:
Process ForkProcess-25:
Process ForkProcess-22:
Process ForkProcess-23:
Process ForkProcess-24:
Process ForkProcess-28:
Process ForkProcess-18:
Process ForkProcess-20:
Process ForkProcess-19:
Process ForkProcess-15:
Process ForkProcess-16:
Process ForkProcess-26:
Process ForkProcess-1:
Process ForkProcess-12:
Process ForkProcess-9:
Process ForkProcess-10:
Process ForkProcess-8:
Process ForkProcess-11:
Process ForkProcess-21:
Process ForkProcess-7:
Process ForkProcess-5:
Process ForkProcess-2:
Process ForkProcess-4:
Process ForkProcess-6:
Process ForkProcess-3:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/envs/nerfstudio/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/envs/nerfstudio/lib/python3.10/mul

KeyboardInterrupt: 

### Normal splatting approach


In [3]:
# Option 1: Use Splatter wrapper for training
from collab_splats.wrapper import Splatter, SplatterConfig

# Configuration
config_dir = Path("/workspace/collab-splats/docs/splats/configs/")
dataset_name = "birds_date-02062024_video-C0043"

# Create splatter from config
splatter = Splatter.from_config_file(
    dataset=dataset_name,
    config_dir=config_dir,
    overrides={
        # "frame_proportion": 0.1,
    }
)

splatter.preprocess()


# splatter.preprocess(
#     sfm_tool='vggt',
#     overwrite=True, 
#     kwargs={
#         "refine-vggt": "",
#         "camera-type": "pinhole",
#     }  # Enable bundle adjustment
# )

✓ Valid video file with 2388 frames
transforms.json already exists at /workspace/fieldwork-data/birds/2024-02-06/environment/C0043/preproc/transforms.json
To rerun preprocessing, set overwrite=True


In [None]:
data = splatter.inspect_data()

In [None]:


# splatter.config["preproc_data_path"] = splatter.config["output_path"] / "preproc"

feature_kwargs = {
    # "pipeline.model.strategy": "mcmc",
    "pipeline.model.output-depth-during-training": True,
    "pipeline.model.rasterize-mode": "antialiased",
    "pipeline.model.use-scale-regularization": True,
    "pipeline.model.random-scale": 1.0,
    "pipeline.model.num-downscales": 1,
    # "pipeline.datamanager.dataparser.downscale-factor": 1,
    # "pipeline.model.collider-params": "near_plane 0.1 far_plane 3.0",
}

splatter.extract_features(
    kwargs=feature_kwargs, 
    overwrite=True
)
print("\n✓ Training complete!")

## Alternative Approaches

### 1. Using CLI (ns-process-data)

You can also use VGGT via the command line wrapper:

```bash
# Process video with VGGT
ns-process-data video \
    --data /path/to/video.mp4 \
    --output-dir /path/to/output \
    --sfm-tool vggt

# Process images with VGGT
ns-process-data images \
    --data /path/to/images/ \
    --output-dir /path/to/output \
    --sfm-tool vggt
```

### 2. Using Splatter Wrapper

```python
splatter.preprocess(sfm_tool='vggt', overwrite=True, kwargs={
    "refine-vggt-ba": "",  # Enable bundle adjustment
})
```

### 3. Direct API (This Notebook) - **Recommended for Prototyping**

```python
# Extract frames
convert_video_to_images(video_path, image_dir, num_frames_target=n_samples)

# Run VGGT with BA
vggt_utils.run_vggt_ba(image_dir, colmap_dir, camera_model, verbose=True)

# Convert to transforms.json
colmap_utils.colmap_to_json(colmap_dir / "sparse" / "0", preproc_dir, camera_model)
```

**Why use the direct API?**
- Full parameter control (max_query_pts, fine_tracking, etc.)
- No subprocess overhead
- Better error messages and debugging
- Easier to modify and test code changes
- Can experiment with different VGGT settings quickly

## When to Use VGGT vs COLMAP vs hloc

**Use VGGT when:**
- You have textureless or repetitive scenes (where COLMAP struggles)
- You want faster preprocessing
- You have good lighting and clear images
- You need dense depth estimates

**Use COLMAP when:**
- You need maximum accuracy
- You have well-textured scenes
- You have challenging camera motions or occlusions
- Traditional feature matching works well

**Use hloc when:**
- You want modern deep features (SuperPoint + SuperGlue)
- You need a balance between speed and accuracy
- You want robust matching in challenging conditions

## Direct API Implementation Details

### Key Functions Used in This Notebook

**1. Frame Extraction:**
```python
from nerfstudio.process_data.process_data_utils import convert_video_to_images

convert_video_to_images(
    video_path,
    image_dir=image_dir,
    num_frames_target=n_samples,
    verbose=True,
)
```

**2. VGGT with Bundle Adjustment:**
```python
from nerfstudio.process_data import vggt_utils

vggt_utils.run_vggt_ba(
    image_dir=image_dir,
    colmap_dir=colmap_dir,
    camera_model=camera_model,
    verbose=True,
    ba_refine_focal_length=True,
    ba_refine_principal_point=False,
    max_query_pts=2048,        # Adjust for memory
    query_frame_num=5,          # Adjust for memory
    fine_tracking=True,         # Set False to reduce memory
)
```

**3. COLMAP to Nerfstudio Conversion:**
```python
from nerfstudio.process_data import colmap_utils

colmap_utils.colmap_to_json(
    recon_dir=colmap_dir / "sparse" / "0",
    output_dir=preproc_dir,
    camera_model=camera_model,
)
```

### VGGT Processing Pipeline

The `run_vggt_ba()` function internally:

1. **Loads VGGT Model** (`facebook/VGGT-1B`)
2. **Runs Inference**: Predicts camera poses and depth maps
3. **Feature Tracking**: Uses VGGSfM's `predict_tracks()` for robust tracks
4. **Builds Reconstruction**: Creates pycolmap Reconstruction from tracks
5. **Bundle Adjustment**: Refines poses and intrinsics
6. **Writes COLMAP Files**: Saves cameras.bin, images.bin, points3D.bin

### Memory Optimization

For memory-constrained GPUs (e.g., RTX 4090 24GB):
- Set `fine_tracking=False` (significant reduction)
- Reduce `max_query_pts` to 1024 or 512
- Reduce `query_frame_num` to 3
- Use `track_resolution=518` (default, already optimized)

### Implementation Files

- **VGGT utilities**: [nerfstudio/process_data/vggt_utils.py](nerfstudio/process_data/vggt_utils.py)
- **COLMAP utilities**: [nerfstudio/process_data/colmap_utils.py](nerfstudio/process_data/colmap_utils.py)
- **Processing utilities**: [nerfstudio/process_data/process_data_utils.py](nerfstudio/process_data/process_data_utils.py)