# 3D Object Reconstruction using NVBundleSDF 

## Overview 

This guide demonstrates an end-to-end real2sim workflow for reconstructing 3D objects from stereo video input using state-of-the-art computer vision and neural rendering techniques. The pipeline combines:

- **[FoundationStereo](https://arxiv.org/abs/2501.09898)** for depth estimation from stereo image pairs
- **[SAM2](https://arxiv.org/abs/2408.00714)** serves as the object segmentation for the entire video
- **[BundleSDF](https://arxiv.org/abs/2303.14158)** for the real-world scale textured mesh generation

<div align="center">
    <img src="../data/docs/pipeline_overview.png" alt="Pipeline Overview" title="3D Object Reconstruction Workflow">
</div>

## Learning Objectives

By the end of this notebook, you'll learn how to perform 3D object reconstruction by:

- **Data Preparation**: Importing stereo input data and pre-processing it for optimal reconstruction
- **Depth Estimation**: Using FoundationStereo to generate accurate depth maps from stereo pairs
- **Object Segmentation**: Employing SAM2 to segment and track objects across all frames
- **3D Object Reconstruction**: Leveraging BundleSDF for pose estimation, SDF training, mesh extraction, and texture baking
- **Asset Creation**: Creating textured 3D assets ready for downstream applications such as digital content creation, dataset simulation and object pose estimation.


## Prerequisites

### System Requirements
- **GPU**: NVIDIA GPU with CUDA support (minimum requirements: Compute Capability 7.0 with at least 24GB VRAM)
- **Memory**: 32GB+ RAM recommended
- **Storage**: 100GB+ free space recommended
- **OS**: Ubuntu 22.04+
- **Software**: 
  - [Docker](https://docs.docker.com/engine/install/) with [nvidia-container-runtime](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) enabled
  - [Docker Compose](https://docs.docker.com/compose/install/)

### Initial Setup
In the cell below, we will install additional dependencies for video codecs, GStreamer plugins, and build the SAM2 extension that we'll use later in the pipeline.

In [1]:
# Install additional dependencies for video processing and build SAM2 extension
print("Installing additional dependencies...")
import subprocess, os, pathlib, sys
DEEPSTREAM_SCRIPT = pathlib.Path("/opt/nvidia/deepstream/deepstream/user_additional_install.sh")
if DEEPSTREAM_SCRIPT.exists():
    subprocess.check_call(["bash", str(DEEPSTREAM_SCRIPT)])
SAM2_ROOT = pathlib.Path(os.getenv("SAM2_ROOT", "/sam2"))
if SAM2_ROOT.exists():
    subprocess.check_call([sys.executable, "setup.py", "build_ext", "--inplace"], cwd=SAM2_ROOT)
else:
    print("⚠️  SAM2 root not found – skipping C++ extension build.")

print("Setup complete!")

Installing additional dependencies...
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1581 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1840 kB]
Get:3 https://librealsense.intel.com/Debian/apt-repo jammy InRelease [3249 B]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:6 https://librealsense.intel.com/Debian/apt-repo jammy/main amd64 Packages [15.3 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:9 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [3148 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:11 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1267 kB]
Get:12 http://security.ubuntu.com/ubuntu jammy-securit

W: https://librealsense.intel.com/Debian/apt-repo/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.



Building dependency tree...
Reading state information...
The following NEW packages will be installed:
  gstreamer1.0-libav
0 upgraded, 1 newly installed, 0 to remove and 186 not upgraded.
Need to get 103 kB of archives.
After this operation, 280 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 gstreamer1.0-libav amd64 1.20.3-0ubuntu1 [103 kB]
Fetched 103 kB in 4s (25.7 kB/s)
Selecting previously unselected package gstreamer1.0-libav:amd64.
(Reading database ... 89644 files and directories currently installed.)
Preparing to unpack .../gstreamer1.0-libav_1.20.3-0ubuntu1_amd64.deb ...
Unpacking gstreamer1.0-libav:amd64 (1.20.3-0ubuntu1) ...
Setting up gstreamer1.0-libav:amd64 (1.20.3-0ubuntu1) ...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libgstreamer-plugins-good1.0-0
Recommended packages:
  gstreamer1.0-x
The following packages wil

If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Emitting ninja build file /sam2/build/temp.linux-x86_64-cpython-310/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


[1/1] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /sam2/build/temp.linux-x86_64-cpython-310/sam2/csrc/connected_components.o.d -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /sam2/sam2/csrc/connected_components.cu -o /sam2/build/temp.linux-x86_64-cpython-310/sam2/csrc/connected_components.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11

## Step 1: Data Preparation and Guidelines

The notebook comes with a sample dataset and a base configuration file that allows users to optimize their results. This section covers suitable object types and dataset capturing instructions.

### Recommended Object Types

For optimal reconstruction results, choose objects with the following characteristics:

- **Rigid, Non-Deformable Objects**: The workflow performs best with objects that maintain a fixed shape across frames
- **Rich Surface Texture**: High texture variance enables reliable feature detection and matching, which is critical for accurate reconstruction
- **Asymmetrical Objects**: Distinct content on different faces helps avoid ambiguity during feature matching
- **Opaque Materials**: Avoid transparent or translucent materials (glass, clear plastic) as they interfere with depth and feature consistency

### Dataset Capturing Guidelines

To achieve optimal 3D reconstruction quality, follow these steps when capturing object images. The goal is to ensure complete coverage of the object's geometry while maintaining consistency in framing and orientation.

#### 1. Position the Object
- **Centering**: Place the object in the center of the camera frame
- **Size**: The object should occupy roughly 45-65% of the image area - large enough to capture details while providing context
- **Lighting**: 
  - Use even, diffused lighting to minimize harsh shadows and reflections
  - Avoid backlighting or direct overhead lights that create glare or overexposure
  - Ensure the object is well-lit from multiple directions to reveal surface details

#### 2. Capture the First Set (Primary Faces)
- Begin image capture while slowly rotating the object horizontally in one direction (clockwise or counterclockwise)
- Cover approximately 360 degrees of rotation
- This step should expose four visible faces of the object: front, back, and both sides
- Capture multiple overlapping frames to ensure robust feature matching across angles

#### 3. Capture the Second Set (Remaining Faces)
- Flip the object to reveal previously hidden faces (typically top and bottom)
- Continue capturing images while rotating slightly, following the same pattern as the previous step
- Ensure full coverage of the remaining two faces with overlapping views for consistent alignment

The animation below demonstrates the full object rotation and flip process, showing how to cover all six faces in a consistent, controlled manner.

<div align="center">
    <img src="../data/docs/adv.gif" alt="Capture Example" title="Capture Example" width="600">
</div>

### Manual Capture Guidelines (Alternative Method)

If a turntable is not available, you can capture the data manually by following these guidelines:

#### 1. Scan-like Movement
- Hold the object in your hands and rotate it manually in front of the camera
- Treat the process like scanning the object: gradually expose all surfaces to the camera
- Rotate the object slowly and smoothly, ensuring sufficient visual overlap between consecutive frames

#### 2. Maximize Visible Surface Area
- Ensure the camera can see a large portion of the object's surface in each frame
- Avoid fast or jerky movements - slower rotations help the system track features accurately
- Verify that all six faces of the object (front, back, sides, top, and bottom) are captured clearly

#### 3. Maintain Consistent Distance
- Keep the object at a consistent distance from the camera throughout the capture process
- Avoid moving the object significantly closer or farther during capture

The animation below demonstrates effective manual object capture technique.

<div align="center">
    <img src="../data/docs/input_dino.gif" alt="Manual Capture Example" title="Manual Capture Example" width="600">
</div>
  
## Comprehensive Camera Comparison

* Note that the following information is for reference only. Please check your camera's parameters for accurate 3D object reconstruction.

| **Specification** | **ZED 2i Camera (Stereolabs)** | **QooCam EGO 3D Camera (Kandao)** | **Hawk Stereo Camera (Leopard Imaging)** | **ZED Mini Camera (Stereolabs)** |
|-------------------|--------------------------------|-----------------------------------|------------------------------------------|----------------------------------|
| **Manufacturer** | Stereolabs | Kandao | Leopard Imaging | Stereolabs |
| **Resolution** | 2K stereo | 4K image / 2K video | Industrial grade | HD stereo |
| **Form Factor** | Desktop setup | Lightweight, portable | Compact industrial | Ultra-compact, lightweight |
| **Field of View** | Wide | Standard | Standard | Standard |
| **Target Audience** | Developers, robotics | Consumer/prosumer | Industrial, NVIDIA ecosystem | Mixed-reality, robotics |
| **Best Use Case** | Desktop capture, robotics | Quick field captures, handheld | Industrial applications, Isaac ROS | Mixed-reality, compact robotics |
| **Technical Specifications** | | | | |
| **Focal Length (fx)** | 1070.800 | 3079.6 | 958.35 | 522.38 |
| **Focal Length (fy)** | 1070.700 | 3075.1 | 956.18 | 522.38 |
| **Principal Point (cx)** | 1098.950 | 2000.0 | 959.36 | 644.88 |
| **Principal Point (cy)** | 649.044 | 1500.0 | 590.95 | 356.03 |
| **Baseline (m)** | 0.1198 | 0.0658 | 0.1495 | 0.12 |
| **Eye Separation** | Standard | Standard | Standard | 6.5cm |
| | | | | |
| **Strengths** | • Medium-resolution stereo capture (2K)<br>• Wide field of view<br>• Robust SDK<br>• Good for objects without detailed text | • Lightweight and portable<br>• User-friendly design<br>• Built-in display for instant review<br>• 4K image capture capability<br>• Ideal for handheld workflows | • Compact industrial-grade system<br>• Accurate calibration<br>• Widely used in Isaac ROS<br>• Developed by NVIDIA camera team<br>• Professional-grade reliability | • Ultra-compact design<br>• Built-in 6DoF IMU<br>• HD depth sensing with Ultra mode<br>• Visual-inertial technology<br>• Aluminum frame for robustness<br>• USB Type-C connectivity |
| **Weaknesses** | • Capture resolution not very high<br>• Cannot capture detailed texture<br>• Motion blur in video capturing<br>• Requires desktop for capturing | • No SDK for developers<br>• More consumer-focused<br>• Limited customization options | • Capture resolution not very high<br>• May not capture detailed texture<br>• Requires additional setup/integration<br>• Needs Jetson as additional device | • Smaller baseline (6.5cm)<br>• HD resolution (lower than 2K/4K)<br>• Requires powerful GPU for AR applications<br>• Limited depth range vs larger cameras |
| | | | | |
| **Setup Requirements** | | | | |
| **Additional Hardware** | Desktop/laptop | None (standalone) | Jetson device | Desktop/laptop |
| **Software Requirements** | ZED SDK | Companion mobile app | Custom integration, Isaac ROS | ZED SDK |
| **Minimum System (SDK)** | Standard desktop | N/A | Jetson platform | Dual-core 2.3GHz, 4GB RAM, USB 3.0 |
| **Recommended For** | • Desktop Development<br>• Robotics Integration<br>• SDK-based prototyping | • Field Data Collection<br>• Portable workflows<br>• Quick capture sessions | • Industrial Applications<br>• Isaac ROS projects<br>• Professional deployments | • Mixed-Reality Applications<br>• Compact Robotics<br>• Motion Tracking<br>• Space-constrained setups |

## Running Time Reference

Below are estimated running times for each stage of the 3D object reconstruction pipeline when using an NVIDIA RTX A6000 GPU with a dataset of 36 stereo frames at 4K (3000x4000) resolution:

| Pipeline Stage | Estimated Time | Key Factors Affecting Performance |
|----------------|----------------|----------------------------------|
| **Initial Setup** | 1-2 minutes | Package installation and extension compilation |
| **FoundationStereo Depth Estimation** | 1-2 minutes | Frame count, resolution |
| **SAM2 Object Segmentation** | 25 Seconds | Frame count, resolution |
| **Object Pose Tracking** | 3-4 minutes | Frame count, resolution |
| **SDF Training** | 3-4 minutes | Training iterations, resolution, number of keyframes |
| **Texture Baking** | 22-23 minutes | Texture resolution, mesh complexity, image resolution |
| **Total Pipeline** | **31-32 minutes** | End-to-end processing time |

**Notes:**
- Performance scales approximately linearly with frame count
- Higher resolution inputs increase processing time, particularly for Neural SDF training and texture generation
- Complex objects with intricate geometry or challenging textures may require longer processing times
- Using lower downscaling factors for higher quality outputs will increase processing time

These estimates are based on benchmarks using the RTX A6000 GPU. Performance may vary based on system configuration, input data characteristics, and specific parameter settings.


### Configuration and Data Setup

Now let's set up our experiment configuration and load the sample dataset.


In [2]:
# Import required libraries
import os
import uuid
import yaml 
import ipywidgets
import shutil
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import JSON, display
from PIL import Image 
import trimesh 
from pathlib import Path
import json

# Import custom modules for 3D reconstruction pipeline
from nvidia.objectreconstruction.utils.visualization import (
    create_stereo_viewer, create_depth_viewer, create_mask_viewer, 
    create_bbox_widget, create_3d_viewer
)
from nvidia.objectreconstruction.networks.foundationstereo import run_depth_estimation
from nvidia.objectreconstruction.networks.sam2infer import run_mask_extraction
from nvidia.objectreconstruction.dataloader import ReconstructionDataLoader
from nvidia.objectreconstruction.networks import NVBundleSDF
from nvidia.objectreconstruction.networks import ModelRendererOffscreen,vis_camera_poses

def pretty_print_config(obj, title=None):
    """Pretty print configuration with custom object handling"""
    
    def json_serializer(obj):
        """Custom JSON serializer for non-serializable objects"""
        if isinstance(obj, Path):
            return str(obj)
        elif hasattr(obj, '__dict__'):
            return obj.__dict__
        elif hasattr(obj, '_asdict'):  # namedtuples
            return obj._asdict()
        else:
            return str(obj)
    
    if title:
        print(f"{title}")
        print("=" * (len(title) + 4))
    
    # Convert to JSON-serializable format
    json_str = json.dumps(obj, default=json_serializer, indent=2, sort_keys=True)
    json_obj = json.loads(json_str)
    
    # Display using IPython's JSON widget with enhanced styling
    return JSON(json_obj, expanded=True)

print("All libraries imported successfully!")

Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.
using device: cuda


Cannot import nvdiffrast
No OpenGL_accelerate module loaded: No module named 'OpenGL_accelerate'


Warp 1.8.0 initialized:
   CUDA Toolkit 12.8, Driver 12.6
   Devices:
     "cpu"      : "x86_64"
     "cuda:0"   : "NVIDIA H100 PCIe" (79 GiB, sm_90, mempool enabled)
   Kernel cache:
     /tmp/warpcache
All libraries imported successfully!


In [3]:
# Load the experiment configuration file
config_file_path = '/workspace/3d-object-reconstruction/data/configs/base.yaml'

with open(config_file_path, 'r') as f:
    config = yaml.safe_load(f)

# Load the input dataset 
input_data_path = '/workspace/3d-object-reconstruction/data/samples/retail_item/'

# Setup the experiment directory
output_data_path = Path('/workspace/3d-object-reconstruction/data/output/retail_item/')

# Check if output directory exists and ask user for action
if output_data_path.exists():
    clear_existing = input(f"Output directory '{output_data_path}' already exists.\nClear existing contents? (y/n): ").lower().strip()
    
    if clear_existing in ['y', 'yes']:
        print("🗑️  Clearing existing output directory...")
        shutil.rmtree(output_data_path)
        print("✅ Existing contents cleared!")
    elif clear_existing in ['n', 'no']:
        print("📁 Keeping existing contents...")
    else:
        print("⚠️  Invalid input. Keeping existing contents by default...")

# Create output directory and copy input frames
output_data_path.mkdir(parents=True, exist_ok=True)
shutil.copytree(input_data_path, output_data_path, dirs_exist_ok=True)

# Update configuration to point to experiment directory
config['workdir'] = str(output_data_path)
config['bundletrack']['debug_dir'] = str(output_data_path)
config['nerf']['save_dir'] = str(output_data_path)

# Configure camera intrinsics and baseline for the sample dataset
# The example dataset uses QooCam with the following specifications:
# Intrinsic matrix format: [fx, 0, cx, 0, fy, cy, 0, 0, 1]
# Baseline: distance between stereo camera lenses in meters
config['camera_config']['intrinsic'] = [3079.6, 0, 2000.0, 0, 3075.1, 1500.01, 0, 0, 1]
config['foundation_stereo']['intrinsic'] = config['camera_config']['intrinsic']
config['foundation_stereo']['baseline'] = 0.0657696127  # 65.77mm baseline

print(f"Configuration loaded successfully!")
print(f"Input data path: {input_data_path}")
print(f"Output data path: {output_data_path}")
print(f"Camera intrinsics configured for QooCam")

Configuration loaded successfully!
Input data path: /workspace/3d-object-reconstruction/data/samples/retail_item/
Output data path: /workspace/3d-object-reconstruction/data/output/retail_item
Camera intrinsics configured for QooCam


### Visualize Input Data

Let's examine the sample stereo dataset to understand the input format and quality.


In [4]:
# Create an interactive stereo viewer to examine the input data
stereo_viewer = create_stereo_viewer(output_data_path)
display(stereo_viewer)

VBox(children=(IntSlider(value=0, continuous_update=False, description='Frame:', layout=Layout(width='500px'),…

## Step 2: Depth Estimation using FoundationStereo

Now we'll extract depth information from our stereo image pairs using FoundationStereo. This step is crucial as it provides the 3D geometric foundation for our reconstruction pipeline.

### About FoundationStereo

FoundationStereo is a state-of-the-art neural network architecture designed for stereo depth estimation. It leverages:
- **Transformer-based feature extraction** for robust matching across stereo pairs
- **Multi-scale processing** to handle objects at different distances
- **Uncertainty estimation** to identify reliable depth predictions

The network takes left and right stereo images as input and produces dense depth maps with sub-pixel accuracy.

### Configuration Review

Let's examine the FoundationStereo configuration before running inference: 


In [5]:
foundationstereo_config = config['foundation_stereo']
display(pretty_print_config(foundationstereo_config, "FoundationStereo Configuration"))

FoundationStereo Configuration


<IPython.core.display.JSON object>

### Run Depth Estimation

Now let's run FoundationStereo inference on our stereo pairs:


In [6]:
print("Starting FoundationStereo depth estimation...")
print("This may take several minutes depending on the number of frames and GPU performance.")

response = run_depth_estimation(
    config=foundationstereo_config, 
    exp_path=output_data_path, 
    rgb_path=output_data_path / 'left',
    depth_path=output_data_path / 'depth'
)

if response:
    print("✓ FoundationStereo depth estimation completed successfully!")
else:
    print("✗ Errors encountered during FoundationStereo inference.")
    print("Please check the configuration and input data before proceeding.")
    

[32m2025-07-18 21:01:40.356[0m | [1mINFO    [0m | [36mnvidia.objectreconstruction.networks.foundationstereo[0m:[36mrun_depth_estimation[0m:[36m460[0m - [1mDepth estimation directory: /workspace/3d-object-reconstruction/data/output/retail_item/depth[0m
[32m2025-07-18 21:01:40.358[0m | [1mINFO    [0m | [36mnvidia.objectreconstruction.networks.foundationstereo[0m:[36mrun_depth_estimation[0m:[36m475[0m - [1mRunning depth estimation...[0m


Starting FoundationStereo depth estimation...
This may take several minutes depending on the number of frames and GPU performance.


model.safetensors:   0%|          | 0.00/22.4M [00:00<?, ?B/s]

Downloading: "https://github.com/facebookresearch/dinov2/zipball/main" to /root/.cache/torch/hub/main.zip
using MLP layer as FFN
[32m2025-07-18 21:01:49.236[0m | [1mINFO    [0m | [36mnvidia.objectreconstruction.networks.foundationstereo[0m:[36mload_weights[0m:[36m166[0m - [1mLoaded weights from /workspace/3d-object-reconstruction/data/weights/foundationstereo/model_best_bp2.pth[0m
[32m2025-07-18 21:01:49.551[0m | [1mINFO    [0m | [36mnvidia.objectreconstruction.networks.foundationstereo[0m:[36m_discover_images[0m:[36m272[0m - [1mFound 37 left images[0m
[32m2025-07-18 21:01:49.553[0m | [1mINFO    [0m | [36mnvidia.objectreconstruction.networks.foundationstereo[0m:[36m_setup_camera_params[0m:[36m281[0m - [1mCamera baseline: 0.0657696127[0m
[32m2025-07-18 21:01:49.554[0m | [1mINFO    [0m | [36mnvidia.objectreconstruction.networks.foundationstereo[0m:[36m_setup_camera_params[0m:[36m282[0m - [1mImage scale factor: 0.3[0m
  with autocast(enabled

✓ FoundationStereo depth estimation completed successfully!


### Visualize Depth Results

Let's examine the generated depth maps to verify the quality of the depth estimation:


In [7]:
# Create an interactive depth viewer
depth_viewer = create_depth_viewer(output_data_path)
display(depth_viewer)

VBox(children=(IntSlider(value=0, continuous_update=False, description='Frame:', layout=Layout(width='500px'),…

## Step 3: Object Segmentation using SAM2

Next, we'll segment our target object across all frames using SAM2 (Segment Anything Model 2). This step is essential for isolating the object of interest from the background.

### About SAM2

SAM2 is Meta's advanced segmentation model that excels at:
- **Video object tracking**: Maintaining consistent segmentation across frames
- **Prompt-based segmentation**: Using minimal user input (like a bounding box) to identify objects
- **Temporal consistency**: Leveraging motion and appearance cues for robust tracking

The model requires only a single frame annotation (bounding box) and automatically propagates the segmentation to all other frames.

### Interactive Bounding Box Selection

Use the interactive widget below to draw a bounding box around your target object in the first frame: 

In [8]:
%matplotlib ipympl

print("Instructions:")
print("1. Use your mouse to draw a bounding box around the target object")
print("2. Make sure the box tightly encompasses the entire object")
print("3. Click 'Finalize & Close' when satisfied with the bounding box")

bbox_widget = create_bbox_widget(output_data_path)
display(bbox_widget.display())

Instructions:
1. Use your mouse to draw a bounding box around the target object
2. Make sure the box tightly encompasses the entire object
3. Click 'Finalize & Close' when satisfied with the bounding box


VBox(children=(HTML(value='<b>Instructions:</b> Click and drag to draw a bounding box on the image below.'), H…

### Run SAM2 Segmentation

Now we'll use the selected bounding box to run SAM2 segmentation across all frames:


In [9]:
# Extract bounding box coordinates
x, y, w, h = bbox_widget.get_bbox()
print(f"Selected bounding box: x={x}, y={y}, width={w}, height={h}")

# Update SAM2 configuration with the bounding box coordinates
sam2_config = config['sam2']
sam2_config['bbox'] = [x, y, x+w, y+h] 

print("Starting SAM2 object segmentation...")
print("This process will track the object across all frames.")

# Run object segmentation using SAM2
response = run_mask_extraction(
    config=sam2_config,
    exp_path=output_data_path,
    rgb_path=output_data_path / "left",
    mask_path=output_data_path / "masks"
)

if response:
    print("✓ SAM2 segmentation completed successfully!")
else:
    print("✗ Errors encountered during SAM2 inference.")
    print("Please check the bounding box selection and try again.")

assert response, 'SAM2 inference failed. Please resolve issues before proceeding.'


Mask extraction directory: /workspace/3d-object-reconstruction/data/output/retail_item/masks
Running mask extraction...


Selected bounding box: x=1067, y=576, width=1164, height=1670
Starting SAM2 object segmentation...
This process will track the object across all frames.
{'checkpoint_path': '/workspace/3d-object-reconstruction/data/weights/sam2/sam2.1_hiera_large.pt', 'model_config': '//workspace/3d-object-reconstruction/data/weights/sam2/sam2.1_hiera_l.yaml', 'bbox': [1067, 576, 2231, 2246], 'device': 'cuda'}


Loaded checkpoint sucessfully
Processing 37 frames for mask extraction...
Loading frames: 100% 37/37 [00:13<00:00,  2.67it/s]
propagate in video: 100% 37/37 [00:01<00:00, 23.99it/s]
Mask extraction completed. Masks saved to /workspace/3d-object-reconstruction/data/output/retail_item/masks
Mask extraction completed successfully


✓ SAM2 segmentation completed successfully!


Let us inspect the extracted masks from SAM2. 

In [10]:
mask_viewer = create_mask_viewer(output_data_path)
display(mask_viewer)

VBox(children=(IntSlider(value=0, continuous_update=False, description='Frame:', layout=Layout(width='500px'),…

# Step 4: 3D Reconstruction and Neural Rendering using NVBundleSDF

This final step combines multiple state-of-the-art techniques to create a complete 3D reconstruction of your object. The pipeline integrates feature matching, pose estimation, and neural rendering to generate high-quality textured 3D assets.

## Pipeline Overview

The 3D reconstruction process follows a sophisticated multi-stage approach:

1. **Pose Estimation** → Estimate and optimize camera poses
2. **Neural Reconstruction with Neural SDF** → Train a neural object field for 3D geometry
3. **Texture Baking** → Generate production-ready textured meshes

<div align="center">
    <img src="../data/docs/bundlesdf_pipeline.png" alt="BundleSDF Pipeline" title="3D Reconstruction Pipeline" width="800">
</div>

## Technical Components

###BundleSDF
[**BundleSDF: Neural 6-DOF Tracking and 3D Reconstruction**](https://arxiv.org/abs/2303.14158) combines:
- **Volume Rendering**: Learns 3D geometry through differentiable ray casting
- **Appearance Modeling**: Captures view-dependent effects and material properties
- **SDF Representation**: Uses signed distance functions for clean mesh extraction
- **Bundle Adjustment**: Performs global optimization across all frames for geometric consistency

###FoundationStereo
[**FoundationStereo: Zero-Shot Stereo Matching**](https://arxiv.org/abs/2501.09898) delivers robust depth estimation:
- **Vision Foundation Model**: Leverages pre-trained vision transformers for rich feature extraction
- **Zero-Shot Generalization**: Performs well across diverse environments without domain-specific fine-tuning
- **Multi-Scale Processing**: Handles objects at different distances through hierarchical feature analysis
- **Sub-Pixel Accuracy**: Achieves precise depth measurements with transformer-based stereo matching

###RoMa Feature Matching
[**RoMa: A Robust Dense Feature Matching**](https://arxiv.org/abs/2305.15404) provides reliable feature correspondences between frames:
- **Dense Matching**: Establishes pixel-to-pixel correspondences across viewpoints
- **Robust Descriptors**: Uses transformer-based features for challenging lighting and viewpoint changes
- **Uncertainty Estimation**: Provides confidence scores for each match to filter unreliable correspondences

###SAM2
[**SAM2: Segment Anything in Images and Videos**](https://arxiv.org/abs/2408.00714) extends segmentation to video:
- **Transformer Architecture**: Uses hierarchical vision transformer with streaming memory
- **Temporal Consistency**: Maintains object tracking across frames via memory mechanisms
- **Prompt Flexibility**: Accepts points, boxes, and masks for interactive segmentation
- **Real-time Performance**: Processes video 6× faster than the original SAM

## Configuration Review

Let's examine the configurations for each component before running the reconstruction:

In [11]:
roma_config = config['roma']
display(pretty_print_config(roma_config, "ROMA Feature Matching Configuration"))

ROMA Feature Matching Configuration


<IPython.core.display.JSON object>

In [12]:
config['bundletrack']['debug_dir'] = output_data_path / "bundletrack"
bundletrack_config = config['bundletrack']
display(pretty_print_config(bundletrack_config, "Pose Estimation Configuration"))

Pose Estimation Configuration


<IPython.core.display.JSON object>

In [13]:
config['nerf']['save_dir'] = output_data_path #sdf config
nerf_config = config['nerf']
display(pretty_print_config(nerf_config, "SDF Training Configuration"))

SDF Training Configuration


<IPython.core.display.JSON object>

In [14]:
# Texture Baking
texturebake_config = config['texture_bake']
display(pretty_print_config(texturebake_config, "Texture Baking Configuration"))

Texture Baking Configuration


<IPython.core.display.JSON object>

In [15]:
# Setup dataloaders
track_dataset = ReconstructionDataLoader(
    str(output_data_path), 
    config, 
    downscale=bundletrack_config['downscale'],
    min_resolution=bundletrack_config['min_resolution']
)
nerf_dataset = ReconstructionDataLoader(
    str(output_data_path), 
    config, 
    downscale=nerf_config['downscale'],
    min_resolution=nerf_config['min_resolution']
)
texture_dataset = ReconstructionDataLoader(
    str(output_data_path), 
    config, 
    downscale=texturebake_config['downscale'],
    min_resolution=texturebake_config['min_resolution']
)

# Setup NVBundleSDF instance 
tracker = NVBundleSDF(nerf_config, bundletrack_config, roma_config,texturebake_config)



  _C._set_default_tensor_type(t)
Set PyTorch to use full precision (float32) for NVBundleSDF
PyTorch model loaded successfully. Maximum batch size: 1


Using coarse resolution (560, 560), and upsample res [864, 864]


Let us now continue with feature matching and pose estimation using BundleSDF.

In [16]:
# Run bundle track for feature matching and pose estimation 
tracker.run_track(track_dataset)

if not os.path.exists(os.path.join(config['bundletrack']['debug_dir'], 'keyframes.yml')):
    print(f'Feature Matching and Pose Estimation Failed, please check logs and resolve error before proceeding.')
else:
    print(f'Feature Matching and Pose Estimation successful.') 

Initializing BundleTrack...
Set PyTorch to use full precision (float32) for NVBundleSDF
Processing 37 frames for tracking...




Applying depth preprocessing...
Depth threshold: 0.2742, remaining valid points: 4470144
Processing frame left000000 (ID: 0)...
Processing frame left000000
Foreground points: 1590940
Valid points: 1517293
Frame left000000 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000000




Applying depth preprocessing...
Depth threshold: 0.3063, remaining valid points: 4987243
Processing frame left000001 (ID: 1)...
Processing frame left000001
Foreground points: 1729978
Valid points: 1646451
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  1.99it/s]
Found correspondences: 1 pairs, shape: (4924, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000001
Before:
[[ 1.     0.     0.     0.028]
 [ 0.     1.     0.     0.007]
 [ 0.     0.     1.    -0.268]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.904 -0.049  0.424 -0.106]
 [ 0.049  0.999  0.012  0.005]
 [-0.424  0.009  0.905 -0.253]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 0 frame pairs
No valid query pairs found
Frame left000001 processing complete


#optimizeGPU frames=2, #keyframes=1, #_frames=2
left000000 left000001 
global_corres=4530
maxNumResiduals / maxNumberOfImages = 192030 / 2 = 96015
m_maxNumberOfImages*m_maxCorrPerImage = 2 x 4530 = 9060
m_solver->solve Time difference = 25.097[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000001




Applying depth preprocessing...
Depth threshold: 0.3110, remaining valid points: 5174659
Processing frame left000002 (ID: 2)...
Processing frame left000002
Foreground points: 1864695
Valid points: 1780227
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.25it/s]
Found correspondences: 1 pairs, shape: (4954, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000002
Before:
[[ 0.904 -0.049  0.424 -0.106]
 [ 0.049  0.999  0.012  0.005]
 [-0.424  0.009  0.905 -0.253]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.735 -0.079  0.673 -0.189]
 [ 0.076  0.997  0.034 -0.001]
 [-0.674  0.026  0.739 -0.211]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 1 frame pairs
Running RoMa feature matching...




100% 1/1 [00:00<00:00,  5.25it/s]
Found correspondences: 1 pairs, shape: (4919, 4)
Running RANSAC for robust correspondence estimation...
Frame left000002 processing complete


#optimizeGPU frames=3, #keyframes=2, #_frames=3
left000000 left000001 left000002 
global_corres=13055
maxNumResiduals / maxNumberOfImages = 575555 / 3 = 191851
m_maxNumberOfImages*m_maxCorrPerImage = 3 x 8867 = 26601
m_solver->solve Time difference = 70.379[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000002




Applying depth preprocessing...
Depth threshold: 0.3121, remaining valid points: 5217669
Processing frame left000003 (ID: 3)...
Processing frame left000003
Foreground points: 1889508
Valid points: 1800136
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  4.96it/s]
Found correspondences: 1 pairs, shape: (4961, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000003
Before:
[[ 0.732 -0.079  0.677 -0.19 ]
 [ 0.076  0.997  0.033 -0.001]
 [-0.677  0.027  0.735 -0.21 ]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.563 -0.095  0.821 -0.24 ]
 [ 0.094  0.994  0.051 -0.005]
 [-0.821  0.048  0.568 -0.164]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 2 frame pairs
Running RoMa feature matching...




100% 2/2 [00:00<00:00,  5.25it/s]
Found correspondences: 2 pairs, shape: (4876, 4)
Running RANSAC for robust correspondence estimation...
Frame left000003 processing complete


#optimizeGPU frames=4, #keyframes=3, #_frames=4
left000000 left000001 left000002 left000003 
global_corres=24458
maxNumResiduals / maxNumberOfImages = 1149458 / 4 = 287364
m_maxNumberOfImages*m_maxCorrPerImage = 4 x 12756 = 51024
m_solver->solve Time difference = 117.752[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000003




Applying depth preprocessing...
Depth threshold: 0.3108, remaining valid points: 5151983
Processing frame left000004 (ID: 4)...
Processing frame left000004
Foreground points: 1842754
Valid points: 1755693
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.11it/s]
Found correspondences: 1 pairs, shape: (4981, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000004
Before:
[[ 0.561 -0.095  0.822 -0.24 ]
 [ 0.093  0.994  0.051 -0.005]
 [-0.823  0.048  0.567 -0.164]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.33  -0.107  0.938 -0.284]
 [ 0.107  0.991  0.076 -0.013]
 [-0.938  0.075  0.338 -0.098]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 3 frame pairs
Running RoMa feature matching...




100% 3/3 [00:00<00:00,  5.23it/s]
Found correspondences: 3 pairs, shape: (4828, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=5, #keyframes=4, #_frames=5
left000000 left000001 left000002 left000003 left000004 
global_corres=39635
maxNumResiduals / maxNumberOfImages = 1914635 / 5 = 382927
m_maxNumberOfImages*m_maxCorrPerImage = 5 x 16847 = 84235
m_solver->solve Time difference = 171.969[ms]


Frame left000004 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000004




Applying depth preprocessing...
Depth threshold: 0.2767, remaining valid points: 4631828
Processing frame left000005 (ID: 5)...
Processing frame left000005
Foreground points: 1702157
Valid points: 1617247
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.38it/s]
Found correspondences: 1 pairs, shape: (4975, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000005
Before:
[[ 0.328 -0.107  0.939 -0.284]
 [ 0.106  0.991  0.076 -0.012]
 [-0.939  0.075  0.337 -0.097]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.074 -0.111  0.991 -0.31 ]
 [ 0.113  0.988  0.103 -0.02 ]
 [-0.991  0.104  0.085 -0.022]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 2 frame pairs
Running RoMa feature matching...




100% 2/2 [00:00<00:00,  5.28it/s]
Found correspondences: 2 pairs, shape: (4971, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=6, #keyframes=5, #_frames=6
left000000 left000001 left000002 left000003 left000004 left000005 
global_corres=53116


Frame left000005 processing complete


maxNumResiduals / maxNumberOfImages = 2865616 / 6 = 477602
m_maxNumberOfImages*m_maxCorrPerImage = 6 x 21012 = 126072
m_solver->solve Time difference = 218.632[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000005




Applying depth preprocessing...
Depth threshold: 0.2790, remaining valid points: 4649650
Processing frame left000006 (ID: 6)...
Processing frame left000006
Foreground points: 1700689
Valid points: 1619214
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.27it/s]
Found correspondences: 1 pairs, shape: (4979, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000006
Before:
[[ 0.074 -0.111  0.991 -0.31 ]
 [ 0.113  0.988  0.103 -0.021]
 [-0.991  0.104  0.085 -0.022]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.194 -0.111  0.975 -0.314]
 [ 0.112  0.984  0.135 -0.03 ]
 [-0.975  0.136 -0.178  0.06 ]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 3 frame pairs
Running RoMa feature matching...




100% 3/3 [00:00<00:00,  5.30it/s]
Found correspondences: 3 pairs, shape: (4970, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=7, #keyframes=6, #_frames=7
left000000 left000001 left000002 left000003 left000004 left000005 left000006 
global_corres=70866


Frame left000006 processing complete


maxNumResiduals / maxNumberOfImages = 4008366 / 7 = 572623
m_maxNumberOfImages*m_maxCorrPerImage = 7 x 25317 = 177219
m_solver->solve Time difference = 287.48[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000006




Applying depth preprocessing...
Depth threshold: 0.2928, remaining valid points: 4967532
Processing frame left000007 (ID: 7)...
Processing frame left000007
Foreground points: 1881931
Valid points: 1795503
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.35it/s]
Found correspondences: 1 pairs, shape: (4953, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000007
Before:
[[-0.193 -0.111  0.975 -0.314]
 [ 0.112  0.985  0.134 -0.03 ]
 [-0.975  0.135 -0.178  0.06 ]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.465 -0.103  0.879 -0.294]
 [ 0.101  0.981  0.168 -0.041]
 [-0.88   0.166 -0.446  0.145]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 4 frame pairs
Running RoMa feature matching...




100% 4/4 [00:00<00:00,  5.24it/s]
Found correspondences: 4 pairs, shape: (4925, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=8, #keyframes=7, #_frames=8
left000000 left000001 left000002 left000003 left000004 left000005 left000006 left000007 
global_corres=92907


Frame left000007 processing complete


maxNumResiduals / maxNumberOfImages = 5342907 / 8 = 667863
m_maxNumberOfImages*m_maxCorrPerImage = 8 x 29701 = 237608
m_solver->solve Time difference = 347.294[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000007




Applying depth preprocessing...
Depth threshold: 0.2968, remaining valid points: 5149789
Processing frame left000008 (ID: 8)...
Processing frame left000008
Foreground points: 2050174
Valid points: 1956795
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.17it/s]
Found correspondences: 1 pairs, shape: (4984, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000008
Before:
[[-0.465 -0.102  0.88  -0.294]
 [ 0.101  0.981  0.167 -0.041]
 [-0.88   0.166 -0.445  0.145]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.706 -0.082  0.703 -0.248]
 [ 0.079  0.978  0.193 -0.049]
 [-0.703  0.192 -0.684  0.224]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 5 frame pairs




Running RoMa feature matching...
100% 5/5 [00:00<00:00,  5.30it/s]
Found correspondences: 5 pairs, shape: (4874, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=9, #keyframes=8, #_frames=9
left000000 left000001 left000002 left000003 left000004 left000005 left000006 left000007 left000008 
global_corres=117377


Frame left000008 processing complete


maxNumResiduals / maxNumberOfImages = 6867377 / 9 = 763041
m_maxNumberOfImages*m_maxCorrPerImage = 9 x 33887 = 304983
m_solver->solve Time difference = 406.478[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000008




Applying depth preprocessing...
Depth threshold: 0.3002, remaining valid points: 5204647
Processing frame left000009 (ID: 9)...
Processing frame left000009
Foreground points: 2097362
Valid points: 1998611
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.42it/s]
Found correspondences: 1 pairs, shape: (4986, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000009
Before:
[[-0.704 -0.083  0.705 -0.249]
 [ 0.079  0.978  0.194 -0.05 ]
 [-0.706  0.192 -0.682  0.223]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.846 -0.062  0.529 -0.2  ]
 [ 0.058  0.976  0.208 -0.055]
 [-0.53   0.207 -0.823  0.273]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 6 frame pairs
Running RoMa feature matching...




100% 6/6 [00:01<00:00,  5.28it/s]
Found correspondences: 6 pairs, shape: (4856, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=9, #_frames=10
left000000 left000001 left000002 left000003 left000004 left000005 left000006 left000007 left000008 left000009 
global_corres=143690


Frame left000009 processing complete


maxNumResiduals / maxNumberOfImages = 8581190 / 10 = 858119
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 37374 = 373740
m_solver->solve Time difference = 459.322[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000009




Applying depth preprocessing...
Depth threshold: 0.2925, remaining valid points: 5023312
Processing frame left000010 (ID: 10)...
Processing frame left000010
Foreground points: 2015842
Valid points: 1916923
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.34it/s]
Found correspondences: 1 pairs, shape: (4987, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000010
Before:
[[-0.847 -0.062  0.528 -0.2  ]
 [ 0.058  0.977  0.208 -0.055]
 [-0.529  0.206 -0.823  0.273]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.96  -0.033  0.277 -0.126]
 [ 0.03   0.975  0.22  -0.059]
 [-0.277  0.219 -0.935  0.316]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 7 frame pairs




Running RoMa feature matching...
100% 7/7 [00:01<00:00,  5.29it/s]
Found correspondences: 7 pairs, shape: (4616, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=10, #_frames=11
left000001 left000002 left000003 left000004 left000005 left000006 left000007 left000008 left000009 left000010 
global_corres=152656
maxNumResiduals / maxNumberOfImages = 8590156 / 10 = 859015
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 36423 = 364230
m_solver->solve Time difference = 480.763[ms]


Frame left000010 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000010




Applying depth preprocessing...
Depth threshold: 0.2548, remaining valid points: 4382905
Processing frame left000011 (ID: 11)...
Processing frame left000011
Foreground points: 1847199
Valid points: 1760533
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.19it/s]
Found correspondences: 1 pairs, shape: (4994, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000011
Before:
[[-0.961 -0.034  0.275 -0.126]
 [ 0.029  0.975  0.221 -0.06 ]
 [-0.275  0.22  -0.936  0.317]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.998 -0.009  0.058 -0.06 ]
 [ 0.004  0.975  0.224 -0.061]
 [-0.058  0.224 -0.973  0.336]
 [ 0.     0.     0.     1.   ]]
Finding correspondences between 2 frame pairs




Running RoMa feature matching...
100% 2/2 [00:00<00:00,  5.32it/s]
Found correspondences: 2 pairs, shape: (4994, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=11, #_frames=12
left000002 left000003 left000004 left000005 left000006 left000007 left000008 left000009 left000010 left000011 
global_corres=155450
maxNumResiduals / maxNumberOfImages = 8592950 / 10 = 859295
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 38002 = 380020
m_solver->solve Time difference = 475.774[ms]


Frame left000011 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000011




Applying depth preprocessing...
Depth threshold: 0.2580, remaining valid points: 4449988
Processing frame left000012 (ID: 12)...
Processing frame left000012
Foreground points: 1863276
Valid points: 1774278
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.38it/s]
Found correspondences: 1 pairs, shape: (4991, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000012
Before:
[[-0.998 -0.009  0.058 -0.06 ]
 [ 0.005  0.975  0.223 -0.061]
 [-0.059  0.223 -0.973  0.336]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.988  0.016 -0.153  0.005]
 [-0.019  0.975  0.221 -0.061]
 [ 0.153  0.221 -0.963  0.34 ]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 3 frame pairs
Running RoMa feature matching...




100% 3/3 [00:00<00:00,  5.28it/s]
Found correspondences: 3 pairs, shape: (4976, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=12, #_frames=13
left000000 left000001 left000005 left000006 left000007 left000008 left000009 left000010 left000011 left000012 
global_corres=97246


Frame left000012 processing complete


maxNumResiduals / maxNumberOfImages = 8534746 / 10 = 853474
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 30786 = 307860
m_solver->solve Time difference = 357.488[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000012




Applying depth preprocessing...
Depth threshold: 0.2709, remaining valid points: 4729669
Processing frame left000013 (ID: 13)...
Processing frame left000013
Foreground points: 1978771
Valid points: 1791590
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.39it/s]
Found correspondences: 1 pairs, shape: (4991, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000013
Before:
[[-0.988  0.016 -0.153  0.005]
 [-0.019  0.975  0.221 -0.061]
 [ 0.152  0.221 -0.963  0.34 ]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.945  0.036 -0.325  0.059]
 [-0.037  0.975  0.217 -0.061]
 [ 0.325  0.217 -0.921  0.333]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 4 frame pairs
Running RoMa feature matching...




100% 4/4 [00:00<00:00,  5.33it/s]
Found correspondences: 4 pairs, shape: (4963, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=13, #_frames=14
left000000 left000005 left000006 left000007 left000008 left000009 left000010 left000011 left000012 left000013 
global_corres=116579
maxNumResiduals / maxNumberOfImages = 8554079 / 10 = 855407
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 35383 = 353830
m_solver->solve Time difference = 414.335[ms]


Frame left000013 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000013




Applying depth preprocessing...
Depth threshold: 0.2840, remaining valid points: 5052082
Processing frame left000014 (ID: 14)...
Processing frame left000014
Foreground points: 2146088
Valid points: 2044071
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.34it/s]
Found correspondences: 1 pairs, shape: (4978, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000014
Before:
[[-0.945  0.036 -0.325  0.059]
 [-0.038  0.975  0.217 -0.061]
 [ 0.324  0.217 -0.921  0.333]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.852  0.059 -0.521  0.123]
 [-0.059  0.976  0.208 -0.059]
 [ 0.521  0.208 -0.828  0.311]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 5 frame pairs
Running RoMa feature matching...




100% 5/5 [00:00<00:00,  5.31it/s]
Found correspondences: 5 pairs, shape: (4914, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=14, #_frames=15
left000000 left000006 left000007 left000008 left000009 left000010 left000011 left000012 left000013 left000014 
global_corres=124478
maxNumResiduals / maxNumberOfImages = 8561978 / 10 = 856197
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 36423 = 364230
m_solver->solve Time difference = 441.075[ms]


Frame left000014 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000014




Applying depth preprocessing...
Depth threshold: 0.2855, remaining valid points: 5150804
Processing frame left000015 (ID: 15)...
Processing frame left000015
Foreground points: 2216227
Valid points: 2115224
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.42it/s]
Found correspondences: 1 pairs, shape: (4985, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000015
Before:
[[-0.852  0.059 -0.52   0.123]
 [-0.059  0.976  0.208 -0.059]
 [ 0.52   0.208 -0.828  0.311]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.705  0.081 -0.704  0.184]
 [-0.08   0.978  0.193 -0.055]
 [ 0.704  0.192 -0.683  0.273]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 6 frame pairs
Running RoMa feature matching...




100% 6/6 [00:01<00:00,  5.30it/s]
Found correspondences: 6 pairs, shape: (4813, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=15, #_frames=16
left000006 left000007 left000008 left000009 left000010 left000011 left000012 left000013 left000014 left000015 
global_corres=153336
maxNumResiduals / maxNumberOfImages = 8590836 / 10 = 859083
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 40897 = 408970
m_solver->solve Time difference = 518.368[ms]


Frame left000015 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000015




Applying depth preprocessing...
Depth threshold: 0.2883, remaining valid points: 5157618
Processing frame left000016 (ID: 16)...
Processing frame left000016
Foreground points: 2182317
Valid points: 2079343
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.32it/s]
Found correspondences: 1 pairs, shape: (4984, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000016
Before:
[[-0.705  0.081 -0.705  0.184]
 [-0.08   0.978  0.193 -0.055]
 [ 0.705  0.193 -0.683  0.273]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.525  0.096 -0.845  0.234]
 [-0.097  0.98   0.172 -0.049]
 [ 0.845  0.172 -0.506  0.223]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 7 frame pairs




Running RoMa feature matching...
100% 7/7 [00:01<00:00,  5.28it/s]
Found correspondences: 7 pairs, shape: (4883, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=16, #_frames=17
left000007 left000008 left000009 left000010 left000011 left000012 left000013 left000014 left000015 left000016 
global_corres=167785
maxNumResiduals / maxNumberOfImages = 8605285 / 10 = 860528
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 40718 = 407180
m_solver->solve Time difference = 525.794[ms]


Frame left000016 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000016




Applying depth preprocessing...
Depth threshold: 0.2777, remaining valid points: 4904225
Processing frame left000017 (ID: 17)...
Processing frame left000017
Foreground points: 2032977
Valid points: 1934111
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.40it/s]
Found correspondences: 1 pairs, shape: (4980, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000017
Before:
[[-0.524  0.096 -0.846  0.234]
 [-0.097  0.98   0.171 -0.049]
 [ 0.846  0.172 -0.504  0.223]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.317  0.106 -0.942  0.271]
 [-0.108  0.983  0.148 -0.042]
 [ 0.942  0.149 -0.301  0.164]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 2 frame pairs
Running RoMa feature matching...




100% 2/2 [00:00<00:00,  5.33it/s]
Found correspondences: 2 pairs, shape: (4971, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=17, #_frames=18
left000008 left000009 left000010 left000011 left000012 left000013 left000014 left000015 left000016 left000017 
global_corres=168316
maxNumResiduals / maxNumberOfImages = 8605816 / 10 = 860581
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 39112 = 391120
m_solver->solve Time difference = 487.171[ms]


Frame left000017 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000017




Applying depth preprocessing...
Depth threshold: 0.2606, remaining valid points: 4507002
Processing frame left000018 (ID: 18)...
Processing frame left000018
Foreground points: 1857625
Valid points: 1769348
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.35it/s]
Found correspondences: 1 pairs, shape: (4979, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000018
Before:
[[-0.318  0.106 -0.942  0.271]
 [-0.108  0.983  0.148 -0.042]
 [ 0.942  0.149 -0.301  0.164]
 [ 0.     0.     0.     1.   ]]
After:
[[-0.032  0.111 -0.993  0.297]
 [-0.113  0.987  0.114 -0.032]
 [ 0.993  0.116 -0.019  0.079]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 3 frame pairs
Running RoMa feature matching...




100% 3/3 [00:00<00:00,  5.25it/s]
Found correspondences: 3 pairs, shape: (4970, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=18, #_frames=19
left000000 left000010 left000011 left000012 left000013 left000014 left000015 left000016 left000017 left000018 
global_corres=119548
maxNumResiduals / maxNumberOfImages = 8557048 / 10 = 855704
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 33947 = 339470
m_solver->solve Time difference = 426.313[ms]


Frame left000018 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000018




Applying depth preprocessing...
Depth threshold: 0.2682, remaining valid points: 4569635
Processing frame left000019 (ID: 19)...
Processing frame left000019
Foreground points: 1791695
Valid points: 1705087
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.28it/s]
Found correspondences: 1 pairs, shape: (4981, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000019
Before:
[[-0.032  0.111 -0.993  0.297]
 [-0.113  0.987  0.114 -0.032]
 [ 0.993  0.116 -0.019  0.079]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.147  0.109 -0.983  0.3  ]
 [-0.111  0.989  0.093 -0.025]
 [ 0.983  0.096  0.157  0.024]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 4 frame pairs




Running RoMa feature matching...
100% 4/4 [00:00<00:00,  5.20it/s]
Found correspondences: 4 pairs, shape: (4977, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=19, #_frames=20
left000000 left000001 left000002 left000013 left000014 left000015 left000016 left000017 left000018 left000019 
global_corres=87044
maxNumResiduals / maxNumberOfImages = 8524544 / 10 = 852454
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 25221 = 252210
m_solver->solve Time difference = 327.967[ms]


Frame left000019 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000019




Applying depth preprocessing...
Depth threshold: 0.2904, remaining valid points: 4980916
Processing frame left000020 (ID: 20)...
Processing frame left000020
Foreground points: 1925967
Valid points: 1833358
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.41it/s]
Found correspondences: 1 pairs, shape: (4977, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000020
Before:
[[ 0.147  0.109 -0.983  0.3  ]
 [-0.112  0.989  0.093 -0.025]
 [ 0.983  0.096  0.157  0.024]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.366  0.101 -0.925  0.29 ]
 [-0.104  0.992  0.068 -0.017]
 [ 0.925  0.071  0.373 -0.044]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs
Running RoMa feature matching...




100% 8/8 [00:01<00:00,  5.23it/s]
Found correspondences: 8 pairs, shape: (4629, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=20, #_frames=21
left000000 left000001 left000002 left000014 left000015 left000016 left000017 left000018 left000019 left000020 
global_corres=110573
maxNumResiduals / maxNumberOfImages = 8548073 / 10 = 854807
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 33567 = 335670
m_solver->solve Time difference = 383.782[ms]


Frame left000020 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000020




Applying depth preprocessing...
Depth threshold: 0.2991, remaining valid points: 5119996
Processing frame left000021 (ID: 21)...
Processing frame left000021
Foreground points: 1983202
Valid points: 1891585
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.33it/s]
Found correspondences: 1 pairs, shape: (4960, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000021
Before:
[[ 0.378  0.095 -0.921  0.288]
 [-0.099  0.993  0.062 -0.016]
 [ 0.92   0.068  0.385 -0.048]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.637  0.081 -0.766  0.249]
 [-0.08   0.996  0.039 -0.008]
 [ 0.766  0.037  0.641 -0.131]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs
Running RoMa feature matching...




100% 8/8 [00:01<00:00,  5.28it/s]
Found correspondences: 8 pairs, shape: (4878, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=21, #_frames=22
left000000 left000001 left000002 left000015 left000016 left000017 left000018 left000019 left000020 left000021 
global_corres=126872
maxNumResiduals / maxNumberOfImages = 8564372 / 10 = 856437
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 38323 = 383230
m_solver->solve Time difference = 426.719[ms]


Frame left000021 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000021




Applying depth preprocessing...
Depth threshold: 0.3077, remaining valid points: 5152195
Processing frame left000022 (ID: 22)...
Processing frame left000022
Foreground points: 1908507
Valid points: 1817522
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.41it/s]
Found correspondences: 1 pairs, shape: (4965, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000022
Before:
[[ 0.649  0.08  -0.757  0.246]
 [-0.08   0.996  0.037 -0.007]
 [ 0.757  0.037  0.653 -0.134]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.845  0.055 -0.532  0.184]
 [-0.054  0.998  0.017 -0.   ]
 [ 0.533  0.015  0.846 -0.201]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 10 frame pairs




Running RoMa feature matching...
100% 10/10 [00:01<00:00,  5.20it/s]
Found correspondences: 10 pairs, shape: (4894, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=22, #_frames=23
left000000 left000001 left000002 left000003 left000017 left000018 left000019 left000020 left000021 left000022 
global_corres=130881
maxNumResiduals / maxNumberOfImages = 8568381 / 10 = 856838
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 37469 = 374690
m_solver->solve Time difference = 428.68[ms]


Frame left000022 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000022




Applying depth preprocessing...
Depth threshold: 0.2803, remaining valid points: 4641393
Processing frame left000023 (ID: 23)...
Processing frame left000023
Foreground points: 1689878
Valid points: 1575782
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.28it/s]
Found correspondences: 1 pairs, shape: (4956, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000023
Before:
[[ 0.854  0.054 -0.518  0.18 ]
 [-0.054  0.998  0.016  0.   ]
 [ 0.518  0.014  0.855 -0.205]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.974  0.022 -0.225  0.094]
 [-0.021  1.     0.004  0.005]
 [ 0.225  0.001  0.974 -0.251]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 6 frame pairs
Running RoMa feature matching...




100% 6/6 [00:01<00:00,  5.28it/s]
Found correspondences: 6 pairs, shape: (4928, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=23, #_frames=24
left000000 left000001 left000002 left000003 left000004 left000019 left000020 left000021 left000022 left000023 
global_corres=135100
maxNumResiduals / maxNumberOfImages = 8572600 / 10 = 857260
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 33432 = 334320
m_solver->solve Time difference = 398.257[ms]


Frame left000023 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000023




Applying depth preprocessing...
Depth threshold: 0.2754, remaining valid points: 4497487
Processing frame left000024 (ID: 24)...
Processing frame left000024
Foreground points: 1594770
Valid points: 1519155
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.38it/s]
Found correspondences: 1 pairs, shape: (4958, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000024
Before:
[[ 0.981  0.022 -0.195  0.086]
 [-0.021  1.     0.004  0.005]
 [ 0.195  0.     0.981 -0.254]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.998 -0.009  0.066  0.006]
 [ 0.009  1.     0.002  0.007]
 [-0.066 -0.001  0.998 -0.269]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 7 frame pairs
Running RoMa feature matching...




100% 7/7 [00:01<00:00,  5.24it/s]
Found correspondences: 7 pairs, shape: (4917, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=24, #_frames=25
left000000 left000001 left000002 left000003 left000004 left000020 left000021 left000022 left000023 left000024 
global_corres=156149


Frame left000024 processing complete


maxNumResiduals / maxNumberOfImages = 8593649 / 10 = 859364
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 36521 = 365210
m_solver->solve Time difference = 486.394[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000024




Applying depth preprocessing...
Depth threshold: 0.2826, remaining valid points: 4544117
Processing frame left000025 (ID: 25)...
Processing frame left000025
Foreground points: 1558923
Valid points: 1480129
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.37it/s]
Found correspondences: 1 pairs, shape: (4782, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000025
Before:
[[ 0.997 -0.009  0.079  0.002]
 [ 0.009  1.     0.001  0.007]
 [-0.079 -0.     0.997 -0.269]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.01  -0.991 -0.132  0.056]
 [ 1.     0.008  0.014  0.019]
 [-0.013 -0.132  0.991 -0.267]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs
Running RoMa feature matching...




100% 8/8 [00:01<00:00,  5.27it/s]
Found correspondences: 8 pairs, shape: (4925, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=24, #_frames=26
left000000 left000001 left000002 left000003 left000004 left000020 left000021 left000022 left000023 left000025 
global_corres=155646


Frame left000025 processing complete


maxNumResiduals / maxNumberOfImages = 8593146 / 10 = 859314
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 36541 = 365410
m_solver->solve Time difference = 456.12[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000025




Applying depth preprocessing...
Depth threshold: 0.3178, remaining valid points: 4886980
Processing frame left000026 (ID: 26)...
Processing frame left000026
Foreground points: 1572579
Valid points: 1472417
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.34it/s]
Found correspondences: 1 pairs, shape: (461, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000026
Before:
[[ 0.997 -0.009  0.079  0.002]
 [ 0.009  1.     0.001  0.007]
 [-0.079 -0.     0.997 -0.269]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.161 -0.987 -0.007  0.025]
 [ 0.818  0.13   0.561 -0.163]
 [-0.553 -0.096  0.828 -0.237]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs
Running RoMa feature matching...




100% 8/8 [00:01<00:00,  5.23it/s]
Found correspondences: 8 pairs, shape: (4882, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=24, #_frames=26
left000000 left000001 left000002 left000003 left000004 left000020 left000021 left000022 left000023 left000026 
global_corres=152033
maxNumResiduals / maxNumberOfImages = 8589533 / 10 = 858953
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 35938 = 359380
m_solver->solve Time difference = 436.895[ms]


Frame left000026 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000026




Applying depth preprocessing...
Depth threshold: 0.3283, remaining valid points: 4933823
Processing frame left000027 (ID: 27)...
Processing frame left000027
Foreground points: 1567740
Valid points: 1463558
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.33it/s]
Found correspondences: 1 pairs, shape: (4928, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000027
Before:
[[ 0.016 -0.993 -0.118  0.052]
 [ 0.834 -0.052  0.549 -0.15 ]
 [-0.551 -0.107  0.828 -0.236]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.015 -0.993 -0.119  0.052]
 [ 0.684 -0.076  0.725 -0.209]
 [-0.729 -0.092  0.678 -0.197]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs
Running RoMa feature matching...




100% 8/8 [00:01<00:00,  5.25it/s]
Found correspondences: 8 pairs, shape: (4862, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=25, #_frames=27
left000000 left000001 left000002 left000003 left000004 left000021 left000022 left000023 left000026 left000027 
global_corres=165908
maxNumResiduals / maxNumberOfImages = 8603408 / 10 = 860340
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 37205 = 372050
m_solver->solve Time difference = 454.728[ms]


Frame left000027 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000027




Applying depth preprocessing...
Depth threshold: 0.3308, remaining valid points: 4877902
Processing frame left000028 (ID: 28)...
Processing frame left000028
Foreground points: 1509367
Valid points: 1401546
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.38it/s]
Found correspondences: 1 pairs, shape: (4916, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000028
Before:
[[ 0.011 -0.993 -0.115  0.051]
 [ 0.68  -0.077  0.729 -0.21 ]
 [-0.733 -0.086  0.674 -0.196]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.009 -0.993 -0.114  0.05 ]
 [ 0.491 -0.095  0.866 -0.259]
 [-0.871 -0.064  0.487 -0.143]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs




Running RoMa feature matching...
100% 8/8 [00:01<00:00,  5.17it/s]
Found correspondences: 8 pairs, shape: (4782, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=26, #_frames=28
left000000 left000001 left000002 left000003 left000021 left000022 left000023 left000026 left000027 left000028 
global_corres=166652
maxNumResiduals / maxNumberOfImages = 8604152 / 10 = 860415
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 36299 = 362990
m_solver->solve Time difference = 451.152[ms]


Frame left000028 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000028




Applying depth preprocessing...
Depth threshold: 0.2554, remaining valid points: 3864516
Processing frame left000029 (ID: 29)...
Processing frame left000029
Foreground points: 1299505
Valid points: 1232317
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.31it/s]
Found correspondences: 1 pairs, shape: (4869, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000029
Before:
[[ 0.006 -0.993 -0.115  0.05 ]
 [ 0.487 -0.098  0.868 -0.259]
 [-0.873 -0.061  0.484 -0.142]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.007 -0.993 -0.114  0.05 ]
 [ 0.164 -0.111  0.98  -0.305]
 [-0.986 -0.025  0.162 -0.047]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 2 frame pairs




Running RoMa feature matching...
100% 2/2 [00:00<00:00,  5.23it/s]
Found correspondences: 2 pairs, shape: (4820, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=27, #_frames=29
left000016 left000017 left000018 left000019 left000020 left000021 left000026 left000027 left000028 left000029 
global_corres=102139
maxNumResiduals / maxNumberOfImages = 8539639 / 10 = 853963
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 31930 = 319300
m_solver->solve Time difference = 348.57[ms]


Frame left000029 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000029




Applying depth preprocessing...
Depth threshold: 0.3310, remaining valid points: 4920750
Processing frame left000030 (ID: 30)...
Processing frame left000030
Foreground points: 1542630
Valid points: 1437734
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.37it/s]
Found correspondences: 1 pairs, shape: (4769, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000030
Before:
[[ 0.014 -0.993 -0.114  0.05 ]
 [ 0.161 -0.11   0.981 -0.305]
 [-0.987 -0.032  0.159 -0.045]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.013 -0.993 -0.119  0.052]
 [ 0.571 -0.09   0.816 -0.24 ]
 [-0.821 -0.078  0.566 -0.165]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs
Running RoMa feature matching...




100% 8/8 [00:01<00:00,  5.15it/s]
Found correspondences: 8 pairs, shape: (4771, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=28, #_frames=30
left000000 left000001 left000002 left000022 left000023 left000026 left000027 left000028 left000029 left000030 
global_corres=149924
maxNumResiduals / maxNumberOfImages = 8587424 / 10 = 858742
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 34900 = 349000
m_solver->solve Time difference = 395.113[ms]


Frame left000030 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000030




Applying depth preprocessing...
Depth threshold: 0.3187, remaining valid points: 4891015
Processing frame left000031 (ID: 31)...
Processing frame left000031
Foreground points: 1573884
Valid points: 1472593
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.25it/s]
Found correspondences: 1 pairs, shape: (4920, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000031
Before:
[[ 0.009 -0.993 -0.118  0.051]
 [ 0.576 -0.091  0.812 -0.239]
 [-0.817 -0.075  0.571 -0.167]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.011 -0.993 -0.12   0.052]
 [ 0.822 -0.06   0.567 -0.156]
 [-0.57  -0.104  0.815 -0.233]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs




Running RoMa feature matching...
100% 8/8 [00:01<00:00,  5.15it/s]
Found correspondences: 8 pairs, shape: (4882, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=29, #_frames=31
left000000 left000001 left000002 left000022 left000023 left000026 left000027 left000028 left000030 left000031 
global_corres=170705


Frame left000031 processing complete


maxNumResiduals / maxNumberOfImages = 8608205 / 10 = 860820
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 36638 = 366380
m_solver->solve Time difference = 437.628[ms]


Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000031




Applying depth preprocessing...
Depth threshold: 0.2910, remaining valid points: 4594085
Processing frame left000032 (ID: 32)...
Processing frame left000032
Foreground points: 1514187
Valid points: 1436583
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.27it/s]
Found correspondences: 1 pairs, shape: (4927, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000032
Before:
[[ 0.007 -0.993 -0.116  0.051]
 [ 0.823 -0.06   0.565 -0.155]
 [-0.568 -0.099  0.817 -0.233]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.006 -0.993 -0.119  0.052]
 [ 0.989 -0.012  0.15  -0.023]
 [-0.151 -0.119  0.981 -0.269]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 9 frame pairs
Running RoMa feature matching...




100% 9/9 [00:01<00:00,  5.15it/s]
Found correspondences: 9 pairs, shape: (4904, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=29, #_frames=32
left000000 left000001 left000002 left000003 left000021 left000022 left000023 left000026 left000027 left000032 
global_corres=180967
maxNumResiduals / maxNumberOfImages = 8618467 / 10 = 861846
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 38439 = 384390
m_solver->solve Time difference = 469.339[ms]


Frame left000032 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000032




Applying depth preprocessing...
Depth threshold: 0.2893, remaining valid points: 4634540
Processing frame left000033 (ID: 33)...
Processing frame left000033
Foreground points: 1600464
Valid points: 1472058
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.23it/s]
Found correspondences: 1 pairs, shape: (4948, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000033
Before:
[[ 0.006 -0.993 -0.119  0.052]
 [ 0.989 -0.012  0.149 -0.023]
 [-0.15  -0.119  0.982 -0.269]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.005 -0.992 -0.124  0.053]
 [ 0.934  0.048 -0.353  0.129]
 [ 0.356 -0.114  0.927 -0.235]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs
Running RoMa feature matching...




100% 8/8 [00:01<00:00,  5.11it/s]
Found correspondences: 8 pairs, shape: (4934, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=30, #_frames=33
left000000 left000001 left000002 left000003 left000021 left000022 left000023 left000026 left000032 left000033 
global_corres=189157
maxNumResiduals / maxNumberOfImages = 8626657 / 10 = 862665
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 40105 = 401050
m_solver->solve Time difference = 492.27[ms]


Frame left000033 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000033




Applying depth preprocessing...
Depth threshold: 0.3055, remaining valid points: 4817288
Processing frame left000034 (ID: 34)...
Processing frame left000034
Foreground points: 1691650
Valid points: 1580794
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.29it/s]
Found correspondences: 1 pairs, shape: (4923, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000034
Before:
[[ 0.006 -0.992 -0.122  0.053]
 [ 0.935  0.049 -0.352  0.129]
 [ 0.356 -0.112  0.928 -0.235]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.003 -0.992 -0.123  0.053]
 [ 0.69   0.092 -0.718  0.233]
 [ 0.723 -0.083  0.685 -0.147]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs
Running RoMa feature matching...




100% 8/8 [00:01<00:00,  5.15it/s]
Found correspondences: 8 pairs, shape: (4936, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=31, #_frames=34
left000000 left000001 left000002 left000021 left000022 left000023 left000026 left000032 left000033 left000034 
global_corres=193976
maxNumResiduals / maxNumberOfImages = 8631476 / 10 = 863147
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 40810 = 408100
m_solver->solve Time difference = 474.697[ms]


Frame left000034 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000034




Applying depth preprocessing...
Depth threshold: 0.3093, remaining valid points: 4760526
Processing frame left000035 (ID: 35)...
Processing frame left000035
Foreground points: 1647044
Valid points: 1530121
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.21it/s]
Found correspondences: 1 pairs, shape: (4893, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000035
Before:
[[ 0.004 -0.992 -0.123  0.053]
 [ 0.691  0.092 -0.717  0.233]
 [ 0.723 -0.082  0.686 -0.148]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.003 -0.993 -0.121  0.053]
 [ 0.489  0.107 -0.866  0.271]
 [ 0.872 -0.057  0.485 -0.081]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs
Running RoMa feature matching...




100% 8/8 [00:01<00:00,  5.13it/s]
Found correspondences: 8 pairs, shape: (4646, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=32, #_frames=35
left000000 left000001 left000002 left000022 left000023 left000026 left000032 left000033 left000034 left000035 
global_corres=188456
maxNumResiduals / maxNumberOfImages = 8625956 / 10 = 862595
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 40823 = 408230
m_solver->solve Time difference = 478.167[ms]


Frame left000035 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000035




Applying depth preprocessing...
Depth threshold: 0.3067, remaining valid points: 4679417
Processing frame left000036 (ID: 36)...
Processing frame left000036
Foreground points: 1605657
Valid points: 1492363
Finding correspondences between 1 frame pairs
Running RoMa feature matching...
100% 1/1 [00:00<00:00,  5.31it/s]
Found correspondences: 1 pairs, shape: (4870, 4)
Running RANSAC for robust correspondence estimation...
Updating pose for frame left000036
Before:
[[ 0.002 -0.993 -0.121  0.053]
 [ 0.487  0.107 -0.867  0.271]
 [ 0.874 -0.058  0.483 -0.08 ]
 [ 0.     0.     0.     1.   ]]
After:
[[ 0.002 -0.993 -0.12   0.052]
 [ 0.397  0.111 -0.911  0.282]
 [ 0.918 -0.046  0.395 -0.052]
 [ 0.     0.     0.     1.   ]]




Finding correspondences between 8 frame pairs
Running RoMa feature matching...




100% 8/8 [00:01<00:00,  5.23it/s]
Found correspondences: 8 pairs, shape: (4359, 4)
Running RANSAC for robust correspondence estimation...


#optimizeGPU frames=10, #keyframes=33, #_frames=36
left000000 left000001 left000002 left000022 left000023 left000032 left000033 left000034 left000035 left000036 
global_corres=184909
maxNumResiduals / maxNumberOfImages = 8622409 / 10 = 862240
m_maxNumberOfImages*m_maxCorrPerImage = 10 x 40311 = 403110
m_solver->solve Time difference = 480.076[ms]


Frame left000036 processing complete




Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/bundletrack/keyframes.yml for frame left000036
Copied keyframes to /workspace/3d-object-reconstruction/data/output/retail_item/keyframes.yml


Feature Matching and Pose Estimation successful.


Let's visualize the camera poses estimation results. 
+ The `scale` parameter controls the size of the camera frustums in the visualization - smaller values like 0.01 make the cameras appear smaller. 
+ The `eps` parameter affects point cloud downsampling during visualization - smaller values like 0.01 preserve more detail but may be slower to render.
+ The visualization shows camera positions and orientations as colored coordinate axes (red=X-right, green=Y-up, blue=Z-lookat), with yellow lines indicating the camera viewing frustums.



In [17]:
scene = vis_camera_poses(os.path.join(config['bundletrack']['debug_dir'], 'keyframes.yml'),track_dataset,scale=0.03,eps=0.01)
scene.show()

Selected 34 keyframes


Now that we have extracted pose information and keyframes, we can train our SDF model. 

In [18]:
# Run SDF training 
tracker.run_global_sdf(nerf_dataset)

Running global SDF optimization...
Found 34 keyframes for SDF optimization
Frame IDs for SDF: ['left000000', 'left000001', 'left000002', 'left000003', 'left000004', 'left000005', 'left000006', 'left000007', 'left000008', 'left000009', 'left000010', 'left000011', 'left000012', 'left000013', 'left000014', 'left000015', 'left000016', 'left000017', 'left000018', 'left000019', 'left000020', 'left000021', 'left000022', 'left000023', 'left000026', 'left000027', 'left000028', 'left000029', 'left000030', 'left000032', 'left000033', 'left000034', 'left000035', 'left000036']
Computing scene bounds...
Scene normalization: scale=6.67771424131479, translation=[-0.04418149  0.00097321 -0.05062355]
Initializing SDF model...
Octree voxel dilate_radius:1


6.67771424131479 [-0.04418149  0.00097321 -0.05062355] 584


  def forward(ctx, inputs, embeddings, offsets, per_level_scale, base_resolution, calc_grad_inputs=False, gridtype=0, align_corners=False):
  def backward(ctx, grad):
Not optimizing poses
  self.amp_scaler = torch.cuda.amp.GradScaler(enabled=self.cfg['amp'])
Training SDF model...


level 0, resolution: 16
level 1, resolution: 20
level 2, resolution: 24
level 3, resolution: 28
level 4, resolution: 34
level 5, resolution: 41
level 6, resolution: 49
level 7, resolution: 59
level 8, resolution: 71
level 9, resolution: 85
level 10, resolution: 102
level 11, resolution: 123
level 12, resolution: 148
level 13, resolution: 177
level 14, resolution: 213
level 15, resolution: 256
optimize poses
sc_factor 6.67771424131479
translation [-0.04418149  0.00097321 -0.05062355]
Using mask


denoise cloud
Denoising rays based on octree cloud
bad_mask#=159
Iter: 0, valid_samples: 655326/655360, valid_rays: 2048/2048, loss: 22.2348900, rgb_loss: 1.2094061, rgb0_loss: 0.0000000, fs_rgb_loss: 0.0000000, depth_loss: 0.0000000, depth_loss0: 0.0000000, fs_loss: 20.5530891, point_cloud_loss: 0.0000000, point_cloud_normal_loss: 0.0000000, sdf_loss: 0.3953925, eikonal_loss: 0.0000000, variation_loss: 0.0000000, truncation(meter): 0.0040000, pose_reg: 0.0000000, reg_features: 0.0770013, 

train progress 0/3001


rays torch.Size([3532643, 12]) cuda:0


train progress 300/3001
train progress 600/3001
train progress 900/3001
train progress 1200/3001
train progress 1500/3001


Loading next batch images
Using mask


denoise cloud
Denoising rays based on octree cloud
bad_mask#=193


rays torch.Size([3546583, 12]) cuda:0


train progress 1800/3001
train progress 2100/3001
train progress 2400/3001
train progress 2700/3001
train progress 3000/3001
Extracting mesh from SDF model...
query_pts:torch.Size([3375000, 3]), valid:995427
Running Marching Cubes
done V:(32374, 3), F:(64736, 3)


Saved checkpoints at /workspace/3d-object-reconstruction/data/output/retail_item/model_latest.pth


Saved cleaned mesh to /workspace/3d-object-reconstruction/data/output/retail_item/mesh_cleaned.obj


We now have our SDF model trained, let us proceed on texture baking and here we can customize our scale factor to use the original scale of the images if needed for 4k images.

In [None]:
tracker.run_texture_bake(texture_dataset)

Running texture baking...
Found 34 keyframes for texture baking
Frame IDs for texture baking: ['left000000', 'left000001', 'left000002', 'left000003', 'left000004', 'left000005', 'left000006', 'left000007', 'left000008', 'left000009', 'left000010', 'left000011', 'left000012', 'left000013', 'left000014', 'left000015', 'left000016', 'left000017', 'left000018', 'left000019', 'left000020', 'left000021', 'left000022', 'left000023', 'left000026', 'left000027', 'left000028', 'left000029', 'left000030', 'left000032', 'left000033', 'left000034', 'left000035', 'left000036']
Computing scene bounds for texture baking...
Scene normalization: scale=6.67771424131479, translation=[-0.04418149  0.00097321 -0.05062355]
Initializing SDF model for texture baking...
Octree voxel dilate_radius:1
Not optimizing poses
Loading pre-trained SDF weights from /workspace/3d-object-reconstruction/data/output/retail_item/model_latest.pth


level 0, resolution: 16
level 1, resolution: 20
level 2, resolution: 24
level 3, resolution: 28
level 4, resolution: 34
level 5, resolution: 41
level 6, resolution: 49
level 7, resolution: 59
level 8, resolution: 71
level 9, resolution: 85
level 10, resolution: 102
level 11, resolution: 123
level 12, resolution: 148
level 13, resolution: 177
level 14, resolution: 213
level 15, resolution: 256
optimize poses
sc_factor 6.67771424131479
translation [-0.04418149  0.00097321 -0.05062355]
Reloading from /workspace/3d-object-reconstruction/data/output/retail_item/model_latest.pth


Using optimized poses from SDF training
Loading mesh from /workspace/3d-object-reconstruction/data/output/retail_item/mesh_cleaned.obj


ckpt keys:  dict_keys(['global_step', 'model', 'optimizer', 'embed_fn', 'embeddirs_fn', 'pose_array', 'feature_array', 'octree'])


Applied mesh smoothing with 2 iterations
Saved smoothed mesh to /workspace/3d-object-reconstruction/data/output/retail_item/mesh_smoothed.obj
Baking texture with resolution 2048


Texture: Initial LOOP FOR storing angles for each face


  3% 1/34 [00:27<15:06, 27.47s/it]

We now have our 3d textured asset, let us take a look at the generated asset to see how it looks ! 

In [None]:
im = Image.open(f'{output_data_path}/material_0.png')
mesh = trimesh.load(f'{output_data_path}/textured_mesh.obj',process=False)
tex = trimesh.visual.TextureVisuals(image=im)
mesh.visual.texture = tex
view_mesh = mesh
material = mesh.visual.material
material.diffuse = [255,255,255,255]
mesh.show()

In [None]:
# Let us also observe the mesh in its reconstructed lightning below.
K,H,W = nerf_dataset.K, nerf_dataset.H, nerf_dataset.W
tRes = 800
scale = tRes/max(H,W)
H,W = int(H*scale), int(W*scale)
cam_K = K[:2]*scale
try:
    renderer = ModelRendererOffscreen([],cam_K,H,W)
    renderer.add_mesh(mesh)
    colors,depths = renderer.render_fixed_cameras()
except:
    renderer = ModelRendererOffscreen([],cam_K,H,W)
    renderer.add_mesh(mesh)
    colors,depths = renderer.render_fixed_cameras()

plt.figure()
for i in range(8):
    plt.subplot(2,4,i+1)
    plt.imshow(colors[i])
    plt.axis('off')
plt.tight_layout()
plt.show()

## Summary and Next Steps

### **Workflow Summary**

Congratulations! You have successfully completed the end-to-end 3D object reconstruction pipeline. Here's what we accomplished:

#### **Pipeline Achievements**
1. **✅ Depth Estimation**: Generated accurate depth maps from stereo pairs using FoundationStereo's transformer-based architecture
2. **✅ Object Segmentation**: Created consistent object masks across all frames using SAM2's video tracking capabilities
3. **✅ Pose Estimation**: Estimated and optimized camera poses for the next step reconstruction 
4. **✅ Neural Reconstruction**: Trained a Neural Field to capture the object's 3D geometry
5. **✅ Texture Baking**: Generated high-resolution texture maps and exported production-ready 3D assets

#### **Generated Assets**
Your reconstruction pipeline has produced the following outputs in the experiment directory:
- **`textured_mesh.obj`**: Complete 3D mesh with UV mapping
- **`material_0.png`**: High-resolution texture map
- **`keyframes.yaml`**: Optimized camera poses for each frame
- **`depth/`**: Dense depth maps for all input frames
- **`masks/`**: Object segmentation masks for background removal

#### **Export to USD**
- In order to support direct loading of various file types into Omniverse, we provide a set of converters that can convert the file into a USD file.
- [USD Converter using isaaclab.sim.converters](https://isaac-sim.github.io/IsaacLab/main/source/api/lab/isaaclab.sim.converters.html)

### **Integration with other workflows**

The generated 3D assets are immediately ready for integration into various platforms and workflows:

**Applications:**
- **Robotic Manipulation**: Use reconstructed objects for grasping and manipulation training
- **Sim2Real Transfer**: Bridge the gap between simulation and real-world deployment
- **Digital Twins**: Create accurate digital replicas of real-world objects
- **Computer Vision Training**: Generate labeled datasets with your reconstructed objects
- **Domain Adaptation**: Create variations of real objects for robust model training
- **Rare Object Simulation**: Generate synthetic data for objects that are difficult to collect

**Further Reading**
- [Object Detection Synthetic Data Generation using isaacsim.replicator.object](https://docs.isaacsim.omniverse.nvidia.com/4.5.0/replicator_tutorials/tutorial_replicator_object.html)
  
### **Advanced Customization Options**

#### **Quality Optimization**
- **Higher Resolution**: Modify `texture_bake.downscale` to `1` for full-resolution texture baking
- **Extended Training**: Increase NeRF training iterations for improved reconstruction quality
- **Custom Camera Intrinsics**: Adapt the pipeline for different camera setups

#### **Experiment with Your Own Data**
1. **Capture Guidelines**: Follow the data collection best practices demonstrated in Step 1
2. **Camera Calibration**: Ensure accurate intrinsic parameters for your stereo setup
3. **Lighting Conditions**: Experiment with different lighting setups for optimal results

### **Conclusion**

You now have a complete understanding of the 3D object reconstruction pipeline and practical experience with state-of-the-art computer vision techniques. The generated assets are production-ready and can be immediately integrated into your robotics, gaming, or AI workflows.

The combination of FoundationStereo, SAM2, and BundleSDF provides a robust foundation for creating high-quality 3D content from real-world objects, bridging the gap between physical and digital worlds.

