# 3D Object Reconstruction using NVBundleSDF 

## Overview 

This guide demonstrates an end-to-end real2sim workflow for reconstructing 3D objects from stereo video input using state-of-the-art computer vision and neural rendering techniques. The pipeline combines:

- **[FoundationStereo](https://arxiv.org/abs/2501.09898)** for depth estimation from stereo image pairs
- **[SAM2](https://arxiv.org/abs/2408.00714)** serves as the object segmentation for the entire video
- **[BundleSDF](https://arxiv.org/abs/2303.14158)** for the real-world scale textured mesh generation

<div align="center">
    <img src="../data/docs/pipeline_overview.png" alt="Pipeline Overview" title="3D Object Reconstruction Workflow">
</div>

## Learning Objectives

By the end of this notebook, you'll learn how to perform 3D object reconstruction by:

- **Data Preparation**: Importing stereo input data and pre-processing it for optimal reconstruction
- **Depth Estimation**: Using FoundationStereo to generate accurate depth maps from stereo pairs
- **Object Segmentation**: Employing SAM2 to segment and track objects across all frames
- **3D Object Reconstruction**: Leveraging BundleSDF for pose estimation, SDF training, mesh extraction, and texture baking
- **Asset Creation**: Creating textured 3D assets ready for downstream applications such as digital content creation, dataset simulation and object pose estimation.


## Prerequisites

### System Requirements
- **GPU**: NVIDIA GPU with CUDA support (minimum requirements: Compute Capability 7.0 with at least 24GB VRAM)
- **Memory**: 32GB+ RAM recommended
- **Storage**: 100GB+ free space recommended
- **OS**: Ubuntu 22.04+
- **Software**: 
  - [Docker](https://docs.docker.com/engine/install/) with [nvidia-container-runtime](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) enabled
  - [Docker Compose](https://docs.docker.com/compose/install/)

### Initial Setup
In the cell below, we will install additional dependencies for video codecs, GStreamer plugins, and build the SAM2 extension that we'll use later in the pipeline.

In [None]:
# Install additional dependencies for video processing and build SAM2 extension
print("Installing additional dependencies...")
import subprocess, os, pathlib, sys
DEEPSTREAM_SCRIPT = pathlib.Path("/opt/nvidia/deepstream/deepstream/user_additional_install.sh")
if DEEPSTREAM_SCRIPT.exists():
    subprocess.check_call(["bash", str(DEEPSTREAM_SCRIPT)])
SAM2_ROOT = pathlib.Path(os.getenv("SAM2_ROOT", "/sam2"))
if SAM2_ROOT.exists():
    subprocess.check_call([sys.executable, "setup.py", "build_ext", "--inplace"], cwd=SAM2_ROOT)
else:
    print("⚠️  SAM2 root not found – skipping C++ extension build.")

print("Setup complete!")

## Step 1: Data Preparation and Guidelines

The notebook comes with a sample dataset and a base configuration file that allows users to optimize their results. This section covers suitable object types and dataset capturing instructions.

### Recommended Object Types

For optimal reconstruction results, choose objects with the following characteristics:

- **Rigid, Non-Deformable Objects**: The workflow performs best with objects that maintain a fixed shape across frames
- **Rich Surface Texture**: High texture variance enables reliable feature detection and matching, which is critical for accurate reconstruction
- **Asymmetrical Objects**: Distinct content on different faces helps avoid ambiguity during feature matching
- **Opaque Materials**: Avoid transparent or translucent materials (glass, clear plastic) as they interfere with depth and feature consistency

### Dataset Capturing Guidelines

To achieve optimal 3D reconstruction quality, follow these steps when capturing object images. The goal is to ensure complete coverage of the object's geometry while maintaining consistency in framing and orientation.

#### 1. Position the Object
- **Centering**: Place the object in the center of the camera frame
- **Size**: The object should occupy roughly 45-65% of the image area - large enough to capture details while providing context
- **Lighting**: 
  - Use even, diffused lighting to minimize harsh shadows and reflections
  - Avoid backlighting or direct overhead lights that create glare or overexposure
  - Ensure the object is well-lit from multiple directions to reveal surface details

#### 2. Capture the First Set (Primary Faces)
- Begin image capture while slowly rotating the object horizontally in one direction (clockwise or counterclockwise)
- Cover approximately 360 degrees of rotation
- This step should expose four visible faces of the object: front, back, and both sides
- Capture multiple overlapping frames to ensure robust feature matching across angles

#### 3. Capture the Second Set (Remaining Faces)
- Flip the object to reveal previously hidden faces (typically top and bottom)
- Continue capturing images while rotating slightly, following the same pattern as the previous step
- Ensure full coverage of the remaining two faces with overlapping views for consistent alignment

The animation below demonstrates the full object rotation and flip process, showing how to cover all six faces in a consistent, controlled manner.

<div align="center">
    <img src="../data/docs/adv.gif" alt="Capture Example" title="Capture Example" width="600">
</div>

### Manual Capture Guidelines (Alternative Method)

If a turntable is not available, you can capture the data manually by following these guidelines:

#### 1. Scan-like Movement
- Hold the object in your hands and rotate it manually in front of the camera
- Treat the process like scanning the object: gradually expose all surfaces to the camera
- Rotate the object slowly and smoothly, ensuring sufficient visual overlap between consecutive frames

#### 2. Maximize Visible Surface Area
- Ensure the camera can see a large portion of the object's surface in each frame
- Avoid fast or jerky movements - slower rotations help the system track features accurately
- Verify that all six faces of the object (front, back, sides, top, and bottom) are captured clearly

#### 3. Maintain Consistent Distance
- Keep the object at a consistent distance from the camera throughout the capture process
- Avoid moving the object significantly closer or farther during capture

The animation below demonstrates effective manual object capture technique.

<div align="center">
    <img src="../data/docs/input_dino.gif" alt="Manual Capture Example" title="Manual Capture Example" width="600">
</div>
  
## Comprehensive Camera Comparison

* Note that the following information is for reference only. Please check your camera's parameters for accurate 3D object reconstruction.

| **Specification** | **ZED 2i Camera (Stereolabs)** | **QooCam EGO 3D Camera (Kandao)** | **Hawk Stereo Camera (Leopard Imaging)** | **ZED Mini Camera (Stereolabs)** |
|-------------------|--------------------------------|-----------------------------------|------------------------------------------|----------------------------------|
| **Manufacturer** | Stereolabs | Kandao | Leopard Imaging | Stereolabs |
| **Resolution** | 2K stereo | 4K image / 2K video | Industrial grade | HD stereo |
| **Form Factor** | Desktop setup | Lightweight, portable | Compact industrial | Ultra-compact, lightweight |
| **Field of View** | Wide | Standard | Standard | Standard |
| **Target Audience** | Developers, robotics | Consumer/prosumer | Industrial, NVIDIA ecosystem | Mixed-reality, robotics |
| **Best Use Case** | Desktop capture, robotics | Quick field captures, handheld | Industrial applications, Isaac ROS | Mixed-reality, compact robotics |
| **Technical Specifications** | | | | |
| **Focal Length (fx)** | 1070.800 | 3079.6 | 958.35 | 522.38 |
| **Focal Length (fy)** | 1070.700 | 3075.1 | 956.18 | 522.38 |
| **Principal Point (cx)** | 1098.950 | 2000.0 | 959.36 | 644.88 |
| **Principal Point (cy)** | 649.044 | 1500.0 | 590.95 | 356.03 |
| **Baseline (m)** | 0.1198 | 0.0658 | 0.1495 | 0.12 |
| **Eye Separation** | Standard | Standard | Standard | 6.5cm |
| | | | | |
| **Strengths** | • Medium-resolution stereo capture (2K)<br>• Wide field of view<br>• Robust SDK<br>• Good for objects without detailed text | • Lightweight and portable<br>• User-friendly design<br>• Built-in display for instant review<br>• 4K image capture capability<br>• Ideal for handheld workflows | • Compact industrial-grade system<br>• Accurate calibration<br>• Widely used in Isaac ROS<br>• Developed by NVIDIA camera team<br>• Professional-grade reliability | • Ultra-compact design<br>• Built-in 6DoF IMU<br>• HD depth sensing with Ultra mode<br>• Visual-inertial technology<br>• Aluminum frame for robustness<br>• USB Type-C connectivity |
| **Weaknesses** | • Capture resolution not very high<br>• Cannot capture detailed texture<br>• Motion blur in video capturing<br>• Requires desktop for capturing | • No SDK for developers<br>• More consumer-focused<br>• Limited customization options | • Capture resolution not very high<br>• May not capture detailed texture<br>• Requires additional setup/integration<br>• Needs Jetson as additional device | • Smaller baseline (6.5cm)<br>• HD resolution (lower than 2K/4K)<br>• Requires powerful GPU for AR applications<br>• Limited depth range vs larger cameras |
| | | | | |
| **Setup Requirements** | | | | |
| **Additional Hardware** | Desktop/laptop | None (standalone) | Jetson device | Desktop/laptop |
| **Software Requirements** | ZED SDK | Companion mobile app | Custom integration, Isaac ROS | ZED SDK |
| **Minimum System (SDK)** | Standard desktop | N/A | Jetson platform | Dual-core 2.3GHz, 4GB RAM, USB 3.0 |
| **Recommended For** | • Desktop Development<br>• Robotics Integration<br>• SDK-based prototyping | • Field Data Collection<br>• Portable workflows<br>• Quick capture sessions | • Industrial Applications<br>• Isaac ROS projects<br>• Professional deployments | • Mixed-Reality Applications<br>• Compact Robotics<br>• Motion Tracking<br>• Space-constrained setups |

## Running Time Reference

Below are estimated running times for each stage of the 3D object reconstruction pipeline when using an NVIDIA RTX A6000 GPU with a dataset of 36 stereo frames at 4K (3000x4000) resolution:

| Pipeline Stage | Estimated Time | Key Factors Affecting Performance |
|----------------|----------------|----------------------------------|
| **Initial Setup** | 1-2 minutes | Package installation and extension compilation |
| **FoundationStereo Depth Estimation** | 1-2 minutes | Frame count, resolution |
| **SAM2 Object Segmentation** | 25 Seconds | Frame count, resolution |
| **Object Pose Tracking** | 3-4 minutes | Frame count, resolution |
| **SDF Training** | 3-4 minutes | Training iterations, resolution, number of keyframes |
| **Texture Baking** | 22-23 minutes | Texture resolution, mesh complexity, image resolution |
| **Total Pipeline** | **31-32 minutes** | End-to-end processing time |

**Notes:**
- Performance scales approximately linearly with frame count
- Higher resolution inputs increase processing time, particularly for Neural SDF training and texture generation
- Complex objects with intricate geometry or challenging textures may require longer processing times
- Using lower downscaling factors for higher quality outputs will increase processing time

These estimates are based on benchmarks using the RTX A6000 GPU. Performance may vary based on system configuration, input data characteristics, and specific parameter settings.


### Configuration and Data Setup

Now let's set up our experiment configuration and load the sample dataset.


In [None]:
# Import required libraries
import os
import uuid
import yaml 
import ipywidgets
import shutil
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import JSON, display
from PIL import Image 
import trimesh 
from pathlib import Path
import json

# Import custom modules for 3D reconstruction pipeline
from nvidia.objectreconstruction.utils.visualization import (
    create_stereo_viewer, create_depth_viewer, create_mask_viewer, 
    create_bbox_widget, create_3d_viewer
)
from nvidia.objectreconstruction.networks.foundationstereo import run_depth_estimation
from nvidia.objectreconstruction.networks.sam2infer import run_mask_extraction
from nvidia.objectreconstruction.dataloader import ReconstructionDataLoader
from nvidia.objectreconstruction.networks import NVBundleSDF
from nvidia.objectreconstruction.networks import ModelRendererOffscreen,vis_camera_poses

def pretty_print_config(obj, title=None):
    """Pretty print configuration with custom object handling"""
    
    def json_serializer(obj):
        """Custom JSON serializer for non-serializable objects"""
        if isinstance(obj, Path):
            return str(obj)
        elif hasattr(obj, '__dict__'):
            return obj.__dict__
        elif hasattr(obj, '_asdict'):  # namedtuples
            return obj._asdict()
        else:
            return str(obj)
    
    if title:
        print(f"{title}")
        print("=" * (len(title) + 4))
    
    # Convert to JSON-serializable format
    json_str = json.dumps(obj, default=json_serializer, indent=2, sort_keys=True)
    json_obj = json.loads(json_str)
    
    # Display using IPython's JSON widget with enhanced styling
    return JSON(json_obj, expanded=True)

print("All libraries imported successfully!")

In [None]:
# Load the experiment configuration file
config_file_path = '/workspace/3d-object-reconstruction/data/configs/base.yaml'

with open(config_file_path, 'r') as f:
    config = yaml.safe_load(f)

# Load the input dataset 
input_data_path = '/workspace/3d-object-reconstruction/data/samples/retail_item/'

# Setup the experiment directory
output_data_path = Path('/workspace/3d-object-reconstruction/data/output/retail_item/')

# Check if output directory exists and ask user for action
if output_data_path.exists():
    clear_existing = input(f"Output directory '{output_data_path}' already exists.\nClear existing contents? (y/n): ").lower().strip()
    
    if clear_existing in ['y', 'yes']:
        print("🗑️  Clearing existing output directory...")
        shutil.rmtree(output_data_path)
        print("✅ Existing contents cleared!")
    elif clear_existing in ['n', 'no']:
        print("📁 Keeping existing contents...")
    else:
        print("⚠️  Invalid input. Keeping existing contents by default...")

# Create output directory and copy input frames
output_data_path.mkdir(parents=True, exist_ok=True)
shutil.copytree(input_data_path, output_data_path, dirs_exist_ok=True)

# Update configuration to point to experiment directory
config['workdir'] = str(output_data_path)
config['bundletrack']['debug_dir'] = str(output_data_path)
config['nerf']['save_dir'] = str(output_data_path)

# Configure camera intrinsics and baseline for the sample dataset
# The example dataset uses QooCam with the following specifications:
# Intrinsic matrix format: [fx, 0, cx, 0, fy, cy, 0, 0, 1]
# Baseline: distance between stereo camera lenses in meters
config['camera_config']['intrinsic'] = [3079.6, 0, 2000.0, 0, 3075.1, 1500.01, 0, 0, 1]
# The example dataset uses a step size of 1, which means we process all frames.
config['camera_config']['step'] = 1 
config['foundation_stereo']['intrinsic'] = config['camera_config']['intrinsic']
config['foundation_stereo']['baseline'] = 0.0657696127  # 65.77mm baseline

print(f"Configuration loaded successfully!")
print(f"Input data path: {input_data_path}")
print(f"Output data path: {output_data_path}")
print(f"Camera intrinsics configured for QooCam")

### Visualize Input Data

Let's examine the sample stereo dataset to understand the input format and quality.


In [None]:
# Create an interactive stereo viewer to examine the input data
stereo_viewer = create_stereo_viewer(output_data_path)
display(stereo_viewer)

## Step 2: Depth Estimation using FoundationStereo

Now we'll extract depth information from our stereo image pairs using FoundationStereo. This step is crucial as it provides the 3D geometric foundation for our reconstruction pipeline.

### About FoundationStereo

FoundationStereo is a state-of-the-art neural network architecture designed for stereo depth estimation. It leverages:
- **Transformer-based feature extraction** for robust matching across stereo pairs
- **Multi-scale processing** to handle objects at different distances
- **Uncertainty estimation** to identify reliable depth predictions

The network takes left and right stereo images as input and produces dense depth maps with sub-pixel accuracy.

### Configuration Review

Let's examine the FoundationStereo configuration before running inference: 


In [None]:
foundationstereo_config = config['foundation_stereo']
display(pretty_print_config(foundationstereo_config, "FoundationStereo Configuration"))

### Run Depth Estimation

Now let's run FoundationStereo inference on our stereo pairs:


In [None]:
print("Starting FoundationStereo depth estimation...")
print("This may take several minutes depending on the number of frames and GPU performance.")

response = run_depth_estimation(
    config=foundationstereo_config, 
    exp_path=output_data_path, 
    rgb_path=output_data_path / 'left',
    depth_path=output_data_path / 'depth'
)

if response:
    print("✓ FoundationStereo depth estimation completed successfully!")
else:
    print("✗ Errors encountered during FoundationStereo inference.")
    print("Please check the configuration and input data before proceeding.")
    

### Visualize Depth Results

Let's examine the generated depth maps to verify the quality of the depth estimation:


In [None]:
# Create an interactive depth viewer
depth_viewer = create_depth_viewer(output_data_path)
display(depth_viewer)

## Step 3: Object Segmentation using SAM2

Next, we'll segment our target object across all frames using SAM2 (Segment Anything Model 2). This step is essential for isolating the object of interest from the background.

### About SAM2

SAM2 is Meta's advanced segmentation model that excels at:
- **Video object tracking**: Maintaining consistent segmentation across frames
- **Prompt-based segmentation**: Using minimal user input (like a bounding box) to identify objects
- **Temporal consistency**: Leveraging motion and appearance cues for robust tracking

The model requires only a single frame annotation (bounding box) and automatically propagates the segmentation to all other frames.

### Interactive Bounding Box Selection

Use the interactive widget below to draw a bounding box around your target object in the first frame: 

In [None]:
%matplotlib ipympl

print("Instructions:")
print("1. Use your mouse to draw a bounding box around the target object")
print("2. Make sure the box tightly encompasses the entire object")
print("3. Click 'Finalize & Close' when satisfied with the bounding box")

bbox_widget = create_bbox_widget(output_data_path)
display(bbox_widget.display())

### Run SAM2 Segmentation

Now we'll use the selected bounding box to run SAM2 segmentation across all frames:


In [None]:
# Extract bounding box coordinates
x, y, w, h = bbox_widget.get_bbox()
print(f"Selected bounding box: x={x}, y={y}, width={w}, height={h}")

# Update SAM2 configuration with the bounding box coordinates
sam2_config = config['sam2']
sam2_config['bbox'] = [x, y, x+w, y+h] 

print("Starting SAM2 object segmentation...")
print("This process will track the object across all frames.")

# Run object segmentation using SAM2
response = run_mask_extraction(
    config=sam2_config,
    exp_path=output_data_path,
    rgb_path=output_data_path / "left",
    mask_path=output_data_path / "masks"
)

if response:
    print("✓ SAM2 segmentation completed successfully!")
else:
    print("✗ Errors encountered during SAM2 inference.")
    print("Please check the bounding box selection and try again.")

assert response, 'SAM2 inference failed. Please resolve issues before proceeding.'


Let us inspect the extracted masks from SAM2. 

In [None]:
mask_viewer = create_mask_viewer(output_data_path)
display(mask_viewer)

# Step 4: 3D Reconstruction and Neural Rendering using NVBundleSDF

This final step combines multiple state-of-the-art techniques to create a complete 3D reconstruction of your object. The pipeline integrates feature matching, pose estimation, and neural rendering to generate high-quality textured 3D assets.

## Pipeline Overview

The 3D reconstruction process follows a sophisticated multi-stage approach:

1. **Pose Estimation** → Estimate and optimize camera poses
2. **Neural Reconstruction with Neural SDF** → Train a neural object field for 3D geometry
3. **Texture Baking** → Generate production-ready textured meshes

<div align="center">
    <img src="../data/docs/bundlesdf_pipeline.png" alt="BundleSDF Pipeline" title="3D Reconstruction Pipeline" width="800">
</div>

## Technical Components

###BundleSDF
[**BundleSDF: Neural 6-DOF Tracking and 3D Reconstruction**](https://arxiv.org/abs/2303.14158) combines:
- **Volume Rendering**: Learns 3D geometry through differentiable ray casting
- **Appearance Modeling**: Captures view-dependent effects and material properties
- **SDF Representation**: Uses signed distance functions for clean mesh extraction
- **Bundle Adjustment**: Performs global optimization across all frames for geometric consistency

###FoundationStereo
[**FoundationStereo: Zero-Shot Stereo Matching**](https://arxiv.org/abs/2501.09898) delivers robust depth estimation:
- **Vision Foundation Model**: Leverages pre-trained vision transformers for rich feature extraction
- **Zero-Shot Generalization**: Performs well across diverse environments without domain-specific fine-tuning
- **Multi-Scale Processing**: Handles objects at different distances through hierarchical feature analysis
- **Sub-Pixel Accuracy**: Achieves precise depth measurements with transformer-based stereo matching

###RoMa Feature Matching
[**RoMa: A Robust Dense Feature Matching**](https://arxiv.org/abs/2305.15404) provides reliable feature correspondences between frames:
- **Dense Matching**: Establishes pixel-to-pixel correspondences across viewpoints
- **Robust Descriptors**: Uses transformer-based features for challenging lighting and viewpoint changes
- **Uncertainty Estimation**: Provides confidence scores for each match to filter unreliable correspondences

###SAM2
[**SAM2: Segment Anything in Images and Videos**](https://arxiv.org/abs/2408.00714) extends segmentation to video:
- **Transformer Architecture**: Uses hierarchical vision transformer with streaming memory
- **Temporal Consistency**: Maintains object tracking across frames via memory mechanisms
- **Prompt Flexibility**: Accepts points, boxes, and masks for interactive segmentation
- **Real-time Performance**: Processes video 6× faster than the original SAM

## Configuration Review

Let's examine the configurations for each component before running the reconstruction:

In [None]:
roma_config = config['roma']
display(pretty_print_config(roma_config, "ROMA Feature Matching Configuration"))

In [None]:
config['bundletrack']['debug_dir'] = output_data_path / "bundletrack"
bundletrack_config = config['bundletrack']
display(pretty_print_config(bundletrack_config, "Pose Estimation Configuration"))

In [None]:
config['nerf']['save_dir'] = output_data_path #sdf config
nerf_config = config['nerf']
display(pretty_print_config(nerf_config, "SDF Training Configuration"))

In [None]:
# Texture Baking
texturebake_config = config['texture_bake']
display(pretty_print_config(texturebake_config, "Texture Baking Configuration"))

In [None]:
# Setup dataloaders
track_dataset = ReconstructionDataLoader(
    str(output_data_path), 
    config, 
    downscale=bundletrack_config['downscale'],
    min_resolution=bundletrack_config['min_resolution']
)
nerf_dataset = ReconstructionDataLoader(
    str(output_data_path), 
    config, 
    downscale=nerf_config['downscale'],
    min_resolution=nerf_config['min_resolution']
)
texture_dataset = ReconstructionDataLoader(
    str(output_data_path), 
    config, 
    downscale=texturebake_config['downscale'],
    min_resolution=texturebake_config['min_resolution']
)

# Setup NVBundleSDF instance 
tracker = NVBundleSDF(nerf_config, bundletrack_config, roma_config,texturebake_config)



Let us now continue with feature matching and pose estimation using BundleSDF.

In [None]:
# Run bundle track for feature matching and pose estimation 
tracker.run_track(track_dataset)

if not os.path.exists(os.path.join(config['bundletrack']['debug_dir'], 'keyframes.yml')):
    print(f'Feature Matching and Pose Estimation Failed, please check logs and resolve error before proceeding.')
else:
    print(f'Feature Matching and Pose Estimation successful.') 

Let's visualize the camera poses estimation results. 
+ The `scale` parameter controls the size of the camera frustums in the visualization - smaller values like 0.01 make the cameras appear smaller. 
+ The `eps` parameter affects point cloud downsampling during visualization - smaller values like 0.01 preserve more detail but may be slower to render.
+ The visualization shows camera positions and orientations as colored coordinate axes (red=X-right, green=Y-up, blue=Z-lookat), with yellow lines indicating the camera viewing frustums.



In [None]:
scene = vis_camera_poses(os.path.join(config['bundletrack']['debug_dir'], 'keyframes.yml'),track_dataset,scale=0.03,eps=0.01)
scene.show()

Now that we have extracted pose information and keyframes, we can train our SDF model. 

In [None]:
# Run SDF training 
tracker.run_global_sdf(nerf_dataset)

We now have our SDF model trained, let us proceed on texture baking and here we can customize our scale factor to use the original scale of the images if needed for 4k images.

In [None]:
tracker.run_texture_bake(texture_dataset)

We now have our 3d textured asset, let us take a look at the generated asset to see how it looks ! 

In [None]:
im = Image.open(f'{output_data_path}/material_0.png')
mesh = trimesh.load(f'{output_data_path}/textured_mesh.obj',process=False)
tex = trimesh.visual.TextureVisuals(image=im)
mesh.visual.texture = tex
view_mesh = mesh
material = mesh.visual.material
material.diffuse = [255,255,255,255]
mesh.show()

In [None]:
# Let us also observe the mesh in its reconstructed lightning below.
K,H,W = nerf_dataset.K, nerf_dataset.H, nerf_dataset.W
tRes = 800
scale = tRes/max(H,W)
H,W = int(H*scale), int(W*scale)
cam_K = K[:2]*scale
try:
    renderer = ModelRendererOffscreen([],cam_K,H,W)
    renderer.add_mesh(mesh)
    colors,depths = renderer.render_fixed_cameras()
except:
    renderer = ModelRendererOffscreen([],cam_K,H,W)
    renderer.add_mesh(mesh)
    colors,depths = renderer.render_fixed_cameras()

plt.figure()
for i in range(8):
    plt.subplot(2,4,i+1)
    plt.imshow(colors[i])
    plt.axis('off')
plt.tight_layout()
plt.show()

## Summary and Next Steps

### **Workflow Summary**

Congratulations! You have successfully completed the end-to-end 3D object reconstruction pipeline. Here's what we accomplished:

#### **Pipeline Achievements**
1. **✅ Depth Estimation**: Generated accurate depth maps from stereo pairs using FoundationStereo's transformer-based architecture
2. **✅ Object Segmentation**: Created consistent object masks across all frames using SAM2's video tracking capabilities
3. **✅ Pose Estimation**: Estimated and optimized camera poses for the next step reconstruction 
4. **✅ Neural Reconstruction**: Trained a Neural Field to capture the object's 3D geometry
5. **✅ Texture Baking**: Generated high-resolution texture maps and exported production-ready 3D assets

#### **Generated Assets**
Your reconstruction pipeline has produced the following outputs in the experiment directory:
- **`textured_mesh.obj`**: Complete 3D mesh with UV mapping
- **`material_0.png`**: High-resolution texture map
- **`keyframes.yaml`**: Optimized camera poses for each frame
- **`depth/`**: Dense depth maps for all input frames
- **`masks/`**: Object segmentation masks for background removal

#### **Export to USD**
- In order to support direct loading of various file types into Omniverse, we provide a set of converters that can convert the file into a USD file.
- [USD Converter using isaaclab.sim.converters](https://isaac-sim.github.io/IsaacLab/main/source/api/lab/isaaclab.sim.converters.html)

### **Integration with other workflows**

The generated 3D assets are immediately ready for integration into various platforms and workflows:

**Applications:**
- **Robotic Manipulation**: Use reconstructed objects for grasping and manipulation training
- **Sim2Real Transfer**: Bridge the gap between simulation and real-world deployment
- **Digital Twins**: Create accurate digital replicas of real-world objects
- **Computer Vision Training**: Generate labeled datasets with your reconstructed objects
- **Domain Adaptation**: Create variations of real objects for robust model training
- **Rare Object Simulation**: Generate synthetic data for objects that are difficult to collect

**Further Reading**
- [Object Detection Synthetic Data Generation using isaacsim.replicator.object](https://docs.isaacsim.omniverse.nvidia.com/4.5.0/replicator_tutorials/tutorial_replicator_object.html)
  
### **Advanced Customization Options**

#### **Quality Optimization**
- **Higher Resolution**: Modify `texture_bake.downscale` to `1` for full-resolution texture baking
- **Extended Training**: Increase NeRF training iterations for improved reconstruction quality
- **Custom Camera Intrinsics**: Adapt the pipeline for different camera setups

#### **Experiment with Your Own Data**
1. **Capture Guidelines**: Follow the data collection best practices demonstrated in Step 1
2. **Camera Calibration**: Ensure accurate intrinsic parameters for your stereo setup
3. **Lighting Conditions**: Experiment with different lighting setups for optimal results

### **Conclusion**

You now have a complete understanding of the 3D object reconstruction pipeline and practical experience with state-of-the-art computer vision techniques. The generated assets are production-ready and can be immediately integrated into your robotics, gaming, or AI workflows.

The combination of FoundationStereo, SAM2, and BundleSDF provides a robust foundation for creating high-quality 3D content from real-world objects, bridging the gap between physical and digital worlds.

