Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.

在3D空间中进行开放词汇查询是具有挑战性但对于场景理解任务（如对象定位和分割）至关重要的。语言嵌入式场景表示通过将语言特征融入到3D空间中取得了进展。然而，它们的有效性严重依赖于训练和渲染中资源密集的神经网络。尽管最近的3D高斯提供了高效和高质量的新视图合成，但直接在其中嵌入语言特征会导致过高的内存使用和性能下降。在这项工作中，我们引入了语言嵌入式3D高斯，这是一种用于开放词汇查询任务的新型场景表示。我们提出了一种专用的量化方案，大幅减轻了内存需求，而不是在3D高斯上嵌入高维原始语义特征。我们还提出了一种新的嵌入过程，实现了更平滑但高精度的查询，以应对多视图特征不一致性和基于点的表示中的高频感应偏差。我们的全面实验表明，我们的表示在当前语言嵌入式表示中实现了最佳的视觉质量和语言查询准确性，同时在单个桌面GPU上保持实时渲染帧率。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2311.18482.md

2311.18482.md

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Files

2311.18482.md

Latest commit

History

2311.18482.md

File metadata and controls

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding