Rui Chen1,2,
Jianfeng Zhang2*,
Yixun Liang1,
Guan Luo2,3,
Weiyu Li1,
Jiarui Liu1,
Xiu Li2,
Xiaoxiao Long1,
Jiashi Feng2,
Ping Tan1*,
*Corresponding authors
1 HKUST
2 Bytedance Seed
3 THU
Note: We have recently found that Dora-VAE can specify tokens of any length during inference, even if this length has not been seen during training. The more tokens are used, the better the reconstruction effect will be. When comparing the reconstruction performance with Dora-VAE, for a fair comparison, please report the length of the latent code used at the same time. Thank you in advance!
- Release sharp edge sampling (2025.2.11)
- Release Dora-bench(256) (2025.2.12) (https://huggingface.co/datasets/aruichen/Dora-bench-256/tree/main)
- Release Dora-VAE v1.1, including inference and training codes with model weight (2025.2.24)
- Release Dora-bench(512).
- Release Dora-VAE v1.2.
Version | training token length -> probability | eps | number of input points (uniform + salient) | output |
---|---|---|---|---|
v1.0 | [256,1280] -> [0.5,0.5] | 2/256 | 16384 + 16384 | occupancy |
v1.1 | [256,512,768,1024,1280,2048,4096] -> [0.1,0.1,0.1,0.1,0.1,0.3,0.2] | 2/256 | 32768 + 32768 | tsdf |
v1.2 | [256,512,768,1024,1280,2048,4096] -> [0.1,0.1,0.1,0.1,0.1,0.3,0.2] | 2/512 | 32768 + 32768 | tsdf |
-
Progressive training is crucial for the faster convergence of diffusion model. Warming up with tokens of length 256 and then gradually increasing the length of the tokens during training can significantly accelerate the convergence speed compared to directly training with a large token length.
-
During training, avoid adding positional encoding to the latent space as it harms convergence, since the VAE's latent codes from point query inputs are unordered.
-
During training, bf16-mixed is more stable than fp16-mixed precision.
Q1: Why a compact or smaller latent space is important?
A: The reconstruction quality of XCube-VAE is pretty good! However, it requires a larger latent space. We note that a compact latent space is crucial for the faster convergence of diffusion training, which leads to lower training difficulty and reduced computational resource requirements. Through a more careful evaluation, we find the XCube-VAE generates latent vectors of average 64,821 dimension in our training data. Our VAE model allows a batchsize of 128 on an A100 GPU, while the XCube-VAE only archives a batchsize of 2 on the same GPU.Q2: Can SNE or normal be used to further enhance the reconstruction quality?
A: In our earlier experiment, we designed a new efficient algorithm within the vecset-based architecture to render the normal map. The purpose was to compute the mean squared error (MSE) loss or GAN loss between the predicted normal and the ground-truth (GT) normal. Unfortunately, the experiment failed, and we observed that the results were even worse. For mse loss, here are the possible reasons for this failure. First, to render normals, we must initially predict an occupancy field. Subsequently, we apply a differentiable marching cube algorithm to this occupancy field to extract a mesh. Finally, a differentiable renderer such as nvdiffrast is employed to render the normals. However, both the mesh extraction and the rendering processes introduce errors. Moreover, during backpropagation, the gradient chain has to pass through the occupancy field. In the context of the optimization problem, using normals for supervision is essentially equivalent to using the occupancy for supervision. Since we already have the ground truth of the occupancy field, it raises the question of whether it is truly necessary to use normals, which involve a longer propagation chain, for supervision.
In addition to the above reasons, for the GAN loss, the normals rendered from the meshes reconstructed by the 3D VAE are absolutely clean. They are free of noise, have no background, show no high-frequency texture variations, and conform to physical constraints. This is different from the RGB images reconstructed by the 2D VAE, which contain some noise, cluttered backgrounds, and significant high-frequency variations. It's important to note that the experiment's failure might also be attributed to incorrect code or the presence of bugs. The above analysis is merely a post-hoc attempt to understand the outcome and does not guarantee a correct explanation.
Q3: What's the difference between point query and leanenable quary in the input of the VAE encoder?
A: According to the experiments conducted in 3DShape2VecSet, the performance of point query is better than that of learnable query. The model with point query as input has better generalization ability. The length of the point query is actually equivalent to that of the latent code, and we found that it has a great property: it is more flexible compared to the learnable query. During inference, it can easily switch between lengths that were not seen during training without introducing additional parameters. In contrast, the model trained with learnable query cannot use lengths that were not encountered during training at test time, which limits its flexibility.
Q4: Any interesting findings?
A: We visualized the cross-attention map of the encoder and found that the query points (colored green) tend to pay more attention to the point clouds in their surrounding areas (where redder indicates more attention and blacker indicates less attention).Q5: Can Dora-VAE handle the thin shell data?
A: Yes. Dora-VAE can reconstruct the thin shell data with high quality. The two examples in the above figure show a slice of the thin-shell data reconstructed by Dora-VAE, where white represents the interior and black represents the exterior.Q6: 2D VAEs typically require several billion data for training. However, due to the shortage of 3D data, 3D VAEs are usually trained with less than one million data. Does it have good generalization performance?
A: At first, we also had this question. But after improving and training the 3D VAE based on the vecset-based architecture proposed by 3DShape2VecSet, we were pleasantly surprised to find that it's really powerful, which had also been verified by CLAY. In fact, it only needs about 400K data to have good generalization ability. We hypothesize that the distribution of 3D geometry is simpler than that of RGB images. This is because, unlike RGB images, 3D geometry doesn't have many high-frequency texture variations or cluttered backgrounds.
- 3DShape2VecSet is the foundation of our work, which proposes the vecset-based representation and the manner of the point query in the input of the VAE.
- CLAY verifies the scalability of the vecset-based representation and proposes the controllable generative model for Creating High-quality 3D Assets.
- CraftsMan provides a PyTorch Lightning framework similar to threestudio, which facilitates native 3D training. Our code is largely based on the repository of CraftsMan.
- Michelangelo. We follow Michelangelo's design of 8-layer encoder and 16-layer decoder.
@article{chen2024dora,
title={Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders},
author={Rui Chen and Jianfeng Zhang and Yixun Liang and Guan Luo and Weiyu Li and Jiarui Liu and Xiu Li and Xiaoxiao Long and Jiashi Feng and Ping Tan},
journal={arXiv preprint arXiv:2412.17808},
year={2024},
}