Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some problem about this architecture #33

Open
FlowEternal opened this issue Nov 23, 2023 · 0 comments
Open

Some problem about this architecture #33

FlowEternal opened this issue Nov 23, 2023 · 0 comments

Comments

@FlowEternal
Copy link

Hi Thanks for your work. I tried your model architecture on my custom data and here are some problems and my insights

1.There are two key advantages about this model:
(1) It is self-supervised and need only continuous video sequences with relative [R|T] between adjacent frames. This reduce the human labor enormously and we can use high precise localization algorithm or device to automatically get the camera position.
(2) It is not computationally intensively if we do not want to render the depth but only want to get 3d occupancy grid, e.g. , x range [-8,8]
y range [-0.4, 2.2] z range [1,21], and the voxel resolution is 0.2, then there are about 100000 sample point and one inference of mlp with input [1,100000,103] is enough. The grid sample op can also be easily implemented using cuda kernel function and there are also no 3d conv ops.

2.But to be honest there are some inevitable problems associated with the model
(1) The training signal depends both on the image quality itself and [R|T] precision:
-- First, If the image has reflection on the ground such as underground parking lot, then no matter how precision the [R|T] matrix is, the training signal will be vague and weak in these region because there will be no way to tune the predicted density along the camera ray to lead it to find the best stereo-matching point on the epipolar line of the render image.

-- Second, If the [R|T] matrix is not precise, then the traning signal will also not be clear enough to get notable result. For example, if the video sequence is monocular, then becasue there are always some road surface vibration when driving, the [R|T] matrix may not be precise which may lead to sub-optimal training result. However when trained on stereo-camera, the [R|T] between left and right camera is very precision and will almost not change when driving, then the training signal is much clearer and the result will be much better compared to mono ones. So in order to get very good result, it seems that the data gathering car need to equip stereo-camera with larger baseline distance when applied to outdoor driving scenerio.

(2) The generalization ability is weak
Even when trained with stereo-camera or with very precise [R|T] between adjacent sequences, and the image quality is very good with no reflection or artifacts. The generalization ablity is not so impressive. For example the model trained on KITTI-raw or KITTI-360 will perform very bad on custom dataset without finetune (zero-shot). The depth map is far from precise, especially in the road surface region. When finetuned on custom mono dataset, the model performe better but still far from precise, especially in the road surface region, and the texture-copy artifact will occur in rendered depth map.

The problem can not be solved even using larger dataset and I think it is really the intrinsic limitation of the model. The key problem may line in the how the point feature in 3d space be constructed. In your paper, the 3d point feature consist of three components: 1. image feature sampled from projected pixel using interpolation 2.position embedding using [sin(fu) sin(fv) sin(fz) cos(fu) cos(fv) cos(fz) sin(2fu) sin(2fv) sin(2fz) cos(2fu) cos(2fv) cos(2fz) ... ] 3.the normalized position itself (u,v,z)

So I think this may be the key problem that weaken its generalization because the image feature part in point feature vector may be the dominant factor in determining the decoded density in that point and image feature may be very different in different dataset domain. So when zero-shot it to a whole new different dataset the model perform so bad that it has to be finetuned to adapt to new image feature space. If this custom dataset has only mono-camera, then the training result will not be so good. This really hinder its practical usage in autonomous driving

I am wonder if there are some misunderstanding about this model and also want to know how to enhance model generalization and get good result on my custom mono video sequence with middle precise [R|T]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant