Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding pose evaluation and depth scale ambiguity #104

Closed
scott89 opened this issue Aug 8, 2020 · 2 comments
Closed

Questions regarding pose evaluation and depth scale ambiguity #104

scott89 opened this issue Aug 8, 2020 · 2 comments

Comments

@scott89
Copy link

scott89 commented Aug 8, 2020

Hi Clément,

Really great work! I have the following two questions.

  1. In my understanding, the groundtruth poses of KITTI Odometry are measured under the camera_00 coordinate, while the posenet is tested using the image sequences captured by camera_02. Shouldn't the predicted poses be first transformed to the camera_00 coordinate before computing the metrics?
  2. The scale ambiguity of self-supervised learning is mentioned in Inverse warp with scaled depth #45 (i.e., a scaled inverse depth with the correctly scaled pose will lead to correct image warping). Though the translation may be scaled, the rotation matrix is converted from Euler angles predicted by the posenet and cannot be scaled. I am wondering whether this Euler angles based parameterization can partially address the scale ambiguity.

Thanks!

@ClementPinard
Copy link
Owner

ClementPinard commented Aug 12, 2020

Hi, thanks for your interest in this repo.

  1. Yes you are right, good catch ! However, this would only be significant with great rotation values. And since the snippet are only 5 frames long, the result we get are still valid IMO. It's also important to remeber that the KITTI itself said that the pose measurement were not great if we took small snippets. But on the principle it's completely true that we should change the output of Posenet and convert it to the camera_00_coordinate.

  2. Not sure what you mean by that. Rotation has no ambiguity, contrary to translation, so even if both translation and rotation come from the same network, they are to be treated differently when it comes to deduce the right depth. The only way I found to adress this problem with posenet is to measure actual translation magnitude, from KITTI OXTS, by using either gps or speed from wheels. More generally, to solve the translational ambiguity, you will need at some point an anchor to reality, because every thing behaves the same way for a camera if you are using a real car, or if you are e.g. using a toy car in a reduced model of a city. The original author chose the Lidar by using the median of ground truth depth map, and this repo propose either this (which is not ideal for a system that tries to estimate depth without supervision) or the car's speed (ubiquitous and far more reasonable).

@scott89
Copy link
Author

scott89 commented Sep 1, 2020

Thanks for your reply. That solves my questions!

@scott89 scott89 closed this as completed Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants