Questions regarding pose evaluation and depth scale ambiguity #104

scott89 · 2020-08-08T11:11:23Z

Hi Clément,

Really great work! I have the following two questions.

In my understanding, the groundtruth poses of KITTI Odometry are measured under the camera_00 coordinate, while the posenet is tested using the image sequences captured by camera_02. Shouldn't the predicted poses be first transformed to the camera_00 coordinate before computing the metrics?
The scale ambiguity of self-supervised learning is mentioned in Inverse warp with scaled depth #45 (i.e., a scaled inverse depth with the correctly scaled pose will lead to correct image warping). Though the translation may be scaled, the rotation matrix is converted from Euler angles predicted by the posenet and cannot be scaled. I am wondering whether this Euler angles based parameterization can partially address the scale ambiguity.

Thanks!

ClementPinard · 2020-08-12T10:08:54Z

Hi, thanks for your interest in this repo.

Yes you are right, good catch ! However, this would only be significant with great rotation values. And since the snippet are only 5 frames long, the result we get are still valid IMO. It's also important to remeber that the KITTI itself said that the pose measurement were not great if we took small snippets. But on the principle it's completely true that we should change the output of Posenet and convert it to the camera_00_coordinate.
Not sure what you mean by that. Rotation has no ambiguity, contrary to translation, so even if both translation and rotation come from the same network, they are to be treated differently when it comes to deduce the right depth. The only way I found to adress this problem with posenet is to measure actual translation magnitude, from KITTI OXTS, by using either gps or speed from wheels. More generally, to solve the translational ambiguity, you will need at some point an anchor to reality, because every thing behaves the same way for a camera if you are using a real car, or if you are e.g. using a toy car in a reduced model of a city. The original author chose the Lidar by using the median of ground truth depth map, and this repo propose either this (which is not ideal for a system that tries to estimate depth without supervision) or the car's speed (ubiquitous and far more reasonable).

scott89 · 2020-09-01T07:02:42Z

Thanks for your reply. That solves my questions!

scott89 closed this as completed Sep 1, 2020

Provide feedback