How to use outputs of layout / angles from a pretrained model? #55

garrickbrazil · 2022-07-31T05:11:53Z

I'm playing with the SUN RGB-D model for (v3 | mAP@0.15: 43.7 which uses 20211007_105247.pth and imvoxelnet_total_sunrgbd_fast.py).

For each image I'm testing, I have the RGB and a 3x3 intrinsic matrix only which goes from camera space to screen.

I've been able to follow the demo code in general so far! Perhaps I'm missing it, but the flow and pipeline for the available demo's appear to not use the outputs for layout / angles? However, the visualized images elsewhere seem to have layout or room tilt predictions applied along with the per-object yaw angles.

I want to make sure that I'm using the SUN RBG-D model correctly. Are there any examples I can follow to make sure I can apply the room tilts to the objects? E.g., say if my end goal is a 8 vertex mesh per object that is in camera coordinates?

For instance, show_result and _write_oriented_bbox seem to only use the yaw angle. It seems like those are the two main functions for visualizing (unless I'm missing some code).

To be clear, the predictions are definitely being made as expected. It's only the exact steps for applying them that are ambiguous to me

garrickbrazil · 2022-07-31T08:50:46Z

Would it be possible to provide an example of how to use the SUN RGB-D model for (v3 | mAP@0.15: 43.7 which uses 20211007_105247.pth and imvoxelnet_total_sunrgbd_fast.py) model, with an arbitrary in the wild image? Specifically, it's a bit confusing at the moment for what assumptions to make regarding the "lidar2img" dict entries:

intrinsic: This seems self-explanatory enough if it is just original focal length and principal point in pixels. No explanation needed!
extrinsic: It seems that this is not needed for the model I'm using, but it's still required when I use it with the demo. Is it okay to just use Identity matrix? Since the extrinsics are essentially predicted for this model, I'm afraid that the demo pipeline I'm may not be respecting those predictions if they are used to help the 3D boxes?
origin: What should this be set to?

Any help would be greatly appreciated!

filaPro · 2022-07-31T13:07:33Z

Hi Garrick,

Thanks for your interest in our research. I will try to answer some of your questions. First, I've never tried to run this code on images in the wild, only on 4 datasets from our paper. But it should be somehow possible.

We support KITTI, nuScenes, ScanNet and 3 benchmarks for SUN RGB-D. As I remember, for all 6 benchmarks we predict boxes in the world coordinate system, so we use both extrinsics and instrinsics provided in the datasets. All these projection matrices are only used in this function. However, this function has a special case for SUN RGB-D dataset and Total3dUnderstanding benchmark. As I remember, we follow their idea of parametrizing the extrinsics matrix with 2 angles and predict them for inference. So, regarding to your question about extrinsics, the answer is - we don't use this matrix for inference of the model you are interested.

The part about origin is tricky. For Total3dUndestanding benchmark we use [0, 3, -1]. This 3 values for 3 axis in DepthInstance3DBoxes coordinate system should be in meters. Following these 2 lines this means that we consider only the space in front of the camera (camera and world coordinate systems have the same zero point) with -3.2 < x < 3.2, -0.2 < y < 6.2, -2.28 < z < 0.28. These assumptions can be kept for your wild cases, or origin can be shifted for some particular case.

Hope these answers help you, or feel free to ask any new questions.

garrickbrazil · 2022-07-31T17:59:31Z

Hi Danila, Thank you for the fast response! Your answers clear up a lot of my confusion for using the imvoxelnet_total_sunrgbd_fast model! I really appreciate you taking the time here.

To re-cap. If we were inferring with the above model on either SUN RGB-D data where only the intrinsics are assumed or in the wild data (where we may estimate or sometimes know the intrinsics), then the origin should be set to [0, 3, -1]. The extrinsics can be set to Identity since it won't be used. By the way, as a sanity check I did verify that the input extrinsics array appears to have no effect on the model output unlike the origin 3x1 array, which corresponds greatly to my understanding now.

My last confirmation is that for the above model, should we manually apply the predicted extrinsics (2 angles for pitch and roll) to each estimated box following the logic of get_extrinsics?

filaPro · 2022-07-31T21:07:04Z

I probably don't quite understand your last question.

In my understanding we need extrinsics for monocular 3d detection here as the ground truth box is parametrized by only one (not 3) rotation angle around z axis. So we somehow need to estimate the floor plane of the room. And it is done by predicting these 2 angles (or using ground truth extrinsics for other benchmarks). This get_extrinsics is used during our model inference, so the predicted boxes are in SUN RGB-D world coordinate system where they are parallel to the floor plane.

Btw, I saw your omni3d paper last week, and taking this opportunity want to say that it is really great :)

garrickbrazil · 2022-07-31T23:21:42Z

I just wanted to make sure that the get_extrinsics function was the logic I should be using. It seems like it is from your comment! My understanding of the high-level flow for visualization of the model outputs seems to be: intrinsics @ get_extrinsics(angles) @ vertices_3D_with_yaw_applied. I didn't notice the get_extrinsics function until recently. That plus the origin clarifications you gave earlier were the missing puzzle pieces. Feel free to close this issue at your convenience. Thanks again for the fast turn around.

Btw, I saw your omni3d paper last week, and taking this opportunity want to say that it is really great :)

Thank you! I would like train ImVoxelNet on the Omni3D dataset in the future if possible. It's a really impressive baseline. I attached a few quick COCO examples from this model (we are not using these anywhere, I am just curious what happens with no additional training). In my opinion, given that this model is only trained on SUN RGB-D and we are merely guessing at the intrinsics, these images do pretty well. I'm using a pretty low threshold for visualizing and set the origin=[0, 3, -1] as suggested. I'm very curious what the generalization power is when it's trained on 234k images.

filaPro mentioned this issue Apr 12, 2023

The training environment #66

Closed

filaPro mentioned this issue Aug 15, 2023

How to run single view detection on ScanNet? #72

Open

filaPro mentioned this issue Feb 5, 2024

ImVoxelNet Inference #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use outputs of layout / angles from a pretrained model? #55

How to use outputs of layout / angles from a pretrained model? #55

garrickbrazil commented Jul 31, 2022

garrickbrazil commented Jul 31, 2022

filaPro commented Jul 31, 2022

garrickbrazil commented Jul 31, 2022

filaPro commented Jul 31, 2022

garrickbrazil commented Jul 31, 2022

How to use outputs of layout / angles from a pretrained model? #55

How to use outputs of layout / angles from a pretrained model? #55

Comments

garrickbrazil commented Jul 31, 2022

garrickbrazil commented Jul 31, 2022

filaPro commented Jul 31, 2022

garrickbrazil commented Jul 31, 2022

filaPro commented Jul 31, 2022

garrickbrazil commented Jul 31, 2022