Question about the Voxel Features #77

taylover-pei · 2021-09-18T04:13:49Z

Congratulations on your great work!

I have read your paper and have several questions that bother me:

In your work,

Firstly, the voxel grid is first generated.
Secondly, use the gird_to_lidar, lidar_to_cam, cam_to_img transformation to find the correspondence between the grid coordinates and the image coordinates.
Thirdly, use grid_sample to sample features from Frustum to Voxel.
Finally, Voxel collapse to BEV features

Since, in my opinion, the BEV features represent the world coordinates. My question is, why not just use BEV features to generate 'BEV grid', which represents the real world (lidar) coordinates? So, the gird_to_lidar step can be omitted. Am I right?

I am still confused about the 'Voxel Features'. I don'y know what is it used for?

Thank you very much, looking forward to your replay!

The text was updated successfully, but these errors were encountered:

codyreading · 2021-09-20T13:57:54Z

Hi and thanks for the interest!

So to answer your second question, voxel_features refers to the 3D voxel feature grid, which is referred in the paper as V. We generate this as an intermediate 3D representation before collapsing it to a BEV feature grid bev_features.

For both voxel_features and bev_features, their coordinates aren't in real world coordinates but rather in what I refer to as grid coordinates, where the coordinates are the grid cell index. Meaning that coordinates range from (0, R) where R is the maximum number of cells in a specific axis. Real world coordinates range from values in metres, which is the range shown here. You need the grid_to_lidar transformation to convert from grid indices to real world coordinates in meters.

taylover-pei · 2021-09-22T06:27:05Z

Thanks for your reply!

There exists another question:

Is it possible to directly transform the Frustum Features to BEV features without using the Voxel Features?

Thank you very much, looking forward to your replay!

codyreading · 2021-09-22T13:37:47Z

Yes, it would be possible if you use the same strategy as PointPillars. Essentially, you construct your voxel grid such that it only has one height layer (voxel_size_z = 4 for KITTI). This results in voxel_features being equivalent bev_features, and can use it directly in the 3D object detection stage. An issue with this a forsee is that you only have one sampling point for each "pillar" (Center of the pillar in CaDDN), where the pillar feature should include information from all points within the pillar. This is why we construct the voxel grid first, and collapse it to BEV such that it includes information from all points within the pillar.

taylover-pei · 2021-09-23T06:44:23Z

Thank you very much. I have got it! It really helps me a lot.

taylover-pei closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the Voxel Features #77

Question about the Voxel Features #77

taylover-pei commented Sep 18, 2021

codyreading commented Sep 20, 2021 •

edited

taylover-pei commented Sep 22, 2021

codyreading commented Sep 22, 2021

taylover-pei commented Sep 23, 2021

Question about the Voxel Features #77

Question about the Voxel Features #77

Comments

taylover-pei commented Sep 18, 2021

codyreading commented Sep 20, 2021 • edited

taylover-pei commented Sep 22, 2021

codyreading commented Sep 22, 2021

taylover-pei commented Sep 23, 2021

codyreading commented Sep 20, 2021 •

edited