# Pseudo-LIDAR

# Overview
An approach to creating LIDAR-like point clouds without working with stereo images, or depth camera but just working with a single image input (is applicable to Tesla).

## Important Pipeline:

1. Get the image.
2. Predict the depth of each pixel.
3. Cast out the pixels. (simulating LiDAR)
4. Get the predictions and project them onto the cameras
- Self-supervised techniques -> Predict a depth.
- Once you do have an 3D image, you cast out the pixels.
- Re-project them into different cameras or in the future frames of the camera.
- And you will have the photometric loss.


## First Step:
- Perform monocular depth estimation and generate pseudo-LiDAR for the entire scene by **lifting every pixel** within the image into its 3D coordinate.
- Then train LiDAR-based 3D detection network with the pseudo-LiDAR.
- Using LiDAR-based 3D detector, **Frustum PointNets**, we detect the 2D object proposals in the input image and extract a **point cloud frustum** from the pseudo-LIDAR for each 2D proposal. Then, an oriented 3D bounding box is detected for each frustum.

## Problems:
1. Depth estimation based on a monocular image is inaccurate because of local misalignment, especially for the objects that are far off.
2. The extracted Point cloud always has a long-tail because it is hard to estimate depth near the edge/periphery of an object. This means that there are always extra points that are shown as belonging to the object when they actually don't.

## Solutions:
1. To solve local misalignment, when projecting the 3D box onto the image, we use a 2D-3D bounding box consistency constraint i.e. the 3D bounding box overlaps with the 2D detected proposals on the image. During training, we formulate the constraint as bounding box consistency loss (BBCL) to supervise learning.
    - During testing, a bounding box consistency optimization (BBCO) is solved subject to this constraint using a global optimization method to further improve the prediction results.

2. To deal with the long-tail of points proposed as belonging to the object, we porpose to use mask segmentation instead of 2D bounding boxes around the object because that would define the object pixel by pixel.


# Other Approaches
Models using 2D-3D bounding box consistency constraint are also used to predict 3D bounding boxes using 2D processing. For example, one proposal is to use 2D CNNs to predict a subset of features like the object orientation and size. During testing, we combine the estimates with the constraint to compute the remaining parameters like the object center location.

# Pseudo-LiDAR Approach:

Goal: Using one RGB image to estimate 3D bounding box of objects.
Parameters for the 3D bounding box (total 7):
Object center: (x,y,z)
Object's size: (h, w, l)
Heading angle: (theta)



![Pseudo LiDAR](PseudoLiDAR.png)

## Approach
Input image is passed into two modules simultaneously:
a. Pseudo-LiDAR Generator
b. 2D Instance Mask Proposal detection (proposal loss is used to train this part of net)

The outputs from both are put together into Frustum PointNet which does 3D point cloud segmentation -> Using 3D segmentation Loss, we optimize the point cloud, then we pass it into 3D box estimation module and 3D box correction module simultaneously. 
1. 3D box estimation module outputs the 7 parameters which are added with the 7 parameters output by the correction module and then, we pass on the final estimate. We then project it onto the image. 

# Monocular Depth Estimation:

DORN network comes with pre-trained weights that serves the purpose of estimating monocular depth using a single RGB image. We do not update the weights of the network and so it can be thought of as an offline module. 

# Pseudo-LiDAR Generation:
After we have gotten the depth from the DORN model, we can use the depth estimate and the camera matrix to calculate the object's 3D location. When we have the camera matrix, we can calculate the 3D location of the object in the image. We can also project that location onto the world because we have the camera extrinsic matrix C = [Rt]

## LiDAR Vs. Pseudo-LiDAR
- The point clouds made by pseudo-LiDAR approach tend to have a long-tail i.e. more point clouds because it is hard to estimate depth around the edge of the object.
- For the far away object, the extracted point cloud frustum might be largely off and there is a local misalignment with respect to the LiDAR point cloud.

Another  factor contributing to the difference between the two approaches is that the LiDAR tends to have low density of the point cloud whereas the Pseudo-LiDAR approach tends to have high density.

# Instance Mask Proposal Detection

One of the ways to deal with the long-tail issues is that we use mask for the object instance. That way, we only keep the points in the 3D point cloud that overlap with the mask pixels and ignore the rest. 

# Modeling:

After getting the point cloud and then, the 2D mask, we can extract a set of point cloud frustums, that can be passed onto training a two stage LiDAR-based 3D detection algorithm for 3D bounding box prediction. The paper used Frustum PointNets (to be looked at in detail). But essentially, using the point clouds extracted from the first couple of techniques, we sample a fixed number of point clouds after segmenting the point cloud frustum, and use those small number of points to estimate the center, size and heading angle. 




## 2D-3D Bounding Box Consistency Optimization (BBCO)

To refine the bounding box estimate, we use geometry to fix the issue and we try to look at whether the 2D box also did not have a good alignment if there is a local alignment issue in the output of the model.
To compare the two, we first convert the 7 points i.e. x,y,z,height,width,length and theta into 8 points for the box's corners. Then, we convert those into 2D project. Then, we calculate the minimum bounding rectangle that represents the smallest axis-aligned 2D bounding box that can enclose the 2D point set. We also get the 2d representation of the mask this way.

## Bounding Box Consistency Loss: 
Box correction module: it takes in the segmented point cloud and features extracted from the 3D box estimation module as the input, and outputs a correction of the 3D bounding box parameters (i.e. residuals). Then, the final estimate can be computed as the summation over the initial estimate and the residual. Since this approach is differentiable, the model can be trained end-to-end with BBCL. 

### Post Processing:
Bounding Box Consistency Optimization:
Using global search optimization method, we can refine the final estimate with the BBC constraing as a post-processing step.






# PointNets
Takes in point clouds as input and outputs either class labels for the entire input or per point segment/part labels for each point of the input. Each point is processed independently, where it is represented by just three coordinates.

## Use of Max Pooling
For each point, we use max pooling to choose the most important points. 
The final FC layers use the max pooled point clouds to perform the classification and segmentation. 
