I will begin by summarizing the following paper: [An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds](https://arxiv.org/abs/2007.12392). I will also try to inject my own thoughts about what certain details in this paper might mean for my own project. I will then focus on a few key aspects of this paper that will be vital in moving forward with my own project independent of its exact goals.

# Summary

Light Detection and Ranging (LiDAR) is a type of 3 dimensional sensor that has become more and more ubiquitous in recent years. The authors point out that this type of sensor is a boon for autonomous and robotic systems. The goal for computer vision using LiDAR is semantic segmentation of a scene, or frame given most LiDAR sensors are used continuously. Noting that most methods for achieving this goal only use a single frame as inputs the authors propose to use recurrent networks that incorporate past frames into the semantic segmentation process. This is further motivated by the fact that large multi-frame datasets have recently become available. Namely, the nuScenes and Waymo Open datasets each contain a thousand sequences of LiDAR data -- the latter of the two is used in the paper. I will note that, for my project, it may be a good idea to investigate the merits of both datasets. 

There has been a host of work in the field of 3D object detection using deep neural networks. Some apporach this task with 2D image processing techniques, while others have looked directly at the 3D data. Notable for this work are [VoxelNet](https://arxiv.org/abs/1711.06396) and [SparseConv](https://arxiv.org/abs/1706.01307). The former voxelizes 3D point cloud data to turn it into a more usable regular grid. The latter restricts 3D convolutions to regions of activity, significantly increasing efficiency as 3D point cloud data is highly sparse. A few of the previous works mentioned try to include the temporal dimension, but the authors note that their apporach is novel and distinct in its recurrent nature.

The model architechture the authors propose consists of three main components: feature extractor, recurrent module, and object identifier. In the feature extractor a frame of 3D point cloud data is passed to a sparse CNN after being voxelized. The original 3D data is in the form of many xyz points, and the voxelization is sparse. The sparse CNN is a U-Net style network that uses sparse 3D convolutions and 3D max pooling. As in the 2D version, there are skip connections between the downsampling and upsampling sides of the network. Before moving to the recurrent module, the extracted features are de-voxelized back into point cloud form. The authors cite [DOPS](https://arxiv.org/abs/2004.01170) as their reference for the feature extraction. 

Once the output of this initial section has been obtained (which we note is of the same shape as the input), it is passed to an LSTM module for fusion with the stored state. This module is mostly the same as the standard, but with two key differences. Specifically, the fully connected layer inside the module is replaced with another (smaller) sparse U-Net style CNN. This allows for the features from the input and previous timesteps to be easily fused and processed. The second difference is that the full hidden state and cell memory aren't kept intact, but instead are sampled for the points with high "semantic scores". The authors note that this is "obtained from the pre-trained single frame detection model", but that seems to be an unsatisfactory description. Furthermore, the stored state is transformed from the coordinate frame of the last input to the coordinate from of the current input. The authors refer to this as an "Ego Motion Transformation". The LSTM input, hidden state, and cell memory are concatenated along the feature dimension and then jointly voxelized to account for the spatial changes due to the motion of the scene. After passing through the internals of the LSTM module the stored state (but not the output) is then de-voxelized. I note here that this seems to be a lot of back and forth between the voxelized form and the point cloud form. It doesn't seem to hold the researchers back from achieving fast inferences, but I wonder if there is a simpler way to deal with this. One final process is used on the stored state to further account for the scene motion.

The last step is to preform the actual object detection on the feature data from the LSTM module. Within this there are actual three sub-processes that occur. First, per voxel bounding boxes are generated using three layers of sparse convolutions for each of center, rotation, height, length, and width. This is then de-voxelized with each point inside a voxel assuming that voxels bounding box. Second, a graph is constructed from groups of points that share similar predicted object centers, and a graph based convolution is performed. In inference the final step is applying non-maximum suppression to obtain final 3D semantic segmentation results.

During training the network is initially using single frame input. The loss for the output is a hybrid of regression and classification. Given that each bounding box is actually described with five different parameters, the authors opt for an "integrated box corner loss" (on the regression side) which can update all the attributes at once. For the classification loss the authors use a 70% Intersection Over Union (IOU) with the ground truth as a positive prediciton and the rest as negative. The integrated box corner loss uses all of the attributes described above to to compute the eight predicted corner locations. A per point regression loss is then applied which will automatically propogate back to the original attributes. 

In testing four LiDAR frames are used as input, and the performance metric being used is mean average precision (mAP). The authors find that their model performs very well. Specifically, they achieve 63.6% mAP which is a 7.5% increase over a similar network using one frame without the LSTM and a  1.2% increase over a network that uses four frame concatenation and no LSTM. Qualitatively, it can be seen that the network produces better bounding boxes and has fewer false positives. The authors also show that adding more frames to the input increases the accuracy of the model -- with a single frame version still achieving 58.7% mAP. Beyond dataset performance the authors succeed in two other important areas: memory and computational efficiency. By reducing the number of points that are being operated on, and becuase the LSTM adds relatively little in computational cost, the model is smaller and can run inference in 19ms, well under the 10hz LiDAR data rate.

This paper was well written, easy to follow, and concise in its contributions and achievements. However, there were two important aspects that I believe could have been further explained. First, the authors mention that they sub-sampled the LiDAR point cloud data but are quite vague and inconsistent on this. They claim that the backbone (the input U-Net) features are sub-sampled, but earlier they had claimed that only the LSTM stored state is sub-sampled from the previous computation. Second, the authors discuss using multiple frames as input, but imply that there method does not use concatenation. I am confused then how four frames are used more efficiently. Perhaps they are using four single frame passses to object detect in one future scene, but that seems to be avoiding the point of using a recurrent layer. More investigation is needed for me to fully understand these two aspects, but I feel that my learning in class and on the reports has prepared me well and allowed me to understand the core of this paper. There are a few topics that I am uncertain about due to a lack of exposure which will be discussed in the Key Components section below. 

In terms of how this paper informs my own project I see a number of areas that could be improved or explored:
- **Object Detection Head:** I see plenty of opportunity for exploration in the last portion of the network. The method they are using seems (at least to me currently) somewhat specific, and I wonder if more thought can be put into a deliberate approach. As well, I have read about some problems with NMS, so there might be other options there.


- **Voxelization:** Within the network there is a decent amount of voxelization (point cloud to voxel grid) and de-voxelization (vice versa). I wonder if there is a way to maintain the voxel encoding throughout and only convert back to the point cloud at the very end.


- **Recurrent Layer:** The authors have chosen a standard LSTM (with the sparse CNN change) as their recurrent layer. Perhaps a GRU would allow for quicker training and a smaller model, but maybe it would cost too much accuracy. There might be some other variant that deserves investigation, and I think this is a good place for alteration.


- **Backbone Network:** A sparse U-Net style network is being used as the backbone for processing the 3D data. We have discussed in class other such networks that may perform better such as the DeepLabv3+.


- **Scene Flow:** Somewhat independent of the model specifics, predicting scen flow (i.e. the movement of objects) is also desirable and is mentioned by the authors as future work. Furthermore, this may allow for better computation and transformation of the recurrent state.


- **Model Reduction:** As it stands the model is quite complicated with a fair number of layers and techincal components. I believe some of this could be pared down etheir with existing reduction techniques (e.g. pruning), or by changing the model architechture. The recurrent layer changes and voxelization discussed above also fit into this idea.

# Key Components

Each of the following topics is a key aspect of building the model described in the paper and achieving their results.

## Waymo Open Dataset

This dataset from the Waymo self-driving company has 1000 diverse sequences. There are about 200 frames in each sequnce with a frame rate of 100ms. In addition to LiDAR data there is also 2D camera data. Within each LiDAR scene there are labels and bounding boxes for vehicles, pedestrians, cyclists, and signs. Each scene also has a global coordinate frame and a vehicle coordinate frame. 

The entire dataset is 2TB which is ... scary to say the least. It is broken into 32 training chunks and 8 validation chunks of about 25GB. I have downloaded one of the chunks (which can be done [here](https://waymo.com/open/download/)), and then placed one of the files in the *waymo-single* folder.

Below we have loaded the data and examined some of its details. We note that we did not visualize the 3D point cloud as Waymo does not have a publicly available system for this, and I did not have time to figure something else out.

In [1]:
import os
import tensorflow.compat.v1 as tf
import math
import numpy as np
import itertools

from waymo_open_dataset.utils import range_image_utils
from waymo_open_dataset.utils import transform_utils
from waymo_open_dataset.utils import  frame_utils
from waymo_open_dataset import dataset_pb2 as open_dataset

Waymo provides a GitHub repository that contains a number of useful tools for working with their data and using it in deep learning. This package has been imported above.

In [4]:
path = 'waymo-single/segment-1005081002024129653_5313_150_5333_150_with_camera_labels.tfrecord'

dataset = tf.data.TFRecordDataset(path, compression_type='')
for data in dataset:
    frame = open_dataset.Frame()
    frame.ParseFromString(bytearray(data.numpy()))

In [5]:
(range_images, camera_projections,
 range_image_top_pose) = frame_utils.parse_range_image_and_camera_projection(
    frame)

In [6]:
print(frame.context)

name: "1005081002024129653_5313_150_5333_150"
camera_calibrations {
  name: FRONT
  intrinsic: 2083.091212133254
  intrinsic: 2083.091212133254
  intrinsic: 957.2938286685071
  intrinsic: 650.5697927719348
  intrinsic: 0.04067236637270731
  intrinsic: -0.3374271466716414
  intrinsic: 0.0016273829099200004
  intrinsic: -0.0007879327563938157
  intrinsic: 0.0
  extrinsic {
    transform: 0.9999151800844592
    transform: -0.008280529275085654
    transform: -0.010053132426658727
    transform: 1.5444145042510942
    transform: 0.008380895965622895
    transform: 0.9999150476776223
    transform: 0.009982885888937929
    transform: -0.022877347388980857
    transform: 0.009969614810858722
    transform: -0.010066293398396434
    transform: 0.9998996332221252
    transform: 2.115953541712884
    transform: 0.0
    transform: 0.0
    transform: 0.0
    transform: 1.0
  }
  width: 1920
  height: 1280
  rolling_shutter_direction: RIGHT_TO_LEFT
}
camera_calibrations {
  name: FRONT_LEFT
  intr

Using the *.context* command we have printed a number of data assocaited with each frame. Much of it consists of sensor callibration information, but as we can see there is also information on the likes of weather conditions and object detections.

Unfortunately, I was not able to write up my investigations into the remaining topics as much as I would have liked due to time contraints, but I have tried to give them a quick reflection.

## Voxelization

Voxels are essentially the 3D equivalent of pixels, and the process of voxelization takes a 3D point cloud and constructs a regular grid of voxels that may or may not contain points from the cloud. It appears there are some third party tools to complete this process, and at the very least there is research on this technique specifically for deep learning.

## 3D Sparse Convolutions

Given something like a voxel grid where many of the voxels are empty and thus the grid is very sparse, it is very inefficient to perform regular 3D convolutions. Thus sparse 3D convolutions only examine active sites and are very efficient when dealing with sparse data. A PyTorch FacebookResearch implementation of this is available.

## Ego Motion

This term is essentially equivalent to Odometry which is the process of determining a change in position over time. Becuase the LiDAR sensor is moving the frame of reference is too, so this needs to be accounted for when using multiple frames. All this boils down to is essentially a coordinate transformation.

## Graph Convolutions

Graph convolutions are a type of convolution an arbitrary graph data structures that use the relations of nodes to propogate features. More investigation is needed on my part to full understand this.