# <img src="https://img.icons8.com/bubbles/100/000000/3d-glasses.png" style="height:50px;display:inline"> EE 046746 - Technion - Computer Vision

#### Dahlia Urbach

## Tutorial 09 - Introduction to 3D Deep Learning
---
<img src=".\assets\tut_09_teaser0.gif" style="width:800px">



<a href="https://towardsdatascience.com/the-future-of-3d-point-clouds-a-new-perspective-125b35b558b9">Image source
</a> 

## <img src="https://img.icons8.com/bubbles/50/000000/checklist.png" style="height:50px;display:inline"> Agenda
---
* [Depth Cameras - Quick overview](#)
    * Stereo Cameras - Next Week
    * [Time of Flight](#-Time-of-Flight-Cameras)
* [3D Deep Learning](#-Deep-Learning-on-Point-Clouds)
    * [Voxels](#-Voxalization)
    * [Multi-View](#-Multi-View-Approach)
    * [Point Clouds](#-Apply-Deep-Learning-Directly-on-3D-Point-Clouds)
* [3D Applications](#-3D-Deep-Learning-Applications)
* [Recommended Tools](#-Recommended-Tools)
* [Recommended Videos](#-Recommended-Videos)
* [Credits](#-Credits)
  

<img src=".\assets\tut_09_teaser.JPG" style="width:800px">


In [8]:
# imports for the tutorial
import numpy as np
import matplotlib.pyplot as plt
import time

# pytorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset, ConcatDataset
# import torchvision



<a href="https://www.youtube.com/watch?v=R-ZXdAEGbiw&feature=youtu.be">LiDAR point cloud support | Feature Highlight | Unreal Engine
</a> 


### <img src="https://img.icons8.com/office/80/000000/depth.png" style="height:50px;display:inline"> Depth Cameras
---
* Stereo Cameras - Next week
* Time of flight

#### <img src="https://img.icons8.com/android/48/000000/time.png" style="height:50px;display:inline"> Time of Flight Cameras
---
* Light travels at approximately a constant speed $c = 3\times 10^8$ (meters per second).
* Measuring the time it takes for light to travel over a distance once can infer distance.
* Can be categorized into two types:
    1. Direct TOF - switch laser on and off rapidly.
    2. Indirect TOF - send out modulated light, then measure phase difference to infer depth.

##### 1. Direct - TOF
* **Li**ght **D**etection **A**nd **R**anging (LiDAR) probably best example in computer vision and robotics.
* High-energy light pulses limit influence of background illumination.
* However, difficulty to generate short light pulses with fast rise and fall times.
* High-accuracy time measurement required.
* Prone to motion blur.
* Sparser as objects grow in distance.
<img src=".\assets\tut_09_LiDAR.GIF" style="width:150px">
<a href="https://en.wikipedia.org/wiki/Lidar">Gif source - Wikipedia</a> 




<img src=".\assets\tut_09_sydney.png" style="width:400px">
<a href="http://www.acfr.usyd.edu.au/papers/SydneyUrbanObjectsDataset.shtml">Sydney Dataset</a> 
 

###### Autonomous Car - LiDAR 
<img src=".\assets\tut_09_cam1.JPG" style="width:400px">



##### SLAM + LIDAT - Zebedee
<img src=".\assets\tut_09_cam2.JPG" style="width:400px">
<a href="https://research.csiro.au/robotics/zebedee/">zebedee</a> 


###### 2. Indirect - TOF
* Continuous light waves instead of short light pulses.
* Modulation in terms of frequency of sinusoidal waves.
* Detected wave after reflection has shifted phase.
* Phase shift proportional to distance from reflecting surface.

<img src=".\assets\tut_09_cam3.JPG" style="width:800px">

<img src=".\assets\tut_09_cam4.JPG" style="width:800px">

### <img src="https://img.icons8.com/nolan/64/cloud.png" style="height:50px;display:inline"> Deep Learning on Point Clouds
---
<img src=".\assets\tut_o9_pn1.JPG" style="width:800px">

* Calssification
* Semantic segmentation
* Part segmentation
    * Each point belongs to a specific part of the object
* ...


*Qi, Charles R., et al. "Pointnet: Deep learning on point sets for 3d classification and segmentation." CVPR. 2017.*

##### <img src="https://img.icons8.com/bubbles/50/000000/question-mark.png" style="height:50px;display:inline"> Questions
* What are the differences between 2D image an a point cloud?
* Why it might be hard to feed a point cloud as neural network input?
* What are the benefits of using a point cloud?

#### <img src="https://img.icons8.com/plasticine/100/000000/not-applicable.png" style="height:50px;display:inline"> Point Clouds Problems
---
* Point Clouds Vary in Size (not constant)
* Unordered Input
    * Data is unstructured (no grid)
    * Data is invariant to point ordering (permutations)

#### Other Point Clouds Challenges
---
* Missing data
* Noise 
* Rotations

##### Problem - Point Clouds Vary in Size
---
<img src=".\assets\tut_09_pn2.JPG" style="width:800px">

<img src=".\assets\tut_09_pn3.JPG" style="width:800px">

* Different point clouds represent the same object

#### Problem  - Unordered Input
---
Point cloud: $N$ **orderless** points, each represented by a $D$ dim vector
<img src=".\assets\tut_09_pn4.JPG" style="width:800px">
How many semi-equal representations?

**Model needs to be invariant to $N!$ permutations**

####  <img src="https://img.icons8.com/carbon-copy/100/000000/switch-camera.png" style="height:50px;display:inline">Alternate 3D Representations - Representations
---
Solution:
* Convert the raw point clouds into Voxels or multiple 2D RGB(D) images

<img src=".\assets\tut_09_pn5.JPG" style="width:800px">



Another 3D representation (not in this course):
<img src=".\assets\tut_09_other.JPG" style="width:800px">


### <img src="https://img.icons8.com/carbon-copy/100/000000/sugar-cubes.png" style="height:50px;display:inline"> Voxalization
---
Idea: generalize 2D convolutions to regular 3D grids

* The straightforward approach: transform the point clouds into a voxel grid by rasterizing and use 3D CNNs
<img src=".\assets\tut_09_vox1.jpg" style="width:300px">

Voxel grid is a 3D grid of equal size volumes (voxels), can be occupied by:
* Binary 0/1 - Is there any point within the voxel?
* Weighted - The amount of point located within each voxel

Usually we use binary occupancy

    


3D convolution uses 4D kernels
<img src=".\assets\tut_09_vox2.JPG" style="width:600px">


<img src=".\assets\tut_09_voxnet.png" style="width:500px">
    


Maturana, Daniel, and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. IROS, 2015.

#### <img src="https://img.icons8.com/plasticine/100/000000/not-applicable.png" style="height:50px;display:inline"> Voxalization Problems
---
* Large memory cost
* Slow processing time
* Limited spatial resolution
* Quantization artifacts
<img src=".\assets\tut_09_vox3.JPG" style="width:600px">

### <img src="https://img.icons8.com/dusk/64/000000/multiple-cameras.png" style="height:50px;display:inline"> Multi-View Approach
---
Idea: Transfrom the problem into a well known domain (3D$\rightarrow$2D)

* The multi-view approach: project multiple views to 2D and use CNN to process
    * How many views do we need? (Another hyper parameter)
    
<img src=".\assets\tut_09_pn7.png" style="width:600px">

CNN$_1$ - We can use pre-trained networks to extract features followed by fine tune layers


H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multiviewconvolutional neural networks for 3d shape recognition. CVPR, 2015.

## <img src="https://img.icons8.com/doodle/48/000000/direction-sign.png" style="height:50px;display:inline"> Apply Deep Learning Directly on 3D Point Clouds
---
Idea: Most of the raw 3D data are point clouds - Solve the problems!

Note: Point Clouds Problems:
* Point Clouds Vary in Size (not constant)
* Unordered Input
    * Data is unstructured (no grid)
    * Data is invariant to point ordering (permutations)


### <img src="https://img.icons8.com/cute-clipart/64/000000/machine-learning.png" style="height:50px;display:inline"> PointNet 
---
#### Permutation Invariance: Symetric Function
$$ f(x_1,x_2,\dots,x_n) \equiv f(x_{\pi_1},x_{\pi_2},\dots,x_{\pi_n}), x_i \in R^D $$

$\pi$ is a different permutation

Example:
$$ f(x_1,x_2,\dots,x_n) = max\{x_1,x_2,\dots,x_n\}$$
$$ f(x_1,x_2,\dots,x_n) = x_1+x_2+\dots+x_n$$


* How can we construct a family of symmetric functions by neural networks?

Observe:
$$ f(x_1,x_2,\dots,x_n) = \gamma \circ g(h(x_1),\dots,h(x_n))$$ 
is symmetric if $g$ is symmetric

<img src=".\assets\tut_09_pn8.JPG" style="width:800px">

#### <img src="https://img.icons8.com/wired/64/000000/diversity.png" style="height:50px;display:inline"> Basic PointNet Architecture
---
Empirically, we use multi-layer perceptron (MLP) and max pooling:

<img src=".\assets\tut_09_pn9.JPG" style="width:500px">

Input MLP:
$$h(x_i): R^{3} \rightarrow{} R^{D}$$
Pooling layer:
$$g(h(x_1),\dots,h(x_n)): R^{N\times D} \rightarrow{} R^{D_{out}}$$
Classification MLP:
$$\gamma \circ g(h(x_1),\dots,h(x_n)): R^{D} \rightarrow{} R^{D_{Num Classes}}$$


The shared MLP implementation "trick":

We can represent the dims $D$ as 2D image channels $C$
* Input: $R^{1\times N\times {C_{in}}}$
* Layers: 2D convolution with output size of $1\times 1 \times {C_{out}}$, followed by activation layer.
Where $C_{out}$ is the number of filters, each has size of $1\times 1 \times {C_{in}}$, simmilar to $1D$ fully connected layer.
* Output: $R^{1\times N\times {C_{out}}}$ -->
Shared mlp implementation "trick":
* Use conv layers :  Number of filters $C_{out}$, each filter size is  ${ 1 \times {C_{in}}} $.
* Input: $R^{ N\times {C_{in}}}$
* Output: $R^{N\times {C_{out}}}$




MLP:
$$h: R^{C_{in}}\rightarrow{} R^{C_1}\rightarrow{} \dots \rightarrow{} R^{C_{out}}$$

In [7]:
class Tnet(nn.Module):
   def __init__(self, k=3):
      super().__init__()
      self.k=k
      self.conv1 = nn.Conv1d(k,64,1)
      self.conv2 = nn.Conv1d(64,128,1)
      self.conv3 = nn.Conv1d(128,1024,1)
      self.fc1 = nn.Linear(1024,512)
      self.fc2 = nn.Linear(512,256)
      self.fc3 = nn.Linear(256,k*k)

      self.bn1 = nn.BatchNorm1d(64)
      self.bn2 = nn.BatchNorm1d(128)
      self.bn3 = nn.BatchNorm1d(1024)
      self.bn4 = nn.BatchNorm1d(512)
      self.bn5 = nn.BatchNorm1d(256)
       

   def forward(self, input):
      # input.shape == (bs,n,3)
      bs = input.size(0)
      xb = F.relu(self.bn1(self.conv1(input)))
      xb = F.relu(self.bn2(self.conv2(xb)))
      xb = F.relu(self.bn3(self.conv3(xb)))
      pool = nn.MaxPool1d(xb.size(-1))(xb)
      flat = nn.Flatten(1)(pool)
      xb = F.relu(self.bn4(self.fc1(flat)))
      xb = F.relu(self.bn5(self.fc2(xb)))
      
      #initialize as identity
      init = torch.eye(self.k, requires_grad=True).repeat(bs,1,1)
      if xb.is_cuda:
        init=init.cuda()
      matrix = self.fc3(xb).view(-1,self.k,self.k) + init
      return matrix


class Transform(nn.Module):
   def __init__(self):
        super().__init__()
        self.input_transform = Tnet(k=3)
        self.feature_transform = Tnet(k=64)
        self.conv1 = nn.Conv1d(3,64,1)

        self.conv2 = nn.Conv1d(64,128,1)
        self.conv3 = nn.Conv1d(128,1024,1)
       

        self.bn1 = nn.BatchNorm1d(64)
        self.bn2 = nn.BatchNorm1d(128)
        self.bn3 = nn.BatchNorm1d(1024)
       
   def forward(self, input):
        matrix3x3 = self.input_transform(input)
        # batch matrix multiplication
        xb = torch.bmm(torch.transpose(input,1,2), matrix3x3).transpose(1,2)

        xb = F.relu(self.bn1(self.conv1(xb)))

        matrix64x64 = self.feature_transform(xb)
        xb = torch.bmm(torch.transpose(xb,1,2), matrix64x64).transpose(1,2)

        xb = F.relu(self.bn2(self.conv2(xb)))
        xb = self.bn3(self.conv3(xb))
        xb = nn.MaxPool1d(xb.size(-1))(xb)
        output = nn.Flatten(1)(xb)
        return output, matrix3x3, matrix64x64

class PointNet(nn.Module):
    def __init__(self, classes = 10):
        super().__init__()
        self.transform = Transform()
        self.fc1 = nn.Linear(1024, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, classes)
        

        self.bn1 = nn.BatchNorm1d(512)
        self.bn2 = nn.BatchNorm1d(256)
        self.dropout = nn.Dropout(p=0.3)
        self.logsoftmax = nn.LogSoftmax(dim=1)

    def forward(self, input):
        xb, matrix3x3, matrix64x64 = self.transform(input)
        xb = F.relu(self.bn1(self.fc1(xb)))
        xb = F.relu(self.bn2(self.dropout(self.fc2(xb))))
        output = self.fc3(xb)
        return self.logsoftmax(output), matrix3x3, matrix64x64

<a href="https://github.com/nikitakaraevv/pointnet/blob/master/nbs/PointNetClass.ipynb">Code source</a> - Full implementation of PointNet Classification (can be opened in Colab)

<a href="https://towardsdatascience.com/deep-learning-on-point-clouds-implementing-pointnet-in-google-colab-1fd65cd3a263">Deep Learning on Point clouds: Implementing PointNet in Google Colab</a> - Nikita Karaev



#### PointNet Classification Network

<img src=".\assets\tut_09_pn10.JPG" style="width:800px">



<img src=".\assets\tut_09_pn20.gif" style="width:800px">
 
<a href="https://towardsdatascience.com/deep-learning-on-point-clouds-implementing-pointnet-in-google-colab-1fd65cd3a263">Image source</a> 

##### Transformation Invariance
<img src=".\assets\tut_09_pn21.JPG" style="width:200px">
Learn transformation matrix to improve task performance. We want our network to be invariante to rigid transformation of the object.

* Practically, more augmentation over the training dataset also solved the transformation invariance


<a href="https://medium.com/@luis_gonzales/an-in-depth-look-at-pointnet-111d7efdaa1a">Image source</a> 

Qi, Charles R., et al. "PointNet: Deep learning on point sets for 3d classification and segmentation." CVPR 2017.

##### Results - Classification
<img src=".\assets\tut_09_pn12.JPG" style="width:800px">

Qi, Charles R., et al. "PointNet: Deep learning on point sets for 3d classification and segmentation." CVPR 2017.

#### PointNet Segmentation Network
<img src=".\assets\tut_09_pn11.JPG" style="width:800px">


* Extract local features - Describes each point seperatly
* Extract global feature - Describes the entire point cloud
* Concatenate the local and global features and feed it into a shared mlp - The mlp learns to process the point feature acording to a condition. The condition is described by the global feature vectore.


<a href="https://github.com/nikitakaraevv/pointnet/blob/master/nbs/PointNetSeg.ipynb">Code</a> - Full implementation of PointNet Segmentation (can be opened in Colab)

Qi, Charles R., et al. "PointNet: Deep learning on point sets for 3d classification and segmentation." CVPR 2017.

##### Semantic Scene Parsing
<img src=".\assets\tut_09_pn13.JPG" style="width:800px">

Qi, Charles R., et al. "PointNet: Deep learning on point sets for 3d classification and segmentation." CVPR 2017.

##### Results - Robustness to Missing Data (Classification example)
* Why is PointNet so robust to missing data?
<img src=".\assets\tut_09_pn14.JPG" style="width:500px">

Qi, Charles R., et al. "PointNet: Deep learning on point sets for 3d classification and segmentation." CVPR 2017.

##### Visualizing Global  Point Cloud Features
<img src=".\assets\tut_09_pn16.JPG" style="width:800px">


##### Visualize What is Learned by Reconstruction
* Salient points are discovered!
<img src=".\assets\tut_09_pn15.png" style="width:800px">
The "critical points" are those who influenced the global feature vector, a.k.a the pooling layer. The "critical" object's geometry structured is reserved.

Qi, Charles R., et al. "PointNet: Deep learning on point sets for 3d classification and segmentation." CVPR 2017.

##### Point function visualization
For each per-point function $h$ (mlp), calculate the values of $h(p)$ for all the points $p$ in the cube.

Random 15 function out of the 1024 learned functions:
<img src=".\assets\tut_09_pn19.JPG" style="width:800px">

* Semi-equivalent to filter response in CNNs


Qi, Charles R., et al. "PointNet: Deep learning on point sets for 3d classification and segmentation." CVPR 2017.

#### <img src="https://img.icons8.com/plasticine/100/000000/not-applicable.png" style="height:50px;display:inline"> Limitations of PointNet
<img src=".\assets\tut_09_pn17.JPG" style="width:800px">

* No local context for each point
* Global feature depends on **absolute** coordinate. Hard to generalize to unseen scene configurations


#### Points in Metric Space
* Learn “kernels” in 3D space and conduct convolution
* Kernels have compact spatial support
* For convolution, we need to find neighboring points
* Possible strategies for range query
    * Ball query (results in more stable features)
    * k-NN query (faster)

### <img src="https://img.icons8.com/plasticine/100/000000/approve-and-update.png" style="height:50px;display:inline"> PointNet v2.0: Multi-Scale PointNet

<img src=".\assets\tut_09_pn18.png" style="width:800px">

Repeated layers:
* Sample anchor points
* Find neighborhood of anchor points
* Apply PointNet in each neighborhood to mimic convolution



Qi, Charles Ruizhongtai, et al. "Pointnet++: Deep hierarchical feature learning on point sets in a metric space." Advances in neural information processing systems. 2017.

### More Point Clouds DL solutions:
* 3DmFV
* Dynamic Graph CNN
* PCNN
* PointCNN
* KPConv

## <img src="https://img.icons8.com/bubbles/50/000000/list.png" style="height:50px;display:inline"> 3D Deep Learning Applications
---

* Calssification (V)
* Semantic segmentation (V)
* Part segmentation
* Object detection (Upcoming)
* Reconstraction 
* Generation (Upcoming)
<!--     * AutoEncoders
    * GANs
    * Implicit Functions -->
* Registration (Upcoming)
* Sampling - Downsampling, Upsampling
* SLAM
* Normal Estimation
* and many more...

#### Registration:
Problem statment: Find the rotation and translation transformation between objects

<img src=".\assets\tut_09_pnlk1.JPG" style="width:800px">

* PointNetLK (blue) - Deep Learning, based on Lucas–Kanade method (Tracking lecture)
    * Comparing 2 point clouds using PointNet features
* ICP (orange) - Classic registration method



<img src=".\assets\tut_09_pnlk2.JPG" style="width:800px">

Both inputs (target and source) are being processed by PointNet architecture

Aoki, Yasuhiro, et al. "PointNetLK: Robust & efficient point cloud registration using PointNet." CVPR 2019.

#### Generation
Conditional generation
<img src=".\assets\tut_09_pngen1.JPG" style="width:800px">
Free generation
<img src=".\assets\tut_09_pngen2.JPG" style="width:800px">

##### How would you build a point cloud GAN?

##### Learning Representations and Generative Models for 3D Point Clouds (Achlioptas et al.)
* FC layer as generator
* PointNet as discriminator
<img src=".\assets\tut_09_pngen4.png" style="width:300px">


<img src=".\assets\tut_09_pngen3.png" style="width:600px">

Achlioptas et al., “Learning Representations and Generative Models for 3D Point Clouds”, ICML 2018

More generation methods:
* AtlasNet
* FoldingNet
* PointFlow
* OccupancyNetworks
* DeepSDF
* ...

#### Detection:
* Generate object proposals from a view (e.g., using SSD)
* Recognize using PointNet
<img src=".\assets\tut_09_pndet1.JPG" style="width:800px">


Qi et al., “Frustum PointNets for 3D Object Detection from RGB-D Data”, CVPR 2018

##### <img src="https://img.icons8.com/bubbles/50/000000/question-mark.png" style="height:50px;display:inline"> Questions
* What are the differences between 2D image an a point cloud?
    * Unstructured
    * Vary number of points
    * Unordered
* Why it might be hard to feed a point cloud as neural network (NN) input?
    * Does not rely on a grid
    * Does not has a fix size
    * Different permutation represent the same point cloud
All three diffrences influence directly the abbility of using NN!
* What are the benefits of using a point cloud?
    * Most sensors raw outputs are point clouds (LiDAR)
    * Very efficient representation of 3D data (no empty voxels)
    * Reserve geometric details (no quantization)

### <img src="https://img.icons8.com/clouds/100/000000/hand-tools.png" style="height:50px;display:inline"> Recommended Tools
---
Python:
* Open3D
* trimesh
* Ipyvolume - Visualization for Notebooks

Deep Learning:
* Python3D
* Kaolin (Pytorch)
* TensorFlow Graphics

Visualize Tools (drop and view):
* CloudCompare
* MeshLab

For more 3D deep learnig frameworks and datasets:
* <a href="https://github.com/Yochengliu/awesome-point-cloud-analysis">awesome-point-cloud-analysis</a>
* <a href="https://github.com/timzhang642/3D-Machine-Learning#datasets">3D-Machine-Learning#datasets</a> 

Datasets:
* ModelNet
* ShapeNet
* PartNet
* Sydney Urban Opject DAtaset
* Stanford 3D
* KITTI
* ...

### <img src="https://img.icons8.com/bubbles/50/000000/video-playlist.png" style="height:50px;display:inline"> Recommended Videos
---
#### <img src="https://img.icons8.com/cute-clipart/64/000000/warning-shield.png" style="height:30px;display:inline"> Warning!
* These videos do not replace the lectures and tutorials.
* Please use these to get a better understanding of the material, and not as an alternative to the written material.

#### Video By Subject
* 3D Deep Learning
    * General (Both highly recomanded):
        *  <a href="https://www.youtube.com/watch?time_continue=6&v=vfL6uJYFrp4&feature=emb_logo">3D Deep Learning Tutorial from SU lab at UCSD</a> - Hao Su
        *  <a href="https://www.youtube.com/watch?v=wLU4YsC_4NY
o">Geometric deep learning</a> - Micahel Bronstein
    * <a href="https://www.youtube.com/watch?v=Cge-hot0Oc0&t=24s">PointNet</a> 
    * <a href="https://www.youtube.com/watch?v=HIUGOKSLTcE">3DmFV</a>
    

## <img src="https://img.icons8.com/dusk/64/000000/prize.png" style="height:50px;display:inline"> Credits
----
* Slides - <a href="http://www.itzikbs.com/category/research-blog">Yizhak (Itzik) Ben-Shabat</a>,  <a href="https://ci2cv.net/people/simon-lucey/">Simon Lucey (CMU)</a>, <a href="https://cseweb.ucsd.edu/~haosu/">Hao Su, Jiayuan Gu and Minghua Liu(UCSanDiego) </a>
* Multiple View Geometry in Computer Vision - Hartley and Zisserman - Sections 9,10
* <a href="https://www.springer.com/gp/book/9781848829343">Computer Vision: Algorithms and Applications</a> - Richard Szeliski - Sections 11,12

* Icons from <a href="https://icons8.com/">Icon8.com</a> - https://icons8.com
