# <img src="https://img.icons8.com/bubbles/100/000000/3d-glasses.png" style="height:50px;display:inline"> EE 046746 - Technion - Computer Vision
---
#### <a href="https://taldatech.github.io/"></a> 


## Tutorial 10 - Neural Radiance Fields
---

<img src="https://uploads-ssl.webflow.com/51e0d73d83d06baa7a00000f/5e700ef6067b43821ed52768_pipeline_website-01.png" style="height:300px">

* <a href="https://www.matthewtancik.com/nerf">Image Source</a>

### <img src="https://img.icons8.com/bubbles/50/000000/checklist.png" style="height:50px;display:inline"> Agenda
---

* [What is NeRF](#-What-is-NeRF)
<!--     * [Image Classifcation + Object Localization = Object Detection](#-Image-Classifcation-+-Object-Localization-=-Object-Detction)
    * [Localization Approaches](#-Localiztion-Approaches)
        * Sliding Windows Approach
    * [Performance Metrics](#-Performance-Metrics) -->
* [Components to NeRF](#-TODO)
    * [What is Implicit Neural Representations](#-What-is-Implicit-Neural-Representations)
    * [Camera Localization](#-TODO)
    * [Positional Encoding](#-TODO)
    * [Ray Casting](#-TODO)
    * [Coarse to Fine Representation](#-TODO)
    * [Volume Rendering](#-TODO)
    * [Model Architecture](#-TODO)
* [Extensions](#-TODO)
    * [Performance](#-TODO?)
    * [Temporal NeRF](#-TODO)
    * [Others](#-TODO)
* [Recommended Videos](#-Recommended-Videos)
* [Credits](#-Credits)

## <img src="https://img.icons8.com/color/96/null/view-details.png" style="height:50px;display:inline"> What is NeRF
---
- NeRF (Neural Radiance Fields) is a technique that uses deep neural networks to generate 3D reconstructions and render novel views of a scene.
- NeRF learns a continuous representation of the scene's appearance and geometry directly from 2D images, and their estimated camera locations, without relying on sparse feature matches or point correspondences.
- It represents the scene as a volumetric function, estimating radiance (color) at any 3D point within the scene. this representation is known as an Implicit Neural Representation
- NeRF captures fine surface details, complex lighting effects, reflections, and refractions, producing highly realistic renderings.
- To train NeRF, a set of images taken from different viewpoints is used as input to the neural network.
- During training, NeRF learns to predict the volumetric representation by minimizing the discrepancy between predicted and ground truth images.
- Once trained, NeRF can generate novel views of the scene from any desired camera viewpoint, even if it was not present in the training data.
![NeRF-Drums.png](assets/NeRF-Drums.png)

### How does it work?

* Essentially, we "launch" rays from each 3D Camera location, and attempt to "predict" the density and RGB color along the ray
![nerf](assets/NeRF_scheme.png)
* The density and color prediction is carried out by an "Implicit Neural Representation"
* We'll now see how each part integrates into NeRF

### <img src="https://img.icons8.com/dusk/64/000000/plus.png" style="height:50px;display:inline"> What is Implicit Neural Representations
---
- As said before, an implicit neural representation represents various types of data using deep neural networks(Images NeRFS and even other neural networks!).
- In 3D, we can look at explicit, or implicit representations.
- Explicit representations stores explicit geometric information like vertices and voxels, implicit representations model the surface or volume as an implicit function.
- When representing 3D data, it can represent a wide range of shapes, including objects with holes, handles, and self-intersections.
- They have gained popularity due to their ability to handle complex and diverse shapes and their potential for end-to-end learning.

<img src="./assets/ImplicitRep.png"/>

### <img src="https://img.icons8.com/clouds/100/000000/google-maps.png" style="height:50px;display:inline"> Camera Localization
---
- In order to launch rays from the cameras, we must first know, "where" they are against the scene
- Accurate camera poses are necessary for NeRF to correctly align the images and estimate the scene's geometry and appearance.
- One common approach for camera localization is using a structure-from-motion (SfM) pipeline like Colmap. as you know, SfM algorithms analyze image correspondences and geometric constraints to estimate camera poses and sparse 3D points.
- Colmap, for instance, can automatically detect feature correspondences across images, perform bundle adjustment to refine camera poses and 3D point locations, and provide accurate camera localization results.

- These techniques have some difficulties with scenes with smooth or repetitive textures.
- Synthetic datasets, created through computer graphics techniques, can provide known camera locations. These datasets are often used to train and evaluate NeRF models.


#### Positional Encoding
---
- The very same concept from transformers
- In the context of NeRF, positional encoding is used to encode the 3D spatial coordinates of points in the scene.
- By incorporating positional encoding, NeRF allows the network to learn a higher frequency representation of the scene, thus allowing for capturing of greater detail.
- This helps NeRF capture complex geometric structures and relationships, enabling accurate reconstruction and rendering.
- The positional encoding in NeRF works in conjunction with other network layers that capture appearance and radiance information.
![Positional_Encoding.png](assets/Positional_Encoding.png)

### Ray Casting
---
- Similarly to triangulation, we launch virtual rays from each camera location.
- Each ray is defined by its origin (the camera viewpoint) and a direction (pointing towards a pixel on the image plane).
- Along each ray, the radiance values are estimated by querying the neural network representing the volumetric function.
- This estimation involves sampling points along the ray and evaluating the neural network at those points to obtain radiance values - Computationally Expensive!
- The quality of the rendered views depends on factors like image resolution, the number of rays per pixel, and the accuracy of radiance estimation(Hyper Parameters)

![Pinhole camera](https://www.researchgate.net/profile/Willy-Azarcoya-Cabiedes/publication/317498100/figure/fig10/AS:610418494013440@1522546518034/Pin-hole-camera-model-terminology-The-optical-center-pinhole-is-placed-at-the-origin.png)


#### Stratified Sampling

Now that we've understood how to cast rays, we need to understand how to "discretize" our density function.
For that, we will take the idea of "sampling" along the ray we've cast. 
The stratified sampling approach splits the ray into evenly-spaced bins and randomly samples within each bin. The perturb setting determines whether to sample points uniformly from each bin or to simply use the bin center as the point. In most cases, We will perturb the samples. as it will encourage the network to learn over a continuously sampled space.

![stratified_sampling.png](assets/stratified_sampling.png)

#### Hierarchical Volume Sampling

The 3D space is in fact very sparse with occlusions and so most points don't contribute much to the rendered image. It is therefore more beneficial to oversample regions with a high likelihood of contributing to the integral.

This is done by applying a learned weighting scheme to the first, *coarse* set of samples, to create a PDF across the whole ray, then sampling from that PDF to generate a finer sample of points, that will be forwarded through a another, fine version of the NeRF.
![hierarchical_sampling.png](assets/hierarchical_sampling.png)

### Volume Rendering

![nerf](assets/NeRF_scheme.png)

Putting it all together, actually rendering the image follows the following formula:
 
$$C(r)=\int_{t_n}^{t_f} T(t)\sigma(r(t))c(r(t),d)  dt \quad \text{Where} \quad T(t)=exp\big(-\int_{t_n}^{t} \sigma(r(s)) ds\big) $$ And $$ \textbf{r}(t)=o+t\textbf{d} $$ 

Where we approximate the integral with the quadrature rule, like so

$$
\hat{C}(r) = \sum_{i=1}^{N}  \left(1 - \exp\left(-\sigma_i \delta_i\right)\right) c_i, \quad \text{where} \quad T_i = \exp\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right)
$$

Therefore, per image pixel we "launch" a ray originating from the camera center, going through the point, and calculate the Transmitance, Color, and Density of the volume point encountered through the ray.

As specified before, there can be both a rendering through the "coarse" NeRF, and through the "fine" one, which is usually output as the output novel view.

#### Finishing up
---

Now that we can render an entire new, novel view, we go to the obvious question

"How do you even train this?!"

Good Question!

all that it takes is a simple, L2 Loss, on the reconstruction of a known, given image! 

In order to force both the coarse model, and the fine model correspond to the expected image, the loss goes as follows:

$$
L = \sum_{r \in R} \left( \lVert C_{c}(r) - C(r)\rVert^2+ \lVert C_{f}(r) - C(r)\rVert^2 \right)
$$


## What came after

NeRF blew the door open for many other works which followed up on the tecnique presented by NeRF, we'll show 3, that each touch up on a different issue the paper didn't solve completly
nerf
![nerf-scholar.png](assets/nerf-scholar.png)

### Training/Rendering Speed
# Instant Neural Graphics Primitives with Multiresolution Hashed Neural Networks

- As said before, implicit representations are very good for representing dense, high resolution information(Giga pixel images, videos, and NeRFs) but are very very (very very ....) slow to evaluate and train.
 - This paper introduces the idea of using some sort of "hybrid" representation, where we can combine the best of both worlds, both the speed of rendering of explicit representations, and the high quality recreation of the implicit ones.
- Specifically, multiresolution hierarchy of hash tables which provides adaptivity and efficiency:
  - Adaptivity: The method maps a cascade of grids to corresponding fixed-size arrays of feature vectors.
      - At coarse resolutions, there is a 1:1 mapping from grid points to array entries.
      - At fine resolutions, the array is treated as a hash table and indexed using a spatial hash function, where multiple grid points alias each array entry. The hash tables automatically prioritize the sparse areas with the most important fine-scale detail.


- Specifically, multiresolution hierarchy of hash tables which provides adaptivity and efficiency:
  - Adaptivity: The method maps a cascade of grids to corresponding fixed-size arrays of feature vectors.
      - At coarse resolutions, there is a 1:1 mapping from grid points to array entries.
      - At fine resolutions, the array is treated as a hash table and indexed using a spatial hash function, where multiple grid points alias each array entry. The hash tables automatically prioritize the sparse areas with the most important fine-scale detail.
  - Efficiency: The hash table lookups are efficient and the method has been validated in four representative tasks: learning the mapping from 2D coordinates to RGB colors of a high-resolution image, learning the mapping from 3D coordinates to the distance to a surface, learning the 5D light field of a given scene from a Monte Carlo path tracer, and learning the 3D density and 5D light field of a given scene from image observations and corresponding perspective transforms.


- The paper combines the ideas of parametric encoding and spatial hashing to reduce waste. The trainable feature vectors are stored in a compact spatial hash table, whose size is a hyperparameter. The method uses multiple separate hash tables indexed at different resolutions, whose interpolated outputs are concatenated before being passed through the MLP. The neural network learns to disambiguate hash collisions itself, avoiding control flow divergence, reducing implementation complexity, and improving performance.

<video height="600" width="900" controls src="https://nvlabs.github.io/instant-ngp/assets/nerf_grid_lq.mp4" />


# D-NeRF
Introduces the following

1. **Dynamic Scene Rendering**: D-NeRF represents the first end-to-end neural rendering system that is applicable to dynamic scenes, encompassing both static and moving/deforming objects. This is a significant innovation, as it requires only a single camera, does not necessitate pre-computed 3D reconstruction, and can be trained end-to-end.

2. **Two-stage Learning Process**: The paper introduces a two-stage learning process. The first stage encodes the scene into a canonical space, and the second maps this canonical representation into the deformed scene at a particular time. Both mappings are simultaneously learned using fully-connected networks. Post training, D-NeRF can render novel images, controlling both the camera view and the time variable, thereby manipulating object movement.

3. **Time Component in 6D Function**: The authors propose a continuous 6D function to represent the input of the system. This function considers 3D location, camera view, and importantly, a time component. By incorporating time as an essential variable, the model can handle dynamic scene changes effectively.

5. **3D Mesh Production**: An interesting side product of the D-NeRF approach is its ability to generate complete 3D meshes capturing the time-varying geometry of scenes. This capability is remarkable because these meshes are produced by observing the scene under a specific deformation only from one single viewpoint.

This is the first paper pertaining to Dynamic NeRFs(Very cool!)
![d_NeRF.png](assets/d_NeRF.png)

<video height="600" width="900" controls src="https://www.albertpumarola.com/images/2021/D-NeRF/standup.mp4" />

## DyNeRF

1. **Dynamic Neural Radiance Field:**
    - The authors extend neural radiance fields to the space-time domain.
    - Instead of directly using time as input, they parameterize scene motion and appearance changes by a set of compact latent codes.
    - These learned latent codes show more expressive power, allowing for recording the vivid details of moving geometry and texture.
    - They also allow for smooth interpolation in time, which enables visual effects such as slow motion or ‘bullet time’.
    
![DyNeRF.png](assets/DyNeRF.png)



2. **Novel Importance Sampling Strategies:**
    - Captured dynamic video often exhibits a small amount of pixel change between frames, providing an opportunity to significantly boost the training progress by selecting the pixels that are most important for training.
    - In the time dimension, they schedule training with coarse-to-fine hierarchical sampling in the frames. In the ray/pixel dimension, their design tends to sample those pixels that are more time-variant than others.
    - The resulting representation is very small comparitvely to the amount of information stored (28 MB for a 10 second, 30 fps, 18 camera setup)
![temporal_hierarchical_sampling.png](assets/temporal_hierarchical_sampling.png)

#### And many more

this was just a partial overview of the papers that expand on nerf

a more complete list can be found here https://github.com/awesome-NeRF/awesome-NeRF

### Trying it yourself at home

all of the code necessary to train nerf is widely available, either at https://github.com/yenchenlin/nerf-pytorch or, using the NeRF-Studio Suite




![nerfstudio.png](assets/nerfstudio.png)

Nerfstudio is a Python library that provides a simplified end-to-end process for creating, training, and visualizing Neural Radiance Fields (NeRFs). The library aims to make NeRFs more interpretable by modularizing each component. Additionally, it provides learning resources to help users understand and keep up-to-date with NeRF technology. Nerfstudio encourages contributions from its users, whether it's a feature request, a new NeRF model, or a new dataset.

### Supported Methods

Included Methods:

- Nerfacto: Recommended method, integrates multiple methods into one.
- Instant-NGP: Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
- NeRF: OG Neural Radiance Fields
- Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields
- TensoRF: Tensorial Radiance Fields

Third-party Methods:

- Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions
- K-Planes: Unified 3D and 4D Radiance Fields
- LERF: Language Embedded Radiance Fields

### <img src="https://img.icons8.com/bubbles/50/000000/video-playlist.png" style="height:50px;display:inline"> Recommended Videos
---
#### <img src="https://img.icons8.com/cute-clipart/64/000000/warning-shield.png" style="height:30px;display:inline"> Warning!
* These videos do not replace the lectures and tutorials.
* Please use these to get a better understanding of the material, and not as an alternative to the written material.

#### Video By Subject
* (Deep) Object Detection - <a href="https://www.youtube.com/watch?v=nDPWywWRIRo"> Stanford CS231 - Lecture 11 | Detection and Segmentation</a>
* Object Detection - <a href="https://www.youtube.com/watch?v=5e5pjeojznk&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=26&t=0s">C4W3L03 Object Detection - Andrew Ng</a>
* Non-Maximum Supression - <a href="https://www.youtube.com/watch?v=VAo84c1hQX8&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=29&t=0s">C4W3L07 Nonmax Suppression - Andrew Ng</a>
* Region Proposals - <a href="https://www.youtube.com/watch?v=6ykvU9WuIws&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=32&t=0s">C4W3L10 Region Proposals - Andrew Ng</a>
* R-CNN, Fast R-CNN, Faster R-CNN - <a href="https://www.youtube.com/watch?v=a9_8wqMxVkY">RCNN, FAST RCNN, FASTER RCNN : OBJECT DETECTION AND LOCALIZATION THROUGH DEEP NEURAL NETWORKS</a>
* YOLO - <a href="https://www.youtube.com/watch?v=9s_FpMpdYW8&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=31&t=0s">C4W3L09 YOLO Algorithm - Andrew Ng</a>
* YOLO v3 - <a href="https://www.youtube.com/watch?time_continue=1&v=MPU2HistivI&feature=emb_logo">YOLOv3</a>
* YOLO v4 - <a href="https://www.youtube.com/watch?v=_JzOFWx1vZg">Yolo V4 - How it Works and Why it's So Amazing!</a>
* NeRF Studio - <a href="https://www.youtube.com/watch?v=nSFsugarWzk">Getting Started</a>

## <img src="https://img.icons8.com/dusk/64/000000/prize.png" style="height:50px;display:inline"> Credits
---
* EE 046746 Spring 21 - <a href="https://taldatech.github.io/">Tal Daniel</a> 
* <a href="https://cv-tricks.com/object-detection/faster-r-cnn-yolo-ssd/">Zero to Hero: Guide to Object Detection using Deep Learning: Faster R-CNN,YOLO,SSDO - Ankit Sachan</a>
* <a href="https://machinelearningmastery.com/object-recognition-with-deep-learning/">A Gentle Introduction to Object Recognition With Deep Learning - Jason Brownlee</a>
* <a href="https://developers.arcgis.com/python/guide/how-ssd-works/">
How single-shot detector (SSD) works?</a>
* <a href="https://towardsdatascience.com/whats-new-in-yolov4-323364bb3ad3">What’s new in YOLOv4?</a>
* <a href="https://heartbeat.fritz.ai/introduction-to-yolov4-research-review-5b6b4bd5f255">Introduction to YOLOv4: Research review</a>
* Slides by David Dov and Yael Amiay.
* Icons from <a href="https://icons8.com/">Icon8.com</a> - https://icons8.com