Skip to content

Project for Deep Learning Methods for Calibrated Photometric Stereo and Beyond

Notifications You must be signed in to change notification settings

Kelvin-Ju/Survey-DLCPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 

Repository files navigation

Survey-DLCPS

Deep Learning Methods for Calibrated Photometric Stereo and Beyond

Accepted By IEEE Transactions on Pattern Analysis and Machine Intelligence https://arxiv.org/pdf/2212.08414.pdf

Please consider citing our paper if you find it useful in your research.

@article{ju2024deep,
title={Deep Learning Methods for Calibrated Photometric Stereo and Beyond},
author={Ju, Yakun and Lam, Kin-Man and Xie, Wuyuan and Zhou, Huiyu and Dong, Junyu and Shi, Boxin},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2024},
publisher={IEEE}
}

Introduction

Photometric stereo methods obtain detailed shape reconstructions from multiple images under different illuminations. The orange box shows the general non-Lambertian surface reflectance. To solve the non-Lambertian surface, many methods have addressed non-Lambertian photometric stereo. This paper focuses on deep learning-based calibrated photometric stereo methods and provides a comprehensive review.

The purpose of this project is to collect and update deep learning-based calibrated photometric stereo methods continuously.

Categorization Based on Input Processing

The first deep learning method, DPSN [paper, Code] makes the order of illuminations and the number of input images unchanged, by a seven-layer fully-connected network. To handle a varying number of inputs during training and testing, many methods were then proposed. In this section, we classify these methods based on how they process the inputs.

Per-pixel methods

The per-pixel strategy is first implemented using the observation map in CNN-PS [paper, code].

The observation map aggregates the corresponding pixels from each input image into a fixed-size observation map using the 2D coordinates of the projected normalized lighting directions (along the axis-z direction). Here, a, b, and c stand for the number of input images (lights), while 1, 2, and 3 stand for the index of pixel position.

Problem of sparse input

Some works were proposed to solve the sparse input images problem, such as SPLINE-Net [paper, code] and LMPS [paper, code]. These two methods adopt opposite strategies to solve the sparse inputs. SPLINE-Net proposes a lighting interpolation network to generate dense lighting observation maps when the input is sparse. LMPS reduces the demands on the number of images by only learning the critical illumination conditions via a connection table.

Problem of global information

Original per-pixel methods designed to represent a single pixel, don't explicitly incorporate information from the surrounding pixel neighborhood. PX-Net [paper, code] proposed an observation map-based method that considers global illumination effects, such as self-reflections, surface discontinuity, and ambient light, which enables global information to be embedded in the per-pixel generation process.

All-pixel methods

All-pixel methods keep all the pixels together, having the advantage of exploring intra-image intensity variations across an entire input image. The original all-pixel method was introduced in PS-FCN [paper, code] through the use of a max-pooling layer, which operates in the channel dimension and fuses features from an arbitrary number of inputs.

Problem of spatially varying BRDF

Since all-pixel methods leverage convolutional networks to process input in a patch-based manner, they may have difficulties in dealing with steep color changes caused by surfaces with spatially varying materials. PS-FCN (Norm.) [paper, code] proposed an observation normalization method to eliminate the impact of changing albedo. NormAttention-PSN [paper, code] further solved the normalization problem under strong non-Lambertian surfaces.

Problem of blurry details

All-pixel methods may cause blurred reconstructions in complex-structured regions mainly because the widely used Euclidean-based loss functions can hardly constrain the high-frequency (i.e., complex-structured) representations, because of the “regression-to-the-mean” problem. Attention-PSN [paper, code] and NormAttention-PSN [paper, code] proposed an attention-weighted loss to produce detailed reconstructions, which learn an adaptive weight of detail-preserving gradient loss for high-frequency regions.

Problem of fusion efficiency

The fusion mechanism of all-pixel methods,i.e., max-pooling, discard a large number of features from the input, reducing the utilization of information and affecting the estimation accuracy. MF-PSN [paper, code] introduces a multi-feature fusion network, utilizing max-pooling operations at different feature levels in both shallow and deep layers to capture richer information. CHR-PSN [paper, code] extend max-pooling at various scales with different receptive fields, rather than the depth. HPS-Net [paper, code] introduces a bilateral extraction module that outputs positive and negative information before aggregation to better preserve useful data.

Hybrid methods

Hybrid approaches that combine these strategies may have the benefits of both per-pixel and all-pixel techniques. MT-PS-CNN [paper, code] proposes a two-stage photometric stereo model to construct inter-frame (per-pixel) and intra-frame (all-pixel) representations. HT21 [paper, code] built upon the observation maps but incorporated spatial information using 2D and 4D separable convolutions to better capture global effects. Similarly, PSMF-PSN [paper, code] introduced a tandem manner for per-pixel and all-pixel feature extraction. GPS-Net [paper, code] introduced a structure-aware graph convolutional network \cite{chang2018structure} to establish connections between an arbitrary number of observations per pixel.

Categorization Based on Network Architectures

Most of the deep learning-based calibrated photometric stereo networks are based on convolutional networks, some advanced architectures are widely used, such as ResNet, DenseNet, and HR-Net. In recent years, Transformer with a self-attention module was also employed in the context of photometric stereo. SPS-Net [paper, code] is the first to propose a self-attention photometric stereo network, which aggregates photometric information through a self-attention mechanism. PS-Transformer [paper, code] then designed a dual branch to explore pixel and image-wise feature for sparse photometric stereo images.

The photometric stereo task can leverage the self-attention module effectively. Theoretically, the surface normal of a point only depends on itself, rather than its relationship with distant points. However, due to the presence of shadows and inter-reflections, capturing long-range context becomes essential for accurate feature extraction. Therefore, Transformer-based photometric stereo models can benefit from both the non-local information acquired through the self-attention module and the embedded local context information obtained through traditional convolutional layers.

Categorization Based on Supervision

we summarize the differences among supervised, self-supervised, and multi-supervised photometric stereo networks.

Supervised photometric stereo methods

Plenty of deep photometric stereo networks have been proposed with the supervised framework. A few methods (DPSN [paper, Code], CNN-PS [paper, code], LMPS [paper, code]) utilized the L2 loss (mean squared error loss) to optimize the training, while the others applied the cosine loss.

Self-supervised photometric stereo methods

Measuring the surface normals of real objects is very difficult and expensive, because it needs high-precision 3D scanners to reconstruct the ground-truth shape, and requires much manpower to align the viewpoints between surface normal maps and multiple images (pixel alignment). Synthetic training data is a possible way, but still requires efforts to generate photo-realistic synthetic images. Self-supervised learning strategy can overcome the above shortcomings to some extent. IRPS [paper, Code] first proposed a self-supervised convolutional network that took the whole set of images as input. The model directly generated surface normals by minimizing the reconstruction loss between re-rendered images obtained via the rendering equation and input images. KS21 [paper, Code] further extended to deal with inter-reflection by explicitly modeling the concave and convex parts of a complex surface. Recently, LL22a [paper, Code] proposed a coordinate-based deep network to parameterize the unknown surface normal and the unknown reflectance at every surface point. The method learned a series of neural specular basis functions to fit the observed specularities and explicitly parameterized shadowed regions by tracing the estimated depth map.

Multi-supervised photometric stereo methods

IS22 [paper, Code] proposed a network to deconstruct the observation map into physical interpretable components such as surface normal, surface roughness, and surface base color. These components were then integrated via the physical formation model. Consequently, the training loss for optimization consisted of the normal reconstruction loss and the inverse rendering loss. DR-PSN [paper, Code] introduced a dual regression network for calibrated photometric stereo. This network combined the surface-normal constraint with the constraint of the reconstructed re-lit image. Additionally, GR-PSN [paper, Code] utilized a parallel framework to simultaneously learn two arbitrary materials for an object and included an additional material transform loss. These methods employed an inverse subnetwork to re-render reconstructed images based on predicted surface normals. They used CNNs to render reconstructed images rather than following the rendering equation.

Data sets for photometric stereo

Training data sets

Blobby and Sculpture data set, rendered by MERL BRDFs. Download Web

CyclesPS data set, rendered by Disney’s principled BSDFs. Download Web

Testing data sets

Gourd&Apple data set [paper, DownloadWeb]

Light Stage Data Gallery [paper, DownloadWeb]

DiLiGenT data set [paper, DownloadWeb]

DiLiGenT-10^2 data set [paper, DownloadWeb]

DiLiGenT-Pi data set [paper, DownloadWeb]

Benchmark Evaluation

Performance on the DiLiGenT benchmark with 96 images, measured in terms of MAE in degrees. The compared methods are ranked by the average MAE of ten objects.

Performance on the DiLiGenT benchmark with 10 images, measured in terms of MAE in degrees. The compared methods are ranked by the average MAE of ten objects.

Comparisons Visualization

Quantitative results for object BALL, tested with 96 input images, in terms of MAE in degrees.

Quantitative results for object BEAR, tested with 96 input images, in terms of MAE in degrees.

Quantitative results for object BUDDHA, tested with 96 input images, in terms of MAE in degrees.

Quantitative results for object CAT, tested with 96 input images, in terms of MAE in degrees.

Quantitative results for object COW, tested with 96 input images, in terms of MAE in degrees.

Quantitative results for object GOBLET, tested with 96 input images, in terms of MAE in degrees.

Quantitative results for object HARVEST, tested with 96 input images, in terms of MAE in degrees.

Quantitative results for object POT1, tested with 96 input images, in terms of MAE in degrees.

Quantitative results for object POT2, tested with 96 input images, in terms of MAE in degrees.

Quantitative results for object READING, tested with 96 input images, in terms of MAE in degrees.

About

Project for Deep Learning Methods for Calibrated Photometric Stereo and Beyond

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published