Accepted By IEEE Transactions on Pattern Analysis and Machine Intelligence https://arxiv.org/pdf/2212.08414.pdf
@article{ju2024deep,
title={Deep Learning Methods for Calibrated Photometric Stereo and Beyond},
author={Ju, Yakun and Lam, Kin-Man and Xie, Wuyuan and Zhou, Huiyu and Dong, Junyu and Shi, Boxin},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2024},
publisher={IEEE}
}
The purpose of this project is to collect and update deep learning-based calibrated photometric stereo methods continuously.
The first deep learning method, DPSN [paper, Code] makes the order of illuminations and the number of input images unchanged, by a seven-layer fully-connected network. To handle a varying number of inputs during training and testing, many methods were then proposed. In this section, we classify these methods based on how they process the inputs.
The per-pixel strategy is first implemented using the observation map in CNN-PS [paper, code].
Some works were proposed to solve the sparse input images problem, such as SPLINE-Net [paper, code] and LMPS [paper, code]. These two methods adopt opposite strategies to solve the sparse inputs. SPLINE-Net proposes a lighting interpolation network to generate dense lighting observation maps when the input is sparse. LMPS reduces the demands on the number of images by only learning the critical illumination conditions via a connection table.
Original per-pixel methods designed to represent a single pixel, don't explicitly incorporate information from the surrounding pixel neighborhood. PX-Net [paper, code] proposed an observation map-based method that considers global illumination effects, such as self-reflections, surface discontinuity, and ambient light, which enables global information to be embedded in the per-pixel generation process.
All-pixel methods keep all the pixels together, having the advantage of exploring intra-image intensity variations across an entire input image. The original all-pixel method was introduced in PS-FCN [paper, code] through the use of a max-pooling layer, which operates in the channel dimension and fuses features from an arbitrary number of inputs.
Since all-pixel methods leverage convolutional networks to process input in a patch-based manner, they may have difficulties in dealing with steep color changes caused by surfaces with spatially varying materials. PS-FCN (Norm.) [paper, code] proposed an observation normalization method to eliminate the impact of changing albedo. NormAttention-PSN [paper, code] further solved the normalization problem under strong non-Lambertian surfaces.
All-pixel methods may cause blurred reconstructions in complex-structured regions mainly because the widely used Euclidean-based loss functions can hardly constrain the high-frequency (i.e., complex-structured) representations, because of the “regression-to-the-mean” problem. Attention-PSN [paper, code] and NormAttention-PSN [paper, code] proposed an attention-weighted loss to produce detailed reconstructions, which learn an adaptive weight of detail-preserving gradient loss for high-frequency regions.
The fusion mechanism of all-pixel methods,i.e., max-pooling, discard a large number of features from the input, reducing the utilization of information and affecting the estimation accuracy. MF-PSN [paper, code] introduces a multi-feature fusion network, utilizing max-pooling operations at different feature levels in both shallow and deep layers to capture richer information. CHR-PSN [paper, code] extend max-pooling at various scales with different receptive fields, rather than the depth. HPS-Net [paper, code] introduces a bilateral extraction module that outputs positive and negative information before aggregation to better preserve useful data.
Hybrid approaches that combine these strategies may have the benefits of both per-pixel and all-pixel techniques. MT-PS-CNN [paper, code] proposes a two-stage photometric stereo model to construct inter-frame (per-pixel) and intra-frame (all-pixel) representations. HT21 [paper, code] built upon the observation maps but incorporated spatial information using 2D and 4D separable convolutions to better capture global effects. Similarly, PSMF-PSN [paper, code] introduced a tandem manner for per-pixel and all-pixel feature extraction. GPS-Net [paper, code] introduced a structure-aware graph convolutional network \cite{chang2018structure} to establish connections between an arbitrary number of observations per pixel.
Most of the deep learning-based calibrated photometric stereo networks are based on convolutional networks, some advanced architectures are widely used, such as ResNet, DenseNet, and HR-Net. In recent years, Transformer with a self-attention module was also employed in the context of photometric stereo. SPS-Net [paper, code] is the first to propose a self-attention photometric stereo network, which aggregates photometric information through a self-attention mechanism. PS-Transformer [paper, code] then designed a dual branch to explore pixel and image-wise feature for sparse photometric stereo images.
The photometric stereo task can leverage the self-attention module effectively. Theoretically, the surface normal of a point only depends on itself, rather than its relationship with distant points. However, due to the presence of shadows and inter-reflections, capturing long-range context becomes essential for accurate feature extraction. Therefore, Transformer-based photometric stereo models can benefit from both the non-local information acquired through the self-attention module and the embedded local context information obtained through traditional convolutional layers.
we summarize the differences among supervised, self-supervised, and multi-supervised photometric stereo networks.Plenty of deep photometric stereo networks have been proposed with the supervised framework. A few methods (DPSN [paper, Code], CNN-PS [paper, code], LMPS [paper, code]) utilized the L2 loss (mean squared error loss) to optimize the training, while the others applied the cosine loss.
Measuring the surface normals of real objects is very difficult and expensive, because it needs high-precision 3D scanners to reconstruct the ground-truth shape, and requires much manpower to align the viewpoints between surface normal maps and multiple images (pixel alignment). Synthetic training data is a possible way, but still requires efforts to generate photo-realistic synthetic images. Self-supervised learning strategy can overcome the above shortcomings to some extent. IRPS [paper, Code] first proposed a self-supervised convolutional network that took the whole set of images as input. The model directly generated surface normals by minimizing the reconstruction loss between re-rendered images obtained via the rendering equation and input images. KS21 [paper, Code] further extended to deal with inter-reflection by explicitly modeling the concave and convex parts of a complex surface. Recently, LL22a [paper, Code] proposed a coordinate-based deep network to parameterize the unknown surface normal and the unknown reflectance at every surface point. The method learned a series of neural specular basis functions to fit the observed specularities and explicitly parameterized shadowed regions by tracing the estimated depth map.
IS22 [paper, Code] proposed a network to deconstruct the observation map into physical interpretable components such as surface normal, surface roughness, and surface base color. These components were then integrated via the physical formation model. Consequently, the training loss for optimization consisted of the normal reconstruction loss and the inverse rendering loss. DR-PSN [paper, Code] introduced a dual regression network for calibrated photometric stereo. This network combined the surface-normal constraint with the constraint of the reconstructed re-lit image. Additionally, GR-PSN [paper, Code] utilized a parallel framework to simultaneously learn two arbitrary materials for an object and included an additional material transform loss. These methods employed an inverse subnetwork to re-render reconstructed images based on predicted surface normals. They used CNNs to render reconstructed images rather than following the rendering equation.
Blobby and Sculpture data set, rendered by MERL BRDFs. Download Web
CyclesPS data set, rendered by Disney’s principled BSDFs. Download Web
Gourd&Apple data set [paper, DownloadWeb]
Light Stage Data Gallery [paper, DownloadWeb]
DiLiGenT data set [paper, DownloadWeb]
DiLiGenT-10^2 data set [paper, DownloadWeb]
DiLiGenT-Pi data set [paper, DownloadWeb]
Performance on the DiLiGenT benchmark with 96 images, measured in terms of MAE in degrees. The compared methods are ranked by the average MAE of ten objects. Performance on the DiLiGenT benchmark with 10 images, measured in terms of MAE in degrees. The compared methods are ranked by the average MAE of ten objects. Quantitative results for object BALL, tested with 96 input images, in terms of MAE in degrees. Quantitative results for object BEAR, tested with 96 input images, in terms of MAE in degrees. Quantitative results for object BUDDHA, tested with 96 input images, in terms of MAE in degrees. Quantitative results for object CAT, tested with 96 input images, in terms of MAE in degrees. Quantitative results for object COW, tested with 96 input images, in terms of MAE in degrees. Quantitative results for object GOBLET, tested with 96 input images, in terms of MAE in degrees. Quantitative results for object HARVEST, tested with 96 input images, in terms of MAE in degrees. Quantitative results for object POT1, tested with 96 input images, in terms of MAE in degrees. Quantitative results for object POT2, tested with 96 input images, in terms of MAE in degrees. Quantitative results for object READING, tested with 96 input images, in terms of MAE in degrees.