# Summary of papers

by Peter Trost  
Rottenburg (Germany), 10/2018  
peter_trost93@yahoo.de  
or peter.trost@student.uni-tuebingen.de (until ~08/2019)

convert slides: jupyter nbconvert *.ipynb* --to slides --post serve

multi view stereo (MVS)

### A Naturalistic Open Source Movie for Optical Flow Evaluation  
  
*by Butler et al.*

### Overview

- dataset for optical flow evaluation derived from 3D animated short film *Sintel*
- contains long sequences, large motions, specular reflections, motion blur, defocus blur, atmospheric effects and more
- scenes of open source graphics data rendered in varying complexity
- can be used to improve optical flow methods

### Render passes

- Albedo Pass: Flat, unshaded, surfaces exhibit constant albedo over time
- Clean Pass: Illumination including smooth shading and specular reflections adds realism
- Final Pass: Full rendering with all effects including blur due to camera depth of field and motion, and atmospheric effects.

### Main aspects

*and comparing to Middlebury flow benchmark*

- Difficulty: Sintel dataset contains varying and more challenging (for existing methods) scenes
- Sequence Length: 50 frames (2 to 8 in Middlebury) with 49 ground truth flow fields (only for one pair per sequence in Middlebury)
- Amount of Data: 1628 frames of ground truth flow (100 times Middlebury), with 564 for test and 1064 for training (good for machine learning)
- Image Resolution: 1024 x 436 px (45% to 100% more than Middlebury frames)
- Large Motions: well over 100 pixels per frame (Middlebury max 35 ppf)
- Blur: Middlebury doesn't contain motion or defocus blur, Sintel contains renders with and without them
- Motion Boundaries and Occluded Regions: new definition of motion boundaries, new error measure (function of distance from boundaries)
- Real-World Challenges: lighting variations, shadows, complex materials and reflections and more contained in Sintel
- Transparency: no transparency included
- Ranking: ranked according to average endpoint error (EPE) in different challenge categories

- Blender's internal motion blur pipeline was modified to give accurate motion vectors at each pixel which provide ground truth optical flow maps
- clips were selected so that optical flow is realistic. one still has to be cautious when training and evaluating algorithms that strongly rely on real-world laws of physics
- images saved as 8-bit PNG files and framerate of 24 fps

### Summary
The MPI-Sintel Flow Data Set is publicly available at http://sintel.is.tue.mpg.de and
includes:
- image sequences for Albedo, Clean and Final passes.
- for training set: forward flow fields (floating point and color image visualizations), occlusion boundary masks, unmatched pixel masks and invalid pixel maps
- software to compute various error statistics on the training data

### Image statistics

Sintel dataset is compared to datasets of lookalike clips with natural scenes

- images are converted to grey-scale $I(x,y) \in [0,255]$ to compare luminance statistics. Kullback-Leibler-Divergence from Sintel to Lookalikes is 0.058 (smaller than Middlebury to Lookalikes, being 0.176)
- (?) Spatial power spectra (p. 10 last paragraph)

### Analysis

computed flow for the Sintel dataset on following publicly available methods:
- Classic+NL (good modern method)
- Classic++ (standard robust method)
- LDOF (deisgned to deal with large displacement optical flow)
- Horn and Schunck (HS) (classic method)
- Anisotropic Huber-L1 Flow (GPU-based optical flow method)

Results:
average endpoint error (EPE) for these methods on Middlebury is $\leq$ 0.5. For Sintel it's between 7.43 and 12.67.

![](./images/Sintel_analysis_table.png)

### Playing for Data: Ground Truth from Computer Games  
  
*by Richter et al.*

### Overview

- uses detouring (inject wrapper between game and OS) to record, modify and reproduce rendering commands from Grand Theft Auto 5 (GTA5)
- hashing distinct rendering resources (geometry, textures, shaders) to create object signatures and therefore pixel-accurate object labels which enables propagating these labels across time and instances that share distinctive resources
- contains 25.000 images from GTA5 with pixel-level semantic segmentation ground-truth
- labeling took 49 hours (3 orders of magnitude faster than other semantic segmentation datasets with similar annotation density)
- through propagation of object labels across images annotation time per image decreases sharply during annotation process of multiple images (stays constant in labeling interfaces prior to this approach -> linear increase with size of dataset)
- uses labeling compatible with other datasets for urban scene understanding
- models trained with game data and $\frac{1}{3}$ of CamVid dataset outperform models trained with full CamVid dataset

### Extracting information from the rendering pipeline

- Games communicate with the hardware through APIs (e.g. OpenGL, Direct3D, Vulkan)
- Autors implemented a wrapper for DirectX 9 API and used RenderDoc for wrapping Direct3D 11
- enables them to monitor creation, modification and deletion of resources used to specify the scene and synthesize an image
- recorded every 40th frame during GTA5 gameplay
- RenderDoc was modified to record data in a format suitable for annotation
- data collected during gameplay session is processed in batch afterwards (30 seconds per frame)

### Challenges:
- identifying relevant function calls:  
group calls by the G-buffers that were assigned as render targets
- identifying resources:  
hash (128-bit key, non-cryptographic) memory content describing a mesh to recognize same meshes between different gaming sessions. Volatile resource IDs are then mapped to persistent hash keys
- Formatting annotation:  
use two render passes. The first is the conventional renderpass, the second is used to encode IDs into pixels so that each pixel contains resource IDs for mesh, texture and shader of the scene element imaged at that pixel. Four render targets with three 8-bit color channels are used resulting in 96 bits per pixel containing the three 32-bit resource IDs. These are mapped to the 128-bit hash keys

### Semantic Labeling

- patch decomposition:  
images are decomposed into patches of pixels that share the same mesh, texture and shader (MTS). Objects generally consist of multiple patches, patches usually are contained within a single object, are associated with underlying surfaces in the scene and are linked to other patches that depict same surface. patch boundaries coincide with semantic class boundaries. 
Grouping patches creates pixel-accurate label maps
![](./images/PlayingForData_patching_illustration.PNG)

- association rule mining:
uses statistical regularities in associations between resources and semantic labels to label other patches that use this resource aswell (e.g. car meshes are highly likely to only be used on cars)
- annotation process:
uses annotation tool. annotator clicks on button of a specific semantic class then on the to be labeled patch in the image. The annotation tool automatically propagates the annotation to other images if above requirements are met. Annotation time per image decreases during the process (tool presents only images that have more than 3% of their area unlabeled). Also small unrecognizable areas are likely propagated back when later images containing that area are labeled.

### dataset and analysis
- image resolution 1914 x 1052
- 49 hours to label 98.3% of pixel area with corresponding classes
- per image labeling time 2-3 magnitudes faster than KITTI and CamVid datasets
- 98.7% of images more than 90% are pre-annotated by the time they are reached by annotator
- high variability: median of 4 images containing same MTS. 26.5% of MTS combinations occur in only one image each

### Semantic Segmentation

- training: first stage on real and synthetic data using mini-batch stochastic gradient descent with mini-batches of 8 images (4 real, 4 synthetic). 50k iterations, learning rate $10^{-4}$, momentum 0.99. Cropsize 628x628, receptive field 373x373 pixels. second stage fine-tuning for 4k iterations, real data only, same parameters


### SyB3R: A Realistic Synthetic Benchmark for 3D Reconstruction from Images  
  
*by Ley et al.*

source no. 8 provides dataset containing architectural objects

### overview

- framework
- uses realistically rendered images
- Blender with path tracing (to simulate more complex light-surface interactions)
- camera parameters and 3D structure of the scene are known
- provides:  
  - automatic synthesis framework with full control about the scene and the image formation  
  - several datasets with a variety of challenging characteristics  
  - results of a few example experiments illustrating the benefits of synthetic but realistic benchmarks  
  - flexible and open-source C++ implementation of the proposed framework (http://andreas-ley.com/projects/SyB3R/)
- image generation is split into rendering the scene into 2D image and post-processing that implements remaining effects in image space
- all steps implemented in modular manner (allows exchanging, reordering and turning modules off/on)

### Image Rendering
- uses "Cycles" in Blender (Monte-Carlo path tracer, accurate propagation of light through scene)
- produced images are stored as HDR (retains full floating-point precision of all intensity values)
- Cycles handles: scene properties (lighting, surface texture, ...), object motion, large camera motion, camera properties (focal length, principal point, resolution, depth of field (DoF), field of view)

### Image post-processing
  
chain of individual modules:
![chain of individual modules](./images/SyB3R_postProcessingChain.PNG)


### The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes  
  
*by Ros et al.*

### Overview

- synthetic dataset of urban scenes
- generated to aid semantic segmentation in context of autonomous driving
- consists of photo-realistic frames
- pixel-level semantic annotations for 13 classes
- frames acquired from multiple view-points
- frames contain associated depth-map
- generated by rendering virtual city created with Unity development platform
- includes four different seasons with drastic change of appearance (lighting, weather conditions,...)
- variety of illumination conditions

### SYNTHIA-Rand
- consists of 13,400 frames of the city
- from camera randomly moving around city

### SYNTHIA-Seq
- four videosequences
- approx. 50,000 frames each
- one per season
- simulating a car moving through city (includes interaction with objects, slowing down, speeding up)
- omnidirectional view on demand (cameras in all 4 directions)

### Results
- DCNNs trained solely on SYNTHIA-Rand frames (resized to 180x120 pixels) are almost as or more accurate than trained on real data
- DCNNs trained on batches of 6 images of real and 4 of synthetic domain have better per-class accuracy than those trained solely on real data (up to 18.3 percentage points more)