by Petar Veličković and Emma Rocheteau
Neural networks make your smartphone videos better... after you've filmed them
Smartphones have gradually revolutionised the way in which we record our most important events---with camera setups that are now, at a glance, capable of rivalling bespoke cameras. This has made cameras one of the centerpiece features of smartphones.
However, phone cameras are also bound to have deficiencies---arising from the constrained nature of squeezing them onto a mobile device. For example, they may not be able to handle a certain lighting setup properly, and result in over- or under-exposed photos and videos. These effects generally deteriorate as we move to the lower-end of the price spectrum.
While it may seem as if all is lost when a photo/video is taken under poor conditions, neural networks may be leveraged to remedy the situation---sometimes substantially. To demonstrate this, we present N-hance (pls), (to the best of our knowledge) the first enhancing system for smartphone videos.
Here we make use of the latest advances in enhancing smartphone photos through deep learning, and build up on them to create a viable smartphone video enhancer. By doing so, we demonstrate:
- Seamless generalisation of the photo enhancement model into the video domain, requiring negligible post-processing;
- Favourable enhancement results outside of the input distribution the model was trained on---namely, on night-time recordings;
- Applicability of fine-tuning a model pre-trained on a particular device to expand to another.
All of these features make our prototype a solid indicator of a potentially important future application area, providing a cost-effective way to obtain enhanced video recordings.
Recent works from Ignatov et al., published at ICCV 2017 and CVPR 2018, demonstrate that convolutional neural networks can be used as enhancers for smartphone camera photos, compensating for the specific shortcomings of said camera and making its output approach DSLR quality (at times, indistinguishable from DSLRs to human assessors).
The techniques rely on a medley of several topical trends in deep learning for computer vision: all-convolutional networks, neural style transfer and adversarial training. The enhancer network is:
- an all-convolutional network, implying that it can process input images of arbitrary dimensions;
- forced to preserve content of its input, by way of a content-based loss based on deep activations of a pre-trained object recognition network (as used in style transfer);
- driven to mimic the characteristics (such as colour and texture) of high-quality DSLR images through employing several discriminator networks.
With minimal fine-tuning (in TensorFlow) of a pre-trained model for iPhone 3GS photos, we have been able to obtain a viable enhancer for the iPhone 5 (deliberately selected here to emphasise the potential of enhancement). From here, creating a video enhancer boiled down to extracting the frames from the input (by using scikit-video) and processing them individually using the photo enhancer---minimal postprocessing was necessary to create a coherent and useful output. We further extended the original enhancer network to support batched inference, speeding up the video processing by a factor of four.
Most surprisingly, we have found that the model fares well even when faced with inputs that drastically differ from the ones it was trained on---namely, the model provides a good level of enhancement on most night-time videos, even though the DPED dataset used to train it consists solely of daytime photos.
By leveraging a Flask server and an Azure instance, we have exposed a clear interface for submitting new videos to be processed, as well as analysing and downloading the results.
Having demonstrated that a photo enhancer can be feasibly generalised to videos---even ones taken outside of the training distribution---we believe that this exposes clear potential for a robust mobile application; one that we hope to explore in the coming months.
ssh
uses port 22
, not 20
...