The Computer Vision (CV) field, where information is extracted from images and videos, has made incredible strides over the past few years thanks to the Deep Learning revolution. The first large-scale CV task tackled with Deep Learning was classifying pictures with a single object in them (i.e. is this a picture of a cat or a car?). The scope of the tasks has grown with improvements in network theory and design to include multiple-object segmentation, image captioning, video summarizing and highlighting, image generation, and denoising to list a few. As the name Deep Learning suggests, networks with a large number of neural layers have been crucial for tackling these more complex tasks. In recent years researchers experimented with different architectures and tradeoffs, but the principle trend was increasing network depth:

AlexNet -> VGG -> GoogLeNet

As the networks grew so did their number of learnable parameters and overall training complexity. Optimizing increasingly deep networks is a challenge as gradients either vanish or explode over many layers, or the network strongly overfits its given training data. There has been a wealth of work addressing these problems and the most successful are now standard in any Deep Learning project:
1. Good initializations (Xavier-Glorot, He)
2. Activation functions with well-behaved gradients (ReLu + leaky variants, elu)
3. Advanced gradient-based optimizers (Adam, AdaGrad, RMSProp)
4. Norm-based penalties for regularization (L1 and L2)
5. Datset augmentation and perturbation during training (random crops, flips, and distortions of an image)
6. Training techniques to prevent overfitting (batch-norm, Dropout, learning rate schedules)

The original creators of Residual Networks (ResNets) suggest that this list does plenty to solve the gradient and overfitting problems. Instead they argue that the training complexity of very deep networks is now the main challenge. They illustrate this with something called the "degradation problem", where a deeper network struggles to learn what should be a simple task:

1. Start with a shallow network N that performs well on task X.
2. Build a deeper version of N with D new layers: N + D = M.
3. Initialize the early layers of M with the weights from the shallower N.
4. In the worst case, network M should learn an identity mapping for the remaining D layers. The same signals would propagate forward and M would perform at least as well as N on task X. Given the success of deep networks, we might implicitly expect M to perform even better.
5. However, this does not happen in practice. In fact network M often performs worse than N unless it is carefully guided during training.

This suggests that a Deep Neural Network, despite all of its expressive power, still struggles to learn a simple identity mapping. This counterintuitive problem is the driving force behind ResNets.

If a network can learn a complicated function H(x) of its input, then it is reasonable to think it can also learn a residual mapping: H(x) - x. This residual mapping could be simpler to optimize and directly tackles the degradation problem. 

A basic ResNet module is shown below. After the input goes through two weight layers, a skip-connection adds the original x from the first input layer point-by-point. If an identity mapping is ideal all the optimizer has to do is push the layer's weights to 0. While it is very unlikely that a pure identity mapping is ideal, the authors show that it is a good starting point. They build very deep networks that converge faster and perform better than their non-residual counterparts [1].

<img src='./imgs/resnet.png', style="width: auto; height: auto">

The followup to this basic ResNet came from the same authors and takes the idea of identity mappings one step further. Now the authors argue for identity mappings in two places in the module: the skip-connection and the output itself. They derive and prove that if a ResNet module outputs an identity mapping, then signals and gradients can flow between any layers during the forward and backward passes. While activation functions are usually applied after a weight layer, to achieve an output identity mapping we have to flip this notion on its head and think about pre-activations instead. This boils down to pushing the activation functions through the weight layers and applying them on the other, earlier side. Once again the authors show that this helps the networks train faster and perform better [2]. 

<img src='./imgs/pre_act_resnet.png', style="width: auto; height: auto">

The most recent ResNet improvement comes from re-thinking the importance of depth. More specifically, in how depth is deployed and managed to increase a network's performance. This team experimented with increasing the width of layers while decreasing the overall number of network layers. They achieve strong improvements and faster convergences than very deep but narrow ResNets [3]. Wide ResNets also take greater advantage of the parallel processing power of GPUs compared to more sequential deep networks. 

<img src='./imgs/wide_resnet.png', style="width: auto; height: auto">

In the `python/practical(?)` post, we explore each of the ResNet variants for the novel tasks of RF Spectrum Mining and modulation detection. We build the model from the ground up with the `layers` library in tensorflow using some of the best-practice flavors that are starting to emerge. 

References:

ResNet: https://arxiv.org/pdf/1512.03385.pdf

Pre-Act. ResNet: https://arxiv.org/pdf/1603.05027.pdf

Wide ResNet: https://arxiv.org/pdf/1605.07146v1.pdf