# Wedge Dropout
## Abstract
[ need logo of wedge ]

Wedge Dropout is one of a recent wave of Dropout algorithms which are customized for Convolution Neural Networks (CNNs). Wedge Dropout has a very simple goal: it attempts to de-correlate feature maps in a CNN by examining pairs of feature maps and zeroing them both out if they are "too similar". At heart, the idea is that a CNN has a set of engines for creating feature maps from an image, and these engines become too similar. Wedge Dropout discovers that two of these engines are too close together, and "drives a wedge" between them to push them apart. Wedge Dropout is suitable for Convolutional Networks using 1D, 2D, 3D, and LSTM2D layers. It is best used at the end of a pipeline of CNN layers, as it operates by critiquing the values created by the CNN pipeline. Wedge Dropout operates by critiquing the values created by the CNN pipeline, and is best used at the end of a pipeline of CNN layers.
[ need logo of mallet driving wedge between two lines that are almost touching]
Preliminary testing has shown that Wedge Dropout can give a very slight improvement to model accuracy, generally 0.05% to 0.1%, in many different 1D, 2D and 3D CNN-based models. There is usually no hyperparameter tuning required to achieve this improvement. Wedge Dropout has no run-time overhead; however, it can increase training time by up to 10%, depending on how many feature maps the model uses in its final stage.




## Introduction
Wedge Dropout is a Dropout algorithm tuned for convolutional networks. The Wedge Dropout algorithm is a variation of Spatial Dropout.
Where Dropout zeros out individual values in a feature map, Spatial Dropout randomly zeros out entire feature maps. Since data within a feature map is strongly correlated, simple Dropout is not effective because the remaining data is still strongly correlated. 

(It tends to have the most useful at improving a neural network when used after the first convolutional layer.)t
[ Assumptions: understand neural networks, CNNs, and the role Dropout plays ]
 A CNN works by generating overlapping pyramids based on an input image. These pyramids are called "feature maps". Given a picture of a cat, one feature map outlines the head, another the eyes, a third the ears. Collectively they describe aspects of the cat. Feature maps should be independent- if they are correlated, or "too similar", the CNN describes fewer unique aspects than it could.

The Wedge Dropout algorithm works by analyzing the final output of a convolutional neural network (CNN). Where Spatial Dropout zeroes out randomly chosen feature maps, Wedge Dropout analyzes randomly chosen pairs of feature maps and drops both when they are "too similar".


## Similar Work
Wedge Dropout was directly inspired by CamDrop. CamDrop analyzes all of the feature maps in a set and chooses a rectangle that will be zeroed out in some of the feature maps. Other similar algorithms are DropBlock, which randomly zeroes out rectangles in feature maps. In a sense, Wedge Dropout is to Spatial Dropout as CamDrop is to DropBlock: it replaces randomness with analysis. 

## Algorithms
There are three algorithms investigated for Wedge Dropout's analysis phase. 
### Rectification
Under rectification, the result is the delta of the two feature maps. Since this would enforce making both feature maps be the inverse of the other, normalize the images around zero and rectify them. Rectification nullifies the drive towards simply inverting the image. This method is appropriate for comparing embeddings which randomly populate a vector space.
### Direct Comparison 
This algorithm isolates the cells in each feature map which are above its mean value, and then compares individual values across both feature maps cell by cell in a simple Boolean AND operation. This counts the number of cells in both feature maps that tend to find the same feature. If this count is above a certain percentage of the total number of cells in the feature map, both feature maps are zeroed out.
### Normalize and Multiply
This algorithm normalizes both feature maps to a range from 0.0 to 1.0, multiplies the two feature maps cellwise (Hadamard matrix multiplication), and counts the number of values above the mean of the resulting output. Again, if more than a certain percentage are above the mean (0.5), both feature maps are zeroed out.

CNN feature maps tend to be dominated by a low-valued background with one or a few small high-valued regions. Direct Comparison and  Normalize and Multiply both work by ignoring the low-valued background and only comparing the high-valued regions. It is not clear that either algorithm is better. Normalize and Multiply offers different normalization algorithms (from tf.linalg.norm()) for experimentation.

Following are implementations of each algorithm in Python.

In [3]:
import numpy as np

In [5]:
def similarity_rectify(img1, img2):
    # normalize values in both images to a range from -0.5 to 0.5 
    # somehow, there is no numpy function for this
    base1 = np.max(img1) - np.min(img1)
    base2 = np.max(img2) - np.min(img2)
    if base1 == 0:
        base1 = 0.0001
    if base2 == 0:
        base2 = 0.0001
    rect1 = np.abs(((img1 - np.min(img1)) / base1) - 0.5)
    rect2 = np.abs(((img2 - np.min(img2)) / base2) - 0.5)
    print(rect1)
    print(rect2)
    # rectified maps range from 0 to 1
    delta = (rect1 - rect2) / 2
    mean = np.abs(np.mean(delta))
    return 1 - mean

def similarity_direct_comparison(img1, img2):
    # This is exactly what a GlobalAveragePooling2D layer does
    mean1 = np.mean(img1)
    mean2 = np.mean(img2)

    visible1 = img1 > mean1
    visible2 = img2 > mean2

    correlated = visible1 == visible2

    percentage = sum(correlated.flatten()) / len(img1.flatten())

    return correlated, percentage

def similarity_multiply(img1, img2):
    norm1 = np.linalg.norm(img1)
    norm2 = np.linalg.norm(img2)
    if norm1 == 0:
        base1 = 0.0001
    if norm1 == 0:
        base2 = 0.0001
    rect1 = img1 / norm1
    rect2 = img2 / norm2
    # rectified maps range from 0 to 1
    mult = rect1 * rect2
    print('mult:', mult)
    avg = np.mean(mult)
    print('avg:', avg)
    correlated = mult > avg

    percentage = sum(correlated.flatten()) / len(img1.flatten())
    return correlated, percentage

img1 = np.asarray([[1,2],[3,4]])
img2 = np.asarray([[5,6],[7,3]])
img3 = np.asarray([[3,1],[4, 9]])


print('Feature Map #1')
print(img1)
print('Feature Map #2')
print(img2)
print('Feature Map #3')
print(img3)
print()

percentage1 = similarity_rectify(img1, img3)
print('  similarity score:', percentage1)
correlation1, percentage1 = similarity_direct_comparison(img1, img2)
print('Correlation (count > mean): ')
print(correlation1)
print('  similarity score:', percentage1)
correlation2, percentage2 = similarity_multiply(img1, img2)
print('Correlation (normalized and multiplied): ')
print(correlation2)
print('  similarity score:', percentage2)



Feature Map #1
[[1 2]
 [3 4]]
Feature Map #2
[[5 6]
 [7 3]]
Feature Map #3
[[3 1]
 [4 9]]

[[0.5        0.16666667]
 [0.16666667 0.5       ]]
[[0.25  0.5  ]
 [0.125 0.5  ]]
  similarity score: 0.9947916666666666
Correlation (count > mean): 
[[ True False]
 [ True False]]
  similarity score: 0.5
mult: [[0.08368274 0.20083858]
 [0.35146751 0.20083858]]
avg: 0.20920685218893076
Correlation (normalized and multiplied): 
[[False False]
 [ True False]]
  similarity score: 0.25


We can see that the three algorithms provide different interpretations of 'correlated'. 

# Usage
Wedge Dropout operates by critiquing the values created by the CNN pipeline, and is best used at the end of a pipeline of CNN layers. Like Spatial Dropout and other CNN-specific Dropout algorithms, do not place a Batch Normalization layer after a Wedge Dropout layer. Wedge Dropout works well after a Batch Normalization layer. All successful tests have placed the Wedge Dropout layer right before the final summarization layer, usually a Dense or GlobalAveragePooling layer.

[ Lift a standard CNN diagram and poke in a Wedge Dropout layer right before the final Dense/GlobalAveragePooling layer. ]

Wedge Dropout only has one hyperparameter, the similarity coefficient. Preliminary testing on many different 1D, 2D and 3D CNN networks has shown that one value is optimal for almost all applications: 0.5 for the Direct Comparison algorithm, and 0.65 for the Normalize and Multiply algorithm.

# Batch-wise Operation
It has proven very powerful to apply Wedge Dropout to all of the feature maps for a pair of random indexes, and then do a simple voting algorithm on the results. For example, if the batch size is 32, then if 16 pairs of feature maps are "too similar", then all 32 feature maps are zeroed out. If only 15 pairs are too similar, none of the feature maps are zeroed out.

When operating per sample, Wedge Dropout critiques a pair of feature maps. In batch-wise operation, Wedge Dropout critiques the engine, or causal chain, that created the feature maps. 

Batch Normalization is a proven method of improving a CNN model, and is used in most reference architectures except image generators. Batch Normalization's performance improves as the batch size increases. Wedge Dropout's performance also increases as batch size increases, so it is a good match for existing CNN architectures.
Wedge Dropout works well after a Batch Normalization layer.

# Concluding Remarks
The Wedge Dropout algorithm can add a noticeable increase in tensor graph compilation time, and training time per epoch. For example, the final phase of EfficientNet "Zero" generates 1024 feature maps, and then applies GlobalAveragePooling to those. Wedge Dropout was most effective when inserted between the final convolution layer and the GlobalAveralPooling layer. Adding Wedge Dropout increased the tensor graph compilation phase by 20%, and added 10%-15% to the running time for each epoch. Wedge Dropout is not active during prediction, and does not contribute any values that need to be loaded for a model.

## Utility
Wedge Dropout is not needed for any image architecture. However, given the lack of complex hyperparameters and simplicity of application, it will probably improve most production uses of CNNs.
It has been tested with many example networks in the Keras documentation, and worked in all Conv1D, Conv2D and Conv3D-based applications. It did not improve a multi-layer fully trained networks (EfficientNet Level 0), but did improve an application which uses pre-trained ImageNet layers. It also did not help with LSTM2D. 

Since Wedge Dropout works by critiquing the output of a CNN, it is possible that its feedback only works in a simple causal chain. It is possible that the causal graph of feature map creation is turbid in deep networks and 2D LSTMs. 

## Further Investigation
"Slice Dropout" [ SliceDropout ], a variant of Spatial Dropout, zeroes out only one half of a feature map instead of the entire feature map. A more complex version of Wedge Dropout's feature map comparison could decide that all of the offending cells are concentrated on one side of the feature map, or even just one quadrant. It would choose to zero out only that area.

It is possible that zeroing out values is not the only way to affect training. There may be ways to do a random fill which do not disrupt the operation of Batch Normalization.

### Feature Maps and Attention Heads
The multi-head attention [ cite ] architecture creates several "answer" vectors of the same size, and combines them with addition. This set of vectors looks suspiciously like a set of feature maps: the information within the vector is strongly correlated, but information across vectors is not correlated. In fact, if the information is correlated across attention heads, the attention heads are learning the same answer set. It is possible that Wedge Dropout, or even Spatial Dropout, will improve the function of a multi-head attention model.

# Citations
[ CamDrop ]

[ DropBlock ]

[ SpatialDropout paper ]

[ SpatialDropout Keras man page ]