<center><h1>EfficientMixNet</h1></center>

This project includes a complete neural network called **EfficientMixNet** based on recents advances in computer vision (EfficientNet, MixNet...) and in optimization (RAdam, Lookahead...). Logs and results are stored under the `/outputs` directory.


## Efficient Nets and Compound Coefficient Scaling 

The core idea about Efficient Nets [[1]](#refs) is the use of compound scaling - using a weighted scale of three inter-connected hyper parameters of the model - Resolution of the input, Depth of the Network and Width of the Network.

<p align="center">
<img src="../img/depth_width_res.png" title="depth_width_res" height=25% width=25%/>
</p>

When $\phi$, the compound coefficient, is initially set to 1, we get the base configuration - in this case `EfficientNetB0`. We then use this configuration in a grid search to find the coefficients $\alpha$, $\beta$ and $\gamma$ which optimize the following objective under the constraint:

<p align="center">
<img src="../img/obj.png" title="objective function" height=25% width=25%/>
</p>

Once these coefficients for $\alpha$, $\beta$ and $\gamma$ are found, then simply scale $\phi$, the compound coeffieints by different amounts to get a family of models with more capacity and possibly better performance.

-----

In doing so, and using Neural Architecture Search to get the base configuration as well as great coefficients for the above, the [paper](https://arxiv.org/abs/1905.11946) generates EfficientNets, which outperform much larger and much deeper models while using less resources during both training and evaluation.


<img src="../img/params.png" width=45%>
<img src="../img/flops.png"  width=45%>



## Mixed Depthwise Convolutional Kernel


An idea of improvement came from the paper [MixNets: : Mixed Depthwise Convolution Kernels](https://arxiv.org/abs/1907.09595) and was to replace standard `DepthwiseConv2D` layers by custom `MixedDepthwiseConv2D` layers. A Mixed Depthwise Convolution is a group of convolutions with varying filter sizes. The paper [[2]](#refs) suggests that [3x3, 5x5, 7x7] can be used safely without any loss in performance (and possible increase in performance), while a 9x9 or 11x11 may degrade performance if used without proper architecture search.

<img src="../img/MixedConv.png" height=100% width=100%>


## Attention-based Batch Normalization

Batch-Normalization [[4]](#refs) is a widely used technique in Deep Learning initially aimed at accelerating training by [reducing internal covariate shift](https://arxiv.org/abs/1502.03167), however some scholars have shown that batch normalization [does not reduce internal covariate shift](https://blog.paperspace.com/busting-the-myths-about-batch-normalization/) [[5]](#refs), but [instead smooths the objective function to improve the performance](https://arxiv.org/abs/1805.11604) [[6]](#refs). The effects of BatchNormalization are empirically proven, although the theoretical justification is still open to discussion.

It works in two steps, first normalize each channels of the input image, w.r.t statistics computed on the input batch then rescale and reshift the normalized tensor to restore the representation tensor. 

Let's consider a tensor $X$ of standard shape (B, H, W, C), we denote by $X_{bc}$ the channel number $c$ from the image number $b$, i.e  $X_{bc} = X[b, :, :, c]$

1) First compute statistics from the current batch **for each channels c**

Mean of channel $c$ across batch $B$ : $$\mu^{B}_c = \frac{1}{B * H * W} \sum_{b=1}^{B} \sum_{h=1}^{H} \sum_{w=1}^{W} X_{bhwc}$$


Standard deviation of channel $c$ across batch $B$ ($\epsilon$ is added to prevent division by $0$): $$\sigma^{B}_c = \sqrt{ \frac{1}{B * H * W} \sum_{b=1}^{B} \sum_{h=1}^{H} \sum_{w=1}^{W} (X_{bhwc} - \mu^{B}_c)^2 + \epsilon } $$

2) Normalize input tensors 

$$\hat{X}_{bc} = \frac{X_{bc} - \mu^{B}_c}{\sigma^{B}_c}$$

3) Rescale normalized tensors with learnable weights and bias to restore representation power

$$\tilde{X}_{bc} = \gamma_c * \hat{X}_{bc} + \beta_c$$

**The idea now is to play around the way how the normalized tensor is rescaled, it can yield better representation power.** 


### Attentive Normalization

Attentive Normalization (AN) [[7]](#refs) is an attention-based version of BN which recalibrates channel information of BN. AN absorbs the [Squeeze-and-Excitation (SE) mechanism](https://arxiv.org/abs/1709.01507) [[3]](#refs) into the affine transformation of BN. AN learns a small number of scale and offset parameters per channel (i.e., different affine transformations). Their weighted sums (i.e., mixture) are used in the final affine transformation. The weights are instance-specific and learned in a way that channel-wise attention is considered, similar in spirit to the squeeze module in the SE unit. This can be used as a droppin replacement of standard BatchNormalization layer. 

<p align="center">
  <img src="../img/AN.PNG">
</p>


### Instance Enhancement Batch Normalization 

Instance Enhancement Batch Normalization (IEBN) [[8]](#refs) is an attention-based version of BN which recalibrates channel information of BN by a simple linear transformation, this can be used as a droppin replacement of standard BatchNormalization layer. 


<p align="center">
  <img src="../img/iebn.jpg" width="400" height="300">
</p>


## Optimization

### Strategy

### RAdam optimizer

### Lookahead meta-optimizer






# Requirements
- Tensorflow 1.13+
- Keras 2.2.4+



# Implementations :

- (official) efficient nets https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/efficientnet_model.py
- (Keras port) efficient nets https://github.com/titu1994/keras-efficientnets
- (official) mix nets https://github.com/tensorflow/tpu/tree/master/models/official/mnasnet/mixnet
- (Keras port) mix nets https://github.com/titu1994/keras_mixnets
- (official) IEBN : https://github.com/gbup-group/IEBN
- (official) RAdam : https://github.com/LiyuanLucasLiu/RAdam
- (Keras port) RAdam : https://github.com/titu1994/keras_rectified_adam
- (Keras port) LookAhead : https://github.com/bojone/keras_lookahead
...


# References <a id='refs'></a>

[1] Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019. Arxiv link: https://arxiv.org/abs/1905.11946.

[2] Mingxing Tan and Quoc V. Le. MixConv: Mixed Depthwise Convolutional Kernels. BMVC 2019. Arxiv link: https://arxiv.org/abs/1907.09595v2

[3] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu. Squeeze-and-Excitation Networks. CVPR 2018. Arxiv link : https://arxiv.org/abs/1709.01507

[4] Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Arxiv link : https://arxiv.org/abs/1502.03167

[5] Ayoosh Kathuria. Intro to optimization in deep learning: Busting the myth about batch normalization. Link : https://blog.paperspace.com/busting-the-myths-about-batch-normalization/

[6] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry. How Does Batch Normalization Help Optimization? Arxiv link : https://arxiv.org/abs/1805.11604

[7] Xilai Li, Wei Sun and Tianfu Wu. Attentive Normalization. Arxiv Link: https://arxiv.org/abs/1908.01259

[8] Senwei Liang, Zhongzhan Huang, Mingfu Liang, Haizhao Yang. Instance Enhancement Batch Normalization (IEBN): an Adaptive Regulator of Batch Noise. Arxiv Link : https://arxiv.org/abs/1908.04008

[9] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han. On the Variance of the Adaptive Learning Rate and Beyond (RAdam). Arxiv Link : https://arxiv.org/abs/1908.03265

[10] Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba. Lookahead Optimizer: k steps forward, 1 step back. Arxiv link : https://arxiv.org/abs/1907.08610


## TO DO

- compléter la doc avec mish / swish, optimizers
- corriger scripts training / eval 

In [1]:
import sys
sys.path.insert(1, '../')

import json
import warnings
import numpy as np
from sklearn.metrics import recall_score, precision_score, fbeta_score

warnings.filterwarnings("ignore")
np.random.seed(1997)

# dummy training set
X_train = np.array([np.random.randn(299, 299, 5)])
y_train = np.array([[0., 1.]])

## Usage

In [2]:
from keras_efficientmixnets import *

Using TensorFlow backend.


Call the EfficientNetBuilder with desired parameters : 

In [3]:
# Equivalent to the standard EfficientNet architecture from the original paper
EfficientNet = EfficientNetBuilder(input_shape=(299, 299, 5),
                                   include_top=True,
                                   weights=None,
                                   input_tensor=None,
                                   pooling='avg',
                                   classes=2,
                                   drop_connect_rate=0.,
                                   data_format=None,
                                   mixed=False,
                                   activation="swish",
                                   typeBN="bn",
                                   batch_norm_momentum=0.99,
                                   batch_norm_epsilon=0.001,
                                   n_mixture=None,
                                   depth_divisor=8,
                                   min_depth=None)

Parameters : 

- input_shape : Default depends on the architecture (B0->B7 or custom) but it can also be set arbitrarly 
- include_top : Whether to include global average pooling (+ dropout) + dense with softmax activation or not
- weights : path to pretrained weights
- input_tensor : one can give the network a custom tensor as input (like the output of another model)
- pooling : choice of pooling layer at the end of the conv tower just before predictions, either 'avg' (`GlobalAveragePooling2D`), 'max' (`GlobalMaxPooling2D`) or 'None' (`Flatten`)
- classes : useful if include_top=True, dimensionality of the output layer
- drop_connect_rate : whether to use drop connect, works only with identity_skip=True when building block args
- data_format : default `None` which is equivalent to `channels_last`
- mixed : whether to use mixed depthwise convolution layer instead of the regular `DepthwiseConv2D`. 
- activation : activation function to use thru the network
- typeBN : Type of Batch Normalization layer to use ["bn", "an", "iebn"]
- n_mixture : Only relevant if using typeBN = "an"
- batch_norm_momentum : momentum to apply for `BatchNormalization` layer
- batch_norm_epsilon : epsilon for `BatchNormalization` layer
- depth_divisor : used when rounding off the coefficient scaled channels and depth of the layers
- min_depth : minimum depth value in order to avoid blocks with 0 layers

**NB** : In the original paper, the authors were using  `DepthwiseConv2D`, `BatchNormalization` and `Swish` activation.

Now either call `default_model` with the desired architecture (`"B2"` for instance) to generate the architecture according to the original paper :

In [4]:
EfficientNetB2 = EfficientNet.default_model("B2")

Instructions for updating:
Shapes are always computed; don't use the compute_shapes as it has no effect.


In [5]:
EfficientNetB2.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 299, 299, 5)  0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 150, 150, 32) 1440        input_1[0][0]                    
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 150, 150, 32) 128         conv2d_1[0][0]                   
__________________________________________________________________________________________________
swish_1 (Swish)                 (None, 150, 150, 32) 0           batch_normalization_1[0][0]      
__________________________________________________________________________________________________
depthwise_

In [6]:
EfficientNetB2.compile("Adam", loss="binary_crossentropy")

In [7]:
hist = EfficientNetB2.fit(X_train, y_train, epochs=1)

Epoch 1/1


In [8]:
print("y_pred : {} - y_true : {}".format(hist.model.predict(X_train)[0], y_train[0]))

y_pred : [1.3657064e-12 1.0000000e+00] - y_true : [0. 1.]


Or either call `custom_model` to generate a custom architecture according to a custom block list and width / depth / resolution coefficients. `custom_model` requires a list of `BlockArgs` as input to define the structure of each block in model. 

Nb : Default sets of `BlockArgs` are provided in `keras_efficientmixnets.config` and are the one used in the original paper.

This list can be built from encoded strings or from BlockArgs objects.

Encoding Schema:
    "rX_kX_sXX_eX_iX_oX_aX_bX_nX{\_se0.XX}{\_noskip}"
     - X is replaced by a any number ranging from 0-9, by a list with the prefix k or by a string for prefix b and a
     - {} encapsulates optional arguments
        'r' : num_repeat,
        'k' : kernel_size(s),
        's' : strides,
        'e' : expand_ratio,
        'i' : input_filters,
        'o' : output_filters,
        'a' : activation,
        'b' : type of Batch Norm,
        'n' : n_mixture (only relevant for Attentive Normalization)


Parameters : 

- block_list : list of BlockArgs object to instantiate the blocks
- width_coefficient: this is the $\beta$ coefficient as explained in the introduction (=1 for EfficientNetB0), determines the number of channels available per layer.
- depth_coefficient: this is the $\alpha$ coefficient as explained in the introduction (=1 for EfficientNetB0), determines the number of layers available to the model.
- dropout_rate : useful if include_top=True, dropout rate for predictions layers
- default_size : default image size, increases when using larger models (linked to the $\gamma$ coefficient explained in the intro)

In [9]:
# from encoded strings
blocks_args_mixed_encoded = [
            'r1_k3_s11_e1_i32_o16_se0.25_aswish_bbn',
            'r1_k[3, 5, 7]_s22_e1_i16_o24_se0.25_aswish_bbn',
            'r1_k3_s11_e1_i24_o32_se0.25_aswish_bbn']
  
BLOCK_LIST_MIXED_ENCODED = [BlockArgs.from_block_string(s) for s in blocks_args_mixed_encoded]


# from BlockArgs object
BLOCK_LIST_MIXED = [
    BlockArgs(32, 16, kernel_size=3, strides=[1, 1], num_repeat=1, se_ratio=0.25, expand_ratio=1, activation="swish", typeBN="bn"),
    BlockArgs(16, 24, kernel_size=[3, 5, 7], strides=[2, 2], num_repeat=1, se_ratio=0.25, expand_ratio=1, activation="swish", typeBN="bn"),
    BlockArgs(24, 32, kernel_size=3, strides=[1, 1], num_repeat=1, se_ratio=0.25, expand_ratio=1, activation="swish", typeBN="bn"),
]


# Blocks are strictly identical
for b1, b2 in zip(BLOCK_LIST_MIXED_ENCODED, BLOCK_LIST_MIXED):
    assert(b1.input_filters == b2.input_filters)
    assert(b1.output_filters == b2.output_filters)
    assert(b1.kernel_size == b2.kernel_size)
    assert(b1.strides == b2.strides)
    assert(b1.num_repeat == b2.num_repeat)
    assert(b1.se_ratio == b2.se_ratio)
    assert(b1.expand_ratio == b2.expand_ratio)
    assert(b1.identity_skip == b2.identity_skip)
    assert(b1.typeBN == b2.typeBN)
    assert(b1.n_mixture == b2.n_mixture)

In [10]:
EfficientNetCustom= EfficientNet.custom_model(block_list=BLOCK_LIST_MIXED, width_coefficient=1., depth_coefficient=1., dropout_rate=0.3)

In [11]:
EfficientNetCustom.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 299, 299, 5)  0                                            
__________________________________________________________________________________________________
conv2d_93 (Conv2D)              (None, 150, 150, 32) 1440        input_2[0][0]                    
__________________________________________________________________________________________________
batch_normalization_47 (BatchNo (None, 150, 150, 32) 128         conv2d_93[0][0]                  
__________________________________________________________________________________________________
swish_70 (Swish)                (None, 150, 150, 32) 0           batch_normalization_47[0][0]     
__________________________________________________________________________________________________
depthwise_

In [12]:
EfficientNetCustom.compile("Adam", "binary_crossentropy")

In [13]:
hist = EfficientNetCustom.fit(X_train, y_train, epochs=1)

Epoch 1/1


In [14]:
print("y_pred : {} - y_true : {}".format(hist.model.predict(X_train)[0], y_train[0]))

y_pred : [0.06242109 0.937579  ] - y_true : [0. 1.]


We can also tweak the default architecture and use default block settings : 

In [15]:
# Tweak the original architecture by using Mixed depthwise conv, Mish activation and attentive normalization
EfficientNetTweaked = EfficientNetBuilder(input_shape=(299, 299, 5),
                                   include_top=True,
                                   weights=None,
                                   input_tensor=None,
                                   pooling='avg',
                                   classes=2,
                                   drop_connect_rate=0.,
                                   data_format=None,
                                   mixed=True,
                                   activation="mish",
                                   typeBN="an",
                                   batch_norm_momentum=0.99,
                                   batch_norm_epsilon=0.001,
                                   n_mixture=5,
                                   depth_divisor=8,
                                   min_depth=None)

In [16]:
# load same blocks as original papers with the blocks are customized
EfficientNetB2 = EfficientNetTweaked.default_model("B2")

In [17]:
EfficientNetB2.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 299, 299, 5)  0                                            
__________________________________________________________________________________________________
conv2d_104 (Conv2D)             (None, 150, 150, 32) 1440        input_3[0][0]                    
__________________________________________________________________________________________________
attentive_normalization_1 (Atte (None, 150, 150, 32) 549         conv2d_104[0][0]                 
__________________________________________________________________________________________________
mish_1 (Mish)                   (None, 150, 150, 32) 0           attentive_normalization_1[0][0]  
__________________________________________________________________________________________________
depthwise_

In [18]:
EfficientNetB2.compile("Adam", loss="binary_crossentropy")

In [19]:
hist = EfficientNetB2.fit(X_train, y_train, epochs=1)

Epoch 1/1


In [20]:
print("y_pred : {} - y_true : {}".format(hist.model.predict(X_train)[0], y_train[0]))

y_pred : [0. 1.] - y_true : [0. 1.]
