#Sparsenet 

It stands for *Sparsely Aggregated Convolutional Networks*. It is a network architecture that can be utilised in DenseNets and ResNets to make training faster by significantly reducing the number of parameteres.  These improvements are acheived while maintaining similar levels of accuracy.

Check out the original publication at [Sparsely Connected Convolutional Networks](https://arxiv.org/abs/1801.05895).
The original repository is hosted at [SparseNet](https://github.com/Lyken17/SparseNet).

##Background

Convolution nueral networks have become quite important in Computer Vision.  They have been utilised in a wide variery of tasks like Image classification, detetcion and segmenetation. At this time, the most popular ones are : AlexNet, VGG, Inception, ResNet and DenseNet. 
 
* __DenseNet__ - It is a network architecture which has the following components - Initial convolution layer, multiple Dense blocks each followed by a Transition block and a final output block. 
![DenseNet basic](https://cloud.githubusercontent.com/assets/8370623/17981496/fa648b32-6ad1-11e6-9625-02fdd72fdcd3.jpg)The above picture shows an example of a Denset used for classification. The transition block consists of a Convolution and a Pooling layer. The output block consists of a Pooling and a linear(fully connected) layer. <br><br>
The actual genius of the densenet lies within the dense blocks. Each dense block consists of multiple dense layers. Each dense layer consists of a batch normalization, ReLU and a Convolution layer. For now, it is easier to treat each dense layer as a single unit. <br><br>Instead of being connected sequentially, the dense layers are conencted in a feed-forward fashion. Basically, for each layer, the outputs of all preceding layers are treated as inputs. Its own feature maps are then, passed on as inputs to all subsequent layers. <br><br>
![DenseNet block](https://cloud.githubusercontent.com/assets/8370623/17981494/f838717a-6ad1-11e6-9391-f0906c80bc1d.jpg) <br><br>In the above picture, H1, H2, H3 and H4 are single dense layers. X1, X2, X3 and X4 are outputs of H1, H2, H3 and H4 respectively. Note that, H1 has single input. However, H2 has 2, H3 has 3 and so on. Since Keras allows a layer to have one single input, we need a mechanism to combine mutliple inputs(when applicable). Densenet, uses the mathematical operation **Concatenation** for this pupose. 
<br>
###Concatenation example<br>
a = np.array([[1, 2], [3, 4]])<br>
b = np.array([[5, 6]])<br><br>
np.concatenate((a, b), axis=0) --> Concatenate along row<br>
**answer** : ([[1, 2], [3, 4], [5, 6]])<br><br> 
np.concatenate((a, b.T), axis=1) --> Concatenate along column<br>
**answer** : ([[1, 2, 5], [3, 4, 6]])<br><br>
You can find more information [here](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html).<br>
###L and k<br>
DenseNets are charecterized by their values of 2 constants - L and k. L is the number of layers and k is the growth rate.<br><br>
L is the total number of convolution layers throughout the whole densenet. <br>For example, if there are 3 dense blocks and each dense block has 12 filters. L = 1(initial conv) + 12(first dense block) + 1(1st transition) + 12(second dense block) + 1(2nd transition) + 12(third dense block) + 1(output block) = 40.<br><br>
k is called the growth rate. It represents the number of output filters of each dense layer inside a dense block. On careful examination, it can be noticed that, it is also the rate at which the tensor grows inside a dense block. For example, assume input to dense block has 24 filters and k = 12. Then, <br><br> Input to 1st layer = 24 filters. <br> Input to 2nd layer = Output of first layer + Original input = 12 + 24 = 36 filters.<br>Input to 3rd layer = Output of 2nd layer + Output of 1st layer + Original input = 12 + 12 + 24 = 48 filter.<br> and so on.. <br><br>
In conclusion, a DenseNet-40-12 represents a Denset with 40 layers and growth rate 12.<br><br>

* __ResNet__ - It has an architecture similar to DenseNet. However, the mathematical operation used to combine multiple inputs is **summation followed by a ReLu layer**.

##Problems with dense architecture
Dense feature aggregation, described above, comes with several potential drawbacks.

* In Densenets,  usage of concatenations means that the number of
skip connections and parameters grow at the asymptotic rate of O(N2)
where N is the network depth. This asymptotically quadratic growth means
that a significant portion of the network is devoted to processing previously seen
feature representations. Each layer contributes only a few new outputs. Hence, efficiency is low.

* In Resnets, after summation operation, it is impossible to seperate individual components of a set of
features. As the depth of a residual network grows, the
number of features maps aggregated grows linearly. Later features may corrupt
or wash-out of the information carried by earlier feature maps.

##Sparsenet architecture

Sparsenet tries to solve the problems faced by dense aggregations with an architectural change. Here, within a dense block, instead of utilizing outputs of all earlier layers as input, only some selective layers outputs are chosen. <br><br>For layer(x), the inputs are : layer_(x - a^(0)),  layer_(x - a^(1)),  layer_(x - a^(2)) and so on , where a is an integer, most commonly, 2<br><br>

![Differences](https://github.com/Lyken17/SparseNet/raw/master/images/dense_and_sparse.png)<br>
For example, in a Sparsenet dense block, layer(5) has inputs from: <br>
layer(5 -2^0), layer(5 - 2^1) and  layer(5-2^2) => layer(4), layer(3) and layer(1).<br>

For a network of total depth N, ResNet and
DenseNet have N incoming links per layer, for a total of O(N * N) connections. In contrast, sparse aggregation has only log
(N) incoming links per layer, for a total of O(N log(N)) connections.

##Advantages
Sparsenet offers two main advantages over DenseNets and ResNets - Faster training and more efficient paramter utilization.
1. __Fewer parameters__ - Sparsenet offers results similar to DenseNet and ResNet, while using significantly lower number of paramters.<br><br>
![Results](https://i.imgur.com/neA484j.png)<br><br>

For example, DenseNet-40-12 needs 1.1 million parameters to acheive top 5 accuracy of approximately around 24 on CIFAR 100 data. However, Sparsenet-40-12 acheives the same top 5 score with only 0.76 million parameters.<br>

2. __Effecient skip connection utilization__ - 
![Parameter weights](https://github.com/Lyken17/SparseNet/raw/master/images/cropped_two-weights-int.jpg)<br><br>The above figure shows the average absolute filter weights of convolutional layers in a trained DenseNet and SparseNet. <br><br>For a target layer x and source layer y, a bluish color indicates that, even though a connection exists from the output of y to the input of x, this connection is almost unutilized and is ignored. This means that the network is performing extra work while geting back very little value. A reddish brown color, however, shows that the skip connection is utilized well and adds value.<br><br>
In the figure, the Dense Block has a lot of bluish region. This proves that many skip connections are redundant. In Sparsenets though, almost no bluish region can be found. Hence, its clear that no redundant skip connections exist. Each existant skip connection plays a significant role in acheiving the result. <br><br>
3.__Effeciency/Flops__ -
Flops is a common metric used to represent the work performed at each layer of the network. It stands for the number of __FL__oating point __OP__eration__S__ . Lets understand it with an example. 
<br>![](https://i.imgur.com/9Q4fl9D.png)<br>Assume a densely connected layer l2 with n2 neurons. Let number of neurons in earlier layer l1 be n1.  Each neuron in l2 is connected to evey neuron in l1. Hence, the weight matrix at layer l2 is of dimneison n2xn1. Since every neuron has single output, output of l1 is of size l1x1.  <br><br> The densely connected layer multiplies weight matrix to l1 output. ie <br> [n2xn1]  \*  [n1x1]. n2 rows of matrix1 are multiplied to the single column in matrix2 individually. Hence,  total number of operations = n2 * (multiplications in each step + additions in each step). <br></br>Each row multiplies n1 to n1 elements and adds the product. Hence, n1 multiplications and n1-1 additions. So, total number of operations = n2 * (2n1 + 1).<br><br>
A RELU unit is then applied, which involves a comparison and a multiply. Since, there are n2 neurons, total operations = 2 * n2.<br></br>
Adding total operations, n2 * (2n1 + 1) + n2 * 2 = n2 (2n1 + 1 + 2) = n2 (2n1 + 3) is the total operations at l2 layer.<br></br>
![](https://i.imgur.com/nyfWh5E.png) <br></br>
The above figure shows a comparison of flops between Sparsenet, Densenet and ResNet. It can be inferred that Sparsenet uses lesser flops compared to the other 2. For instance, DenseNet-121-32 consumes 5.7G, but SparseNet-121-32  only uses 3.46G operations.





In [0]:
from __future__ import print_function
from __future__ import absolute_import
from __future__ import division

import numpy as np
import warnings

import keras as keras
from keras.models import Model
from keras.layers.core import Dense, Dropout, Activation, Reshape
from keras.layers.convolutional import Conv2D, Conv2DTranspose, UpSampling2D
from keras.layers.pooling import AveragePooling2D, MaxPooling2D
from keras.layers.pooling import GlobalAveragePooling2D
from keras.layers import Input
from keras.layers.merge import concatenate
from keras.layers.normalization import BatchNormalization
from keras.regularizers import l2
from keras.utils.layer_utils import convert_all_kernels_in_model, convert_dense_weights_data_format
from keras.utils.data_utils import get_file
from keras.engine.topology import get_source_inputs
from keras.applications.imagenet_utils import _obtain_input_shape
from keras.applications.imagenet_utils import decode_predictions
import keras.backend as K

In [0]:
# A single dense layer. 
def _dense_layer(ip, nb_filter, bottleneck=False, dropout_rate=None, weight_decay=1e-4):
    ''' Args:
        ip: Input keras tensor
        nb_filter: number of filters/ growth rate
        bottleneck:  conditon to add bottleneck block
        dropout_rate: dropout rate
        weight_decay: weight decay factor  '''
    
    concat_axis = 1 if K.image_data_format() == 'channels_first' else -1

    with K.name_scope('conv_block'):
        x = BatchNormalization(axis=concat_axis, momentum=0.1, epsilon=1e-5)(ip)
        x = Activation('relu')(x)

        # If we need a convolution block with a bottleneck 
        if bottleneck:
            inter_channel = nb_filter * 4  

            x = Conv2D(inter_channel, (1, 1), kernel_initializer='he_normal', padding='same', use_bias=False,
                       kernel_regularizer=l2(weight_decay))(x)
            x = BatchNormalization(axis=concat_axis, epsilon=1e-5, momentum=0.1)(x)
            x = Activation('relu')(x)

        x = Conv2D(nb_filter, (3, 3), kernel_initializer='he_normal', padding='same', use_bias=False)(x)
        if dropout_rate:
            x = Dropout(dropout_rate)(x)

    return x

In [0]:
# A utility function that takes a list of all previous inputs (0,1, ..., x -1).
# Returns list containing only selective inputs(x-1, x-2, x-4...) as defined by sparsenet architecture.
def _exponential_fetch(x_list):
    count = len(x_list)
    i = 1
    inputs = []
    while i <= count:
        inputs.append(x_list[count - i])
        i *= 2
    return inputs

In [0]:
# A single denseblock. Each layer output is used as input for a select number of future layers.
def _dense_block(x, nb_layers, growth_rate, bottleneck=False, dropout_rate=None, weight_decay=1e-4):
    ''' Args:
        x: input tensor
        nb_layers: the number of dense layers inside this dense block
        growth_rate: growth rate
        bottleneck: coondition to add bottleneck block
        dropout_rate: dropout rate
        weight_decay: weight decay factor
    Returns: keras tensor with nb_layers of conv_block appended
    '''
    concat_axis = 1 if K.image_data_format() == 'channels_first' else -1

    x_list = [x]

    for i in range(nb_layers):
        x = _dense_layer(x, growth_rate, bottleneck, dropout_rate, weight_decay)
        x_list.append(x)

        fetch_outputs = _exponential_fetch(x_list)
        x = concatenate(fetch_outputs, axis=concat_axis)

    return x


In [0]:
# A single transition block
def _transition_block(ip, nb_filter, compression=1.0, weight_decay=1e-4):
    ''' Args:
        ip: keras tensor
        nb_filter: number of filters
        compression: calculated as 1 - reduction. Reduces the number of feature maps
                    in the transition block.
        dropout_rate: dropout rate
        weight_decay: weight decay factor '''
    
    concat_axis = 1 if K.image_data_format() == 'channels_first' else -1

    with K.name_scope('transition_block'):
        x = BatchNormalization(axis=concat_axis, epsilon=1e-5, momentum=0.1)(ip)
        x = Activation('relu')(x)
        x = Conv2D(int(nb_filter * compression), (1, 1), kernel_initializer='he_normal', padding='same', use_bias=False,
                   kernel_regularizer=l2(weight_decay))(x)
        x = AveragePooling2D((2, 2), strides= 2)(x)

    return x

In [0]:
# A Sparsenet
def _create_SparseNet(img_height, img_width, channel, depth=40, nb_dense_block=3, growth_rate=12,
              bottleneck=False, reduction=0.0, dropout_rate=0.0, weight_decay=1e-4,
                      nb_classes=10, activation='softmax'):
    ''' Arguments
            img_height = Input height
            img_width = Input width
            channel = Number of channels in input
            depth: Total number or dense layers in the DenseNet across all dense blocks.
            nb_dense_block: number of dense blocks (generally = 3)
            growth_rate: number of filters to add per dense layer. Can be
                a single integer number. Use a list of numbers if different numbers have
                to be used for each layer.
            bottleneck: flag to add bottleneck blocks in dense layer
            reduction: reduction factor of transition blocks.
                Note : reduction value is inverted to compute compression.
            dropout_rate: dropout rate
            weight_decay: weight decay rate
            nb_classes: Number of classes to classify images into.
            activation: Type of activation at the top layer. Can be one of 'softmax' or 'sigmoid'.
                Note that if sigmoid is used, classes must be 1.
    # Returns  A Keras model instance. '''
    
    concat_axis = 1 if K.image_data_format() == 'channels_first' else -1

    if activation not in ['softmax', 'sigmoid']:
        raise ValueError('activation must be one of "softmax" or "sigmoid"')

    if activation == 'sigmoid' and classes != 1:
        raise ValueError('sigmoid activation can only be used when classes = 1')

 
    if reduction != 0.0:
      assert reduction <= 1.0 and reduction > 0.0, 'reduction value must lie between 0.0 and 1.0'
    
  
    # Create a list of items. Each item i represents the number of dense layers in ith dense block.   
    count = int((depth - 4) / 3)
    # Since a dense layer with bottleneck has twice as many convolution layers, divide the number by 2 when using bottlenecks.
    if bottleneck:
        count = count // 2
    nb_layers = [count for _ in range(nb_dense_block)]
            
    
    # Create a list of items. Each item i represents growth rate in ith dense block. 
    if type(growth_rate) is list or type(growth_rate) is tuple:
        growth_rate = list(growth_rate)
        assert len(growth_rate) == len(nb_layers)
    else:
        growth_rate = [growth_rate for _ in range(len(nb_layers))]
   

    # compute compression factor
    compression = 1.0 - reduction

    input = Input(shape=(img_height, img_width, channel,))
    # Initial convolution
    x = Conv2D(2*growth_rate[0], (3, 3), kernel_initializer='he_normal', padding='same',
               strides=(1, 1), use_bias=False, kernel_regularizer=l2(weight_decay))(input)


     # Add dense blocks. Handle last one seperately since it does not have a transition block
    for block_idx in range(nb_dense_block - 1):
        x = _dense_block(x, nb_layers[block_idx], growth_rate[block_idx], bottleneck=bottleneck,
                                    dropout_rate=dropout_rate, weight_decay=weight_decay)
        # add transition_block
        x = _transition_block(x, x.shape[3], compression=compression, weight_decay=weight_decay)

    # The last dense_block does not have a transition_block
    x = _dense_block(x, nb_layers[-1], growth_rate[-1], bottleneck=bottleneck,
                                dropout_rate=dropout_rate, weight_decay=weight_decay)

    # Output block
    x = BatchNormalization(axis=concat_axis, epsilon=1e-5, momentum=0.1)(x)
    x = Activation('relu')(x)
    x = GlobalAveragePooling2D()(x)

    output = Dense(nb_classes, activation=activation)(x)

        
    # Create model.
    model = Model(inputs=[input], outputs=[output], name='densenet')
  
    return model

In [0]:
#Hyperparameters
depth = 40
nb_dense_block = 3
growth_rate = 24
num_classes = 10

In [0]:
from keras.datasets import cifar10

# Load CIFAR10 Data
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
  
img_height, img_width, channel = x_train.shape[1],x_train.shape[2],x_train.shape[3]

# convert to one hot encoing 
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)


In [0]:
# Data Augmentation - Horizontal flip
from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(horizontal_flip=True)
# fit parameters from data
datagen.fit(x_train)

In [0]:
# Data Augmentation - Random Shift. Note : random crop is not supported by Keras
shift = 0.2
datagen = ImageDataGenerator(width_shift_range=shift, height_shift_range=shift)
# fit parameters from data
datagen.fit(x_train)
  

#Data Normalization
It is a procedure performed as a pre processing step on data.  It is used to standardize the range of independent variables or features of data.<br><br>
For example, in image processing classification task, consider two images  of the same object. However, one of the image was shot in low light condition. In such case, the network may fail to recognise them as the same object, since pixel values would be significantly less. Subtracting the mean circumvents this issue and helps to remove network dependence on photo shooting conditions. <br><br>

Similarly,  division by the standard deviation mitigates variations in the spread of the data about the mean so that the two images have similar means and standard deviations. <br><br> Lack of similar ranges would mean that a particular step in weights would have different impact on each data.



In [0]:
# Data normalization
cifar_mean = x_train.mean(axis=(0, 1, 2), keepdims=True)
cifar_std = x_train.std(axis=(0, 1, 2), keepdims=True)

x_train = (x_train - cifar_mean) / (cifar_std + 1e-8)
x_test = (x_test - cifar_mean) / (cifar_std + 1e-8)


In [0]:
model = _create_SparseNet(img_height, img_width, channel, depth=depth, nb_dense_block=nb_dense_block,
                            growth_rate=growth_rate, nb_classes=num_classes)
print("Model created")
model.summary()

In [0]:
#Did not get better results with random crop. Hence, unused.
def random_crop(images, paddings):
  
    npad = ((0,0), paddings, paddings, (0, 0))
    paddedImages = np.pad(images, pad_width=npad, mode='constant', constant_values=0)

    print(paddedImages.shape)

    height, width = (40,40)
    dy, dx = (height - 2*paddings[0], width - 2*paddings[1])
    if width < dx or height < dy:
        return None
    x = np.random.randint(0, width - dx + 1)
    y = np.random.randint(0, height - dy + 1)
    return paddedImages[:, y:(y+dy), x:(x+dx)]

In [0]:
from keras.optimizers import SGD
from keras.callbacks import ModelCheckpoint

sgd = SGD(lr=0.1, decay=0.0001, momentum=0.9, nesterov=True)

# determine Loss function and Optimizer
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

filepath="models-improvement.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

model.fit(x_train, y_train,
                    batch_size=100,
                    epochs=100,
                    verbose=1,
                    validation_data=(x_test, y_test), callbacks=callbacks_list)

from keras.models import load_model
model2 = load_model(filepath)

# Save the model in to .h5 format
model.save('SPNet-100epoch.h5')

from google.colab import files
#does not work on ubuntu
files.download('SPNet-100epoch.h5')

In [0]:
from keras.models import load_model
model2 = load_model(filepath)

model2.optimizer.lr = model2.optimizer.lr/10
print(keras.backend.eval(model2.optimizer.lr))

callbacks_list = [checkpoint]

model2.fit(x_train, y_train,
                    batch_size=100,
                    epochs=50,
                    verbose=1,
                    validation_data=(x_test, y_test), callbacks=callbacks_list)

In [0]:
keras.backend.eval(model2.optimizer.lr)

0.001

In [0]:
from keras.models import load_model
best_model_total = load_model(filepath)

# Test the model
score = best_model_total.evaluate(x_test, y_test, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# Save the trained weights in to .h5 format
best_model_total.save_weights("SPW.h5")
best_model_total.save("SPM.h5")

#print("Saved model to disk")

from google.colab import files
#does not work on ubuntu
files.download('SPW.h5')
files.download('SPM.h5')

Test loss: 0.5409464004039765
Test accuracy: 0.8902
