# deep residual network 

what this paper is all about is how to increase the depth of the network without harm the preformance 
the authors say that when you increase the depth you will face serious problem with vanishing and exploding gradients so they manges to deal with that 

they considered a shallow network and deeper one as counterpart which you add more layers into the deeper one and they claimed that the solution by construction to the deeper model :  the added layers are identity mapping 

the expected behavior that the new model produce no higher training error but what was surprising that it give them better solutions 

so how they managed that ? 

In [214]:
# importing 
import tensorflow as tf 
from tensorflow.keras.layers import Dense , Conv2D , Add ,ReLU

### who the model work 
The formulation of F(x) + x  can be realized by feedforward neural networks with “shortcut connections” 
what is shortcut connections ? 
Shortcut connections are those skipping one or more layers. In our case, the shortcut connections simply
perform identity mapping, and their outputs are added to the outputs of the stacked layers  Identity shortcut connections add neither extra parameter nor computational complexity

### model description 

so they described two version of the network one which contain about 34 layer and another one which 152 we will implement both 
we will first start with the simple one 

### the Architectures 
the authors were inspired by the VGG architectures so the architectures consist of blocks each block contain two conv layer with (3,3) kernal size and the architectures follow simple rule which when we halved the feature map (stride = 2) we double the number of filter 

### problem 
tiny problem we will face that to add two layers they should have the same dimentions so this will work fine at first but 
when we set the strides = 2 and double the filter number this we cause error so the authors proposed two solution 
- first one to use zero padding 
- second to use conv layer with kernal(1,1 ) in the shortcut layer to match the shape of the stacked layer and this the solution they used to deal with that  


In [195]:
# init residual_block class 
class residual_block(tf.keras.Model) : 
    def __init__(self , n_filters , kernal_size,strides   ) : 
        super(residual_block , self).__init__()
        # define the two conv layers 
        self.conv_res_1 = Conv2D(n_filters , (kernal_size,kernal_size) , strides = (strides,strides) , padding = 'same' , activation='relu') 
        self.conv_res_2 = Conv2D(n_filters , (kernal_size,kernal_size) , strides = (1,1) , padding = 'same' ) 
        #define the short cut conv with (1,1) kernal to match the shape 
        self.shourt_cut_conv = Conv2D(n_filters , (1,1) , strides = (strides,strides)  , padding = 'valid') 
        self.add = Add() 
        self.relu = ReLU() 
    def call(self, inputs ) :
        # keep the input 
        X_short_cut =inputs
        # pass the input throw the conv layers 
        X = self.conv_res_1(inputs)
        X = self.conv_res_2(X) 
        # match the stacked layers shape
        connection = self.shourt_cut_conv(X_short_cut)
        # add the two layers 
        addtion = self.add([X , connection])
        return self.relu(addtion) 
        
        

### the model 
- they start the model with conv layer 
- followed by maxpooling layer 
- they used 16 residul block 
- then average pooling and dense layer with number of output classes 

so we will loop to define the 16 block by using this little loop 

In [196]:
class res_model(tf.keras.Model) : 
    def __init__(self , num_classes) : 
        super(res_model ,self).__init__() 
        self.n_filter = 64
        self.first_layer = Conv2D(64 , 7 , strides=(2,2) , padding ='same' , activation ='relu')
        self.max_pool = tf.keras.layers.MaxPooling2D((3, 3), strides=(2, 2))
        
        # loop throw the num of blocks and use class var object to define the blocks 
        for i in range(1, 17) : 
            # in the paper they halved the feature map and doubled the filter size three times in [4,8,14] block number
            if i in [4, 8 ,14] :
                # we double filter number 
                self.n_filter *=  2 
                # we halved the feature map by setting the strides  = 2 
                vars(self)[f'block_{i}'] = residual_block(self.n_filter , 3, strides = 2 )
            else : 
                vars(self)[f'block_{i}'] = residual_block(self.n_filter , 3, strides = 1 )
                
        self.avg_pool = tf.keras.layers.AveragePooling2D()
        self.flat = tf.keras.layers.Flatten()
        self.out = Dense(num_classes , activation ='softmax')
    
    def call(self, inputs )  : 
        # pass the input throw the first layer 
        X = self.first_layer(inputs)
        X = self.max_pool(X)
        # pass the inputs throw the blocks 
        for i in range(1,17)  :
            block_i = vars(self)[f'block_{i}']
            X = block_i(X)
        X = self.avg_pool(X)
        X = self.flat(X)
        return self.out(X)     
                
                
                

In [197]:
model = res_model(# number of classes  )  

### model training 

- they used SGD optimizer 
- ReduceLROnPlateau callback 
- categorical_crossentropy

In [198]:
plateau_callback =  tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.1, patience=5
)

In [199]:
model.compile(optimizer ='sgd' , loss ='categorical_crossentropy' , metrics =['accuracy'])

## second version 
this version is really deep so they made some changes to the Architectures 
this time each block consist of three conv layers  : 
- first and last layer with (1,1) kernal shape to control the shape of the data 
- second layer act as bottle neck with kernal (3,3) 

In [209]:
class deep_residual_block(tf.keras.Model) : 
    def __init__(self , n_filters  , kernal_size,strides   ) : 
        super(deep_residual_block , self).__init__()
        # define the conv layers this time n_filters is list because it hasn,t have to be the same number 
        self.conv_res_1 = Conv2D(n_filters[0] , (1,1) , strides = (strides,strides) , padding = 'valid' , activation='relu') 
        self.conv_res_2 = Conv2D(n_filters[1] , (kernal_size,kernal_size) , strides = (1,1) , padding = 'same' , activation ='relu') 
        self.conv_res_3 = Conv2D(n_filters[2], (1,1) , strides = (1,1) , padding = 'valid' )
        # we define conv layer to match the shape of the stacked model and it has the same filters as the third conv layer 
        self.shourt_cut_conv = Conv2D(n_filters[2] , (1,1) , strides = (strides,strides)  , padding = 'valid') 
        self.add = Add() 
        self.relu = ReLU() 
    def call(self, inputs ) :
        X_short_cut =inputs
        X = self.conv_res_1(inputs)
        X = self.conv_res_2(X) 
        X = self.conv_res_3(X)
        connection = self.shourt_cut_conv(X_short_cut)
        addtion = self.add([X , connection])
        return self.relu(addtion) 

### the model 

the model almost will stay the same we will just define filter number as a lise 
and in the paper they did not say when they halved the feature map so i set it like that 
when the i can be divided by 20  

In [212]:
class deep_res_model(tf.keras.Model) : 
    def __init__(self , num_classes) : 
        super(deep_res_model ,self).__init__() 
        self.n_filter = [64 ,64 ,128]
        self.first_layer = Conv2D(64 , 7 , strides=(2,2) , padding ='same' , activation ='relu')
        self.max_pool = tf.keras.layers.MaxPooling2D((3, 3), strides=(2, 2))
        for i in range(1,51 ) : 
            if i % 20 == 0  :
                self.n_filter *=  2 
                vars(self)[f'block_{i}'] = deep_residual_block(self.n_filter , 3, strides = 2 )
            else : 
                vars(self)[f'block_{i}'] = deep_residual_block(self.n_filter , 3, strides = 1 )
        self.avg_pool = tf.keras.layers.AveragePooling2D()
        self.flat = tf.keras.layers.Flatten()
        self.out = Dense(num_classes , activation ='softmax')
    
    def call(self, inputs )  : 
        X = self.first_layer(inputs)
        X = self.max_pool(X)
        for i in range(1,10)  :
            block_i = vars(self)[f'block_{i}']
            X = block_i(X)
        X = self.avg_pool(X)
        X = self.flat(X)
        return self.out(X)     
                

## conclusion 

this all about residual network this paper is accully easy to read and you should do and you can search for images that illustrate the Architectures of the model 

you can try this model but make sure that the height and width of the image big enough to fit the model to avoid errors or you can decrease the number of blocks 