# Applied Deep Learning
> Applied Deep Learning [Course](https://github.com/maziarraissi/Applied-Deep-Learning) 

- toc: true 
- badges: true
- comments: true
- categories: [jupyter,deeplearning,python]


## Deep Learning overview
* we can look at deep learning as an algorithm that writes algorithms, like a compiler
 - in this case the source code would be the data: (examples/experiences)
 - excutable code would be the deployable model 

 * Deep: Functions compositions  $ f_l f_{l-1} .... f_1$
 * Learning: Loss, Back-propagation, and Gradient Descent

 * $ L(\theta) \approx J(\theta)$ --> noisy estimate of the objective function due to mini-batching. That's why we call it stochastic Gradient Descent 
 * why do we use the first order derivative, not the second order one (the hessian), because order of first order derivative is N, but for the hessian it's N*N, so it's computationally expensive and slow 
 ### Optimizers
 * to make gradient descent faster, we can add momentum to it.
  * another way is to use Nesttov Accelerated Gradient: the idea is to look ahead while computing the gradient, so we can add that to the momentum




  * RMSprop: A mini-batch version of rprop method. the original rprop can't work with mini batches, as it doesn't consider the magnitude of the gradient, but only the sign of it, and it would multiply the gradient by a fixed factor every time depending on the sign. 

  ![](assets/Applied_deep_learning/rprop.png)


  * Nestrov adaptive optimizer: the main idea is that we know that we gonna update the weights according to our average velocity so far, and also our gradient, but this can cause us to over shoot as we have a huge velocity moving down the hill, so why not update first the weights according to our velocity and see where that gets us (the look ahead term), and then we update the weights according to the gradient there 

* Adam:
    - can take different time steps for each paramater (Adaptive steps) (took concepts from Adadelta)
    - can also has momentum for all parameter wich can lead to faster convergence
* Nadam: Just like Adam but with added nestrov acceleration look ahead functionality so we can slow down we go near the goal

## Dropout
* A simple method to prevent the NN from overfitting 
* CNNS are less prune to overfitting becaue the weight sharing idea, that we have a set of filters fot the entire image 
* you can look at dropout as a smart way of ensembling, as it combines exponentially many different networks architectures effienctly. 


# Computer Vision


## Image Classification 


### Large Networks

#### Network In Network
* the main idea is to put a network inside another network

* they introduced multi layer preceptron conv layer which is a conv layer followed by a few FC layers
* this idea is bisacally a (one to one convution) 
* they introduced a global averaging pooling: insted of adding a bunch of FC layers at the end of teh conv architecture, we can just average multible channels  from the last conv layer to form the output layer 

* one by one convolution is a normal convolution with fliter size of 1 by 1 

* in conv net, we want the network to be invariant both localy and globaly, which means we still predict the photo is for a dog, even if the dog had  slight shift in pixels (local invariant), and also of the dog went to be in the lower corner of the pic isntead of the upper one (global invariant)
* we can achieve local invariant with pooling, and deal with global invariant with data augmentation


#### VGG Net

##### Local Response Normalization:
* the idea is to normalize a pixel across nearing channels 

* after comparing nets with lrn and nets without, they didn't find big difference, so they stoped using it 

##### Data Augmentation
* Image translations( random crops), and horizontal reflection 
* altering the intensities of the RGB channels 
* scale jittering   


#### GoogleNet

* You stack multiple inception modules on top of each ohter 
* the idea is that you don't have to choose which filter size to use, so why don't use them all 
* to make the network more efficient, they first projected the input with one by one convolution then applied the main filters 
* you concatinate the many filters through the channel dimension    

#### Batch Normalization 

* The main goal of batch normalization is to redude the `Internal Covariant Shift`

* we can just normalize the inputs and it would work fine
* the problem is that in each following layer, and statistics of its output would depend on its weights 
* so we also need to nomalize the inputs in hidden layers 
* here, the gradient is also going through the mean and variance operations , so it gets a snese of whats gonna happen

* in inference we can't have batch-dependant mean and variance, so we use the average mean and variance for the whole dataset 

##### conv layers
* for conv layers we apply normalization across every channel for every pixel in the batch of images
* the effective bach size would be ==> m*p*q where m is the number of images in the batch and 
p,q are the image resolution 

##### Benifits of batch norm:
* you can use higher learning rate, as the training is more stable 
* less sensitive to initialization 
* less sensitive to activation function 
* it has regularization effects, because thre's random mini batch every time 
* preserve gradient magintude ?? maybe --> because the jacobian doesn't scale as we scales the weights 

#### Parametric Relu:

$ f({y_i}) = \max(0,y_i) + a_i \min(0, y_i) $
* if $a_i = 0$  --> Relu
* if $a_i = 0.01$ --> Leaky Relu



* the initialization of weights and biases depends on the type of activation function 

#### Kaiming Initialization  (I didn't fully understand the heavy math in this lecture, as Im still weak in statistics and variance calculations):

- professor went into deep mathematical details into how to choose the intial values for weights
* the main idea is to investigate the variance of the response in each layer, so we start by calculating the variance for the output of the layer, and we end up with many terms of the weights multiplied together, so to prevent it it from vanishing or exploding, we  want the weights to have values centred around 1 

#### Label smoothing regularization
- the idea is to reagularize the notwork by giving random false labels for a few examples of the dataset 

#### ResNet 

* The main idea is to make the NN deeper so that it becomes better, but the idea is that when you do that, the network gets worse, so we can fix that by adding a resdual connection.

##### Identity mapping in resnets 

* the idea is to do no non-linear operations on the main branch(identity mapping), so that the keep a deep flow of the data both in forward and backward pathes 

#### Wide Residual Networks 

* an attempt to make resnets wider and study if that would make them better 

#### ResNext 
* just like resnets but they changed bottleneck blocks with group convolution block 

#### Squeeze-and-Ecxcitation Networks 

##### Squeeze : just a global averaging step 
##### Excitation: is just a fully connected newtwork 
##### Scaling : multiply every channel with the corresponding exctitiaiton value, more like attention 
* scaling is you paying different attention to different channels like attention models 

#### Spatial Transformer Network 
* the main idea is to seperate the main object in the image, like putting a box around it and then this box can be resized, shifted, rotated. so in the end we have a focused image that has only the object, and so we can apply convolution on it and it would be easy then 

* the idea is to first find a good transformation parameters theta, you can do that using NN
* then for every position in the output image, you do a bilinear sampling from the input image


#### Dynamic Routing between capsuls 

* the idea is to make the outputs of the capsule has a norm that is the probability that an object is presenet

### Small Networks


#### Knowledge Distillation

* the main idea in to use an artificial data coming from the gaint model, using the normal training dataset and a smoothed the output from the giant model. then we train the distilled model using this dataset and with the same parameter `T` that we used to smooth the data. then in production we set the temperature parameter to 1 and use the distilled model for inference.

#### Network Pruning:
*  all connections with weights below a threshold are removed from the network 
* weight are sparse now 
* then we can represent them using fewer bits

#### Quantization
* we basically cluster our weight to some centroids
* the number of centroids for conv layers are more than the ones for FC layers why:
    -   because conv layer filters are already sparse, we need higher level of accuracy in them
    -   FC layers are so dense that we can tolerate fewer quantization levels 

#### Huffman Coding

* store the more common symbols with more bits 

####  Squeeze Net
* the idea is to squeeze the network by using one by one convolution thus use one smaller firlter sizes, then expand to make up for the squeeze that is made 
* the main idea  is to use one by one comvultion to reduce the dimensionality

#### XNOR-NET

* the idea to to convert the weights and inputs to binary values, and so we save a lot in memory and computation
* the idea is to use a pre trained weights, then you try to binariez the weights by trying to approximate ==> $W = \alpha * B $ where alpha  is postative 32 bit constant and B is a binary matrix 
* then mean we try to train by using a means square error loss function of the original weights and alpha and B 

* I still can't fully understand  how to binarize the input

#### Mobile Nets

* the idea is to reduce computation complexity by doing conv for each channel sperately, and not across channels.
* so we use number of filters as the same as the input channels
* but then we will end up with  output size as the input size, so we still need to do one by one convolution to output the correct size

#### Xception
* unify the filters sizes for the inception, and then apply them for each channel sperately, then do one by one convolution to fix the output size

#### Mobile Net V2
* the same as MobileNet, but with Residuals connecions.

#### SuffleNet

* the idea is to suffle channels after doing a group convultion 

### Auto ML

* the question is can we automate architicture engineering, as we automated feature engineering in DL?
* we can use RNN to output a probability, to sample an architicture from, then use train using this arch, and give the eval acc, as a feedback to the RNN 



#### Regularized Evolution
* it's basically random search + selection
* at first you randomly choose some  architecture  train, and eval on it and push it to to the population
* then you sample some arch. from the population
* then u select the best acc model from your samples , and then mutate it (ie. change some of its arch.), then add it to your samples 
* then remove the oldest arch. in the population
* you keep repeating this cycle till you evolve for C cycles (history size reaches the limit) and report the best arch.

#### EfficientNet 

* the idea is that we do grid seach on a small network to come with the best depth scaling coefficient `d`, width scaling coefficient `w`, and resolution scalling coefficient `r`, then we try to find scaling parameter $\phi$, that gives the best accuracy while maintaning the `flops` under the limit


### Robustness 

* The main goal is to make your network robust against adverarial attacks

#### Intrigiong peroperties of neural networks 
* there's nothing special about individual units, and the individual features that the network learn, and they you can interpret any random direction. So, the entire spacd matters 
* neural networks  has blind spots, this  means you can add small pertirbations to an image, they are not noticable to the human eye, but they make the network wrongly classify the image  
* Adversiral examples tend to stay hard even for models trained with different hyper-parameters, or ever for different training datasets 

* you can train your network to defend against attacks but that's expensive, as: first, you have to train your network, then train it again to find some adversiral attacks, then add those examples to the training set, and finally train for a third time.

* small perturbation to the image, leads to huge perturbation to the activation, due to high dimensionality
#### untargeted adversiral examples
* fast gradient sign: using the trick of the sign of the loss gradient, and add it to the original image to generate an adversiral example 
* then you can just add a weighted loss, one for the  orginal example, and another for the adversiral one, so that the network would be more robust to adversiral examples 

#### Towards Evaluating the Robustness of Neural Networks
* another way to generate targetted adversiral examples is: to choose a function that forces the network to make the logits for the targeted example the biggest, so that this class is selected. 

![](assets/Applied_deep_learning/adversiral-attack-algo.png)


### Visualizing & Understanding
* now we want to debug our network, to understand how it works
* so we want do a backward pass, by inverting our forward pass, but then we habe a problem with pooling layers as we subsamples the input.
* so we store the locations for the max pixels that we choose in our pooling operation, so that we can upsample the input again in the backward pass.
* we call these max locations, switches
* the main idea is, visualising the feature maps, gonna help you modify the network 
* you can have two models that have the same output for the same input but which one do you trust more?
    - to answer that, you need to see which features each one of them focuses on, so if one of them focuses on features that are important to classfication, then this model is more trustworthy 
#### LIME: Local Interpretable Model-agnsortic Explanations
* you want to make trust the model, meaning that you wanna maek sure the model parioritized the important features
* but you can't interprete non linear models, so the idea is to make a locally linear model, that have the same output for your local input example, then use this linear model to get the features that the model parioritrized 

### Transfer Learning

## Image Transformation


## Object Detection