# Neural Networks for Images

Examples for Scientific Papers:
- [Theoretical](https://arxiv.org/abs/1711.10925)
- [Practical](https://arxiv.org/abs/1506.01497)

You can have more than 1 test set but we need to always have 1 chosen at random.

[Covariate shift](https://www.seldon.io/what-is-covariate-shift) is a situation in which the distribution of the model's input features in production changes compared to what the model has seen during training and validation. Covariate shift is a change in the distribution of the model's inputs between training and production data

Something as subtle as a change in lighting could shift the distribution of data points, and thereby lower the accuracy of the model. In the case of facial recognition, training data may lack specific ethnicities or ages of subject. When deployed in a live environment, subjects that aren’t in line with training data may have an unrecognisable feature distribution.

Covariate drift can cause serious issues with speech recognition models because of the diversity of voices, dialects and accents in spoken word. For example, a model may be trained on English speakers from a specific area with a specific accent. Although the model may achieve a high degree of accuracy with the training data, it will become inaccurate when processing spoken language in a live environment. This is because processing speech with new dialects or accents will be a different input distribution to the training data.  

We can fix this with more real world data and retraining with it.

Interpreting NNs:
- [ZFNet](https://paperswithcode.com/method/zfnet) - ZFNet is a classic convolutional neural network. The design was motivated by visualizing intermediate feature layers and the operation of the classifier. Compared to AlexNet, the filter sizes are reduced and the stride of the convolutions are reduced.
- [Activation atlases](https://openai.com/research/introducing-activation-atlases) - Activation atlases are a new way to see some of what goes on inside that box
- [Saliency map](https://analyticsindiamag.com/what-are-saliency-maps-in-deep-learning/) - This [method](https://arxiv.org/pdf/1312.6034.pdf) is derived from the concept of saliency in images. Saliency refers to unique features (pixels, resolution etc.) of the image in the context of visual processing. These unique features depict the visually alluring locations in an image. Saliency map is a topographical representation of them.
![image.png](attachment:3d0abcf1-9422-4087-aee7-2369d0974222.png)

There is no limit to the problem we are solving with DL. There isn't only regression and classification like in ML.

We can do anything. All we need is a **Differentiable function** (because of GD) that represents what we want to achieve. 

## Convolutional Neural Networks

If you are working with images you always use convolutional NN and layers.

#### Convolution

Related to [Fourier transform](https://en.wikipedia.org/wiki/Fourier_transform)

We apply a convolution like a filter (from the back to the front of the image): 

![image.png](attachment:5f3c6ef8-be95-4ddf-a140-c58c2c665fbc.png)

[Cross-correlation](https://en.wikipedia.org/wiki/Cross-correlation) but we call it **Convolution**

[Template Matching](https://medium.com/mlearning-ai/image-template-matching-using-cross-correlation-2f2b8e59f254)

Convolutional volume? - An image has 3 dimentions (height, width, number of channels) 

![image.png](attachment:f6f8e2d8-35ce-48eb-9c4c-29346fea176f.png)

The channels of an image we treat as independent! RGB. On each channel we can apply a different filter. 

Similar pixels have similar characteristics but not true for the channels. We assume they are completely independent. This allows us to perform a lot of convolutions at once.

In summary:

We pick a convolution (filter) that is well knows to us. For example showing vertical/horizontal edges([Sobel edge detection](https://homepages.inf.ed.ac.uk/rbf/HIPR2/sobel.htm)) and apply it to an image to gain information on it.

We can do this with many filters since we have arithmetics for images. We can very easily add substract etc images. So working with these filters and combining the results is very easy.

![image.png](attachment:dbc32ee4-16e6-4656-92b8-413174fbdf9e.png)

The filters are usually small and odd in size. 3, 5, 7 etc. We want to have a central pixel.

![image.png](attachment:8ba5a05e-ae55-466c-937d-7ef143764bfe.png)

We can see that we remove a "frame" in a sense when applying these filters. The outer pixels are removed. The bigger the filter is size the more pixels we remove.

We deal with this issue through **Padding** - adding pixels on the outside so the frame can be included in the result. There are 2 types:
 - Valid convolution - no padding
   - If we use valid padding with many conv filters one after the other. At some point the volume of the image gets so small that it is unusable.
 - Same convolution - pad so that the output size remains unchanged. We usually fill these newly added pixels with 0.5 (average in a normalized image between 0 and 1)

The sliding window step(stride) can be different than 1. In this case it is 2. But this leads to less output pixels:

![image.png](attachment:30b142de-84bb-437c-8c00-f1b47f413a32.png)

Dilated convolution also exists. It is well ... more dilated like the image shows..

![SegmentLocal](images/0_3cTXIemm0k3Sbask.gif "segment")

Duration left: -2:46:22