<a href="https://colab.research.google.com/github/Rsych/Machine-learning-with-tensorflow/blob/main/3_Tensorflow_CNN/1_tf_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning with Tensorflow - Fundamentals Convolution Neural Network

Convolution neural network (CNN) is very suitable for image classification due to its use of spatial information. This neural network was driven from receptive fields in in the visual cortex.
![cnn](img/cnn.png)
There are 3 main factors in CNN. 

* Local receptive field
* Shared weights and biases
* Pooling

## Local receptive field

If you want to save spatial information from images or other shaped data, it's easier to save them as matrix. To encode local structure into matrix is to connect its nearest input neurons' submatrices to a hidden neuron in next layer. This hidden neuron shows a local receptive field and this action is called "Convolution" and with neural network consists of this job is called convolution neural network. In addition, submatrices can be overlapped to encode more information in it. How it operates we'll see later with matrix sliding window function.

## Shared weights and biases
By sharing weights and biases, each layers learn the weights to detect feature at different parts of the image. This means that all the neurons in the layer detect exactly the same feature, just at different locations in the input. These weights that define the feature map are also called kernels or filters. One or more functional maps are required to perform image recognition. So a convolutional layer consists of several different feature maps.

## Mathematical explanation

The best way to understand what convolution is, to see how sliding window function applied to matrix works. From next image, given input matrix I and K we get convoluted output. 3x3 kernel K (aka filter or feature detector) is multiplied to each element of input matrix and calculated a cell.

![sliding](img/slidingMatrix.png)

Here we decided to stop sliding window when it hits edge of I. Therefore output is 3x3. We can fill input with zeros (output is 5x5). This is decided depending on padding. Kernel depth is same as input depth (channel). 

Also, there's feature called "stride", it is about how far sliding window is sliding. Bigger stride, output size is smaller, smaller stride, more output is made and contains more information. 

Filter size, stride, and padding is used to optimize hyper-parameter for  neural network.

## How to start with CNN

To write 32 parallel features and 3x3 filter size convolution layer in Tensorflow 2 we start from this:

In [9]:
import tensorflow as tf
from keras import Sequential
from keras.layers import Conv2D
model = Sequential()
model.add(Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)))

model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_2 (Conv2D)            (None, 26, 26, 32)        320       
Total params: 320
Trainable params: 320
Non-trainable params: 0
_________________________________________________________________


Single input filter 28x28 image applied 3x3 convolution, outputs 32 channel. This can be seen in Fig. 1 convolutions layer.

## Pooling

Pooling layer is used to reduce dimensions of the feature maps to reduce the amount of parameters and computation performed in the network. Pooling layer operates on each feature map independently. There are two types of pooling used: 

* Max pooling
* Average pooling

Max pooling is the most general selection, it simply outputs maximum activation. In keras to perform 2x2 size max pooling layer is
![maxpooling](img/maxpooling.png)

In [12]:
from keras.layers import MaxPooling2D
model.add(MaxPooling2D((2,2)))

Average pooling outputs average activation in given area. Keras has many different level of poolings. You can check it on [Keras official page](https://keras.io/api/layers/pooling_layers/).

# Summary 

We've looked into basic concept of CNN. CNN operates convolution and pooling differently depending on dimension of information of input. 
* Audio and text data(`time`) - 1st dimension
* Image(`height*width`).- 2nd dimension 
* Video(`height*width*time`) - 3rd dimension