# Task proposal


More in-depth content of concepts and calculations applied in convolutional neural networks and how they learn.


Course: [Deep Learning: Convolutional Neural Networks in Python](https://www.udemy.com/course/deep-learning-convolutional-neural-networks-theano-tensorflow/?couponCode=ST7MT41824)

Sections: Convolutional Neural Networks; Natural Language Processing (NLP); In-Depth Convolution; Convolutional Neural Network Description; Practical Tips; In-Depth: Loss Functions; In-Depth: Gradient Descent. (4 hours and 48 minutes)

# Convolutional Neural Networks


## What is convolution?


The convolution operation is the act of transforming an input image into a different image, combining it with a filter (kernel). In other words, there is a feature transformation of the image based on a filter.


## Filters


Within the context of convolution, one can think of each filter serving the purpose of finding a pattern. Therefore, using multiple filters, different patterns can be found in an image.


## Architecture of a CNN


A CNN typically has two stages, the first with convolutional layers, and the second with dense layers.


### Convolutional layer


In a way, this layer reduces the resolution of the original image, transforming it. This is done by applying the kernel to the image, and then applying pooling, a process in which the most important pixels are kept in the image, and the rest are disregarded.


This entire transformation reduces the amount of processing required, since the image becomes smaller. This is possible because of the nature of computer vision.


Since the image size keeps decreasing, when it reaches max pooling, there will be a small image with some values. These values will only be there because they mean that a feature was detected and maintained, and in a way this feature "survived" until the end, signaling that it occurred in the original image.


Therefore, the same object can be anywhere on the screen, once the convolution and pooling process begins, at the end, its characteristic will also be maintained until the end, as that is what the network learned to do.


In this way, the neural network gains translational invariance, where regardless of the position of the object, the network will be able to identify it.


"The network doesn't care where in the image a feature happened, but rather whether it happened or not"


Although the image decreases in size, the filters remain the same, that is, in the last stages of the process, the filter will increasingly consider the image as a whole, looking for larger patterns. And that's why CNN hierarchically learns the features of the original image.


"First learns small patterns compared to the total size of the image, and then moves on to learning larger patterns"


Despite losing part of the information by reducing the size of the image, the number of feature maps increases, once again being concerned only with the fact of detecting a feature, not where it occurs.


####Stride


Based on the way filters are applied to images, from left to right, and from top to bottom, stride is a way of making the filter "skip" parts of the image and continue considering future parts of the image, speeding up the process. process.


## Code


When using convolutional layers, it is important to note that there are 1D, 2D and 3D convolutions. Being 1D audio processing, for example. 2D analysis of an image, and 3D analysis of a video.


## Keras


When using Keras, layer creations can also be considered function calls, which makes it possible to use the same variable to create each layer. So, commonly in machine learning codes, a single variable is used, to maintain the readability of the code


## Data augmentation


Although the convolutional network process guarantees translational invariance, another way to improve the model in this sense is to create new data from existing data.


This data is created randomly during learning, and as soon as another piece of data needs to be learned, the previous one is undone. This saves space in this process, as well as helping to generalize the model, as it is random.


## Normalization


As arithmetic operations are performed while walking through the layers of the neural network, these values may become too large. Therefore, a new normalization layer, called batch norm, helps to regularize the network and prevent overfitting, as it normalizes based on the standard deviation of each batch.


# Natural Language Processing (NLP)


When dealing with texts, it is clear that each word can be seen as a category, thus making NLP a classification problem. However, given words in the input data, how could the computer extract information from them?


The one-hot encoding approach appears to be a possible solution, but due to the possible number of words in an alphabet, this approach becomes unfeasible.


A better solution is the embedding layer, where, by transforming words into integers, these numbers are multiplied by a matrix generating vectors, which thus represent each word in a dimensional space.


The process of transforming a sentence into a list of integers is called the tokenization process.


Therefore, neural networks can now interpret meanings/information in words. Besides that,The process of which matrix will be used in multiplication is also a factor that must be learned and studied, however, in code applications, this process is already automatic.


However, different sentences have different numbers of words in them. This means that tensorflow cannot receive the inputs, as they are of different sizes. Therefore, the number zero is reserved for the use of filling in those sentences that did not reach the maximum number of words, so that all sentences have the same length, a process called padding.


If there is a need to reduce very large sentences, this process can be done by truncating the beginning or end of the vector, known as Truncating in English.


Padding can also be applied to both ends of the vector, depending on the problem.


## CNNs for texts


Just like in images, where information about a pixel is also related to the pixels around it, in text, a word will also be related to the previous and subsequent word. This is an indication that CNN is a good model for this case, where, based on the text treatments described above, convolution and 1D pooling are applied.


# In-Depth Convolution


Convolution is not a process applied only to machine learning models, in fact it is found everywhere in real life, just a few examples.


1. Music: effects such as echo, reverb and others.
2. Image: Gaussian blur, edge detection, among others.


Then the course explains convolution again in a visual way.


# Convolutional Neural Network Description


The processing of a 2D image has dimensions of H x W x 1, with H = height, W = width, and 1 = a color channel (black and white photo for example). In images with three color channels, the problem goes from being 2D to 3D


The convolutional process in 3D will be carried out in the same way as in 2D, using a filter, pixels are combined and a new image is generated.


The filters are formatted as C x K x K, where K = size of filter rows and columns and C = color channels.


Both filters and generated images can be stored by stacking them, thus bringing additional dimensionality.


## Convolution modes


### Valid


The filter will never go beyond the input.


###Full


Values are created in addition to the output using padding so that the filter is passed throughout the input + padding.


### Same


Fixed padding so that the output is the same size as the input.


# Practical Tips


Starting from the basic concepts that build a CNN, different models are standardized using these concepts. For example, by combining convolutional layers, pooling, and dense layers in a certain way, it is possible to achieve already existing standardized models, such as LeNet.


Therefore, and bearing in mind that each problem requires its own solution, choosing a CNN is not exact, but requires experience and consultation when building one for a specific objective.


# In-Depth: Loss Functions


## Mean Squared Error


When it comes to errors, it doesn't make much sense to obtain errors represented by negative numbers, since when calculating the total error, the positive error will be subtracted by the negative, appearing to be smaller than it appears or even smaller than zero, which is not the case. it makes sense.


By squaring errors, they not only neutralize negative effects, but also emphasize the importance of larger discrepancies, as higher values contribute proportionally to the final result.


The next step is to calculate the average of these squared errors.


## Maximum likelihood Estimation (MLE)


  The objective of MLE is to find the parameter values that maximize the likelihood function. This is equivalent to finding the parameters that make the observed data more likely under the model.


## Binary Cross-Entropy


Effective for training binary classification models, especially in optimization problems where probabilistic interpretation of model outputs is valuable. The goal during training is to minimize the average of these individual losses in a data set by adjusting model parameters to optimize prediction accuracy.


## Categorical Cross-Entropy


Used in multiclass classification problems, seeking to minimize the disparity between model predictions and real labels, encouraging convergence towards an ideal probability distribution.


# Section 11: In-Depth: Gradient Descent


## Gradient descent


Using a cost function, we seek to find its minimum point in order to have the smallest possible error. This is possible through calculus concepts and derivatives.


As in a machine learning problem the formula of the function for the problem is not known, the minimum of the cost function is approximated by gradient descent, which, through the inclination of the adjustment of the parameters in relation to the weight, improves the performance of the model by adjusting the parameterss in the direction of descent of the cost function.


## Stochastic gradient descent


Instead of considering each prediction as a case of adjusting the parameters in the gradient, the average cost of N predictions is calculated to then change the model. This makes the walk apparently less "precise" with each interaction, however, it converges to the minimum faster.


## Momentum


Unlike standard Gradient Descent, which simply adjusts parameters in the opposite direction to the instantaneous gradient, Momentum takes the previous direction into account by adding a fraction of the previous update vector to the current vector. This makes model learning faster.


## Variable learning rate


The model's learning speed changes based on the situation, for example, the closer it is to the minimum, the smaller the gradient step will be.


### AdaGrad


Another example is AdaGrad, which associates information called cache with each parameter, which will tell you how much impact previous parameter changes had, and based on this, adjust the current value.


### RMSProp


The RMSProp algorithm introduces a damping term to avoid the explosion of the accumulated squared gradients that is done in AdaGrad.


###Adam


Adam combines RMSProp's effectiveness in adapting the learning rate with the ability to maintain different moments for different parameters. This makes it widely used and efficient in a variety of scenarios.