## What is Deep Learning?

A field of Machine Learning that is a field of AI. It's called Deep Learning because it exploits architectures with multiple "hidden layers", called Neural Networks, giving a certain "depth" to the model.

![image](https://t44dz3y7fq02vi4w64ej6i5t-wpengine.netdna-ssl.com/wp-content/uploads/2018/06/deep-learning.png)

## What are common components of a Neural Network?

**Activation functions**

Examples of activation functions:

*Sigmoid*: `1/(1+e^(-x))`, output between 0 and 1. Often used for transforming scores into probabilities.

![image](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1200px-Logistic-curve.svg.png)



*ReLU*: max(0, x), output: between 0 and +inf, used almost all the times between hidden layers (not the output)

*Tanh**: output between -1 and 1. Good because it's zero centered. Problem: saturate fast as sigmoid. Good in LSTM. 

![image](https://www.oreilly.com/library/view/machine-learning-with/9781789346565/assets/c9014c8e-7d06-4a12-9390-4d17f9379eb9.png)

An activation function define how much an input neuron contribute to the next layer based on conditional threshold.

It's a function that allows the neural network to approximate non-linear functions. In fact, it is also called non-linearity.

**Softmax**

A function that transforms an array of scores (the length of the array is the number of classes you want to predict) into probabilities.

If our `criterion` (loss function) is `nn.NLLLoss()`, then we can use the `nn.LogSoftmax()` as output. This will compute the logarithm of the probabilities of each class.

If we are using the nn.LogSoftmax(), and so we have the log probabilities, if we want to see the probability of each class we just compute the `torch.exp(logps)`.


**Dropout**

Used as regularization method to prevent overfitting. It turns off some neurons during training, and turn them on again in the validation part/inference part. It also speed up the training process because you have less computation to do (less weights to multiply).

**Batch Normalization**

Instead of just normalizing the inputs to the network, we normalize the inputs to layers within the network.

>It's called batch normalization because during training, we normalize each layer's inputs by using the mean and variance of the values in the current batch.

It makes the training faster and more stable through normalization of the layers' inputs by re-centering and re-scaling.

Debates in "why it works".

At the beginning, it makes the gradient to "explode", while *skip connections* (that are present in resnet for example) helps.

In code:

```nn.BatchNorm1d(hidden_dim)```

`hidden_dim` must be the same dimension of the next layer input. So if before it there's a `Linear(input_dim, output_dim)`, we have that `hidden_dim=output_dim`.




**Loss Function**

The Loss function computes the error of our prediction.

- Mean-Squared Error --> Used in Regression, `nn.MSELoss()`
- NegativeLogLikelihood Loss --> Classification, in conjuction with `nn.LogSoftmax()`, in Pytorch `nn.NLLLoss()`
- Cross Entropy Loss --> Classification, in Pytorch there's no need to add a softmax as output of the NN, `nn.CrossEntropyLoss()`.

**Optimizer**

It's an algorithm used to update the weights of our model in order to reduce the error given by the loss function. Some example of them are:

- Stochastic Gradient Descent, SGD: updates the weights every batch. In Pytorch `optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9`

- Adam Optimizer, it exploits many techniques to be more adaptive and use momentum in the updates.
In Pytorch: `optimizer = optim.Adam(model.parameters(), lr=0.0001)`.

... Other that you see on torch.optim Pytorch documentation

**Learning Rate**

An hyperparameter that we specify in the optimizer that define how much we update the weights at each iteration.

An alternative definition is how much is the step size in the gradient descent process.

**Some other definitions**:

- *Logits*: the output of an hidden layer before applying the activation function (or a softmax)

- Output probabilities: the probability of each class in the prediction of the NN. For binary classification problems, it's either a vector of two components (one hot encoding style - output of a 2-way softmax) or the output of a sigmoid (so single value). For multiclassification problem, let's say with N classes, it is the output of an N-way softmax.

- Backpropagation: the algorithm that is used by neural networks to automatically compute the gradient of the loss function



**How a Neural Network learn**

- A neural network learns from the data, that are the input of the network.
- It computes a prediction as output (often as a probability, or a continuous value for regression)
- The model consists of weights that are used to make the prediction
- In a supervised learning approach, a "loss function" is used to compute the error for each prediction
- To minimize the loss function, the gradient of it with respect to the weights of the model is computed using backpropagation
- The weights are updated towards the steepest direction of the gradient



## What are the main categories of Deep Learning architectures?

### Multilayer Perceptron or Fully Connected layers

![image](https://www.researchgate.net/profile/Michael-Frish/publication/241347660/figure/fig3/AS:298690993508361@1448224890429/The-structure-of-a-multilayer-perceptron-neural-network.png)

The basic component is the Perceptron, that is a Neuron.

### Convolutional Neural Networks

![image](https://miro.medium.com/max/3288/1*uAeANQIOQPqWZnnuH-VEyw.jpeg)



### When do we use Convolutional Neural Networks?

With *grid-like* data, like pictures, videos.

Mainly when the absolute position of a feature is not relevant for the target prediction. 

**Why?** Because the convolution operation on the input makes it *translation invariant*.

**What are other properties of CNNs?**

*Parameter sharing*: the filters (that are learned during convolution) are shared all over the input.


**What are the things that a CNN learns?** The filters.

**How is a common CNN architecture defined?** We have Convolutational Layers, then Activations function (often *ReLu*), Pooling layers (no weights learned, often *Max Pooling*) for the "feature extraction" part. At the end, we have a "classifier", that is usually a MLP or just a fully connected layer with a softmax as output.

### Recurrent Neural Networks


General rules for a good cheat sheet:

- Definition
- Applications
- Code in Pytorch