# Gradient Descent

Gradient descent is an algorithm that helps "climb down a mountain" all the way to a relative or absolute bottom. The mountain is the loss function, and the bottom of the valley is a minimum (local or global).

Gradient : vector, of the partial derivatives of the loss function according to the model's parameters. The direction of the gradient corresponds to the direction of maximum increase of the loss function (which is why going the opposite way decreases the loss function). 

Gradient descent algorithm:
* Initialisation: give a random (or not) initial value to the model's parameters
* Iteration: $$\beta^{t+1} = \beta^{t} - \gamma \nabla_{\beta} C$$
* Stopping: 
    * if the gradient norm falls under a certain value (convergence)
    * number of iteration
    * if the validation loss starts increasing (overfitting limit)

## +
* Calculation is easy (chain rule) (computation efficient)
* Always brings us in the right direction for training the parameters on the train set

## -
* local minimum (Initialisation)
* Explosive gradients
* Vanishings gradients
* Dying relu
* Final model is very sensitive to the initial values of the parameters

## Batch and Stochastic gradient descents
* stochastic : random subset of observations at each step
* batch : split the data into random batches -> 1 epoch is one sweep through all the batches

The advantage is to make more updates of the parameters with less computation

# Fully connected (dense) neural networks

A dense neuron :
* w (one parameter per input value)
* b a bias
* activation function
    * ReLu:
        * \+ fast computation
        * \- Dying relu (derivative is 0 for negative inputs)
    * Leaky ReLu
        * \+ same as ReLu minus the dying part
    * Sigmoid:
        * \+ gives you a probability (good for binary classification or multi-modal classification) 
        * \+ smooth gradients
        * \- vanishing gradients (because the function is saturating for negative and positive values)
        * \- computation time
    * Softmax:
        * \+ categorical classification probabilities
    * Tanh:
        * \+ nice properties for the second derivative -> used in RNN
        * \- like sigmoid
    * Swish:
        * \+ google's tried it and works good (smooth ReLu) $x \times \sigma$

A fully connected (dense) neuron is connected to each element of its inputs, one parameter per input value.
Reads the entire input on the first layer (creating new features from all the data at step one like base ingredients)

## Architecture guidelines
* Start with lots of neurons then decrease (lots of base ingredients to make lots of recipes)
* Generally up to three layers total (after that no better results but more overfitting, but that's empirical)
* Last layer should be compatible with the target:
    * Regression, y quantitative -> 1 neron with linear activation
    * Binary classification -> 1 neuron sigmoid activation
    * Categorical classification -> 1 neuron per category activation softmax
    * Multi label classification -> 1 neuron per category activation sigmoid
* First layer -> specify the input dimension

Effect of adding more neurons in one layer is: exploring more possibilities of features at one level of complexity/non-linearity. Effect of adding more layers: increase the level of attainable non-linearity.

## Prepare for training
* compile :
    * choose the loss function:
        * MSE, MAE, MAPE -> Regression
        * (Sparse)CategoricalCrossEntropy -> Classification
        * BinaryCrossEntropy -> Binary Classification
    * Optimizer (choice of algorithm) -> Adam (adapts the learning rate depending on the gradient values, if gradient increases-> slows down, when gradient diminishes -> accelerates, like a skier)
    * Metrics (more interpretable than the loss)
        * MSE MAE MAPE
        * Accuracy, Precision, Recall, F1

# Convolutional Neural Networks

## Image porcessing
* normalize pixel values to be in the range 0->1 (dividing by 255)
* images need to have shape (batch_size, width, heigth, channels=(1,3,4)) 1 B&W 3 RGB 4 RGBA (A is transparency)
* Data Augmentation: any transformation of your data that let's you artificially increase the size of the dataset without altering the target (reasonnable with regards to the problem)
* use ImageDataGenerator from tensorflow (easy data augmentation & no need to load all data in memory)

## convolutional neuron
* Kernel size: window that will travel on the input -> the size of the patterns you'll be detecting on the input, it also defines the number of parameters 
* padding: do we add zero padding on the edges of the input -> the goal is to be able to look at the corners and borders of the input + maintain the size of the original input
* Strides: The way the kernel is going to travel on the input, does it stop at every element or every 2 element etc...
* activation function (ReLu in most cases)

## Architecture guidelines
* Start with a few neurons then increase the number of neurons (first layers capture small patterns -> subsequent layers capture bigger and bigger patterns that allow for more and more variety, so we need more neurons to catch them all)
* Deeper = Better (limited by the input shape, it's not worth capturing patterns that are bigger than the input)
* Alternate between convolutional layers and pooling layers to gradually reduce the size of the input
* Before prediction we need to flatten the last convolution output before feeding it to dense layers

# Transfer Learning

## General Principle
* Pretrained model trained on a huge dataset that is similar to yours
* \+ save time and ressources 
* \+ achieve better results with minimum effort and cost
* \+ take advantage of performance to train on small datasets
* \- data must be similar to pretraining data (medical images not compatible with daily life images)

## Architecture guidelines
* Remove the last layer to replace it with a layer adapted to your problem
* If you want to work with different level of pattern complexity:
    * Resnet (conv net with additive bypassing)
    * Densenet (conv net with concatenate bypassing)
    * Inception (using different convolution layers in parallel)
* Finetuning: letting more top layers train to further adapt the model to our problem

## Where to find them?
* Tensorflow API (imagenet)
* Tensorflow Hub
* GitHub (paper with code)
* Google Scholars (find scientific paper and manually reproduce the architecture)
* Hugging Face (pretrained model and preset architecture)

# Gans

## General intuition
* mafia (Generator) vs. cops (Discriminator)
* Generator produces fake data from noise or input data
* Discriminator separates fake data from real data
* Unsupervised (do not need labels, just need data examples)

## Usecases
* Generate anonymous data
* Complete missing data, missing values
* Produce realistic fakes
* Anomaly detection
* Domain adaptation (super resolution)
* Image translation (google maps into landscape)

## Architecture guidelines
* Conv transpose layers (inverse of convolution)
* stride of 2 is better than pooling of inverse pooling
* Leaky ReLu
* Batchnormalization
* generator: more units at the beginning than at the end
* penalty on discriminator
* noisy labels

# Embedding

## General principle
* represent text with numbers
* keep the sequential aspect of data
* represent each word by a vector
* embedding vectors represent a summarized meaning of the word (either connected to the target or to similar words in the language -> word2vec)

## Training the embedding

### Supervised training
* Use the embedding to predict a target variable, so the embedding parameters will be trained to represent the words according to their connection to the target (example: cat and airplane could both be connected to good movie reviews, although they are not similar words)

### Unsupervised training
Word2Vec
* Bag of words: Taking a group of words and try to predict the word in the middle
* Skip Gram: taking one word and associate it positively with words that are close and negatively to random words (positive vs. negative skigrams)


# RNN

## General Principle
* Read through a sequence of inputs while keeping a memory of the previous words in the sequence
* Simple RNN bad (vanishing gradient -> there is no choice but to go through the tanh activation)
* memory is persisted in the form of a hidden state vector

## GRU (gated recurrent unit)
* reset gate: choose whether to use the current input for creating the new output or not
* update gate: 
    * produce the new output
    * decide wether to replace the old output with the new one or not
* possibility to bypass the tanh activation -> less vanishing gradient

## LSTM (long short term memory)
* forget gate: do you erase previous memory of not
* input gate:
    * sigmoid: are we going to use the new information in the memory?
    * tanh: what is the new information?
    * feeds the portion of the new info in the memory
* output gate: choosing what portion of the memory to use as the output (hidden state)
* hidden state and cell state
* possibility to bypass the tanh -> less vanishing gradient

## Architecture guidelines
* not several RNN layers after one another
* same guidelines as for dense networks

# Encoder Decoder

## General principle
* allowing to deal with output of arbitrary length
* Teacher forcing (feed the right previous answers at each step)

## Attention
* Instead of looking at the encoder output only once with the decoder, at each step we assign importance weights to the encoder output in order to prioritize the information to be used
* partly solves the error propagation problem of the encoder decoder