# Deeplearning - Datascience 

In [3]:
# notes from the book Understanding Deeplearning
https://udlbook.github.io/udlbook/

# Definities
### hidden state or state
f: X --> Y 
Number of layers between input to output 
### Loss or training error: 
We quantify the total mismatch, training error, or loss as the sum of the squares of these deviations for all I training pairs:
$$ L[ϕ] = \sum_{i=1}^n(f[x_i,ϕ] − y_i)^2 $$
Since the best parameters minimize this expression, we call this a least-squares loss. The squaring operation means that the direction of the deviation (i.e., whether the line is above or below the data) is unimportant.
### Loss function or cost function
The loss L is a function of the parameters ϕ, ϕ^r is the result of the function;
$$ ϕ^r = argmin [L[ϕ]]$$
### Training
The process of finding parameters that minimize the loss is termed model fitting, training,
or learning.
### Generative vs. discriminative models: 
The models y = f[x,ϕ] models. These make an output prediction y from real-world measurements x. Another approach is to build a generative model x = g[y,ϕ], in which the real-world measurements x are computed as a function of the output y.
### Rectified linear unit or ReLU:
a[z] = ReLU[z] = (0 z < 0 | z z ≥ 0 )
This returns the input when it is positive and zero otherwise
### Universal approximation theorem
The universal approximation theorem proves that for any continuous function, there exists a shallow network that can approximate this function to any specified precision.
### Width, depth, capacity of a network
The number of **hidden units in each layer** is referred to as the **width** of the network, and the number of **hidden layers** as the **depth**. The total number of hidden units is a measure of the network’s capacity.
### Maximum likelihood
We now consider the model as computing a conditional probability distribution Pr(y|x) over possible outputs y given input x. The loss encourages each training output yi to have a high probability under the distribution Pr(yi|xi) computed from the corresponding input xi.
First, we choose a parametric distribution Pr(y|θ) defined on the output domain y. Then we use the network to compute one or more of the parameters θ of this distribution. For example, suppose the prediction domain is the set of real numbers, so y ∈ R.
Here, we might choose the univariate normal distribution, which is defined on R. This distribution is defined by the mean μ and variance σ2, so θ = {μ, σ2}. The machine learning model might predict the mean μ, and the variance σ2 could be treated as an unknown constant. The model now computes different distribution parameters θi = f[xi,ϕ] for each training input xi.The combined probability term is the likelihood of the parameters, and hence equation 5.1 is known as the maximum likelihood criterion
### Minimum log-likelihood
This log-likelihood criterion is equivalent because the logarithm is a monotonically increasing function: if z > z′, then log[z] > log[z′] and vice versa (figure 5.2). It follows that when we change the model parameters ϕ to improve the log-likelihood criterion, we also improve the original maximum likelihood criterion. It also follows that the overall maxima of the two criteria must be in the same place, so the best model parameters ˆϕ
are the same in both cases. Finally, we note that, by convention, model fitting problems are framed in terms of
minimizing a loss. To convert the maximum log-likelihood criterion to a minimization problem, we multiply by minus one, which gives us the negative log-likelihood criterion.
### Inference
The network no longer directly predicts the outputs y but instead determines a probability distribution over y. When we perform inference, we often want a point estimate rather than a distribution, so we return the maximum of the distribution:
ˆy = argmax Pr(y|f[x,ˆϕ])


## Regularization techniques
### L2 regularization
For neural networks, L2 regularization is usually applied to the weights but not the biases and is hence referred to as a weight decay term. The effect is to encourage smaller weights, so the output function is smoother. To see this, consider that the output prediction is a weighted sum of the activations at the last hidden layer. If the weights have a smaller magnitude, the output will vary less. The same logic applies to
the computation of the pre-activations at the last hidden layer and so on, progressing backward through the network. In the limit, if we forced all the weights to be zero, the network would produce a constant output determined by the final bias parameter. 
If the network is overfitting, then adding the regularization term means that the network must trade off slavish adherence to the data against the desire to be smooth. One way to think about this is that the error due to variance reduces (the model no longer needs to pass through every data point) at the cost of increased bias (the model can only describe smooth functions).

### Implicit regularization in gradient descent

### Implicit regularization in stochastic gradient descent
SGD implicitly favors places where the gradients are stable (where all the batches agree on the slope). Once more, this modifies the trajectory of the optimization process (figure 9.4) but does not necessarily change the position of the global minimum

## Activation functions
### MSE (regression models):
Mean Average Error, does typically not work that well as a loss function. The default for regression is a variation on this, the Mean Squared Error:

$$MSE = \frac{1}{n}\sum_{i=1}^n (Y_i - \hat{Y}_i)^2$$

This is the mean $\frac{1}{n}\sum_{i=1}^n$ of the squared error $(Y_i - \hat{Y}_i)^2$ 
But torch has already implemented that for us in an optimized way:
````
loss = torch.nn.MSELoss()
loss(yhat, y)
````
### Negative log likelihood. 
The function is:

$$NLL = - log(\hat{y}[c])$$

Or: take the probabilities $\hat{y}$, and pick the probability of the correct class $c$ from the list of probabilities with $\hat{y}[c]$. Now take the log of that.

The log has the effect that predicting closer to 0 if it should have been 1 is punished extra.
````
loss = torch.nn.NLLLoss()
loss(yhat, y)
````
###  Softmax:
Will scale everything between 0 and 1 and sum to 1
````
m = torch.nn.softmax()
yhat = m(input)
````
###  Sigmoid:
Will scale everything between 0 and 1, but without making everything sum to 1
````
m = torch.nn.Sigmoid()
yhat = m(input)
````
### Tanh
Scale everything from -1 to 1 y = tanh x
Voordeel with respect to ReLU is that derivatives are smaller, Relu produces growing derivatives 
### Categorical Cross-Entropy Loss  (multiclass)
CEL adds the LogSoftmax to the loss. <b>This means you don't need to add a LogSoftmax layer to your model</b>.
This is used when the model needs to classify an input into one of many classes. It compares the predicted probability distribution across all classes with the actual one-hot encoded labels.
In PyTorch, you can use torch.nn.CrossEntropyLoss, which expects raw logits (not passed through softmax) and will apply softmax internally.

### Binary Cross-Entropy Loss (Log Loss)
This loss function is used when the output consists of a single probability value representing the likelihood that the input belongs to class 1. The binary cross-entropy loss is also known as log loss.

$$ BCE = \frac{1}{n}\sum_{i=1}^n y_i \cdot log(\hat{y}_i) + (1-y_i) \cdot log(1-\hat{y}_i) $$

Or, in plain language: 
- assume that $y$ is a binary label (0 or 1)
- predict the probability $\hat{y}$
- if the label is 1, take the log of the probability: $y_i \cdot log(\hat{y}_i$)
- if the label is 0, take the log of $1-\hat{y}$
- take the mean $\frac{1}{n}\sum_{i=1}^n$ of that

````
loss = torch.nn.BCELoss()
loss(yhat, y)
````
###  Binary Cross Entropy with logits 
In the case you dont have predictions with values between 0 and 1, you can use the WithLogits variation. You can then skip the sigmoid layer.
In PyTorch, you can use torch.nn.BCEWithLogitsLoss for binary classification tasks. This loss function expects raw output logits from the model (not passed through a sigmoid function), and it applies the sigmoid internally.

````
loss = torch.nn.BCEWithLogitsLoss()
loss(input, y)
````
# Wrapup

Losses are very important: they tell your model what is "right" and "wrong" and determines what the model will learn!

- For regression models, typically use a MSE
- For classification, use  BinaryCrossEntropy  (note: this might be implemented different in other libraries like Tensorflow!)
- For multiclass, use CrossEntropyLoss

There are other, more complex losses for more complex usecases but these three will cover 80% of your needs.

### Resnets
Learn of residuals (difference between states)

### RNN Recurrent Neural Networks
RNN zijn ontworpen voor timeseries
Activations: Tanh (-1 to 1)
1982 discovered by John Hopfield
1995 LSTM achitecture
2013 LSTM outperforms models Natural language recognition
Je kunt een CNN ook gebruiken voor timeseries, door een matrix te gebruiken
CNN: 2D tensor (B, D) (HxBxchannel zijn Dimensies)
Timeseries: 3D tensor (B, Sequence, D)
(batchsize, sqeuence, D=8 metingen)
Bij timeseries de context is belangrijk, door context verandert de betekenins
Timeseries: volgorde in tijd is belangrijk, hoeveel van het verleden is nodig om een voorspelling te doen (window)
hoveel in de toekomst je wilt voorspellen (horizon), zonder data te lekken (normaliseren van data met gemiddelde uit de toekomst is data lekken).
RNN have not explicit forget or retain memory.

### GRU Gate Residual Unit
Gamma = gate
Remember the past en completely ingnor the new state (nagate new info)
Forget the past and focus on the present or something in between

To create a gate: sigmoid activation (between 0 and 1)

Remember: multiplicate by 1

Forget: multiplicate by 0
Gate is een matrix van zelfde omvang met sigmoid (tussen 0 and 1)

candidate hidden state = tanh of a matrix

gate * hidden state = hidden state with reset or update *is harmand multiplication of matrix (pointwise)

full GRU; has two gates, update and reset gate

### LSTM
3 Gates: in the computation of the new state LSTM uses two gates forget and update states, instead of the single update gate of the GRU 

input: number of units of hidden state
one to many: generate a name with one letter as initial
many to one: review --> sentiment
many to many: vertalen / give me all verbs in the sentence
ontwerpkeuze: smallere units / more layers for ex. 128 units, 3 layers

### Naive model
Very simple mode, as baseline
First use the naive model to do predictions, create a baseline for your error and this is the base line, everything above is not correct
For ex. weather of today is the same a weather of tomorrow

### Deeplearning in practice

1. welke type NN model ga ik gebruiken?
2. aantal units en lagen
3. activations and optimizers
4. learning rate, scheduler, dropout
5. batch and regularization
The most important is to know, do you need memory in timeseries? --> LSTM/GRU is the best choice (ex. gesture dataset, or language/translation, audio)

### Input tensors 
2D tensor linear
3D tensor RNN, GRU, LSTM, CONV1D (timeseries)
4D tensor CONV2D MAXPOOL2D, BATCH NORM 2D


### TMUX 
Als je de VM wil laten draaien bijv. s'nachts
ctrl + b
ctrl loss +b
tmux a -t 0 #select process 0

```rye install mlflow```

Start stop mlflow:
```
mlflow server     --backend-store-uri sqlite:///mlflow.db     --default-artifact-root ./mlruns     --host 127.0.0.1:5000
```

### Curse of dimentionality
Search space is the product of all the hyperparameters options, can be huge. To decrease the search space, choose basis defaults for parameters which are not of big influence.

1. Architectuur choice: think about the architecture, intuition
2. Ray is agnostic, works for tensorflow and pythorch, works less good with windows
3. 


Baysian search: search for the best solution, by random trying, requires a distribution of you parameters
Baysian Hyperband: stops earlier if after x epochs is not good enough, with risk that you miss the right optimization
Tree parser: for strange search spaces, works with a tree structure
Mostly start with Baysians search or Hyperband, then later tree parsers

### Learning rate schedulers
LR 0.1 is handy to start high and end low with 0.001-4 (very small), these lr are set in the scheduler
Reduce on Plataut: reduce (divide by 10) lr when leaning stops (error reach plateaus, constant error rate), the pacience determine the moment of decrease of lr
Cosine Warm-up: vry big datasets: starts with a low lr to start a warming up, to prevent that you set the parameters too early




In [None]:
### Ray Hypertuning
Ray -> Keyconcepts -> config (types of options)
Hyperband: niet intelligent, random hypertuning

layers vs hidden size
kies twee parameters
opt (Adam, SGD, AdamB vs lr)
units vs dropout
units vs layers = verhoogt dipte of breedte
GRU/LSTM vs layers = verschil tussen GRU en LSTM
GrU/LSTM vs window = langere window is meer geheugen, lstm heeft een complexere geheugen, mogelijk met verschillende datasets
run hypertune.py
analyse van de data kan in 03_ray notebook

In [3]:
## transfer learning
param.requires_grad = False # freeze de model grandients zoals ze niet meetrenen

features layer = linear layer vervang je met eigen layer (of twee linear layers met dropout)

Normalize: gemiddelde standard deviatie van de resnet dataset

SyntaxError: invalid syntax (2326155033.py, line 4)