In [None]:
"""In Deep Learning, generally L2 regularization is used, (ʎ/2n)*∑Wi^2
is added to the loss function, caution biases are not added.

Just add, a hyperparameter called as,
kernel_regularizer = tensorflow.keras.regularizers.l2(0.01)
kernel_regularizer = tensorflow.keras.regularizers.l1(0.01)

in hidden layers.
"""

In [None]:
"""Activation Functions: Also called as transfer function, is a function applied to the weighted sum of inputs + bias of a perceptron, and the resultant output then according to the layer number and other conditions serves as input for other layers or as final output.
Without activation function, it cannot solve non-linear problems, as then g(z) = z
Ideal Activation Function:
1. Non-linear
2. Differentiable (ReLU is not differentiable)
3. Computationally inexpensive
4. Zero-centered (Standardized data is empirically proven that it gives better result)
5. Non-saturated (Squeezes output in a range, e.g.sigmoid, causes vanishing gradient problem)

Sigmoid Activation Function: σ(z) = 1/[1+e^(-z)].
Pros: 1. As output lies in [0,1] useful in binary classification.
2. Non-linear.
3. Differentiable σ'(z) = [1-σ(z)][σ(z)]

Cons:
1. Saturating Problem, should be used in output layer only. Derivative goes to 0, except -6 to 6.
2. Non-zero centered output data
3. Gradient of all weights in same layer is either positive or negative, this causes restriction and slow convergence.
4. Computationally expensive.

tanh Activation Function: f(x) = tanh(x) = [e^x - e^(-x)] / [e^x + e^(-x)]
f'(x) = 1 - [tanh(x)]^2
Pros:
1. Non-linear
2. Differentiable
3. Zero-centered output //Solved of sigmoid

Cons:
1. Saturating function
2. Computationally expensive

ReLU Activation Function: max(0,x)
Pros:
1. Non-linear (Consider as whole not just f(x) = x)
2. Non-saturated
3. Computationally inexpensive
4. Convergence faster

Cons:
1. Not differentiable, but we consider its derivative 0 for negative and 1 for greater than or equal to zero.
2. Output is not zero centered, so we use batch normalization to overcome it.
3. Dying ReLU Problem.
"""

In [None]:
"""Dying ReLU Problem:
As it may happen that neurons may become dead, and once they are dead they are for forever, they change but negligible.
Cause: The input given to ReLU becomes negative thus output zero and derivative of zero is also zero, so within chain rule as one term becomes zero, whole expression shrinks to zero, and no further updation takes place.
Reasons:
	1. High learning rate
	2. High -ve bias (Bias intialization, updation)

Solution:
	1. Set low learning rate.
	2. Use learning rate = +ve value,usually, 0.01
	3. Don't use ReLU, but its variants,
Linear ReLU variants:
1. Leaky ReLU
2. Parametric ReLU: max

Non-Linear ReLU:
1. ELU (Exponential Linear Unit)
2. SELU (Scaled exponential Linear Unit)

A]. Leaky ReLU: max(0.01z, z) ; (z<0, z>=0)
B]. Parametric ReLU: max(ɑz, z) ; (z<0, z>=0), ɑ is a training parameter.
Pros:
	1. Non-saturated, unbounded
	2. Easily computed
	3. No dying ReLU Problem
	4. Close to 0 centred (Both +ve and -ve value)

C] ELU(x) = [x, ɑ(e^x - 1)] ; (x>=0 , x<0)
Pros:
	1. Non-saturated, unbounded
	2. Continuous, differentiable if ɑ=1.
	3. No dying ReLU Problem
	4. Close to 0 centred (Both +ve and -ve value)

Cons:
	1. Computationally expensive

D] SELU(x) = ʎ[x, ɑ(e^x - 1)] ; (x>=0 , x<0)
ʎ ≈ 1.05, ɑ ≈ 1.6732632423543772848170429916717
Pros:
	1. Non-saturated, unbounded
	2. Continuous, differentiable if ɑ=1.
	3. No dying ReLU Problem
	4. Close to 0 centred (Both +ve and -ve value)
	5. Self-normalizing //The best advantage

Cons:
	1. Computationally expensive.
	2. Relatively newer, less adopted.
"""

In [None]:
"""Weight Intialization:
Problems:
	1. Vanishing Gradient
	2. Exploding Gradient
	3. Slow Convergence

Why weights should not be zero intially?
ReLU, tanh
1. Training will not take place, as first output will be zero and so the derivatives.
Sigmoid:
1. All neurons in a single layer behaves as same neuron, so it acts as single perceptron.
.
Why weights should not be same non-zero values?
1. All neurons of a layer will act like same neurons, so non-linear relations will be captured.

Problems in random intialization, with small weights or large weights?
We would set weights by random.randn()*0.01, leading to Vanishing Gradient Problem
Intensity: tanh(Extreme problem) > Sigmoid(Still a significant problem) > ReLU(Very slow convergence)
In large weights, it cause saturation problem in tanh and sigmoid.

Heurisitics(Jugaad)
Practical Solution:
Xavier-Glorat Intialization: Normal, Uniform //Preferred for tanh
He Init Intialization: Normal, Unifom //Preferred for ReLU

A\ Xavier Normal
	np.random.randn(m,n) * ɑ
	Variance should be 1/n , n be the number of inputs to a particular node.
	ɑ = standard deviation, √(1/n)
	In techincal terms, np.random.randn(m,n) * √(1/fan_in),
	or some also write rarely, np.random.randn(m,n) * √[2/(fan_in+fan_out)]

B] He Normal:
	Here, basically the factor becomes, √(2/fan_in)

C] Xavier Uniform Distribution:
Take the factors uniformly from:
[-limit, limit], where limit stands for, limit = √[6/(fan_in+fan_out)]

D] He Uniform Distribution:
Take the factors uniformly from:
[-limit, limit], where limit stands for, limit = √[6/(fan_in)]
"""

In [None]:
"""Put a hyperparameter, in layers, e.g. Dense(kernel_intializer = 'he-normal')
Default is  'glorat-uniform',
Uniform is better for shallow and wider networks.
While, normal is better for deep networks.
"""

In [None]:
"""Batch Normalization: It is an algorithmic method which makes the training of DNN faster and in more stable manner.
It consists of normalizing activation vectors from hidden layers using the mean and variance of the current batch. This normalization step is applied right before (or right after) the non-linear function.
Covariate Shift: 
It means you train the model for a particular use, but by data which is aligned to a particular unwanted features, lacking generalization of that question, which leads to bad results on test data.
Like, training to identify rose and in X_train, all red colored roses are there but in X_test, you give yellow, white, all colored-roses.

//Actually, internal covariate shift has no direct relationship between them, but is so named by authors of batch normalization for analogy in input distribution, there in datasets and here in inside network.
Internal Covariate Shift:
Due to changing input distribution, it is difficult for deep network layers to figure out output, it's like you are not fixed which exam you are going to give and you want to achieve AIR-1.

So, we apply batch normalization to ensure that atleast some features of data would be same.
Most popular is normalization before putting in activation function.
Its done by 
z = [z-μ]/[σ + ∈],  ∈ is so that never denominator become zero when σ=0, usually it is negligible.
µ: Mean of all points, σ is standard deviation.

Now, after this normalization, z = Yz + B, where Y and B are learnable parameters, Y = 1 and B = 0 in Keras intially.
And this Y and B are specific to each neuron.
This is contradictory, as if Y = σ+∈ and B=µ, them this reverses and ultimately is of no use.
But this provides flexibility to the neural network for training according to its own.

See, in training we calculate mean and std of all the activation function inputs of the batch.
Now, here EWMA, exponentially weighted moving average is calculated for further retaining of mean and std.
Its calculated roughly by average of all means and std of all batches till date.
So, in batch normalization, for each neuron, there are 2 learning parameters and 2 non-learning parameters.
Non-learning in the sense, their gradient is not used in back-propagation, instead they are calculated on the go.

Advantages:
1. Stable: Hyperparameter tuning, wider range of values
2. Faster: Higher learning rate is now acceptable, as problems like vanishing gradient and exploding gradient are now not an issue, as it normalizes data in each step.
3. Regularization: As, the mean and std changes by EWMA, it causes noise, which itself causes regularization.
4. Reduces weight intialization problems.

To apply it use, between hidden layers.
model.add(Dense())
model.add(BatchNormalization())
model.add(Dense())
model.add(BatchNormalization())
model.add(Dense())

μnew= ([1−α] * μold) + (α * μbatch)
σnew= ([1−c] * σold) + (α * σbatch)

where α is the momentum parameter.
"""

In [None]:
"""Optimizers:
Actually deep learning is an optimisation technique used to minimize the loss between y_actual and y_pred.
And so, we use the opimizer gradient descent with its 3 variants.
But, it has problem of deciding learning rate and so some people try learning rate scheduler for it, but it also has haeavy dependence on data, and so poorly performs on test or new data.
And, also for each parameters or weights we use same learning rate, which means for every direction, we are approaching with same speed, which is not fair.
Local Minimum, is also a problem as sometimes it sticks to a particular minima, which is not global but best.
Saddle Point Problem: Here, the value of gradients become very less, leading to slow convergence or stopping of training.
"""

In [None]:
"""Common Optimizers:
1. Momentum
2. Adagrad
3. NAG
4. RMSProp
5. Adam
"""

In [None]:
"""Optimizers:
Exponentially Weighted Moving Average:
Time Series Forecasting, Finance Forecasting, Signal Processing, Deep Learning Optimizers.
Vnew= ([1−α] * Vnew) + (α * Vold)
Here, α = 0.9 generally, and initially V1 = V0 or simply 0, but V0 is more correct.
Here, 1/[1−α], is the number of days of which EWMA seems to be average.
Therefore, less the α, more it sticks to the data.
Now, by multiple substituting Vnew in Vnew1, subsequently in Vnew2, leads to α and 1-α factors, to old terms thus leading to their low weightage.
It is inbuilt in pandas, df['meantemp'].ewm(alpha=0.9).mean()

SGD with Momentum:

Non-Convex Optimization: Problems are, 
High Curvature (Unstable)
Consistent Gradient (Saddle)
Noisy (Local minima)
"""

In [None]:
"""Wt+1 = Wt - ŋ∇Wt //Vanilla Gradient Descent
Wt+1 = Wt - Vt //SGD with Momentum
Vt =  ß*Vt-1 + ŋ∇Wt //Definition of Vt using EWMA
0<ß<1, here ß is generally 0.9
SGD with momentum takes longer steps.
For analogy, if on any path, 4 people tell you single direction for your destination and in other case 3 people tell you one and one person tells another,
then you will move forward faster in 1st case, while slowly in 2nd case.

and another 1/1-ß, gives you number of days whose average you are considering.
But, if ß=1, then it achives dynamic equilibrium and keeps oscillating forever.
For, ß=0, its simply SGD only.

Escapes local minimum due to its large updates due to its velocity.
But, it oscillates due to its velocity even near global minimum. 
"""


In [None]:
"""Nesterov Accelerated Gradient(NAG):
Wla = Wt - ß*Vt-1
Vt = ß*Vt-1 + ŋ∇Wla
Wt+1 = Wt - Vt
la: look ahead

Here, we first traverse with previous velocity and then calculate gradient on the new point and then traverse accordingly.
It basically dampens the oscillation and thus reduce its epochs but again it may lead to local minimum, which was a fundamental problem sorted by momentum.

Syntax:
personalopt = tf.keras.optimizers.SGD(
	learning_rate=0.01, momentum = 0.0, nesterov=False, name="SGD", **kwargs
)
Put this in hyperparameter: optimizer in following way, optimizer="personalopt"
"""

In [None]:
"""Adaptive Gradient:
1. Input Features have different scale.
2. Features are sparse, meaning maximum values are zero.
Actually, sparse data creates elongated bowl problem.

In such situation,
Consider a half cut cylinder, now with half face and full length, and a pit between the length.
In Vanilla GD and momentum, if in one dimension there is steep slope it would go there first, then it starts moving in another parameters slope, it leads to longer path, if from corner of the cross section, it first reaches inner bottom, then slowly to mid.

Now, in AdaGrad we keep different learning rate for different parameter, in such a fashion that if gradient is small, learning rate is bigger.
Wt+1 = Wt - (ŋ∇Wt)/[√(Vt+∈)],  ∈ is just a small number so that denominator never becomes zero.
Vt = Vt-1 + (∇Wt)^2, square so that it should not be negative and differentiable.
Similarly for biases.

AdaGrad: Is not meant for complex deep neural network, like non-convex optimization, as it never converges to global minima as gradually Vt becomes larger.
Can be used for simple problems like linear regression, convex problems.
"""

In [None]:
"""RMSProp: Root Mean Square for Propagation
Wt+1 = Wt - (ŋ∇Wt)/[√(Vt+∈)],  ∈ is just a small number so that denominator never becomes zero.
Vt = ßVt-1 + (1-ß)[(∇Wt)^2],   ß is generally 0.95.
So, here we are not giving that much weightage to old weights now.
Disadvantage: Generally no disadvantages, a competitor for Adam.
"""

In [None]:
"""Adam: Adaptive Moment Estimation
Most powerful used extensively in RNN, CNN, ANN.
Ideas Used:
1. Momentum
2. Learning decay

mt and Vt are called as moments.
Wt+1 = Wt - {(ŋ∇Wt)/[√(Vt+∈)]}*mt,  ∈ is just a small number so that denominator never becomes zero.
Where, mt = ß1mt-1 + (1-ß1)∇Wt : Momentum
Vt = ß2Vt-1+ (1-ß2)(∇Wt)^2     : Learning Rate

Bias Correction: t is epoch number, in denominator, not in numerator.
mt^ = mt/(1-ß1t),
Vt^ = Vt/(1-ß2t)
ß1 = 0.9, ß2 = 0.99, Configurable as hyperparameter in keras.
Bias Correction is meant for eliminating effects of first few steps where value of mt and Vt are zero, it does so by scaling mt and Vt at every step, according to the given expressions.
"""

In [None]:
"""For Hyperparameter tuning, use libraries like Keras Tuner"""