# ML broad categories
    - supervised: output is given (X->Y)
        - Classification: trying to predict a label/category.
            - Categorical variables: (gender, degree, city, etc)
                - one-hot-encoding leads to good interpretability because each weight tells us how much that category
                   value affects the output
                   if a categorical variable has K different values, X will have K columns
                   one-hot-encoding of Degree : bachelors=[1,0,0] masters=[0,1,0] PhD=[0,0,1]
                   one-hot-encoding of Salary: y = 50000 - 5000x_1 + 5000x_2 (female: x_1 = 1, male x_2 = 1)
                                           or y = 45000 + 10000 (male:x=1, female  x=0)
                - K-1 encoding: we can save 1 column out of K columns by letting all 0s represent one category value
                   this is undesirable because the effect of one category will be absorbed into the bias term
                - Predicted value E(y|female)=45000 and E(y|male)=55000
        - Regression: predicting a real-valued  number or vector.
            - Continues values: (age, years of experience, GPA)
    - unsupervised: no output - just trying to learn the structure of the data (X)

* Gradient Descent: Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.
In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks.

* Learning rate: The size of these steps is called the learning rate. With a high learning rate we can cover more ground each step, but we risk overshooting the lowest point since the slope of the hill is constantly changing. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get to the bottom.

* Cost function: A Loss Functions tells us “how good” our model is at making predictions for a given set of parameters. The cost function has its own curve and its own gradients. The slope of this curve tells us how to update our parameters to make the model more accurate.

* Tools: Before you start to use deeplearning libraries you need to have a prior knowledge of the following:

    * `Numpy`: stack in python: Numpy is a library for Linear Algebra and probability with Numpy Array as its central object
    * `Scipy`: Adds functionality for statistics, signal processing, and computer vision: e.g. PDF, CDF, Standard normal, convolution ...
    * `Matplotlib` : linechart, scatterplot, histogram, plotting images, etc
    * `Pandas`: useful for data that is structured  like a table, csv, excel, etc

## Linear Regression
 It finds line of best fit. It is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). There are two main types:

* `Simple regression`: Simple linear regression uses traditional slope-intercept form, where $m$ and $𝑏$ are the variables our algorithm will try to “learn” to produce the most accurate predictions. $𝑥$ represents our input data and $𝑦$ represents our prediction. $y = mx + b$
    * Step-by-step Now if we run gradient descent using a cost function. There are two parameters in our cost function we can control: $𝑚$ (weight) and $𝑏$ (bias). Since we need to consider the impact each one has on the final prediction, we need to use partial derivatives. We calculate the partial derivatives of the cost function with respect to each parameter and store the results in a gradient.
    $𝑓(𝑚,𝑏)= 1/N * sum (𝑦_𝑖−(𝑚*𝑥_𝑖+𝑏))^2 for i in [1,n]$

* `Multivariable regression`: A more complex, multi-variable linear equation might look like this, where 𝑤 represents the coefficients, or weights, our model will try to learn. $ 𝑓(𝑥,𝑦,𝑧)=𝑤1𝑥+𝑤2𝑦+𝑤3𝑧$
    * The variables $𝑥,𝑦,𝑧$ represent the attributes, or distinct pieces of information, we have about each observation. For sales predictions, these attributes might include a company’s advertising spend on radio, TV, and newspapers.  $𝑆𝑎𝑙𝑒𝑠=𝑤1𝑅𝑎𝑑𝑖𝑜+𝑤2𝑇𝑉+𝑤3𝑁𝑒𝑤𝑠$

* `Cost function`: We need a cost function so we can start optimizing our weights.
    * Let’s use MSE (L2) as our cost function. MSE measures the average squared difference between an observation’s actual and predicted values. The output is a single number representing the cost, or score, associated with our current set of weights. Our goal is to minimize MSE to improve the accuracy of our model.
    * Given our simple linear equation $𝑦=𝑚𝑥+𝑏$, we can calculate MSE as: $ 𝑀𝑆𝐸 = 1/𝑁 * sum(𝑦𝑖−(𝑚𝑥𝑖+𝑏))^2 for i in [1,n]$

* `Normalization`: As the number of features grows, calculating gradient takes longer to compute.
We can speed this up by “normalizing” our input data to ensure all values are within the same range.
This is especially important for datasets with high standard deviations or differences in the ranges of the attributes.
 Our goal now will be to normalize our features so they are all in the range -1 to 1.

* Addressing `over-fitting` using `L1 (LASSO Regression)` and `L2 Regularization (RIDGE Regression)` and `Elasticnet (L1 and L2)` :
    * These methods help prevent over-fitting, by not fitting to noise and outliers
              
      J = sum(y_n-y_hat_n)^2
      J_LASSO = J + lambda.|w|
      J_RIDGE = J + lambda.|w|^2
      J_ELASTIC = J + lambda.|w| + lambda.|w|^2
    
      - L1 regularization: accomplishes this by choosing the most important features by forcing some weights to be zero
            It encourages a sparse solution, forces weights to be zero (few non-zero w)
      - L2 regularization: accomplishes this by making the assertion that none of the weights are extremely large
            It encourages small weights ( all w values close to 0 not exactly 0) 
      - L2 penalty is `quadratic` and L1 is an `absolute` function
        -In Quadratic as w->0, derivative -> 0: if w is already small, further gradient descent won't change much
        -In Absolute as derivative is always +/-1 (0 at w=0); so it doesn't matter where w is, it will fall at a
                       constant rate, when it reaches 0 it stays there forever
      - Elasticnet regularizattion: It's possible to include both L1 and L2 simultaneously (`Elasticnet`)
      - regularization reduces over fitting, because as Lambda gets bigger our weights get smaller and in cases like Tanh or Sigmoid g(z)=tanh(z) or sigmoid(z) becomes almost linear around zero z_l = (w_l . a_l-1) + b,
      - so if lambda is big, then weights and consequently z are forced to be small, activation functions around 0 act like a linear function, and if every layer acts linear then the whole network acts like a linear model
      - and even a deep network acts like a linear function and its unable to overfit the non-linear decision boundaries

* L1 and L2 as a loss function:
        - L1-norm loss function (least absolute error) is robust qnd gives unstable solution and posssibly multiple solutions
            - S = sum(|y_i - y_i_hat|) for i in [1:n]
            - L1 is more robust and resistant to outliers in data compared to L2. It lets outliers be safely ignored unlike the L2 loss that adjust the model for outliers in expense of many common examples, since L2 squares the error, the model will see a much larger error than the L1, so the model is much more sensitive to outliers
        - L2-norm loss function ( least squares error) is not very robust , gives you stable solution, and always one solution
            - S = sum((y_i - y_i_hat)^2) for i in [1:n]
            - Unlike L1, the L2 loss is stable, because for any small adjustment of a data point the regressionline moves only slightly
* `weight initialization (why divide w by sqrt(D))`:
            * Poor weight initialization can lead to poor convergence of your loss per iteration, 
              your loss might even explode to infinity as a result of weights and their variance getting too big
              we might be able to avoid the exploding weights by making the variance smaller by
              initializing weights w to have mean 0 and variance 1/D as: `w = np.random.randn(D) / np.sqrt(D)`
              since we know `y = var(w_1)var(x_1)+ ...+var(w_D)var(x_D) = D.var(w)` as `var(x)=1`
              therefore if we want the `var(y)=1`, we must have `var(w) = 1/D` and hence `sd(w)=1/sqrt(D)`
              
             * This isn't the only way to initialize the weights, but same general theme applies
                - e.g. He Normal, Glorot Normal, or multiply by a small number like 0.01
        - Bellow `D` is input dimensionality and `randn()`
          and draws samples from the standard normal N(0,1) distribution
        - w = np.random.randn(D) / np.sqrt(D)
        - so it means we want  w to have mean 0 and variance 1/D
          because randn()*C == sample from N(0,C^2)
          if we dont devide by  np.sqrt(D) loss explodes because weights are  too large

* `Sample standardization (or Normalization)`: make all inputs 0 mean, 1 standard deviation
    - subtract the mean and devide by the variance.
    - To standardize z = (x - mu)/sigma  `here z has mean 0, sd 1`
    - Inverse transform x = z.sigma + mu `here x has mean mu, sd sigma`
    - Normalizing the training set leads to faster training.
    - if you use a particular mean and variance to normalize the training set use the same to normalize the test/dev set
    - if you don't normalize your data the cost function looks very elongated because corresponding parameters to features of different range will be proportionaly different
    - whereas if you do normalize the features the loss function look more symmetric it takes less steps from GD to find the minimum and you can take bigger steps

* `cross-validation`: this is another method to measure your generalization error
   we split the entire dataset into k parts and for k iterations 
   we treat the ith part as test set where i is [1,k] and the rest is the training set
   we can then find mean and standard deviation of k different errors as a measure of    
   how accurate our model is and how confident we can be in that measure

* `Dummy Variable Trap`: It happens when we need to solve (X_T X)^ -1 when dealing with one-hot encoding for Categorical variables
    because (X_T X) is not invertible as X is a singular matrix because
    it has a column of 1s) and the addition of the one-hot-encoded columns sum to 1 
    meaning that one column is a linear combination of other columns.

* `Multicollinearity`: when one column is a linear combination of other columns it is also called Multicollinearity.
   IN GENERAL YOU CAN'T GUARANTEE THAT YOUR DATA ISN'T CORRELATED. Image data has strong correlations. 
   Therefore Gradient descent is the preferred most general solution  as it is used throughout deep learning
   

* Ways to deal with the `Dummy Variable Trap`:
    * Statisticians suggest using `K-1 approach` instead (not interpretable as one categories will be included in the bias)
    * `L2 regularization`: X_T X is singular but (lambda. I + X_T X ) is not and we can inverse it!
            * inverting a singular matrix is the matrix equivalent of dividing by 0 hence
               adding lambda.I is the equivalent of adding a small number lambda to the denominator
    * `Remove` the column of all 1s, so there is `no bias`
    * `Gradient descent`: (the most general method, used through out deep learning since linear regression 
    is the only setting where we can find a closed form solution for the weights)
    
You can find a high level review of the used cases in this directory.

* Notations:
    * X is an NxD input matrix
    * X(i,j) is X_ij refers to i_th row and j_th columns
    * X(n,d) is X_nd refers to n_th row and d_th columns
    * N is number of samples
    * D is number of features
    * Y is an N-length Target/Output vector
    * Y_hat is an N-length prediction vector
    * Linear regression: y = ax + b 
    * Polynomial regression: add polynomial terms to linear regression 
    * Multiple Linear Regression y = wx + b
        * w: every time x increases by one y increases by w
        * b: when x is 0, y is b
        * x_i,w_i: every time x_i increases by one, and all other x's don't change, then y increases by w_i
    * Multiple Linear Regression in implementation: y = w_Tx (where w_T: transpose of matrix w)
    * J is the objective function
    * E is the Error or Cost function. The goal is to minimize the errors.
        * Squared Error: E= sigma( y - y_hat)^2 for i in [1 , N). 
        * linear regression is the maximum likelihood solution to line of best fit
        * find mu, where x_i are from the same Gaussian distribution X ~ N(mu,sig^2) : where mean = mu ; sig^2:variance
        * prob of any single point x_i: p(x_i)=pdf of the gaussian = (1/sqrt(2.pi.sig^2)) exp(0.5 (x_i - mu)^2/sig^2)
        * we can write joint likelihood of all x_i's. Multiply them cause they are IID
        * IID: Independent and Identically distributed = p(x_1,x_2,...x_N)= p(x_1)p(x_2)...p(x_N)
        * Maximum likelihood  p(X|mu)=p(x_1,...x_N), we want to find a mu so that the likelihood is maximized
        * we solve it by maximizing the log-likelihood  = d(log(p))/d(log(mu)) = 0
        * for i in [1, N) : l = - sum(x_i - mu)^2  is equivalent of minimization in linear regression E = sum(yi-yi_hat)^2
        * y ~ N(w_Tx,sig^2)  equivalent y = w_Tx + epsilon ~ N(0, sig^2)
   * Minimizing squared error is the same as maximizing log-likelihood and also maximizing the likelihood
    * l is log-likelihood
    * L is likelihood

* A regression model that uses L2 regularization technique is called Ridge Regression.
* A regression model that uses L1 regularization technique is called LASSO Regression.

* `L1 regularization `: In general we want the X matrix to be skinny meaning D<<N (# features << # samples)
    * Goal: select a small number of important features that predict the trend and remove the features that are noise
    * We use `L1 regularization` mainly when we have a fat data matrix X and 
      want to create `sparsity` so most of the weights become zero
      * This also puts a prior on w, so it's also a `MAP estimation` of w
          
    - `L1 regularization` uses L1 norm for penalty term
        - J_LASSO = sum(y_n-y_hat_n)^2 + lambda.|w|
        - J = (Y - X.w)_T.(Y - X.w) + lambda.|w|
        - J = Y_T.Y - 2.Y_T.X.w + w_T.X_T.X.w + lambda.|w|
        - dJ/dw = - 2.X_T.Y + 2.X_T.X.w + lambda.sign(w) = 0 [sign(1)=1 if x> 0; sign(x)=-1 if x< 0; sign(1)=0 if x=0
        
           * We have a Laplace distribution meaning exp of negative of absolute value.
             unlike in L2 in L1 we don't have a gaussian prior because we dont have exp of negative square anymore 
           * Prior (Laplace distribution) : p(w) = (lambda/2)exp(-lambda.|w|)
                   
    - `L2 regularization` used L2 norm for penalty term
        - J_RIDGE = sum(y_n-y_hat_n)^2 + lambda.|w|^2
        - J = (Y - X.w)_T.(Y - X.w) + lambda.w_T.w
        - J = Y_T.Y - 2.Y_T.X.w + w_T.X_T.X.w + lambda.w_T.w
        - dJ/dw = - 2.X_T.Y + 2.X_T.X.w + 2.lambda.w = 0
                
          * Both Likelihood and Prior are Gausian because they contain exp of negative of a square 
          * Likelihood (Gaussian distribution) : P(Y|X,w) = mult((1/sqrt(2.pi.sig^2)) exp(0.5 (y_i - w_T x_n)^2/sig^2)) for n in [1 , N)
          * Prior (Gaussian distribution) : p(w) = sqrt(lambda/2.pi) exp(- 0.5 lambda w_T w)

* `Maximum likelihood (ML)` solution: `Minimizing squared error` is the same as `maximizing log-likelihood` 
    - J = sum(y_n-y_hat_n)^2
    * dJ/dw = - 2.X_T.Y + 2.X_T.X.w 
    * w_ml = (X.T.dot(X))^-1 + X_T.Y
    In Python:
    * w_ml = np.linalg.solve(X.T.dot(X), X.T.dot(Y))
    * Yhat_ml = X.dot(w_ml)
    
* `Maximum A Posteriori (MAP)` solution through `L2 regularization (Ridge Regression)`.
    * `L2 regularization (Ridge Regression)`: it helps with reducing the model complexity and prevent us from over-fitting to outliers. 
        Data may have outliers that pull the line away from the main trend in order to minimize the square error. 
        Hence we don't want very large weights because that might lead to fitting to outliers to minimize the squared error. 
        As a result we add a penalty for large weights
        * L2 regularization penalty: lambda|w|^2: to do this we add a lambda multiplied by squared norm of the weights:
         - J_RIDGE = sum(y_n-y_hat_n)^2 + lambda|w|^2
         - J = Y_T.Y - 2.Y_T.X.w + w_T.X_T.X.w + lambda.w_T.w
         - dJ/dw = - 2.X_T.Y + 2.X_T.X.w + 2.lambda.w = 0
        * w_MAP = (lambda.I + X_T.X)^-1 + X_T.Y
        In Python:
        * w_map = np.linalg.solve(l2*np.eye(2) + X.T.dot(X), X.T.dot(Y))
        * Yhat_map = X.dot(w_map)
         
* `Gradient Descent` solution you have a function you want to minimize J(w)=cost/error
  So you iteratively update w, in the direction of dJ(w)/d(w) in small steps
  By moving slowly in the direction of the gradient of a function we get closer to the minimum of that function
  Gradient descent solution to Linear Regression:
   -   J = cost/error = sum(y_n-y_hat_n)^2 = (Y - X.w)_T(Y - X.w)
   -   dJ/dw = - 2.X_T.Y + 2.X_T.X.w = 2.X_T.(Y_hat - Y) (can drop 2 as itas a constant)
   -   instead setting dJ/dw to 0 and solving it for w, we will take samll steps in the direction
   -   initial w is w_0
   -   w <- w - etha.dJ(w)/d(w)
   - Gradient descent for linear regression: 
     w = draw a sample from N(0,1/D)
       - for t = [1,T]:                                                                                
       -  w = w - learning_rate * X_T.(Y_hat - Y)
      python:
       -  Yhat = X.dot(w)
       -  delta = Yhat - Y
       -  w = w - learning_rate*X.T.dot(delta)
       -  mse = delta.dot(delta) / N
       -  costs.append(mse)
      
     > can quit after a number of steps or when change in w is smaller tthan a predetermined threshold
     - If learning rate is too big: won't converge (bounce back and forth across the optima)
     - If learning rate is too small: gradient descent will be too slow 

                               
# Logistic Regression
Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

* Comparison to linear regression: Given data on time spent studying and exam scores. Linear Regression and logistic regression can predict different things:
    * Linear Regression could help us predict the student’s test score on a scale of 0 - 100.
    * Linear regression predictions are continuous (numbers in a range). Logistic Regression could help use predict whether the student passed or failed. Logistic regression predictions are discrete (only specific values or categories are allowed). We can also view probability scores underlying the model’s classifications.

* Types of logistic regression
    * `Binary` (Pass/Fail): In order to map predicted values to probabilities, we use the sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.
        * $𝑆(𝑧)= 1/1+exp(-z)$
        * $𝑠(𝑧)$ = output between 0 and 1 (probability estimate)
        * $𝑧$= input to the function (your algorithm’s prediction e.g. mx + b)
        * Decision boundary: Our current prediction function returns a probability score between 0 and 1. In order to map this to a discrete class (true/false, cat/dog), we select a threshold value or tipping point above which we will classify values into class 1 and below which we classify values into class 2
        * For example, if our threshold was .5 and our prediction function returned .7, we would classify this observation as positive. If our prediction was .2 we would classify the observation as negative. For logistic regression with multiple classes we could select the class with the highest predicted probability.

    * `Multi` (Cats, Dogs, Sheep): Instead of $𝑦=0,1$ we will expand our definition so that $𝑦=0,1...𝑛$. Basically we re-run binary classification multiple times, once for each class.
        * Softmax activation: it is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval $[ 0 , 1 ]$ , and the components will add up to 1, so that they can be interpreted as probabilities. The standard (unit) softmax function is defined by the formula
            * $σ(𝑧_𝑖) = softmax(z_i) = exp(z_i) / sum(exp(z_j)) for j in [1,K]$
            * In words: we apply the standard exponential function to each element $z_i$ of the input vector $𝑧$ and normalize these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector $σ(𝑧)$ is 1.

    * `Ordinal` (Low, Medium, High)

# Neural Networks

* Input Layer: Holds the data your model will train on. Each neuron in the input layer represents a unique attribute in your dataset (e.g. height, hair color, etc.).
* Hidden Layer: Sits between the input and output layers and applies an activation function before passing on the results. There are often multiple hidden layers in a network. In traditional networks, hidden layers are typically fully-connected layers — each neuron receives input from all the previous layer’s neurons and sends its output to every neuron in the next layer. This contrasts with how convolutional layers work where the neurons send their output to only some of the neurons in the next layer.
* Output Layer: The final layer in a network. It receives input from the previous hidden layer, optionally applies an activation function, and returns an output representing your model’s prediction.
* Weighted Input: A neuron’s input equals the sum of weighted outputs from all neurons in the previous layer. Each input is multiplied by the weight associated with the synapse connecting the input to the current neuron. If there are 3 inputs or neurons in the previous layer, each neuron in the current layer will have 3 distinct weights — one for each each synapse.
* Single Input: $𝑍 = 𝐼𝑛𝑝𝑢𝑡⋅𝑊𝑒𝑖𝑔ℎ𝑡 = 𝑋𝑊$
* Multiple Inputs: $𝑍 = sum(𝑥𝑖𝑤𝑖) for i in [1,n] = 𝑥1𝑤1+𝑥2𝑤2+𝑥3𝑤3$
* Notice, it’s exactly the same equation we use with linear regression! In fact, a neural network with a single neuron is the same as linear regression! The only difference is the neural network post-processes the weighted input with an activation function.

* `training the network`: we split the data int train,test, dev set to measure the generalizabilitty of the model to unseen data and set hyperparameters and parameters. Tets and dev sets should come from the same distribution.
    * train/test/dev for medium size data: 60/20/20
    * train/test/dev for big data: 98/1/1 or 99.5/0.4/0.1

* `Bias and Variance trade-off`: From the train and dev set errors we can detect:
    * `Bias`: Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.
    * `Variance`: Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.
    * `Underfitting`: It happens when a model unable to capture the underlying pattern of the data
    * `Overfitting`: happens when our model captures the noise along with the underlying pattern in data.
    * `Underfitting with high bias` : when the model gives poor accuracy to both train and dev data (e.g. Error 15% and 16% for train and dev sets)
    * `Overfitting with high variance`: when the model over fit the training data and doesn't generalize to unseen dev set and you see high accuracy for train set and poor accuracy for dev set.(e.g. Error 1% and 16% for train and dev sets)
    * `high bias and high variance`: when the accuracy for training set is poor and the accuracy for dev set is even worse (e.g. Error 15% and 30% for train and dev sets)
    * `optimal balance`: An optimal balance of bias and variance would never overfit or underfit the model. low bias and low variance: (e.g. Error 0.5% and 1% for train and dev sets)

* `Solving Bias and Variance issues`:
    * `High bias`: Bigger network, more complex NN architecture, deeper network
    * `High variance`: More data, regularization, search for a different NN architecture

* `Parameter initialization`: Weight  and bias initialization is a very important step. Weights should be initialized with small random values and biases can be initialized with zeros. `W = np.random.randn((2,2))*0.01` and `b=np.zeros((2,1))`

* `Forwardpropagation`: Forwardpropagation is how neural networks make predictions. Input data is “forward propagated” through the network layer by layer to the final layer which outputs a prediction.
    * Calculate the weighted input to the hidden layer by multiplying input by the hidden weight
    * Apply the activation function and pass the result to the final layer
    * Repeat step 2 except this time 𝑋 is replaced by the hidden layer’s output, 𝐻
* `Weight Initialization`: we initialize each array with the numpy’s $np.random.randn(rows, cols)$ method, which returns a matrix of random numbers drawn from a normal distribution with mean 0 and variance 1.
    * If we are using Relu activation function we initialize weight of layer l with w_l = np.random.randn(shape)*np.sqrt(2/n^[l-1])
    * Bias Terms: it allow us to shift our neuron’s activation outputs left and right. This helps us model datasets that do not necessarily pass through the origin. Using the numpy method np.full() below, we create two 1-dimensional bias arrays filled with the default value 0.2. The first argument to np.full is a tuple of array dimensions. The second is the default value for cells in the array.
```
    def init_weights_relu():

        Wh = np.random.randn(INPUT_LAYER_SIZE, HIDDEN_LAYER_SIZE) * \
                    np.sqrt(2.0/INPUT_LAYER_SIZE)
        Wl = np.random.randn(PREV_LAYER_SIZE, CURRENT_LAYER_SIZE) * \
                            np.sqrt(2.0/PREV_LAYER_SIZE)
        Wo = np.random.randn(HIDDEN_LAYER_SIZE, OUTPUT_LAYER_SIZE) * \
                        np.sqrt(2.0/HIDDEN_LAYER_SIZE)
    def init_weights_tanh():
        Wh = np.random.randn(INPUT_LAYER_SIZE, HIDDEN_LAYER_SIZE) * \
                    np.sqrt(1.0/INPUT_LAYER_SIZE)
        Wl = np.random.randn(PREV_LAYER_SIZE, CURRENT_LAYER_SIZE) * \
                            np.sqrt(1.0/PREV_LAYER_SIZE)
        Wo = np.random.randn(HIDDEN_LAYER_SIZE, OUTPUT_LAYER_SIZE) * \
                        np.sqrt(1.0/HIDDEN_LAYER_SIZE)
    def init_bias():
        Bh = np.full((1, HIDDEN_LAYER_SIZE), 0.1)
        Bo = np.full((1, OUTPUT_LAYER_SIZE), 0.1)
        return Bh, Bo
```
* `Backpropagation`: The goals of backpropagation are straightforward: adjust each weight in the network in proportion to how much it contributes to overall error. If we iteratively reduce each weight’s error, eventually we’ll have a series of weights that produce good predictions.
We use the chain rule to calculate the derivative of cost with respect to any weight in the network. The chain rule will help us identify how much each weight contributes to our overall error and the direction to update each weight to reduce our error. Here are the equations we need to make a prediction and calculate total error, or cost.

* Saving work with `Memoization`: Memoization is a computer science term which simply means: don’t recompute the same thing over and over. In memoization we store previously computed results to avoid recalculating the same function. It’s handy for speeding up recursive functions of which backpropagation is one.

* `Vanishing/exploding gradients`:
    * if weights < I the activations would decrease exponentially as network gets deeper and activation and gradients will vanish and this makes training very difficult (training would take forever as gradient descent takes tiny steps for parameter update.
    * if weights > I the activation would increase exponentially as network gets deeper and activation and gradients will explode and this makes training very difficult
    * Weight initialization for NNs is a partial solution to this problem
* `Activation Functions`: They live inside neural network layers and modify the data they receive before passing it to the next layer. Activation functions give neural networks their power — allowing them to model complex non-linear relationships. By modifying inputs with non-linear functions neural networks can model highly complex relationships between features. Popular activation functions include relu and sigmoid.
    * Activation functions typically have the following properties:
        * `Non-linear` - In linear regression we’re limited to a prediction equation that looks like a straight line. This is nice for simple datasets with a one-to-one relationship between inputs and outputs, but what if the patterns in our dataset were non-linear? (e.g. 𝑥2, sin, log). To model these relationships we need a non-linear prediction equation.¹ Activation functions provide this non-linearity.
        * `Continuously differentiable` — To improve our model with gradient descent, we need our output to have a nice slope so we can compute error derivatives with respect to weights. If our neuron instead outputted 0 or 1 (perceptron), we wouldn’t know in which direction to update our weights to reduce our error.
        * `Fixed Range` — Activation functions typically squash the input data into a narrow range that makes training the model more stable and efficient.
    * Activation functions:
        * `ELU`: Exponential Linear Unit or its widely known name ELU is a function that tend to converge cost to zero faster and produce more accurate results. Different to other activation functions, ELU has a extra alpha constant which should be positive number.
            * ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.
            * ELU is a strong alternative to ReLU. Unlike to ReLU, ELU can produce negative outputs.

        * `ReLU`: A recent invention which stands for Rectified Linear Units. The formula is deceptively simple: $𝑚𝑎𝑥(0,𝑧)$. Despite its name and appearance, it’s not linear and provides the same benefits as Sigmoid but with better performance.
            * It avoids and rectifies vanishing gradient problem.
            * ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.
            * One of its limitation is that it should only be used within Hidden layers of a Neural Network Model.
            * Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.
            * In another words, For activations in the region (x<0) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ). This is called dying ReLu problem.
            * The range of ReLu is [0, inf). This means it can blow up the activation.

        * `LeakyReLU`: it is a variant of ReLU. Instead of being 0 when 𝑧<0, a leaky ReLU allows a small, non-zero, constant gradient 𝛼 (Normally, 𝛼=0.01). However, the consistency of the benefit across tasks is presently unclear
            * Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative slope (of 0.01, or so).
            * As it possess linearity, it can’t be used for the complex Classification. It lags behind the Sigmoid and Tanh for some of the use cases.
        * `Sigmoid`: It takes a real value as input and outputs another value between 0 and 1. It’s easy to work with and has all the nice properties of activation functions: it’s non-linear, continuously differentiable, monotonic, and has a fixed output range.
            * It is nonlinear in nature. Combinations of this function are also nonlinear!
            * It will give an analog activation unlike step function.
            * It has a smooth gradient too.
            * It’s good for a classifier.
            * The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function. So we have our activations bound in a range. Nice, it won’t blow up the activations then.
            * Towards either end of the sigmoid function, the Y values tend to respond very less to changes in X.
            * It gives rise to a problem of “vanishing gradients”.
            * Its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
            * Sigmoids saturate and kill gradients.
            * The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computation gets hit by floating point value limits ).

        * `Tanh`: It squashes a real-valued number to the range $[-1, 1]$. It’s non-linear. But unlike Sigmoid, its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.
            * The gradient is stronger for tanh than sigmoid ( derivatives are steeper).
            * Tanh also has the vanishing gradient problem.

        * `Softmax`: It finds the probabilities distribution of the event over ‘n’ different events. In general way of saying, this function will calculate the probabilities of each target class over all possible target classes. Later the calculated probabilities will be helpful for determining the target class for the given inputs.

* `Layers`:
    *  `BatchNorm` helps to train much deeper networks and get your learning algorithm to run much more quickly.
        * `BatchNorm during training`: applies normalization to input/hidden layers. It Normalize the value from the each hidden layer before applying the activation function. It makes hyper parameter search much easier.
            * It helps training very deep network, because it makes weights, deeper in your network, say the weight on layer 10, more robust to changes to weights in earlier layers of the neural network, say, in layer one
            * It accelerates convergence by reducing internal covariance shift inside each batch. If the individual observations in the batch are widely different, the gradient updates will be choppy and take longer to converge.
                * `covariate shift`, is the idea of changing the data distribution, if you've learned some X to Y mapping, if the distribution of X changes, then you might need to retrain your learning algorithm. And this is true even if the ground truth function, mapping from X to Y, remains unchanged. The need to retain the function becomes even more acute if the ground truth function shifts as well.
                * `covariate shift`, BatchNorm limits the amount to which updating the parameters in the earlier layers can affect the distribution of values that the next layers now see and therefore has to learn on. And so, batch norm reduces the problem of the input values changing, it really causes these values to become more stable, so that the later layers have more firm ground to stand on. And even though the input distribution changes a bit, it changes less, and what this does is, even as the earlier layers keep learning, the amounts that this forces the later layers to adapt to as early as layer changes is reduced or, if you will, its weakens the coupling between what the early layers parameters has to do and what the later layers parameters have to do. And so it allows each layer of the network to learn by itself, a little bit more independently of other layers, and this has the effect of speeding up of learning in the whole network
                * regularization is an unintended side effect of BatchNorm: each mini-batch, X_t, has the values Z_t, has the values Z_l, scaled by the mean and variance computed on just that one mini-batch. N
        * `BatchNorm` during test: during the test Batch norm processes your data one mini batch at a time, but the test time you might need to process the examples one at a time.
            * During `training` the batch norm layer normalizes the incoming activations and outputs a new batch where the mean equals 0 and standard deviation equals 1. It subtracts the mean and divides by the standard deviation of the batch.
                * Incoming value `Z` to hidden layer `l`: $Z^[l] = W^[l] * a^[l-1] + b^[l]$ so if you are using BatchNorm you can eliminate $b^[l]$ and use $Z^[l] = W^[l] * a^[l-1]$ because BatchNorm zeros out the mean
                * Calculate the mean for each hidden layer : $mu = 1/m(sum(z_i))$ for i units in hidden layer l (m: number of examples in each mini batch)
                * Calculate the variance $std^2 = (1/m)*sum((z_i - mu)^2)$ for i units in hidden layer l
                * Normalize every component of z and find z_norm by scaling by the mean and standard deviation with Epsilon added for numerical stability : $z_i_norm = (z_i - mu)/sqrt(sigma^2+epsilon))$
                * In practice Batch Norm is usually applied to mini batches
                * You might not want your hidden unit values to forced to have mean 0 and variance 1 for instance if you have sigmoid activation function you might want a bigger variance or mean bigger than 0 to leverage the non-linearity of the activation function rather than just using the ninear region.
                    * z_i = gamma * z_i_norm + beta  : with gamma and beta you can make sure your z_i values are in the range that you want. your learning algorithm (such as gradient descent, Adam, RMSProp) to set gamma and beta.
                    * For every mini batch compute FW propagation to make the output $y^hat$ close to ground truth $y$ . In each hidden layer, use BatchNorm to normalize $z^[l]$; then use backprop and compute parameters $dw^[l]$, $dbeta^[l]$, and $dgamma^[l]$; update the $w^[l]=w^[l]-alpha*dw^[l]$, $beta^[l]=beta^[l]-alpha*dbeta^[l]$ $, and $gamma^[l]=gamma^[l]-alpha*dgamma^[l]$
                    * And then if you want you can find $z_tilda$ by taking $z_norm$ and rescaling by gamma and beta.
            * During the `test time` you might not have a mini batch thousands of examples to process at the same time And if you have just one example, taking the mean and variance of that one example, doesn't make sense. In order to apply your neural network and test time is to come up with some separate estimate of mu and sigma squared.
                * You could in theory run your whole training set through your final network to get mu and sigma squared for test time. But in practice, what people usually do is implement and exponentially weighted average where you just keep track of the mu and sigma squared values you're seeing during training and use and exponentially the weighted average, also sometimes called the running average, to just get a rough estimate of mu and sigma squared and then you use those values of $mu$ and $sigma$ squared for the test time to do the scaling for $z_norm$ and $z_tild$
    * `Convolution layer`: In CNN, a convolution is a linear operation that involves multiplication of weight (kernel/filter) with the input and it does most of the heavy lifting job.
        * Convolution layer consists of 2 major component 1. Kernel(Filter) 2. Stride
        * Kernel (Filter): A convolution layer can have more than one filter. The size of the filter should be smaller than the size of input dimension. It is intentional as it allows filter to be applied multiple times at difference point (position) on the input.Filters are helpful in understanding and identifying important features from given input. By applying different filters (more than one filter) on the same input helps in extracting different features from given input. Output from multiplying filter with the input gives Two dimensional array. As such, the output array from this operation is called “Feature Map”.
        * Stride: This property controls the movement of filter over input. when the value is set to 1, then filter moves 1 column at a time over input. When the value is set to 2 then the filer jump 2 columns at a time as filter moves over the input.
    * `Pooling`: Pooling layers often take convolution layers as input. A complicated dataset with many object will require a large number of filters, each responsible finding pattern in an image so the dimensionally of convolutional layer can get large. It will cause an increase of parameters, which can lead to over-fitting. Pooling layers are methods for reducing this high dimensionally. Just like the convolution layer, there is kernel size and stride. The size of the kernel is smaller than the feature map. For most of the cases the size of the kernel will be 2X2 and the stride of 2. There are mainly two types of pooling layers.
        * The first type is max pooling layer. Max pooling layer will take a stack of feature maps (convolution layer) as input. The value of the node in the max pooling layer is calculated by just the maximum of the pixels contained in the window.
        * The other type of pooling layer is the Average Pooling layer. Average pooling layer calculates the average of pixels contained in the window. Its not used often but you may see this used in applications for which smoothing an image is preferable.
    * `LSTM`
    * `Dropout`
    * `Lineear`


* `Regularization`:
    * `Data Augmentation`:Having more data (dataset / samples) is a best way to get better consistent estimators (ML model). In the real world getting a large volume of useful data for training a model is cumbersome and labelling is an extremely tedious task.
        * Either labelling requires more manual annotation, example - For creating a better image classifier we can use Mturk and involve more man power to generate dataset or doing survey in social media and asking people to participate and generate dataset. Above process can yield good dataset however those are difficult to carry and expensive. Having small dataset will lead to the well know Over fitting problem.
        * Data Augmentation is one of the interesting regularization technique to resolve the above problem. The concept is very simple, this technique generates new training data from given original dataset. Dataset Augmentation provides a cheap and easy way to increase the amount of your training data.
        * This technique can be used for both NLP and CV.
        * In Computer Vision we can use the techniques like Jitter, PCA and Flipping. Similarly in NLP we can use the techniques like Synonym Replacement,Random Insertion, Random Deletion and Word Embeddings.
    * `Dropout regularization` : A dropout layer takes the output of the previous layer’s activations and randomly sets a certain fraction (dropout rate) of the activatons to 0, cancelling or ‘dropping’ them out.
        * It is a common regularization technique used to prevent overfitting in Neural Networks.
        * deactivating neurons lead to smaller network and it has regularization affect
        * similar to L2 regularizationm with dropout network weights are forxed to shrink. because network can't rely on any one feature alone, so it helps with spreading out the weights and not rely on any one feature alone
        * Dropout is like adaptive form of L2 regularization, because L2 penalty on different weights is different depending on size of the activation supplied to that weight
        * Keep-prob: it is the chance of keeping a unit in each layer. It is possible to vary keep-prob for each layer
            * Keep-prob is lower for layers we are worried about overfitting (e.g. layers with more parameters). it acts like adjust lambda in L2 regularization per layer
            * When applying dropout in neural networks, one needs to compensate for the fact that at training time a portion of the neurons were deactivated. To do so, there exist two common strategies:
                * scaling the activation at test time
                * inverting the dropout during the training phase
            * when you want to test the loss function value per itteration, you should turn of dropout
        * Doing the inverted dropout (besides not having to change the code at test time) is that during training one can get fancy and change the dropout rate dynamically. This has been termed as "annealed" dropout. Essentially the logic is that adding dropout "noise" towards the beginning of training helps to keep the optimization from getting stuck in a local minimum, while letting it decay to zero in the end results in a finer tuned network with better performance.
    * `Early Stopping`: One of the biggest problem in training neural network is how long to train the model.
        * During the training you want to optimize your cost function but also not overfitting, early stopping forces the system not to overfit and stops the training before the cost is fully optimized
        * Training too little will lead to underfit in train and test sets. Training too much will have the overfit in training set and poor result in test sets.
        * Here the challenge is to train the network long enough that it is capable of learning the mapping from inputs to outputs, but not training the model so long that it overfits the training data.
        * One possible solution to solve this problem is to treat the number of training epochs as a hyperparameter and train the model multiple times with different values, then select the number of epochs that result in the best accuracy on the train or a holdout test dataset, But the problem is it requires multiple models to be trained and discarded.
        * Clearly, after ‘t’ epochs, the model starts overfitting. This is clear by the increasing gap between the train and the validation error in the above plot.
        * One alternative technique to prevent overfitting is use validation error to decide when to stop. This approach is called Early Stopping.
        * While building the model, it is evaluated on the holdout validation dataset after each epoch. If the accuracy of the model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to decrease), then the training process is stopped. This process is called Early stopping.
    * `Ensembling`:Ensemble methods combine several machine learning techniques into one predictive model. There are a few different methods for ensembling, but the two most common are:
        * `Bagging`: Bagging stands for bootstrap aggregation. One way to reduce the variance of an estimate is to average together multiple estimates.
            * It trains a large number of “strong” learners in parallel.
            * A strong learner is a model that’s relatively unconstrained.
            * Bagging then combines all the strong learners together in order to “smooth out” their predictions.
        * `Boosting`: Boosting refers to a family of algorithms that are able to convert weak learners to strong learners.
            * Each one in the sequence focuses on learning from the mistakes of the one before it.
            * Boosting then combines all the weak learners into a single strong learner.
            * Bagging uses complex base models and tries to “smooth out” their predictions, while boosting uses simple base models and tries to “boost” their aggregate complexity.
    * `Injecting Noise`: Noise is often introduced to the inputs as a dataset augmentation strategy. When we have a small dataset the network may effectively memorize the training dataset. Instead of learning a general mapping from inputs to outputs, the model may learn the specific input examples and their associated outputs. One approach for improving generalization error and improving the structure of the mapping problem is to add random noise.
        * Adding noise means that the network is less able to memorize training samples because they are changing all of the time, resulting in smaller network weights and a more robust network that has lower generalization error.
        * Random noise can be added to other parts of the network during training. Some examples include:
            * Noise is only added during training. No noise is added during the evaluation of the model or when the model is used to make predictions on new data.
            * Noise Injection on Weights: Noise added to weights can be interpreted as a more traditional form of regularization. In other words, it pushes the model to be relatively insensitive to small variations in the weights, finding points that are not merely minima, but minima surrounded by flat regions.
            * Noise Injection on Outputs: In the real world dataset, We can expect some amount of mistakes in the output labels. One way to remedy this is to explicitly model the noise on labels. An example for Noise Injection on Outputs is label smoothing
    * `L1 Regularization`: A regression model that uses L1 regularization technique is called Lasso Regression.
    * `L2 Regularization`: A regression model that uses L2 regularization technique is called Ridge Regression. Main difference between L1 and L2 regularization is, L2 regularization uses “squared magnitude” of coefficient as penalty term to the loss function.


* `Loss Functions`: A loss function, or cost function, is a wrapper around our model’s predict function that tells us “how good” the model is at making predictions for a given set of parameters. The loss function has its own curve and its own derivatives. The slope of this curve tells us how to change our parameters to make the model more accurate! We use the model to make predictions. We use the cost function to update our parameters. Our cost function can take a variety of forms as there are many different cost functions available. Popular loss functions include: MSE (L2) and Cross-entropy Loss.
    * Cross-Entropy: Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0
        * As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predictions that are confident and wrong!
        * Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.
    * Hinge: the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).
    * Huber: In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used.
    * Kullback-Leibler
    * MAE (L1): Mean Absolute Error, or L1 loss. Excellent overview
    * MSE (L2): Mean Squared Error, or L2 loss

* `Gradient Accumulation`: Gradient accumulation is a mechanism to split the batch of samples—used for training a neural network—into several mini-batches of samples that will be run sequentially.
This is used to enable using large batch sizes that require more GPU memory than available. Gradient accumulation helps in doing so by using mini-batches that require an amount of GPU memory that can be satisfied.
Gradient accumulation means running all mini-batches sequentially (generally on the same GPU) while accumulating their calculated gradients and not updating the model variables - the weights and biases of the model. The model variables must not be updated during the accumulation in order to ensure all mini-batches use the same model variable values to calculate their gradients. Only after accumulating the gradients of all those mini-batches will we generate and apply the updates for the model variables.
This results in the same updates for the model parameters as if we were to use the global batch.
    * Gradient descent variants:
        * `Batch` gradient descent: in every iteration you have to process the entire training set before you are able to take one little step of gradient descent. The cost function goes down in every iteration. It takes too long per iteration!
            * `Vectorization` allows you to efficiently compute on `m` training m examples `X=[x_1,...,x_m]` and labels `Y=[y_1,...,y_m]`. It allows you to process the whole training set  without an explicit for loop.
            * If you have a small training set use `Batch Gradient Descent` (e.g. less 2000 examples)
        * `Stochastic` gradient descent: if mini-batch size is 1 it is called stochastic GD. here every example is its own mini batch.SGD is very noisy byt on average takes you to the right direction. It never converges and oscillates around minimum region and does not hit to the minimum and stay there
            * You will lose all the speedup you gain through vectorization by processing one example at a time
        * `Mini-batch` gradient descent: split the training data into small batches called mini-batch (e.g. 1000 examples each). In each iteration you process a mini-batch and apply gradient descent (GD) to it.
            * The cost function goes down but its noisier as in every mini-batch iteration its like you are training on a new training set. The noisiness can be reduced by choosing smaller learning rate (smaller steps)
            * in practice you use a mini-batch size that is not too small or too big to achieve fastest learning (uses both vectorization and processing gradients faster without needing to wait for the whole training set.
            * It converges and oscillate around a small region. You can reduce the learning rate as you get closer to the minimum.
            * If you have a big training set you can take mini-batch of size [64,128,512,1024]. Make sure your mini-batch fits into memory.

* `Optimizers`: It is very important to tweak the weights of the model during the training process, to make our predictions as correct and optimized as possible. But how exactly do you do that? How do you change the parameters of your model.  They tie together the loss function and model parameters by updating the model in response to the output of the loss function. Gradient descent optimization algorithms
    * `Adagrad`: Adagrad (short for adaptive gradient) adaptively sets the learning rate according to a parameter.
        * Parameters that have higher gradients or frequent updates should have slower learning rate so that we do not overshoot the minimum value.
        * Parameters that have low gradients or infrequent updates should faster learning rate so that they get trained quickly.
        * It divides the learning rate by the sum of squares of all previous gradients of the parameter.
        * When the sum of the squared past gradients has a high value, it basically divides the learning rate by a high value, so the learning rate will become less.
        * Similarly, if the sum of the squared past gradients has a low value, it divides the learning rate by a lower value, so the learning rate value will become high.
        * This implies that the learning rate is inversely proportional to the sum of the squares of all the previous gradients of the parameter.
    * `Adadelta`: AdaDelta belongs to the family of stochastic gradient descent algorithms, that provide adaptive techniques for hyperparameter tuning. Adadelta is probably short for ‘adaptive delta’, where delta here refers to the difference between the current weight and the newly updated weight.
        * The main disadvantage in Adagrand is its accumulation of the squared gradients. During the training process, the accumulated sum keeps growing. From the above formala we can see that, As the accumulated sum increases learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.
        * Adadelta is a more robust extension of Adagrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients. This way, Adadelta continues learning even when many updates have been done.
        * With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.
    * `Adam`: [RMSProp+Momentum] Adaptive Moment Estimation (Adam) combines ideas from both RMSProp and Momentum. It computes adaptive learning rates for each parameter and works as follows.
        * `initialization` $V_dW = 0, S_dW = 0, V_db=0, S_db=0$
        * On iteration `t` compute d_W and d_b using current mini-batch
            * `Momentum` uses beta_1:  $V_dW = beta_1*V_dW + (1-beta_1)dW$ and  $V_db = beta_1*V_db + (1-beta_1)db$     and bias correction $V_dW_corrected = V_dW/(1-beta_1^t)$ and $V_db_corrected = V_db/(1-beta_1^t)$
            * `RMSprop` uses beta_2:   $S_dW = beta_2*S_dW + (1-beta_2)dW^2$ and $S_db = beta_2*S_db + (1-beta_2)db^2$  and bias correction $S_dW_corrected = S_dW/(1-beta_2^t)$ and $S_db_corrected = S_db/(1-beta_2^t)$
            * parameter updates $W = W - alpha * V_dW_corrected / sqrt(S_dW_corrected)+epsilon$
            * parameter updates $b = b - alpha * V_db_corrected / sqrt(S_db_corrected)+epsilon$
            * Hyper-parameters:
                * learning rate `alpha` has to be tuned
                * `beta_1` is computing: weighted average of derivatives $dW$ (the first moment): `beta_1` is usually 0.9
                * `beta_2` is computing: exponential weighted average of squares of the derivatives $dW^2$ (the second moment): `beta_2` is  usually 0.999
                * epsilon should be small in orders of `10^-8` and doesn't impact performance much

    * `Momentum`: SSpeeds up GD ands damps down the oscillation in GD.
        * It is used in conjunction Stochastic Gradient Descent (sgd) or Mini-Batch Gradient Descent, Momentum takes into account past gradients to smooth out the update. This is seen in variable $𝑣$ which is an exponentially weighted average of the gradient on previous steps. This results in minimizing oscillations and `faster convergence`.
        * When you are doing gradient descent the oscillation at every step slows down gradient descent and stops you from using larger learning rate. in fact if you use larger learning rate you might overshoot.
        * If you use GD with momentum in every iteration, you compute the derivatives on current mini-batch/batch and you compute moving average of the derivatives you are getting to update your weight/bias parameters.
        * Computes exponential weighted average of your gradients and use that gradient to update your weights.
        * GD with momentum: Averaging out the gradients helps smooth out the steps of gradient descent; the oscillation in vertical direction tends to average out to something close to 0, whereas in the horizental direction all the derivative averages are pointing to the right direction and the average is big. Hence the algorithm takes a more straight path to the minimum.
            * Compute dW and dB on every min-batch; v_dw is initialized to 0
            * Beta of 0.9 is good in practice
            * Alpha is the learning rate
            * dW: exponential weighted average of derivative of current batch or mini-batch
            * $v_dW = beta*v_dW + (1-beta)dW$ and parameter updates $W = W - alpha*v_dW$
            * $v_db = beta*v_db + (1-beta)db$ and parameter updates $b = b - alpha*v_db$
    * `RMSProp`: Speeds up GD ands damps down the oscillation in GD. Another adaptive learning rate optimization algorithm, Root Mean Square Prop (RMSProp) works by keeping an exponentially weighted average of the squares of past gradients. RMSProp then divides the learning rate by this average to speed up convergence.
            * Compute dW and dB on every min-batch; v_dw is initialized to 0
            * Beta of 0.9 is good in practice
            * Alpha is the learning rate
            * dW^2: exponential weighted average of derivative squares
            * $S_dw = beta*S_dW + (1-beta)dW^2$ and parameter updates $W = W - alpha*dW/sqrt(S_dW)$
            * $S_db = beta*S_db + (1-beta)db^2$ and parameter updates $b = b - alpha*db/sqrt(S_db)$
    * `SGD`: SGD stands for Stochastic Gradient Descent.In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total number of samples from a dataset that is used for calculating the gradient for each iteration.
        * In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole dataset. Although, using the whole dataset is really useful for getting to the minima in a less noisy or less random manner, but the problem arises when our datasets get really huge.
        * This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single sample to perform each iteration. The sample is randomly shuffled and selected for performing the iteration.
        * Since only one sample from the dataset is chosen at random for each iteration, the path taken by the algorithm to reach the minima is usually noisier than your typical Gradient Descent algorithm. But that doesn’t matter all that much because the path taken by the algorithm does not matter, as long as we reach the minima and with significantly shorter training time.
    * `Learning rate decay`: it speeds up learning algorithm by reducing the learning rate over time
        * `epoch` : 1 epoch is one pass through the data
        * alpha_0 : initial learning rate
        * learning rate: alpha = 1/(1+decay_rate*epoch_number)* alpha_0
        * Your learning rate decreases as a function of epoch number (alpha_0 = 0.2, decay_rate=1)
        * `Exponential decay`: is another method for learning rate decay. $alpha = 0.95^(epoch_number)*alpha_0$
        * other formulas: $ alpha = [k/sqrt(epoch_number)]*alpha_0$
        * `manual decay` if you want to manually choose alpha

    * Additional strategies for optimizing SGD
        * Shuffling and Curriculum Learning
        * Batch normalization
        * Early Stopping
        * Gradient noise
    * `saddle points`: When you are training NN most points with 0 gradient are not local optima instead most points with 0 gradient are actually saddle points
        * In a function of very high dimensional space, if the gradient is zero, then in each direction it can either be a convex light function or a concave light function.
        * If you are in, say, a 20,000 dimensional space, then for it to be a local optima, all 20,000 directions need to look like this.
        * And so the chance of that happening is maybe very small, maybe two to the minus 20,000. Instead you're much more likely to get some directions where the curve bends up like so, as well as some directions where the curve function is bending down rather than have them all bend upwards.
        * So that's why in very high-dimensional spaces you're actually much more likely to run into a saddle point like that shown on the right, then the local optimum.
    * `Problem of plateau`: If local optima aren't a problem, then what is a problem? It turns out that plateaus can really slow down learning and a plateau is a region where the derivative is close to zero for a long time. So if you're here, then gradient
    * descents will move down the surface, and because the gradient is zero or near zero, the surface is quite flat. You can actually take a very long time, you know, to slowly find your way to maybe this point on the plateau. And then because of a random perturbation of left or right, maybe then finally I'm going to search pen colors for clarity.
    * Your algorithm can then find its way off the plateau. Let it take this very long slope off before it's found its way here and they could get off this plateau. So the takeaways from this video are, first, you're actually pretty unlikely to get stuck in bad local optima so long as you're training a reasonably large neural network,
    * save a lot of parameters, and the cost function J is defined over a relatively high dimensional space. But second, that plateaus are a problem and you can actually make learning pretty slow. And this is where algorithms like momentum or RmsProp or Adam can really help your learning algorithm as well. And these are scenarios where more sophisticated observation algorithms, such as Adam, can actually speed up the rate at which you could move down the plateau and then get off the plateau. So because your network is solving optimizations problems over such high dimensional spaces, to be honest,
    * don't think anyone has great intuitions about what these spaces really look like, and our understanding of them is still evolving. But I hope this gives you some better intuition about the challenges that the optimization algorithms may face.

* `Hyper-parameter tuning`:
    * In order of importance: learning rate (alpha: most important), momentum (Beta: usually 0.9), Adam optimizer (Beta_1, Beta_2, epsilon: usually 0.9, 0.999, 10^-8 ), No. layers, No. hidden units, learning rate decay, mini-batch size
    * Create a grid of hyperparameters in log scale rather than linear scale, choose the points at random, and systematically explore these values.
        * So to take an example, let's say hyperparameter one turns out to be alpha, the learning rate. And to take an extreme example, let's say that hyperparameter two was that value epsilon that you have in the denominator of the Adam algorithm.
        * So your choice of alpha matters a lot and your choice of epsilon hardly matters. So if you sample in the grid then you've really tried out five values of alpha and you might find that all of the different values of epsilon give you essentially the same answer.
            *  So if you have systematically searched the grid you've now trained 25 models and only got into trial five values for the learning rate alpha, which I think is really important.
            * Whereas in contrast, if you were to sample at random, then you will have tried out 25 distinct values of the learning rate alpha and therefore you be more likely to find a value that works really well.
        * If you have, say, three hyperparameters, I guess instead of searching over a square, you're searching over a cube where this third dimension is hyperparameter three and then by sampling within this three-dimensional cube you get to try out a lot more values of each of your three hyperparameters

* `Single Number Evaluation Metric`:
    * During the hyperparameters tuning, or trying out different ML algorithms, it is often recommend that you set up a single real number evaluation metric for the problem.
    * One reasonable way to evaluate the performance of your classifiers is to look at its F1 score which is harmonic mean of precision P and recall R (2/ (1/P + 1/R)).
    * It speeds up this iterative process of improving your machine learning algorithm.

* `Evaluation Metric: Defining Optimizing Aaccuracy & Satisficing Running Time Matrix`:
    * It's not always easy to combine all the things you care about into a single number evaluation metric
    * Defining optimizing as well as satisficing matrix, this gives you a clear way to pick the, quote, best classifier
        * Let's say you're building a system to detect wake words, also called trigger words. So this refers to the voice control devices like the Amazon Echo where you wake up by saying Alexa or some Google devices which you wake up by saying okay Google or some Apple devices which you wake up by saying Hey Siri or some Baidu devices we should wake up by saying you ni hao Baidu.
        * So in this case maybe one reasonable way of combining these two evaluation matrix might be to maximize accuracy, so when someone says one of the trigger words, maximize the chance that your device wakes up. And subject to that, you have at most one false positive every 24 hours of operation (and minimize running time)

* set up `training, dev(development), test sets`
    * the dev and test sets should come from the same distribution.So once you do well on dev set, you know you are going to do well on test set too.
    * Once you've established that dev set and the metric is that, the team can innovate very quickly, try different ideas, run experiments and very quickly use the dev set and the metric to evaluate crossfires and try to pick the best one.
    * It can have a huge impact on how rapidly you or your team can make progress on building ml application.
    * In ML application with smaller data: we often split the data into 70% Train, and 30% Test or alternatively 60% Train, 20% Dev and 20% Test sets
    * In the modern ML era (Deep learning) where we work with much larger data It's quite reasonable to use a much smaller than 20 or 30% of your data for a dev set or a test set.
        * So let's say you have a million training examples, it might be quite reasonable to set up your data so that you have 98% in the training set, 1% dev, and 1% test.
        * For some applications, maybe you don't need a high confidence in the overall performance of your final system. Maybe all you need is a train and dev set, And I think, not having a test set might be okay.