## Neural Networks Basics

Networks come in all shapes and sizes.

- We can add more features (nodes) in the input layer.
- We can add more nodes in the hidden layer. Also, we can simply add more hidden layers. This is what turns a neural network in a "deep" neural network (hence, deep learning)
- We can have several nodes in the output layer.

And there is one more thing that makes deep learning extremely powerful: unlike many other statistical and machine learning techniques, deep learning can deal extremely well with **unstructured data**.

Types or Neural networks:
- Standard neural networks
- Convolutional neural networks (input = images, video)
- Recurrent neural networks (input = audio files, text, time series data)
- Generative adversarial networks

#### Logistic regression as a neural network (no hidden layers)
We'll need some expression here in order to make a prediction.
The parameters here are $w \in  \mathbb{R}^n$ and $b \in \mathbb{R}$. Some expression to get to $\hat y$ could be $\hat y = w^T x + b$. The problem here is, however, that this type of expression does not ensure that the eventual outcome $ \hat y$ will be between zero and one, and it could be much bigger than one or even negative!

#### activation functions
**sigmoid function**
Recall that the mathematical expression of the sigmoid is $ a=\dfrac{1}{1+ \exp(-z)}$, and this outputs activation values somewhere between 0 and 1.
    
        def sigmoid(x, derivative=False):
        f = 1 / (1 + np.exp(-x))
        if (derivative == True):
            return f * (1 - f)
        return f

**tanh**
The hyperbolic tangent (or tanh) function goes between -1 and +1, and is in fact a shifted version of the sigmoid function, with formula $ a=\dfrac{\exp(z)- \exp(-z)}{\exp(z)+ \exp(-z)}$. For intermediate layers, the tanh function generally performs pretty well because, with values between -1 and +1, the means of the activations coming out are closer to zero! 

    def tanh(x, derivative=False):
        f = np.tanh(x)
        if (derivative == True):
            return (1 - (f ** 2))
        return np.tanh(x)
        
**ReLU** (Rectified Linear Unit function)
This is probably the most popular activation function, along with the tanh! The fact that the activation is exactly 0 when $z <0$  is slightly cumbersome when taking derivatives though. $a=\max(0,z)$

    def relu(x, derivative=False):
        f = np.zeros(len(x))
        if (derivative == True):
            for i in range(0, len(x)):
                if x[i] > 0:
                    f[i] = 1  
                else:
                    f[i] = 0
            return f
        for i in range(0, len(x)):
            if x[i] > 0:
                f[i] = x[i]  
            else:
                f[i] = 0
        return f

**leaky ReLU**

The leaky ReLU solves the derivative issue by allowing for the activation to be slightly negative when $z <0$ ! $a=\max(0.001*z,z)$

    def leaky_relu(x, leakage = 0.05, derivative=False):
        f = np.zeros(len(x))
        if (derivative == True):
            for i in range(0, len(x)):
                if x[i] > 0:
                    f[i] = 1  
                else:
                    f[i] = leakage
            return f
        for i in range(0, len(x)):
            if x[i] > 0:
                f[i] = x[i]  
            else:
                f[i] = x[i]* leakage
        return f
        
**arctan**
The inverse tangent (arctan) function has a lot of the same qualities that tanh has, but the range roughly goes from -1.6 to 1.6, and  the slope is more gentle than the one we saw using the tanh function.

    def arctan(x, derivative=False):
        if (derivative == True):
            return 1/(1+np.square(x))
        return np.arctan(x)

    z = np.arange(-10,10,0.2)
    
#### Loss and Cost Function
The **loss function** is used to measure the inconsistency between the predicted value $(\hat y)$ and the actual label $y$. In logistic regression the loss function is defined as
$\mathcal{L}(\hat y, y) = - ( y \log (\hat y) + (1-y) \log(1-\hat y))$. The advantage of this loss function expression is that the optimization space here is convex, which makes optimizing using gradient descent easier. The loss function is defined over 1 particular training sample. 

The **cost function** takes the average loss over all the samples: $J(w,b) = \displaystyle\frac{1}{l}\displaystyle\sum^l_{i=1}\mathcal{L}(\hat y^{(i)}, y^{(i)})$
When you train your logistic regression model, the purpose is to find parameters $w$ and $b$ such that your cost function is minimized!

#### Forward propagation
Initialize $J= 0$, $dw_1= 0$, $dw_2= 0$, $db= 0$. 

For each training sample $1,...,l$ you'll need to compute:

$ z^{(i)} = w^T x^ {(i)} +b $

$\hat y^{(i)} = \sigma (z^{(i)})$

$dz^{(i)} = \hat y^{(i)}- y^{(i)}$

#### Backward propagation (after updating values with forward propagation)
$J_{+1} = - [y^{(i)} \log (\hat y^{(i)}) + (1-y^{(i)}) \log(1-\hat y^{(i)})$

$dw_{1, +1}^{(i)} = x_1^{(i)} * dz^{(i)}$

$dw_{2, +1}^{(i)} = x_2^{(i)} * dz^{(i)}$

$db_{+1}^{(i)} =  dz^{(i)}$

$\dfrac{J}{m}$, $\dfrac{dw_1}{m}$, $\dfrac{dw_1}{m}$, $\dfrac{db}{m}$

#### Update weights
$w_1 := w_1 - \alpha dw_1$

$w_2 := w_2 - \alpha dw_2$

$b := b - \alpha db$

repeat until convergence!

## Deep Neural Networks Process Overview

To summarize the process once more, we begin by defining a model architecture which includes the number of hidden layers, activation functions (sigmoid or relu) and the number of units in each of these.   

We then initialize parameters for each of these layers (typically randomly). After the initial parameters are set, forward propagation evaluates the model giving a prediction, which is then used to evaluate a cost function. Forward propogation involves evaluating each layer and then piping this output into the next layer. 

Each layer consists of a linear transformation and an activation function.  The parameters for the linear transformation in **each** layer include $W^l$ and $b^l$. The output of this linear transformation is represented by $Z^l$. This is then fed through the activation function (again, for each layer) giving us an output $A^l$ which is the input for the next layer of the model.  

After forward propogation is completed and the cost function is evaluated, backpropogation is used to calculate gradients of the initial parameters with respect to this cost function. Finally, these gradients are then used in an optimization algorithm, such as gradient descent, to make small adjustments to the parameters and the entire process of forward propogation, back propogation and parameter adjustments is repeated until the modeller is satisfied with the results.

## Creating a Deep Neural Network from Scratch
   
    import numpy as np
    import h5py
    import matplotlib.pyplot as plt

    %matplotlib inline
    plt.rcParams['figure.figsize'] = (5.0, 5.0) 
    plt.rcParams['image.interpolation'] = 'nearest'
    plt.rcParams['image.cmap'] = 'gray'

    %load_ext autoreload
    %autoreload 2

    np.random.seed(123)
    
#### Initialization in an L-layer Neural Network
    def initialize_parameters(n_0, n_1, n_2):
    np.random.seed(123) 
    W1 = np.random.randn(n_1, n_0) * 0.05 
    b1 = np.zeros((n_1, 1))
    W2 =  np.random.randn(n_2, n_1) * 0.05 
    b2 = np.zeros((n_2, 1))
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters
    
#### create a dictionary of parameters for W and b given a list of layer dimensions.
    #Simply randomly initialize values in accordance to the shape each parameter should have.
    #Use random seed 123 (as provided)
    def initialize_parameters_deep(list_layer_dimensions):

        np.random.seed(123)
        parameters = {}
        L = len(list_layer_dimensions)           

        for l in range(1, L):
            parameters['W' + str(l)] = np.random.randn(list_layer_dimensions[l], list_layer_dimensions[l-1])*0.05
            parameters['b' + str(l)] = np.zeros((list_layer_dimensions[l], 1))

        return parameters
    
#### forward propagation through linear activation
    def linear_activation_forward(A_prev, W, b, activation):
        Z = np.dot(W, A_prev) + b #Your code here; see the linear transformation above for how to compute Z
        linear_cache = (A_prev, W, b)
        activation_cache = Z

        #Here we define two possible activation functions
        if activation == "sigmoid":
            A = 1/(1+np.exp(-Z))
        elif activation == "relu":
            A = np.maximum(0,Z)
        assert (A.shape == (W.shape[0], A_prev.shape[1]))
        cache = (linear_cache, activation_cache)

        return A, cache
        
#### continue forward propagation through L layer
    def L_model_forward(X, parameters):
        #Initialize a cache list to keep track of the caches
        caches = [] #Your code here
        A = X
        L = len(parameters) // 2 # number of layers in the neural network

        # Implement the RELU activation L-1 times. Add "cache" to the "caches" list.
        for l in range(1, L):
            A_prev = A
            A, cache = linear_activation_forward(A_prev, parameters['W'+ str(l)], parameters['b' + str(l)], activation = "relu")        
            caches.append(cache)
        #Implement the sigmoid function for the last layer. Add "cache" to the "caches" list.
        AL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], activation = "sigmoid")
        caches.append(cache)

        assert(AL.shape == (1,X.shape[1]))

        return AL, caches
        
#### create cost function
    def compute_cost(AL, Y):

        m = Y.shape[1]

        cost = -(1/m)* np.sum((Y*np.log(AL))+ (1-Y)*np.log(1-AL))
        cost = np.squeeze(cost)      #turn [[17]] into 17

        return cost
    
#### start backward propagation with a linear backward function  
    def linear_backward(dZ, cache):
        A_prev, W, b = cache #Unpacking our complex object
        m = A_prev.shape[1]

        dW = (1/m) * np.dot(dZ,A_prev.T)
        db = (1/m) * np.sum(dZ, axis =1, keepdims = True)
        dA_prev = np.dot(W.T , dZ) #Your code here; see the formulas above

        return dA_prev, dW, db
        
#### combine activation to linear backward function
    def linear_activation_backward(dA, cache, activation):
        linear_cache, activation_cache = cache
        Z= activation_cache

        if activation == "sigmoid": 
            s = 1/(1+np.exp(-Z))
            dZ = dA * s * (1-s)
            dA_prev, dW, db = linear_backward(dZ, linear_cache)

        elif activation == "relu":
            dZ = np.array(dA, copy=True) # just converting dz to a correct object.
            dZ[Z <= 0] = 0
            dA_prev, dW, db = linear_backward(dZ, linear_cache)

        return dA_prev, dW, db
        
#### commence backword propagation
    def L_model_backward(AL, Y, caches):
        grads = {}
        L = len(caches) # the number of layers
        m = AL.shape[1]
        Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL

        # Initializing the backpropagation
        dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

        # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "dAL, current_cache". Outputs: "grads["dAL-1"], grads["dWL"], grads["dbL"]
        current_cache = caches[L-1]
        grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, activation = "sigmoid")

        # Loop from l=L-2 to l=0
        for l in reversed(range(L-1)):
            # (RELU -> LINEAR) gradients
            # Inputs: "grads["dA" + str(l + 1)], current_cache". Outputs: "grads["dA" + str(l)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 
            current_cache = caches[l]
            dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l+1)], current_cache, activation = "relu") #Your code here; use the helper function defined above
            grads["dA" + str(l)] = dA_prev_temp
            grads["dW" + str(l + 1)] = dW_temp
            grads["db" + str(l + 1)] = db_temp

        return grads 
    
#### update parameters with updated weights
    def update_parameters(parameters, grads, learning_rate):

        L = len(parameters) // 2 # number of layers in the neural network


        for l in range(L):
            parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads["dW" + str(l+1)]
            parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads["db" + str(l+1)]
        return parameters

## Image Recognition using Deep Learning from Scratch
    
#### import images
    import matplotlib.image as mpimg
    filename = 'data/validation/santa/00000448.jpg'
    img=mpimg.imread(filename)
    plt.imshow(img)
    print(img.shape)
    plt.show()
    
#### keras preprocessing through image downgrade
    import time
    import matplotlib.pyplot as plt
    import scipy
    from PIL import Image
    from scipy import ndimage
    from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
    %matplotlib inline
    plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
    plt.rcParams['image.interpolation'] = 'nearest'
    plt.rcParams['image.cmap'] = 'gray'

    np.random.seed(1)
    
#### set directory path
    train_data_dir = 'data/train'
    test_data_dir = 'data/validation'

#### get all the data in the directory data/validation (132 images), and reshape them
    test_generator = ImageDataGenerator().flow_from_directory(
            test_data_dir, 
            target_size=(64, 64), batch_size=132) 

#### get all the data in the directory data/train (790 images), and reshape them
    train_generator = ImageDataGenerator().flow_from_directory(
            train_data_dir, 
            target_size=(64, 64), batch_size=790)

#### create the data sets
    train_images, train_labels = next(train_generator)
    test_images, test_labels = next(test_generator)
    
#### Explore your dataset again
    m_train = train_images.shape[0]
    num_px = train_images.shape[1]
    m_test = test_images.shape[0]

    print ("Number of training examples: " + str(m_train))
    print ("Number of testing examples: " + str(m_test))
    print ("Each image is of size: (" + str(num_px) + ", " + str(num_px) + ", 3)")
    print ("train_images shape: " + str(train_images.shape))
    print ("train_labels shape: " + str(train_labels.shape))
    print ("test_images_orig shape: " + str(test_images.shape))
    print ("test_labels shape: " + str(test_labels.shape))
    
#### Reshape the training and test examples 
    train_img = train_images.reshape(train_images.shape[0], -1).T   # The "-1" makes reshape flatten the remaining dimensions
    test_img = test_images.reshape(test_images.shape[0], -1).T

    # Standardize data to have feature values between 0 and 1.
    train_x = train_img/255.
    test_x = test_img/255.

    print ("train_img's shape: " + str(train_img.shape))
    print ("test_img's shape: " + str(test_img.shape))

#### Reshape the labels
    train_labels_final = train_labels.T[[1]]
    test_labels_final = test_labels.T[[1]]

    print ("train_labels_final's shape: " + str(train_labels_final.shape))
    print ("test_labels_final's shape: " + str(test_labels_final.shape))
    
#### putting the functions together
    def L_layer_model(X, Y, layers_dims, learning_rate = 0.005, num_iterations = 3000, print_cost=False):#lr was 0.009
        np.random.seed(1)
        costs = []                         

        # Parameters initialization. (≈ 1 line of code)
        parameters = initialize_parameters_deep(layers_dims)

        # Loop (gradient descent)
        for i in range(0, num_iterations):

            # Forward propagation: [LINEAR -> RELU]*(L-1) -> LINEAR -> SIGMOID.
            AL, caches = L_model_forward(X, parameters) #Your code here; use the previous helper functions
        
            # Compute cost.
            cost = compute_cost(AL, Y) #Your code here; use the previous helper functions

            # Backward propagation.
            grads = L_model_backward(AL, Y, caches) #Your code here; use the previous helper functions

            # Update parameters.
            parameters = update_parameters(parameters, grads, learning_rate)  #Your code here; use the previous helper functions

            # Print the cost every 100 training example
            if print_cost and i % 100 == 0:
                print ("Cost after iteration %i: %f" %(i, cost))
            if print_cost and i % 100 == 0:
                costs.append(cost)

        # plot the cost
        plt.plot(np.squeeze(costs))
        plt.ylabel('cost')
        plt.xlabel('iterations (per tens)')
        plt.title("Learning rate =" + str(learning_rate))
        plt.show()

        return parameters
        
#### run function    
    parameters = L_layer_model(train_img, train_labels_final, layers_dims, num_iterations = 1000, print_cost = True)
    
    #make predictions
    def predict(X, parameters, y=None):

        m = X.shape[1]
        n = len(parameters) // 2

        # Forward propagation
        probs, caches = L_model_forward(X, parameters)

        # convert probs to 0/1 predictions
        for i in range(0, probs.shape[1]):
            if probs[0,i] > 0.50:
                probs[0,i] = 1
            else:
                probs[0,i] = 0

        #print ("predictions: " + str(probs)); print ("true labels: " + str(y))
        if type(y) != type(None):
            print("Accuracy: "  + str(np.sum((probs == y)/m)))

        return probs
        
#### print misslabeled images
    def print_mislabeled_images(classes, X, y, p):
        a = p + y
        mislabeled_indices = np.asarray(np.where(a == 1))
        plt.rcParams['figure.figsize'] = (90.0, 90.0) # set default size of plots
        num_images = len(mislabeled_indices[0])
        for i in range(num_images):
            index = mislabeled_indices[1][i]

            plt.subplot(2, num_images, i + 1)
            plt.imshow(X[:,index].reshape(64,64,3), interpolation='nearest')
            plt.axis('off')
            
    print_mislabeled_images(list(train_generator.class_indices), test_img, test_labels_final, pred_test)

## Keras Basics

    #packages to import
    from keras import models
    from keras import layers
    from keras import optimizers

#### Deciding on the network architecture
    model = models.Sequential()

#### Adding layers
Once we have initialized a network object as shown above, we can then add layers to the network which includes the number of layers we wish to add, as well as which activiation function we hope to use. For example, when coding from scratch, we previously used the sigmoid and ReLu activation functions.   

The `Dense` method indicates that this layer will be fully connected. There are other layer architectures that we will discuss further in upcoming labs and lessons.

Finally, the `input_shape` parameter is often optional. That is, in successive layers, Keras implies the required shape of the layer to be added based on the shape of the previous layer.

    model.add(layers.Dense(units, activation, input_shape))
    
#### Compiling the model
Once we have defined the network architecture and added layers to that network, we then compile the model before then training that model on our data.  

    model.compile(optimizer=optimizers.RMSprop(lr=0.001),
                  loss='mse',
                  metrics=['accuracy'])

#### Training the model
* **Sample**: one element of a dataset.  
    * *Example*: one image is a sample in a convolutional network  
    * *Example*: one audio file is a sample for a speech recognition model  
    
* **Batch**: a set of N samples. The samples in a batch are processed independently, in parallel. If training, a batch results in only one update to the model.  
* A batch generally approximates the distribution of the input data better than a single input. The larger the batch, the better the approximation; however, it is also true that the batch will take longer to process and will still result in only one update. For inference (evaluate/predict), it is recommended to pick a batch size that is as large as you can afford without going out of memory (since larger batches will usually result in faster evaluation/prediction).
* **Epoch**: an arbitrary cutoff, generally defined as "one pass over the entire dataset", used to separate training into distinct phases, which is useful for logging and periodic evaluation.
* When using validation_data or validation_split with the fit method of Keras models, evaluation will be run at the end of every epoch.
* Within Keras, there is the ability to add callbacks specifically designed to be run at the end of an epoch. Examples of these are learning rate changes and model checkpointing (saving).

        history = model.fit(x_train,
                            y_train,
                            epochs=20,
                            batch_size=512,
                            validation_data=(x_val, y_val))
                            
#### Plotting 
When we fit the model as shown above, we not only update the model object itself, we are also returned a history associated with the model. (Hence our variable name.) With this, we can retrieve further information regarding how the model training progressed from epoch to epoch. To do this, you can access the history attribute of the returned object. Given our variable naming above, we would thus have:

```history.history```

This will return a dictionary of the metrics we indicated when compiling the model. By default, the loss criteria will always be included as well. So in our example, this dictionary will have 2 keys, one for the loss, and one for the accuracy. If you wish to plot learning curves for the loss or accuracy versus the epochs, you can then simply retrieve these lists. For example:

    ```history.history['loss']```

would return a list of the loss at each epoch.

#### Making Predictions

As with sci-kit learn and other prebuilt packages, making predictions from a trained model is relatively straightforward. To do this, you can simply use the `predict` method built into the model object. For example:  
    ```{python}
    y_hat = model.predict(x)
    ```
    
#### Evaluating the Model

Now that the model has been trained, our predictions are applying that model to the data. Similarly, we can use the `evaluate` method in order to compute the loss and other specified metrics for our trained model.

For example,   

```model.evaluate(X_train, X_train_labels)``` will return the final loss associated with the model for the training data as well as any other metrics that were specified during compilation.

Similarly, 

```model.evaluate(X_test, X_test_labels)``` will return the final loss associated with the model for the test data as well as any other specified metrics.


## Tuning NN

#### Hyperparameters
- number of hidden units
- number of layers
- learning rate alpha
- activation function

**Hyperparameter level of performance**
Most important:
- $\alpha$

Important next:
- $\beta$ (momentum)
- Number of hidden units
- mini-batch-size

Finally:
- Number of layers
- Learning rate decay

Almost never tuned:
- $\beta_1$, $\beta_2$, $\epsilon$ (Adam)

Things to do:

- don't use a grid, because hard to say in advance which hyperparameters will be important

#### Training, Test and Validation Sets
In short, we'll use 3 sets when running, selecting and validating a model:
- You train algorithms on the training set
- You'll use a validation set to decide which one will be your final model after parameter tuning
- After having chosen the final model (and having evaluated long enough), you'll use the test set to get an unbiased estimate of the classification performance (or whatever your evaluation metric will be).

With big data, your dev and test sets don't necessarily need to be 20-30% of all the data. You can choose test and hold-out sets that are of size 1-5%. eg. 96% train, 2% hold-out, 2% test set.

If your dataset is small, you can perform K-fold cross-validation on the training set. 

In this example, let's take the first 1000 cases out of the training set to become the validation set. You should do this for both `train` and `label_train`.

    random.seed(123)
    val = X_train[:1000]
    train_final = X_train[1000:]
    label_val = y_train[:1000]
    label_train_final = y_train[1000:]
    
#### Normalization
- normalized inputs speed up training
- prevents exploding or vanishing gradients by having same scale values
- Done by:
    1. Subtracting the mean
    2. dividing by the standard deviation

#### Rules of Thumb Regarding Bias / Variance
| High Bias? (training performance) | high variance? (validation performance)  |
|---------------|-------------|
| Use a bigger network|    More data     |
| Train longer | Regularization   |
| Look for other existing NN architextures |Look for other existing NN architextures |

#### Regularization
**use regularization when overfitting occurs**

- **L1-regularization** is where you just add a term:

$$ \dfrac{\lambda}{m}||w||_1$$ (could also be 2 in the denominator)

- **L2-regularization** is the most common type of regularization.
- L2-regularization is called weight decay, because regularization will make your load smaller:

$\bigr(1- \dfrac{\alpha\lambda}{m}\bigr)$.

Intuition for regularization: the weight matrices will be penalized from being too large. Actually, the network will be forced to almost be simplified.

Also: eg, tanh function, if $w$ small, the activation function will be mostly operating in the linear region and not "explode" as easily.

- **Dropout Regularization** For each node, drop a coin and drop them out (you can also alter the dropout probability to be different from 0.5).
- In different iterations through the training set, different nodes will be zeroed out!
- **When making predictions, don't do dropout!**

- **Early stopping** Overfitting happens when the model is overtrained. Early stopping reduces the number of epochs which reduces overfitting
- Need to visualize loss and accuracy plots and find epoch where loss is minimized and accuracy is maximized.

#### Optimization
In addition, we could even use an alternative convergence algorithm instead of gradient descent. One issue with gradient descent is that it oscillates to a fairly big extent, because the derivative is bigger in the vertical direction. There are other algorithms that run faster:

**Gradient Descent with Momentum**
Compute an exponentially weighthed average of the gradients and use that gradient instead. The intuitive interpretation is that this will successively dampen oscillations, improving convergence.

Momentum:
compute dW and db on the current minibatch.

Combute $V_{dw} = \beta V_{dw} + (1-\beta)dW$ and

Combute $V_{db} = \beta V_{db} + (1-\beta)db$

--> moving average for the derivatives of W and b

$W:= W- \alpha Vdw$

$b:= b- \alpha Vdb$

This averages out gradient descent, and will "dampen" oscillations
Generally, $\beta=0.9$ is a good hyperparameter value.

**RMSprop**
RMSprop: "root mean square" prop.

    model.compile(optimizer= "rmsprop" ,loss='mse',metrics=['mse'])

Slow down learning on one direction and speed up in another one.

On each iteration, use exponentially weithed average again:
exponentially weighted average of the squares of the derivatives

$S_{dw} = \beta S_{dw} + (1-\beta)dW^2$

$S_{db} = \beta S_{dw} + (1-\beta)db^2$

$W:= W- \alpha \dfrac{dw}{\sqrt{S_{dw}}}$ and

$b:= b- \alpha \dfrac{db}{\sqrt{S_{db}}}$

In the direction where we want to learn fast, the corresponding S will be small, so dividing by a small number. On the other hand, in the direction where we will want to learn slow, the corresponding S will be relatively large, and updates will be smaller. 

Often, add small $\epsilon$ in the denominator to make sure that you don't end up dividing by 0.

**Adam Optimization Algorithm**
"Adaptive Moment Estimation", basically using the first and second moment estimations.

    model.compile(optimizer= "Adam" ,loss='mse',metrics=['mse'])

Works very well in many situations!

Taking momentum and RMSprop and putting it together!

Initialize:

$V_{dw}=0, S_{dw}=0, V_{db}=0, S_{db}=0$.

each iteration:
Compute $dW, db$ using the current mini-batch

$V_{dw} = \beta_1 V_{dw} + (1-\beta_1)dW$, $V_{db} = \beta_1 V_{db} + (1-\beta_1)db$ 

$S_{dw} = \beta_2 S_{dw} + (1-\beta_2)dW^2$, $S_{db} = \beta_2 S_{db} + (1-\beta_2)db^2$ 

Is like momentum and then RMSprop. We need to perform a correction! This is sometimes also done in RSMprop, but definitely here too.


$V^{corr}_{dw}= \dfrac{V_{dw}}{1-\beta_1^t}$, $V^{corr}_{db}= \dfrac{V_{db}}{1-\beta_1^t}$

$S^{corr}_{dw}= \dfrac{S_{dw}}{1-\beta_2^t}$, $S^{corr}_{db}= \dfrac{S_{db}}{1-\beta_2^t}$

$W:= W- \alpha \dfrac{V^{corr}_{dw}}{\sqrt{S^{corr}_{dw}+\epsilon}}$ and

$b:= b- \alpha \dfrac{V^{corr}_{db}}{\sqrt{S^{corr}_{db}+\epsilon}}$ 

**Learning rate decay**
Learning rate decreases across epochs.

    sgd = optimizers.SGD(lr=0.03, decay=0.0001, momentum=0.9)

$\alpha = \dfrac{1}{1+\text{decay_rate * epoch_nb}}* \alpha_0$

other methods:

$\alpha = 0.97 ^{\text{epoch_nb}}* \alpha_0$ (or exponential decay)

or

$\alpha = \dfrac{k}{\sqrt{\text{epoch_nb}}}* \alpha_0$

or

Manual decay!


## Convolutional Neural Networks (CNN)

CNNs have certain features that identify patterns in images because of  "convolution operation" including:

- Dense layers learn global patterns in their input feature space

- Convolution layers learn local patterns, and this leads to the following interesting features:
    - Unlike with densely connected networks, when a convolutional neural network recognizes a patterns let's say, in the upper-right corner of a picture, it can recognize it anywhere else in a picture. 
    - Deeper convolutional neural networks can learn spatial hierarchies. A first layer will learn small local patterns, a second layer will learn larger patterns using features of the first layer patterns, etc. 
     
Because of these properties, CNNs are great for tasks like:
- Image classification
- Object detection in images
- Picture neural style transfer

**In Keras, function for the convolution step is Conv2D. The convolutional operation applies a filter (typically 3x3 or 5x5) to each possible 3x3 or 5x5 region of the original image**

### Padding

There are some issues with using filters on images including: 

- The image shrinks with each convolution layer: you're throwing away information in each layer! For example:
    - Starting from a 5 x 5 matrix, and using a 3 x 3 matrix, you end up with a 3 x 3 image. 
    - Starting from a 10 x 10 matrix, and using a 3 x 3 matrix, you end up with a 8 x 8 image. 
    - etc.
- The pixels around the edges are used much less in the outputs due to the filter. 

Fortunately, padding solves both of these problems! Just one layer of pixels around the edges preserves the image size when having a 3 x 3 filter. We can also use bigger filters, but generally the dimensions are odd!

Some further terminology regarding padding that you should be aware of includes:

- "Valid" - no padding
- "Same" - padding such that output is same as the input size

By adding padding to our 5x5 image, (now a 6x6 image by adding a border of pixels) we can add padding so that each pixel of our original 5x5 image can be the center of a 3x3 convolution window filter.

### Strided convolutions

The stride is how the convolution filter is moved over the original image. In our above example, we moved the filter one pixel to the right starting from the upper left hand corner, and then began to do this again after moving the filter one pixel down. Alternatively, by changing the stride, we could move our filter by 2 pixels each time, resulting in a smaller number of possible locations for the filter.

Strided convolutions are rarely used in practice but a good feature to be aware of for some models.

### RGB in CNN

Instead of 5 x 5 grayscale, let's imagine a 7 x 7 RGB image, which boils down to having a 7 x 7 x 3 tensor. (The image itelf is compromised by a 7 by 7 matrix of pixels, each with 3 numerical values for the RGB values.) From there, you will need to use a filter that has the third dimension equal to 3 as well, let's say, 3 x 3 x 3 (a 3D "cube").

This allows us to detect, eg only horizontal edges in the blue channel (filter on the red and green channel all equal to 0).

Then, in each layer, you can convolve with several 3D filters. Then, you stack every output result together, and that way you end up having a 5 x 5 x (number of filters) shape.

### Pooling Layer

The last element in a CNN architecture (before fully connected layers as we have previously discussed in other neural networks) is the pooling layer. This layer is meant to substantially downsample the previous convolutional layers. The idea behind this is that the previous convolutional layers will find patterns such as edges or other basic shapes present in the pictures. From there, pooling layers such as Max pooling (the most common) will take a summary of the convolutions from a larger section. In practice Max pooling (taking the max of all convolutions from a larger area of the original image) works better then average pooling as we are typically looking to detect whether a feature is present in that region. Downsampling is essential in order to produce viable execution times in the model training.

Max pooling has some important hyperparameters:
- f (filter size)
- S (stride)

Common hyperparameters include: f=2, s=2 and f=3, s=2, this shrinks the size of the representations.
If a feature is detected anywhere in the quadrants, a high number will appear. so max pooling preserves this feature.

### Fully Connected Layers in Your CNN.

Once you have addded a number of convolutional layers and pooling layers, add fully connected (dense) layers as we did before in previous neural network models. This now allows the network to learn a final decision function based on these transformed informative inputs generating from the convolutional and pooling layers.

### Pretrained CNN

 A pretrained network is a network which was previously ran on a large, general data set, and saved. The advantage is that the hierarchical features learned by this network can act as a generic model, and can be used for a wide variety of computer vision tasks, even if your new problem involves completely different classes of images.
 
 Keras has several pretrained models available. Here is a list of pretrained image classification models. All these models are available in `keras.applications` and were pretrained on the ImageNet dataset, a data set with 1.4 Million labeled images and 1,000 different classes.

- DenseNet
- InceptionResNetV2
- InceptionV3
- MobileNet
- NASNet
- ResNet50
- VGG16
- VGG19
- Xception

You can find an overview here too: https://keras.io/applications/

You can simply import the desired pretrained model, and use it as a function with 2 arguments: **weights and include_top**. Use "imagenet" in weights in order to use the weights that were obtained when training on the ImageNet data set. You can chose to iclude the top of the model (whether or not to include the fully-connected layer at the top of the network), through the argument include_top.

#### Feature Extration (pretrained CNN)
Feature extraction with convolutional neural networks means that you take the convolutional base of a pretrained network, run new data through it, and train a new classifier on top of the output (a new densely connected classifier). Why use convolutional base but *new* dense classifier? Generally, patterns learned by the convolutional layers are more generalizable.

Note that, if your dataset differs a lot from the dataset used when pretraining, it might even be worth only using part of the convolutional base (see "fine-tuning")

Also, with feature extraction, there are two ways running the model:
- You can run the convolutional base over your dataset, save its output, then use this data as input to a standalone, densely connected network. This solution is pretty fast to run, and you need to run the convolutional base first for every input image. The problem here is, however, that you can't use data augmentation as we've seen it before.
- You can extend the convolutional base by adding dense layers on top, and running everything altogether on the input data. This way, you can use data augmentation, but as every input image goes through the convolutional base every time, this technique is much more time-consuming. It's almost impossible to do this without a GPU.

        #add pretrained CNN as top layer
        model = models.Sequential()
        model.add(cnn_base)
        #add dense layers for a classifier
        model.add(layers.Flatten())
        model.add(layers.Dense(132, activation='relu'))
        model.add(layers.Dense(1, activation='sigmoid'))
        #freeze convolutional base
        cnn_base.trainable = False
        #perform sanity check 
        #You can check whether a layer is trainable (or alter its setting) through the layer.trainable attribute:
        for layer in model.layers:
            print(layer.name, layer.trainable)

        #Similarly, we can check how many trainable weights are in the model:
        print(len(model.trainable_weights))

#### Fine-tuning  (pretrained CNN)

Fine tuning is similar to feature extraction in that you reuse the convolution base and retrain the dense, fully connected classifier layers to output a new prediction. In addition, fine tuning also works by retraining the frozen weights for the convolutional base. This allows these weights to be tweaked for the current scenario, hence the name. To do this, you'll freeze part of the model while tuning specific layers.

**must use feature extraction before using fine tuning**
    
    #unfreeze convolutional base
    cnn_base.trainable = True
    #select (fine tune) layers to be frozen/unfrozen
    cnn_base.trainable = True
    set_trainable = False
    for layer in cnn_base.layers:
        if layer.name == 'block5_conv1':
            set_trainable = True
        if set_trainable:
            layer.trainable = True
        else:
            layer.trainable = False


## RNN - Sequence Models

A Sequence Model is a general term for a special class of Deep Neural Networks that work with a time series of data as an input. The series of data is any set of data where we want the model to consider the data one point at a time, in order. This means that they are great for problems where the order of the data matters--for instance, stock price data or text. In both cases, the data only makes sense in order.

#### Use Cases
**_Text Classification_**
One of the most common applications of RNNs is for plain old text classification. Recall that all the models that we've used so far for text generation have been incapable of focusing on the order of the words, which means that they're likely to miss out on more advanced pieces of information such as connotation, context, sarcasm, etc. However, since RNNs examine the words one at a time and remember what they've seen at each time step, they're able to capture this information quite effectively in most cases!

**_Sequence Generation_**
Sequence generation is probably some of the most fun that you can have with Neural Networks, because they excel at coming up with wacky, almost-human sounding names for things when fed the right data. 

**_Sequence to Sequence Model_**
If you've ever used Google Translate before, then you've already interacted with a **_Sequence to Sequence Model_**. These models learn to map an input sequence to an output sequence, usually through an **_Encoder-Decoder_** architecture.

#### RNN Architecture
A basic Recurrent Neural Network is just a neural network that passes it's output from a given example back into itself as input for the next example.

##### Back Propagation
One interesting aspect of working with RNNs is that they use a modified form of back propagation called **_Back Propagation Through Time (BPTT)_**. Because the model is trained on sequence data, it has the potential to be right or wrong at every point in that sequence. This means that we need to adjust the model's weights at each time point to effectively learn from sequence data. Because the model starts at the most recent output, and then works backwards to calculate the loss and update the weights at each time step, the model is said to be going "back in time" to learn.  Since we have to update the every single weight at every single time step, that means that BPTT is much more computationally expensive than traditional back propagation. For instance, if a single data point is a sequence with 1000 time steps, then our model will perform a full round of back propagation for each of the 1000 points in that single sequence. 

##### Truncated Back Prop Through Time
This algorithm increases performance by breaking a big sequence of 1000 into 50 sequences of 20. This significantly improves training time over regular BPTT, but is still significantly slower than vanilla back propagation. 

#### Vanishing and Exploding Gradients

One of the biggest problems with standard Recurrent Neural Networks is that they get Saturated. The problem with this it that they use a sigmoid or tanh activation function, and there are large areas of each function where the derivative is very, very close to 0. When gradients are close to 0 because the values are extremely low, this is called **Vanishing Gradient**. Similarly, networks can also get the point where the gradients are much, much too large, resulting in massive weight updates that cause the model to thrash between 1 extremely wrong answer and another. When this happens, it is called **Exploding Gradient**. In practice, we can easily solve Exploding Gradients by just "clipping" the weight updates by bounding them at a maximum value. However, there's no good solution for Vanishing Gradients! This is where the modern versions of RNNs come in. In practice, when building models for Sequence Data, people rarely use traditional RNN architectures anymore. Instead they make use of LSTMs and GRUs. Both of these models can be thought of as special types of neurons that can be used in an RNN. Although they work a little differently, they have the same strength--the ability to forget information! By constantly updating their internal state, they can learn what is important to remember, and when it is okay to forget it.

### Gradient Recurring Network (GRU)

**_Gated Recurrent Units_**, or **_GRUs_**, are a special type of cell that passes along it's internal state at each time step. However, not every part of the internal state is passed along--only the important stuff! GRUs make use of two "gate" functions: a **_Reset Gate_**, which determines what should be removed from the cell's internal state before passing itself along to the next time step, and an **_Update Gate_**, which determines how much of the state from the previous time step should be used in the current time step.

    #sample GRU model
    gru_model = Sequential()
    gru_model.add(Embedding(20000, 128))
    gru_model.add(GRU(50, return_sequences=True))
    gru_model.add(GlobalMaxPool1D())
    gru_model.add(Dropout(0.5))
    gru_model.add(Dense(50, activation='relu'))
    gru_model.add(Dropout(0.5))
    gru_model.add(Dense(20, activation='softmax'))

### Long Short Term Memory Cells (LSTMs)

**_Long Short Term Memory Cells_**, or **_LSTMs_**, are another sort of specialized neurons for use in RNNs that are able to effectively learn what to remember and what to forget in sequence models. 

LSTMs are generally like GRUs, except that they use 3 gates instead of 2. LSTMs have: 

* an **_Input Gate_**, which determines how much of the cell state that was passed along should be kept
* an **_Forget Gate_**, which determines how much of the current state should be forgotten
* an **_Output Gate_**, which determines how much of the current state should be exposed to the next layers in the network

**BUILDING MODELS USING BOTH (GRU, LSTM) METHODS ARE BEST PRACTICE AS ONE DOES NOT NECESSARILY OUTPERFORM THE OTHER**

    #sample LSTM model
    lstm_model = Sequential()
    lstm_model.add(Embedding(20000, 128))
    lstm_model.add(LSTM(50, return_sequences=True))
    lstm_model.add(GlobalMaxPool1D())
    lstm_model.add(Dropout(0.5))
    lstm_model.add(Dense(50, activation='relu'))
    lstm_model.add(Dropout(0.5))
    lstm_model.add(Dense(20, activation='softmax'))

### Bidirectional Layers

A Bidirectional RNN is just like a regular RNN, but with a twist--half of the neurons start by at the beginnig of the data and work towards the end one step at a time, while the other half start at the end of the data and work towards the beginning at the same pace!

Bidirectional RNNs excel at things like speech recognition and other NLP tasks. Typically, Bidirectional RNN Layers combined with LSTM cells are a great first place to start when tackling NLP tasks. However, _they do come with the drawback of increased complexity and computational requirements_, since each bidirectional layer is essentially double the size, since an equal amount of neurons are needed for each direction. This means that if we create a bidirectional layer of 50 LSTM neurons, then our model actually has 100 LSTM cells for that layer--50 for front-to-back, and 50 for back-to-front. This size increase can definitely slow down training times, because using things like LSTM cells are already quite time intensive. However, *_when it comes to performance with things like human speech, bidirectional models are often best-in-class_*!

    #sample LSTM model with Bidirectional Layer added
    from keras.layers import LSTM, Dense, Bidirectional
    from keras.models import Sequential

    model = Sequential()
    model.add(Bidirectional(LSTM(10, return_sequences=True),
                            input_shape=(5, 10)))
    model.add(Bidirectional(LSTM(10)))
    model.add(Dense(5))
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
