## Neural Networks Basics

Networks come in all shapes and sizes.

- We can add more features (nodes) in the input layer.
- We can add more nodes in the hidden layer. Also, we can simply add more hidden layers. This is what turns a neural network in a "deep" neural network (hence, deep learning)
- We can have several nodes in the output layer.

And there is one more thing that makes deep learning extremely powerful: unlike many other statistical and machine learning techniques, deep learning can deal extremely well with **unstructured data**.

Types or Neural networks:
- Standard neural networks
- Convolutional neural networks (input = images, video)
- Recurrent neural networks (input = audio files, text, time series data)
- Generative adversarial networks

#### Logistic regression as a neural network (no hidden layers)
We'll need some expression here in order to make a prediction.
The parameters here are $w \in  \mathbb{R}^n$ and $b \in \mathbb{R}$. Some expression to get to $\hat y$ could be $\hat y = w^T x + b$. The problem here is, however, that this type of expression does not ensure that the eventual outcome $ \hat y$ will be between zero and one, and it could be much bigger than one or even negative!

#### activation functions
**sigmoid function**
Recall that the mathematical expression of the sigmoid is $ a=\dfrac{1}{1+ \exp(-z)}$, and this outputs activation values somewhere between 0 and 1.
    
        def sigmoid(x, derivative=False):
        f = 1 / (1 + np.exp(-x))
        if (derivative == True):
            return f * (1 - f)
        return f

**tanh**
The hyperbolic tangent (or tanh) function goes between -1 and +1, and is in fact a shifted version of the sigmoid function, with formula $ a=\dfrac{\exp(z)- \exp(-z)}{\exp(z)+ \exp(-z)}$. For intermediate layers, the tanh function generally performs pretty well because, with values between -1 and +1, the means of the activations coming out are closer to zero! 

    def tanh(x, derivative=False):
        f = np.tanh(x)
        if (derivative == True):
            return (1 - (f ** 2))
        return np.tanh(x)
        
**ReLU** (Rectified Linear Unit function)
This is probably the most popular activation function, along with the tanh! The fact that the activation is exactly 0 when $z <0$  is slightly cumbersome when taking derivatives though. $a=\max(0,z)$

    def relu(x, derivative=False):
        f = np.zeros(len(x))
        if (derivative == True):
            for i in range(0, len(x)):
                if x[i] > 0:
                    f[i] = 1  
                else:
                    f[i] = 0
            return f
        for i in range(0, len(x)):
            if x[i] > 0:
                f[i] = x[i]  
            else:
                f[i] = 0
        return f

**leaky ReLU**

The leaky ReLU solves the derivative issue by allowing for the activation to be slightly negative when $z <0$ ! $a=\max(0.001*z,z)$

    def leaky_relu(x, leakage = 0.05, derivative=False):
        f = np.zeros(len(x))
        if (derivative == True):
            for i in range(0, len(x)):
                if x[i] > 0:
                    f[i] = 1  
                else:
                    f[i] = leakage
            return f
        for i in range(0, len(x)):
            if x[i] > 0:
                f[i] = x[i]  
            else:
                f[i] = x[i]* leakage
        return f
        
**arctan**
The inverse tangent (arctan) function has a lot of the same qualities that tanh has, but the range roughly goes from -1.6 to 1.6, and  the slope is more gentle than the one we saw using the tanh function.

    def arctan(x, derivative=False):
        if (derivative == True):
            return 1/(1+np.square(x))
        return np.arctan(x)

    z = np.arange(-10,10,0.2)
    
#### Loss and Cost Function
The **loss function** is used to measure the inconsistency between the predicted value $(\hat y)$ and the actual label $y$. In logistic regression the loss function is defined as
$\mathcal{L}(\hat y, y) = - ( y \log (\hat y) + (1-y) \log(1-\hat y))$. The advantage of this loss function expression is that the optimization space here is convex, which makes optimizing using gradient descent easier. The loss function is defined over 1 particular training sample. 

The **cost function** takes the average loss over all the samples: $J(w,b) = \displaystyle\frac{1}{l}\displaystyle\sum^l_{i=1}\mathcal{L}(\hat y^{(i)}, y^{(i)})$
When you train your logistic regression model, the purpose is to find parameters $w$ and $b$ such that your cost function is minimized!

#### Forward propagation
Initialize $J= 0$, $dw_1= 0$, $dw_2= 0$, $db= 0$. 

For each training sample $1,...,l$ you'll need to compute:

$ z^{(i)} = w^T x^ {(i)} +b $

$\hat y^{(i)} = \sigma (z^{(i)})$

$dz^{(i)} = \hat y^{(i)}- y^{(i)}$

#### Backward propagation (after updating values with forward propagation)
$J_{+1} = - [y^{(i)} \log (\hat y^{(i)}) + (1-y^{(i)}) \log(1-\hat y^{(i)})$

$dw_{1, +1}^{(i)} = x_1^{(i)} * dz^{(i)}$

$dw_{2, +1}^{(i)} = x_2^{(i)} * dz^{(i)}$

$db_{+1}^{(i)} =  dz^{(i)}$

$\dfrac{J}{m}$, $\dfrac{dw_1}{m}$, $\dfrac{dw_1}{m}$, $\dfrac{db}{m}$

#### Update weights
$w_1 := w_1 - \alpha dw_1$

$w_2 := w_2 - \alpha dw_2$

$b := b - \alpha db$

repeat until convergence!

## Deep Neural Networks Process Overview

To summarize the process once more, we begin by defining a model architecture which includes the number of hidden layers, activation functions (sigmoid or relu) and the number of units in each of these.   

We then initialize parameters for each of these layers (typically randomly). After the initial parameters are set, forward propagation evaluates the model giving a prediction, which is then used to evaluate a cost function. Forward propogation involves evaluating each layer and then piping this output into the next layer. 

Each layer consists of a linear transformation and an activation function.  The parameters for the linear transformation in **each** layer include $W^l$ and $b^l$. The output of this linear transformation is represented by $Z^l$. This is then fed through the activation function (again, for each layer) giving us an output $A^l$ which is the input for the next layer of the model.  

After forward propogation is completed and the cost function is evaluated, backpropogation is used to calculate gradients of the initial parameters with respect to this cost function. Finally, these gradients are then used in an optimization algorithm, such as gradient descent, to make small adjustments to the parameters and the entire process of forward propogation, back propogation and parameter adjustments is repeated until the modeller is satisfied with the results.

## Creating a Deep Neural Network from Scratch
   
    import numpy as np
    import h5py
    import matplotlib.pyplot as plt

    %matplotlib inline
    plt.rcParams['figure.figsize'] = (5.0, 5.0) 
    plt.rcParams['image.interpolation'] = 'nearest'
    plt.rcParams['image.cmap'] = 'gray'

    %load_ext autoreload
    %autoreload 2

    np.random.seed(123)
    
#### Initialization in an L-layer Neural Network
    def initialize_parameters(n_0, n_1, n_2):
    np.random.seed(123) 
    W1 = np.random.randn(n_1, n_0) * 0.05 
    b1 = np.zeros((n_1, 1))
    W2 =  np.random.randn(n_2, n_1) * 0.05 
    b2 = np.zeros((n_2, 1))
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters
    
#### create a dictionary of parameters for W and b given a list of layer dimensions.
    #Simply randomly initialize values in accordance to the shape each parameter should have.
    #Use random seed 123 (as provided)
    def initialize_parameters_deep(list_layer_dimensions):

        np.random.seed(123)
        parameters = {}
        L = len(list_layer_dimensions)           

        for l in range(1, L):
            parameters['W' + str(l)] = np.random.randn(list_layer_dimensions[l], list_layer_dimensions[l-1])*0.05
            parameters['b' + str(l)] = np.zeros((list_layer_dimensions[l], 1))

        return parameters
    
#### forward propagation through linear activation
    def linear_activation_forward(A_prev, W, b, activation):
        Z = np.dot(W, A_prev) + b #Your code here; see the linear transformation above for how to compute Z
        linear_cache = (A_prev, W, b)
        activation_cache = Z

        #Here we define two possible activation functions
        if activation == "sigmoid":
            A = 1/(1+np.exp(-Z))
        elif activation == "relu":
            A = np.maximum(0,Z)
        assert (A.shape == (W.shape[0], A_prev.shape[1]))
        cache = (linear_cache, activation_cache)

        return A, cache
        
#### continue forward propagation through L layer
    def L_model_forward(X, parameters):
        #Initialize a cache list to keep track of the caches
        caches = [] #Your code here
        A = X
        L = len(parameters) // 2 # number of layers in the neural network

        # Implement the RELU activation L-1 times. Add "cache" to the "caches" list.
        for l in range(1, L):
            A_prev = A
            A, cache = linear_activation_forward(A_prev, parameters['W'+ str(l)], parameters['b' + str(l)], activation = "relu")        
            caches.append(cache)
        #Implement the sigmoid function for the last layer. Add "cache" to the "caches" list.
        AL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], activation = "sigmoid")
        caches.append(cache)

        assert(AL.shape == (1,X.shape[1]))

        return AL, caches
        
#### create cost function
    def compute_cost(AL, Y):

        m = Y.shape[1]

        cost = -(1/m)* np.sum((Y*np.log(AL))+ (1-Y)*np.log(1-AL))
        cost = np.squeeze(cost)      #turn [[17]] into 17

        return cost
    
#### start backward propagation with a linear backward function  
    def linear_backward(dZ, cache):
        A_prev, W, b = cache #Unpacking our complex object
        m = A_prev.shape[1]

        dW = (1/m) * np.dot(dZ,A_prev.T)
        db = (1/m) * np.sum(dZ, axis =1, keepdims = True)
        dA_prev = np.dot(W.T , dZ) #Your code here; see the formulas above

        return dA_prev, dW, db
        
#### combine activation to linear backward function
    def linear_activation_backward(dA, cache, activation):
        linear_cache, activation_cache = cache
        Z= activation_cache

        if activation == "sigmoid": 
            s = 1/(1+np.exp(-Z))
            dZ = dA * s * (1-s)
            dA_prev, dW, db = linear_backward(dZ, linear_cache)

        elif activation == "relu":
            dZ = np.array(dA, copy=True) # just converting dz to a correct object.
            dZ[Z <= 0] = 0
            dA_prev, dW, db = linear_backward(dZ, linear_cache)

        return dA_prev, dW, db
        
#### commence backword propagation
    def L_model_backward(AL, Y, caches):
        grads = {}
        L = len(caches) # the number of layers
        m = AL.shape[1]
        Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL

        # Initializing the backpropagation
        dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

        # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "dAL, current_cache". Outputs: "grads["dAL-1"], grads["dWL"], grads["dbL"]
        current_cache = caches[L-1]
        grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, activation = "sigmoid")

        # Loop from l=L-2 to l=0
        for l in reversed(range(L-1)):
            # (RELU -> LINEAR) gradients
            # Inputs: "grads["dA" + str(l + 1)], current_cache". Outputs: "grads["dA" + str(l)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 
            current_cache = caches[l]
            dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l+1)], current_cache, activation = "relu") #Your code here; use the helper function defined above
            grads["dA" + str(l)] = dA_prev_temp
            grads["dW" + str(l + 1)] = dW_temp
            grads["db" + str(l + 1)] = db_temp

        return grads 
    
#### update parameters with updated weights
    def update_parameters(parameters, grads, learning_rate):

        L = len(parameters) // 2 # number of layers in the neural network


        for l in range(L):
            parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads["dW" + str(l+1)]
            parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads["db" + str(l+1)]
        return parameters

## Image Recognition using Deep Learning from Scratch
    
#### import images
    import matplotlib.image as mpimg
    filename = 'data/validation/santa/00000448.jpg'
    img=mpimg.imread(filename)
    plt.imshow(img)
    print(img.shape)
    plt.show()
    
#### keras preprocessing through image downgrade
    import time
    import matplotlib.pyplot as plt
    import scipy
    from PIL import Image
    from scipy import ndimage
    from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
    %matplotlib inline
    plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
    plt.rcParams['image.interpolation'] = 'nearest'
    plt.rcParams['image.cmap'] = 'gray'

    np.random.seed(1)
    
#### set directory path
    train_data_dir = 'data/train'
    test_data_dir = 'data/validation'

#### get all the data in the directory data/validation (132 images), and reshape them
    test_generator = ImageDataGenerator().flow_from_directory(
            test_data_dir, 
            target_size=(64, 64), batch_size=132) 

#### get all the data in the directory data/train (790 images), and reshape them
    train_generator = ImageDataGenerator().flow_from_directory(
            train_data_dir, 
            target_size=(64, 64), batch_size=790)

#### create the data sets
    train_images, train_labels = next(train_generator)
    test_images, test_labels = next(test_generator)
    
#### Explore your dataset again
    m_train = train_images.shape[0]
    num_px = train_images.shape[1]
    m_test = test_images.shape[0]

    print ("Number of training examples: " + str(m_train))
    print ("Number of testing examples: " + str(m_test))
    print ("Each image is of size: (" + str(num_px) + ", " + str(num_px) + ", 3)")
    print ("train_images shape: " + str(train_images.shape))
    print ("train_labels shape: " + str(train_labels.shape))
    print ("test_images_orig shape: " + str(test_images.shape))
    print ("test_labels shape: " + str(test_labels.shape))
    
#### Reshape the training and test examples 
    train_img = train_images.reshape(train_images.shape[0], -1).T   # The "-1" makes reshape flatten the remaining dimensions
    test_img = test_images.reshape(test_images.shape[0], -1).T

    # Standardize data to have feature values between 0 and 1.
    train_x = train_img/255.
    test_x = test_img/255.

    print ("train_img's shape: " + str(train_img.shape))
    print ("test_img's shape: " + str(test_img.shape))

#### Reshape the labels
    train_labels_final = train_labels.T[[1]]
    test_labels_final = test_labels.T[[1]]

    print ("train_labels_final's shape: " + str(train_labels_final.shape))
    print ("test_labels_final's shape: " + str(test_labels_final.shape))
    
#### putting the functions together
    def L_layer_model(X, Y, layers_dims, learning_rate = 0.005, num_iterations = 3000, print_cost=False):#lr was 0.009
        np.random.seed(1)
        costs = []                         

        # Parameters initialization. (≈ 1 line of code)
        parameters = initialize_parameters_deep(layers_dims)

        # Loop (gradient descent)
        for i in range(0, num_iterations):

            # Forward propagation: [LINEAR -> RELU]*(L-1) -> LINEAR -> SIGMOID.
            AL, caches = L_model_forward(X, parameters) #Your code here; use the previous helper functions
        
            # Compute cost.
            cost = compute_cost(AL, Y) #Your code here; use the previous helper functions

            # Backward propagation.
            grads = L_model_backward(AL, Y, caches) #Your code here; use the previous helper functions

            # Update parameters.
            parameters = update_parameters(parameters, grads, learning_rate)  #Your code here; use the previous helper functions

            # Print the cost every 100 training example
            if print_cost and i % 100 == 0:
                print ("Cost after iteration %i: %f" %(i, cost))
            if print_cost and i % 100 == 0:
                costs.append(cost)

        # plot the cost
        plt.plot(np.squeeze(costs))
        plt.ylabel('cost')
        plt.xlabel('iterations (per tens)')
        plt.title("Learning rate =" + str(learning_rate))
        plt.show()

        return parameters
        
#### run function    
    parameters = L_layer_model(train_img, train_labels_final, layers_dims, num_iterations = 1000, print_cost = True)
    
    #make predictions
    def predict(X, parameters, y=None):

        m = X.shape[1]
        n = len(parameters) // 2

        # Forward propagation
        probs, caches = L_model_forward(X, parameters)

        # convert probs to 0/1 predictions
        for i in range(0, probs.shape[1]):
            if probs[0,i] > 0.50:
                probs[0,i] = 1
            else:
                probs[0,i] = 0

        #print ("predictions: " + str(probs)); print ("true labels: " + str(y))
        if type(y) != type(None):
            print("Accuracy: "  + str(np.sum((probs == y)/m)))

        return probs
        
#### print misslabeled images
    def print_mislabeled_images(classes, X, y, p):
        a = p + y
        mislabeled_indices = np.asarray(np.where(a == 1))
        plt.rcParams['figure.figsize'] = (90.0, 90.0) # set default size of plots
        num_images = len(mislabeled_indices[0])
        for i in range(num_images):
            index = mislabeled_indices[1][i]

            plt.subplot(2, num_images, i + 1)
            plt.imshow(X[:,index].reshape(64,64,3), interpolation='nearest')
            plt.axis('off')
            
    print_mislabeled_images(list(train_generator.class_indices), test_img, test_labels_final, pred_test)