In [2]:
import numpy as np
from sklearn.linear_model import LogisticRegressionCV

## 01 Logistic regression

```python
clf = LogisticRegressionCV()
clf.fit(X.T, Y.T.ravel())
LR_predictions = clf.predict(X.T)
accuracy = np.mean(LR_predictions == Y.T.ravel()) * 100

print(f'Accuracy of logistic regression: {accuracy:.2f}% (percentage of correctly labelled datapoints)')
```

## 02 Two-layer neural network

General methodology to build a Neural Network:
1. Define the neural network structure ( # of input units,  # of hidden units, etc). 
2. Initialize the model's parameters
3. Loop:
    - Implement forward propagation
    - Compute loss
    - Implement backward propagation to get the gradients
    - Update parameters (gradient descent)

**REMARKS**:
- Build helper functions to compute steps 1-3.
- Merge them into a single function called `nn_model()`.
- Once nn_model is trained and learned the parameters, you can use them to make predictions on new data.

For one example $x^{(i)}$:
$$z^{[1] (i)} =  W^{[1]} x^{(i)} + b^{[1]}\tag{1}$$ 
$$a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}$$
$$z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2]}\tag{3}$$
$$\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}$$
$$y^{(i)}_{prediction} = \begin{cases} 1 & \mbox{if } a^{[2](i)} > 0.5 \\ 0 & \mbox{otherwise } \end{cases}\tag{5}$$


Given the predictions on all the examples, the cost $J$ is: 
$$J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right)  \large  \right) \small \tag{6}$$

### 02.01 Define the structure

Define three variables:
- n_x: the size of the input layer
- n_h: the size of the hidden layer
- n_y: the size of the output layer

In [3]:
def layer_sizes(X, Y):
    """
    Arguments:
    X: input dataset of shape (input size, number of examples)
    Y: labels of shape (output size, number of examples)
    
    Returns:
    n_x: the size of the input layer
    n_h: the size of the hidden layer
    n_y: the size of the output layer
    """
    
    n_x = X.shape[0]
    n_h = 4
    n_y = Y.shape[0]
    
    return (n_x, n_h, n_y)

### 02.02 Initialize parameters

- Initialize the weights matrices with random values via `np.random.randn(a,b) * 0.01`.
- Initialize the bias vectors as zeros via `np.zeros((a,b))`.

In [8]:
def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x: size of the input layer
    n_h: size of the hidden layer
    n_y: size of the output layer
    
    Returns:
    params: dictionary containing parameters:
                    W1: weight matrix of shape (n_h, n_x)
                    b1: bias vector of shape (n_h, 1)
                    W2: weight matrix of shape (n_y, n_h)
                    b2: bias vector of shape (n_y, 1)
    """    
    
    W1 = np.random.randn(n_h, n_x) * 0.01 # shape of (n_h, n_x)
    b1 = np.zeros((n_h, 1)) # shape of (n_h, 1)
    W2 = np.random.randn(n_y, n_h) * 0.01 # shape of (n_y, n_h)
    b2 = np.zeros((n_y, 1)) # (n_y , 1)
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

### 02.03 Forward propagation

Implementation of `forward_propagation()` using the following equations:
$$Z^{[1]} =  W^{[1]} X + b^{[1]}\tag{1}$$ 
$$A^{[1]} = \tanh(Z^{[1]})\tag{2}$$
$$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}\tag{3}$$
$$\hat{Y} = A^{[2]} = \sigma(Z^{[2]})\tag{4}$$


In [9]:
def forward_propagation(X, parameters):
    """
    Argument:
    X: input data of size (n_x, m)
    parameters: dict containing parameters (output of initialization function)
    
    Returns:
    A2: sigmoid output of second activation
    cache: dict containing  "Z1", "A1", "Z2" and "A2"
    """

    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    
    Z1 = np.dot(W1, X) + b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)
    
    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2}
    
    return A2, cache

### 02.04 Compute cost

Compute the cost function as follows:

$$J = - \frac{1}{m} \sum\limits_{i = 1}^{m} \large{(} \small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right) \large{)} \small\tag{13}$$


One way to implement one part of the equation without for loops: $- \sum\limits_{i=1}^{m}  y^{(i)}\log(a^{[2](i)})$:
```python
logprobs = np.multiply(np.log(A2),Y)
cost = - np.sum(logprobs)
```

- Use np.multiply + np.sum, or np.dot.
- np.multiply with np.sum returns a float, while np.dot returns a 2D array.
- Use np.squeeze() or float() to reduce the result to a scalar if needed.

In [10]:
def compute_cost(A2, Y):
    """
    Computes the cross-entropy cost.
    
    Arguments:
    A2: The sigmoid output of the second activation, of shape (1, number of examples)
    Y: "true" labels vector of shape (1, number of examples)

    Returns:
    cost -- cross-entropy cost given equation
    """
    
    m = Y.shape[1] # number of examples

    logprobs = np.multiply(Y, np.log(A2)) + np.multiply((1-Y), np.log(1-A2))
    cost = -(1/m)*np.sum(logprobs)
        
    cost = float(np.squeeze(cost))  # makes sure cost is the dimension we expect. 
                                    # E.g., turns [[17]] into 17 
    
    return cost

### 02.05 Backpropagation

Using the cache computed during forward propagation, you can now implement backward propagation.

In [12]:
def backward_propagation(parameters, cache, X, Y):
    """
    
    Arguments:
    parameters: dict containing our parameters 
    cache: a dict containing "Z1", "A1", "Z2" and "A2".
    X: input data of shape (2, number of examples)
    Y: "true" labels vector of shape (1, number of examples)
    
    Returns:
    grads: dict containing grads with respect to different parameters
    """
    m = X.shape[1]
    
    W1 = parameters['W1']
    W2 = parameters['W2']

    A1 = cache['A1']
    A2 = cache['A2']
    
    dZ2 = A2 - Y
    dW2 = (1/m) * np.dot(dZ2, A1.T)
    db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
    dZ1 = np.dot(W2.T, dZ2) * (1 - np.power(A1, 2))
    dW1 = (1/m) * np.dot(dZ1, X.T)
    db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
    
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads

### 02.06 Update parameters

Implement gradient descent to update (W1, b1, W2, b2) using their gradients (dW1, db1, dW2, db2).

General gradient descent rule: $\theta = \theta - \alpha \frac{\partial J }{ \partial \theta }$ where $\alpha$ is the learning rate and $\theta$ represents a parameter.

In [13]:
def update_parameters(parameters, grads, learning_rate = 1.2):
    """
    Updates parameters using the gradient descent update rule given above
    
    Arguments:
    parameters: python dictionary containing your parameters 
    grads: python dictionary containing your gradients 
    
    Returns:
    parameters: python dictionary containing your updated parameters 
    """
    W1 = copy.deepcopy(parameters['W1'])
    b1 = copy.deepcopy(parameters['b1'])
    W2 = copy.deepcopy(parameters['W2'])
    b2 = copy.deepcopy(parameters['b2'])
    
    dW1 = grads['dW1']
    db1 = grads['db1']
    dW2 = grads['dW2']
    db2 = grads['db2']
    
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

### 02.07 Integration

The neural network model has to use the previous functions in the right order.

In [17]:
def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False):
    """
    Arguments:
    X: dataset of shape (2, number of examples)
    Y: labels of shape (1, number of examples)
    n_h: size of the hidden layer
    num_iterations: num of iterations in gradient descent loop
    print_cost: if True, print the cost every 1000 iterations
    
    Returns:
    parameters: parameters learnt by the model. They can then be used to predict.
    """
    
    np.random.seed(3)
    n_x = layer_sizes(X, Y)[0]
    n_y = layer_sizes(X, Y)[2]
    
    parameters = initialize_parameters(n_x, n_h, n_y)

    for i in range(0, num_iterations):
        A2, cache = forward_propagation(X, parameters)
        cost = compute_cost(A2, Y)
        grads = backward_propagation(parameters, cache, X, Y)
        parameters = update_parameters(parameters, grads)

        # Print the cost every 1000 iterations
        if print_cost and i % 1000 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

    return parameters

### 02.08 Predict

Use forward propagation to predict results.

predictions = $y_{prediction} = \mathbb \{\text{{activation > 0.5}}\} = \begin{cases}
      1 & \text{if}\ \text{activation} > 0.5 \\
      0 & \text{otherwise}
    \end{cases}$  

In [18]:
def predict(parameters, X):
    """
    Using the learned parameters, predicts a class for each example in X
    
    Arguments:
    parameters: python dictionary containing your parameters 
    X: input data of size (n_x, m)
    
    Returns
    predictions: vector of predictions of our model (red: 0 / blue: 1)
    """

    A2, cache = forward_propagation(X, parameters)
    predictions = A2 > 0.5
    
    return predictions