## Mindset of a neural network with logistic regression

In this workshop, you will code a neural network which can recognize cats from an image. You will learn the mindset of how a neural network works and, in general, get an idea of what really is deep learning.

**Instructions:**
- Don't use for or while loops unless explicitly asked.

**You will learn to:**
- Create the general architecture of a learning model including:
  - The initialization of parameters
  - The computation of the cost function and its gradient
  - The usage of an optimization algorithm
- Regroup the three above functions for the model

## 1 - Packages ##

Let's begin by importing the required packages:
- [numpy](www.numpy.org) is the fundamental package for scientific calculations in python
- [h5py](http://www.h5py.org) is a package allowing you to interact with a dataset stored in a H5 file
- [matplotlib](http://matplotlib.org) is a popular library for displaying graphs in python
- [PIL](http://www.pythonware.com/products/pil/) and [scipy](https://www.scipy.org/) are used here to test the model with your own photos at the end

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy
from PIL import Image
from scipy import ndimage
from lr_utils import load_dataset

%matplotlib inline

## 2 - Overview of the problem##

**Problem**:
We gave you a dataset ("data.h5") containing:
     - a training set of m_train images, each labeled as being a cat (y=1) or non-cat (y=0)
     - a test set of m_test images, each labeled as cat or non-cat
     - each image is of the form (num_px, num_px, 3) where 3 corresponds to the three RGB channels. Each image is therefore square with side num_px.

You will build a simple image recognition algorithm that can correctly classify other people's cat images.

Let's explore our dataset first. Let's start by importing it.

In [None]:
# Data loading
x_train_orig, y_train, x_test_orig, y_test, classes = load_dataset()

We added "_orig" at the end of the image datasets (train and test) because we will process them later. In general, when you are given a data set, it is never perfect. You will still need to go through a `cleanup` stage called `preprocessing`. After this step you will end up with x_train and x_test (y_train and y_test don't need preprocessing)

Each row of your x_train_orig and y_test_orig is an array representing an image. You can view one by running the following code. You can also change the value of `index` if you want to see other images.

In [None]:
# Example of an image
index = 30
plt.imshow(x_train_orig[index])
if classes[np.squeeze(y_train[:, index])] == b'cat':
    print("y = " + str(y_train[:, index]) + ", this is a cat photo")
else:
    print("y = " + str(y_train[:, index]) + ", this is not a cat photo")

Many bugs in deep learning come from incorrect matrix/vector dimensions. By paying attention to it, you will avoid many bugs.


**Exercise:** Find the correct values of:
     - m_train (number of train images)
     - m_test (number of test images)
     - num_px (length/width of an image)
    
Remember that x_train_orig is a numpy array of the form (m_train, num_px, num_px, 3). For example, you can access `m_train` by writing `x_train_orig.shape[0]`.

In [None]:
### Start of code ### (≈ 3 lines of code)
m_train = None
m_test = None
num_px = None
### End of code ###

print ("Train size: m_train = " + str(m_train))
print ("Test size: m_test = " + str(m_test))
print ("Height/Width of each image: num_px = " + str(num_px))
print ("Each image is of size: (" + str(num_px) + ", " + str(num_px) + ", 3)")
print ("x_train shape: " + str(x_train_orig.shape))
print ("y_train shape: " + str(y_train.shape))
print ("x_test shape: " + str(x_test_orig.shape))
print ("y_test shape: " + str(y_test.shape))

**Expected result for m_train, m_test and num_px**: 

<table style="width:15%">
  <tr>
    <td>m_train</td>
    <td> 209 </td> 
  </tr>
  
  <tr>
    <td>m_test</td>
    <td> 50 </td> 
  </tr>
  
  <tr>
    <td>num_px</td>
    <td> 64 </td> 
  </tr>
  
</table>


For convenience, you should transform your images of the shape (num_px, num_px, 3) into a numpy array of the shape (num_px $*$ num_px $*$ 3, 1). After that, our datasets (train and test) will be numpy arrays in which each column represents a flattened image. There should be m_train and m_test columns.


**Exercise:**
Transform the train and test datasets so that images of the shape (num_px, num_px, 3) are flattened into simple vectors of the shape (num\_px $*$ num\_px $*$ 3, 1).

Small trick: when you want to flatten an X matrix of the form (a,b,c,d) into an X_flatten matrix of the form (b$*$c$*$d, a):
```python
X_flatten = X.reshape(X.shape[0], -1).T # X.T is the transpose of X
```

In [None]:
### Start of code ### (≈ 2 lines of code)
x_train_flatten = None
x_test_flatten = None
### End of code ###

print ("x_train_flatten shape: " + str(x_train_flatten.shape))
print ("y_train shape: " + str(y_train.shape))
print ("x_test_flatten shape: " + str(x_test_flatten.shape))
print ("y_test shape: " + str(y_test.shape))
print ("check 1 random after reshaping: " + str(x_train_flatten[5:10,1]))
print ("check 2 random after reshaping: " + str(x_train_flatten[17:22,34]))

**Expected result**: 

    x_train_flatten shape: (12288, 209)
    y_train shape: (1, 209)
    x_test_flatten shape: (12288, 50)
    y_test shape: (1, 50)
    check 1 random after reshaping: [182 188 179 174 213]
    check 2 random after reshaping: [20 16  3 22 15]

To represent color images, red, green and blue (RGB) channels must be specified for each pixel, and the value of each pixel is actually a vector of 3 numbers between 0 and 255.

A fairly common preprocessing step in machine learning is to center and normalize your dataset, which means you'll calculate the average of the entire numpy array and then divide each instance in that dataset by the standard deviation. For images, it is simpler and more practical to only divide each row of the dataset by 255 (the maximum value of a color channel). You then end up with a numpy array comprising numbers between 0 and 1.

Let's normalize our dataset:

In [None]:
x_train = x_train_flatten/255.
x_test = x_test_flatten/255.

<font color='blue'>
    
**What you need to remember:**

Common preprocessing steps:
- Analyze data by displaying dataset dimensions and shapes (m_train, m_test, num_px, etc.)
- Transform the datasets so that each example becomes a dimension vector (num_px \* num_px \* 3, 1)
- Normalize data

## 3 - General architecture of the learning algorithm ##

It's time to build a simple algorithm to recognize a cat from a non-cat from an image.

You will build a logistic regression, while following the mindset (state of mind) of a neural network. The following figure explains why **logistic regression** is a very simple neural network!

<img src="images/LogReg_kiank.png" style="width:650px;height:400px;">

**Mathematical expression of the algorithm**:

For each example $x^{(i)}$:
$$z^{(i)} = w^T x^{(i)} + b \tag{1}$$
$$\hat{y}^{(i)} = a^{(i)} = sigmoid(z^{(i)})\tag{2}$$
$$ \mathcal{L}(a^{(i)}, y^{(i)}) = - y^{(i)} \log(a^{(i)}) - (1-y^ {(i)} ) \log(1-a^{(i)})\tag{3}$$

The cost is then calculated by adding all the **losses** of each example.
$$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})\tag{6}$$

**Key steps**:
In this exercise, you will cover the following steps:
    - Initialize model parameters
    - Learn the parameters to the model while minimizing the cost
    - Use learned parameters to make predictions (on the test dataset)
    - Analyze the results and conclude

Feel free to call one of the assistants if you need help.

## 4 - Build the different parts of the algorithm ##

The main steps to build a neural network are:
1. Define the structure of the model (such as the number of input features)
2. Initialize model parameters
3. Buckle:
     - Calculate the current loss (forward spread)
     - Calculate the current gradient (backward propagation)
     - Update parameters (gradient descent)
    
### 4.1 - Useful functions

**Exercise**:
Using your code from the last workshop (on Python and numpy), implement the sigmoid function. As you saw in the previous figure, you need to calculate $sigmoid( w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$ to make predictions. Use np.exp().

In [None]:
def sigmoid(z):
    """
    Arguments:
    z -- A scalar or a numpy array
    
    Return:
    s -- sigmoid(z)
    """

    ### Start of code ### (≈ 1 line of code)
    s = None
    ### End of code ###
    
    return s

In [None]:
print ("sigmoid([-1, 0, 0.5, 2, 3]) = " + str(sigmoid(np.array([-1, 0, 0.5, 2, 3]))))

**Expected result**: 

sigmoid([-1, 0, 0.5, 2, 3]) = [0.26894142 0.5        0.62245933 0.88079708 0.95257413]

### 4.2 - Parameter initialization

**Exercise:**
Implement parameter initialization in the next cell. You must initialize w as a vector containing only 0s. If you don't know which numpy function to use, check out np.zeros() in the numpy docs.

In [None]:
def initialize(dim):
    """
    This function creates a vector of shape zeros (dim, 1) for w and initializes b to 0

    Argument:
    dim -- size of the vector w we want (or the number of parameters in this case)

    Returns:
    w -- initialized shape vector (dim, 1)
    b -- initialized scalar number corresponding to the bias
    """
    
    ### Start of code ### (≈ 2 lines of code)
    w = None
    b = None
    ### End of code ###

    assert(w.shape == (dim, 1))
    assert(isinstance(b, float) or isinstance(b, int))
    
    return w, b

In [None]:
dim = 5
w, b = initialize(dim)
print ("w = " + str(w))
print ("b = " + str(b))

**Expected result**: 

w = [[0.]
 [0.]
 [0.]
 [0.]
 [0.]]
 
b = 0

For images, w must be of shape (num_px $\times$ num_px $\times$ 3, 1).

### 4.3 - Forward and Backward spread

Now that your parameters are initialized, you can write forward and backward propagation steps to learn the parameters.


**Exercise:** Implement the `propagate()` function which will calculate the cost function and its gradient.

**Hints**:

Forward Spread:
- We give you X
- You calculate $A = \sigma(w^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(m-1)}, a^{ (m)})$
- You calculate the cost function: $J = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(a^{(i)})+ (1-y^{(i)})\log(1-a^{(i)})$

Here are the two formulas you will use:

$$ \frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T\tag{7}$$
$$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})\tag{8}$$

In [None]:
def propagation(w, b, X, Y):
    """
    Arguments:
    w -- the weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- the bias, a scalar number
    X -- matrix of size (num_px * num_px * 3, number of examples)
    Y -- the vector corresponding to labels (0 if non-cat, 1 if cat) of size (1, number of examples)

    Return:
    cost -- cost
    dw -- loss gradient of the same shape as w
    db -- loss gradient with same shape as b

    Tips:
    - Write your code step by step for propagation. np.log(), np.dot()
    """
    
    m = X.shape[1]

    # Forward Propagation (from X to cost)
    ### Start of code ### (≈ 2 lines of code)
    # compute activation
    A = None
    # compute cost
    cost = None
    ### End of code ###
    
    # Backward Propagation (to find gradients)
    ### Start of code ### (≈ 2 lines of code)
    dw = None
    db = None
    ### End of code ###

    assert(dw.shape == w.shape)
    assert(db.dtype == float)
    cost = np.squeeze(cost)
    assert(cost.shape == ())
    
    grads = {"dw": dw,
             "db": db}
    
    return grads, cost

In [None]:
w, b, X, Y = np.array([[4.],[2.]]), 2., np.array([[1.,2.,-1.],[3.,4.,-3.2]]), np.array([[1,0,1]])
grads, cost = propagation(w, b, X, Y)
print ("dw = " + str(grads["dw"]))
print ("db = " + str(grads["db"]))
print ("cost = " + str(cost))

**Expected result**:

<table style="width:50%">
    <tr>
        <td>  dw   </td>
      <td> [[0.999923  ]
         [2.39975403]]</td>
    </tr>
    <tr>
        <td>  db  </td>
        <td> 7.288578855054369e-05 </td>
    </tr>
    <tr>
        <td>  cost  </td>
        <td> 8.800077000774023 </td>
    </tr>

</table>

### 4.4 - Optimization

- You have initialized your settings.
- You also know how to calculate a cost function and its gradient.
- Now you want to update the parameters using gradient descent.


**Exercise:** Write the optimization function. The goal is to learn $w$ and $b$ by minimizing the cost function $J$. For a given parameter $\theta$, the update formula is $ \theta = \theta - \alpha \text{ } d\theta$, where $\alpha$ corresponds to the **learning rate**.

In [None]:
def optimization(w, b, X, Y, n_iterations, learning_rate, print_cost=False):
    """
    Arguments:
    w -- the weights, numpy array of size (num_px * num_px * 3, 1)
    b -- the bias, a scalar
    X -- matrix of size (num_px * num_px * 3, number of examples)
    Y -- the vector corresponding to labels (0 if non-cat, 1 if cat) of size (1, number of examples)
    n_iterations -- number of iterations in the optimization loop
    learning_rate -- the learning rate
    print_cost -- True to print the loss every 100 times
    
    Returns:
    params -- python dictionary containing weights w and bias b
    grads -- dictionary containing the gradients of w and b thanks to the cost function
    costs -- list of all costs calculated during optimization, this will allow us to display the learning curve

    Tips:
    You will need to write 2 steps and iterate through them:
        1) Calculate the cost and the gradient of the parameters. Use spread()
        2) Update the parameters using gradient descent for w and b
    
    """
    
    costs = []
    
    for i in range(n_iterations):
        
        
        # Computing cost and gradients (≈ 1-4 lines of code)
        ### Start of code ### 
        grads, cost = None
        ### End of code ###
        
        # Retrieve derivates from grads
        dw = grads["dw"]
        db = grads["db"]
        
        # update (≈ 2 lines of code)
        ### Start of code ###
        w = None
        b = None
        ### End of code ###
        
        # Store the costs
        if i % 100 == 0:
            costs.append(cost)
        
        # Display cost every 100 iterations
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
    
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads, costs

In [None]:
params, grads, costs = optimization(w, b, X, Y, n_iterations= 100, learning_rate = 0.009, print_cost = False)

print ("w = " + str(params["w"]))
print ("b = " + str(params["b"]))
print ("dw = " + str(grads["dw"]))
print ("db = " + str(grads["db"]))

**Expected result**: 

w = [[ 3.11567021]
 [-0.10995646]]

b = 1.9850788219795317

dw = [[0.89699851]
 [2.07122651]]

db = 0.09736680112994214

**Exercise:** The previous function displayed the learned weights w and bias b. They can be used to predict the labels of an X dataset. Implement the `prediction()` function. There are 2 steps to calculate the predictions:

1. Calculate $\hat{Y} = A = \sigma(w^T X + b)$

2. Convert the inputs of A to 0 (if activation <= 0.5) or 1 (if activation > 0.5) and store them in a `Y_prediction` vector.

In [None]:
def prediction(w, b, X):
    '''
    Arguments:
    w -- the weights, numpy array of size (num_px * num_px * 3, 1)
    b -- the bias, a scalar
    X -- matrix of size (num_px * num_px * 3, number of examples)

    Returns:
Y_prediction -- a numpy array (vector) containing all the predictions (0/1) of the examples of X
    '''
    
    m = X.shape[1]
    Y_prediction = np.zeros((1,m))
    w = w.reshape(X.shape[0], 1)
    
    # Calculate the vector "A" corresponding to the predictions if a cat is present in the image
    ### Start of code ### (≈ 1 line of code)
    A = None
    ### End of code ###
    
    for i in range(A.shape[1]):
        
        # Convert the predictions A[0,i] en p[0,i]
        ### Start of code ### (≈ 4 lines of code)

        ### End of code ###
    
    assert(Y_prediction.shape == (1, m))
    
    return Y_prediction

In [None]:
w = np.array([[0.1124579],[0.23106775]])
b = -0.3
X = np.array([[1.,-1.2,-2.4],[1.4,2.5,0.6]])
print ("predictions = " + str(prediction(w, b, X)))

**Expected result**: 

<table style="width:30%">
    <tr>
         <td>
             predictions
         </td>
          <td>
            [[ 1.  1.  0.]]
         </td>  
   </tr>

</table>


<font color='blue'>
    
**What you need to remember:**

You have implemented several functions that:
- initialize (w, b)
- optimize the loss iteratively to learn the parameters (w, b):
     - calculation of the cost and its gradient
     - update parameters using gradient descent
- uses the learned (w, b) to predict the labels of an example dataset

## 5 - Merge all functions into a single model ##

You will now see how the overall model is structured by integrating all the blocks (the functions you have implemented) together, in the correct order.

**Exercise**: Implement the "model" function. Use the following notations:
     - Y_pred_test for your predictions on the test dataset
     - Y_pred_train for your predictions on the tain dataset
     - w, costs, grads for optimization() results

In [None]:
def model(X_train, Y_train, X_test, Y_test, n_iterations = 2000, learning_rate = 0.5, print_cost=False):
    """ 
     Arguments:
     X_train -- train dataset represented by a numpy array of shape (num_px * num_px * 3, m_train)
     Y_train -- train data labels represented by a numpy array (vector) of shape (1, m_train)
     X_test -- test dataset represented by a numpy array of shape (num_px * num_px * 3, m_test)
     Y_test -- test data labels represented by a numpy array (vector) of shape (1, m_test)
     n_iterations -- number of iterations in the optimization loop
     learning_rate -- the learning rate
     print_cost -- True to print the loss every 100 times


     Returns:
     d -- python dictionary containing all model information.
    """
    
    ### Start of code ###

    # Initialize parameters with 0s (≈ 1 line of code)
    w, b = None

    # Gradient descent (≈ 1 line of code)
    parameters, grads, costs = None
    
    # Retrieve parameters
    w = parameters["w"]
    b = parameters["b"]
    
    # Predict both train and test examples (≈ 2 lines of code)
    Y_pred_test = None
    Y_pred_train =  None

    ### End of code ###

    # Display accuracies
    print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_pred_train - Y_train)) * 100))
    print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_pred_test - Y_test)) * 100))

    
    d = {"costs": costs,
         "Y_pred_test": Y_pred_test, 
         "Y_pred_train" : Y_pred_train, 
         "w" : w, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": n_iterations}
    
    return d

Run the following cell to train your model.

In [None]:
d = model(x_train, y_train, x_test, y_test, n_iterations = 2000, learning_rate = 0.006, print_cost = True)

**Expected result**: 

Cost after iteration 0: 0.693147

Cost after iteration 100: 0.649811

Cost after iteration 200: 0.538312

Cost after iteration 300: 0.439262

Cost after iteration 400: 0.349825

Cost after iteration 500: 0.278498

Cost after iteration 600: 0.249764

Cost after iteration 700: 0.231178

Cost after iteration 800: 0.215229

Cost after iteration 900: 0.201339

Cost after iteration 1000: 0.189110

Cost after iteration 1100: 0.178249

Cost after iteration 1200: 0.168533

Cost after iteration 1300: 0.159788

Cost after iteration 1400: 0.151873

Cost after iteration 1500: 0.144677

Cost after iteration 1600: 0.138104

Cost after iteration 1700: 0.132079

Cost after iteration 1800: 0.126537

Cost after iteration 1900: 0.121421

train accuracy: 99.52153110047847 %

test accuracy: 68.0 %




**Comment**: Your train accurary is very close to 100%. That's great, your model is functional and can recognize cats from the train dataset very well. However, the test accuracy is 68%. It's not too bad for the simple model we built given the small dataset provided, but don't worry, we'll build a better model in a future workshop.

You also noticed that the model overinterpreted (overfit) the train data. We will see later methods to reduce overinterpretation (using regularization for example). Use the code below to display the cost function and its gradients.

In [None]:
# Plot learning curve (with costs)
costs = np.squeeze(d['costs'])
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (par centaine)')
plt.title("Learning rate =" + str(d["learning_rate"]))
plt.show()

**Interpretation**:
You can see that the cost is decreasing. This shows that the parameters are being learned. However, you also see that the model is training too much on the train dataset. Try increasing the number of iterations in the cell above and re-run the cells. You should see that the accuracy of the training dataset increases but the test dataset decreases. This is called overfitting.

Some useful sources:
- http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/
- https://stats.stackexchange.com/questions/211436/why-do-we-normalize-images-by-subtracting-the-datasets-image-mean-and-not-the-c