# Gradient Descent

### Using squared errors instead of log loss

We want to find the weights for our neural networks. Let's start by thinking about the goal. The network needs to make predictions as close as possible to the real values. To measure this, we use a metric of how wrong the predictions are, the **error**. A common metric is the sum of the squared errors (**SSE**):

$$ E = \frac{1}{2} \sum_{\mu} \sum_{j} [y_{j}^{\mu}-\hat{y}_{j}^{\mu}]^{2} $$

where $\hat{y}$ s the prediction and $y$ is the true value, and you take the sum over all output units $j$ and another sum over all data points $\mu$.

First, the inside sum over $j$. 
> This variable $j$ represents the output units of the network. So this inside sum is saying "for each output unit, find the difference between the true value $y$ and the predicted value $\hat{y}$, the square the difference, then sum all those squares."
    
Then, the other sum over $\mu$ is a sum over all the data points.
> For each data point you calculate the inner sum of the squared differences for each output unit. Then, you sum those squared differences for each data point. That gives you the overall error for all the output predictions for all the data points.

The **SSE** is a good choice for a few reasons
> The square ensures the error is always positive and larger errors are penalized more than smaller errors
> it makes the math nice (always a good thing)

Output of the neural network depends on weights:
$$ \hat{y}_{j}^{\mu} = f (\sum_{i}w_{ij}x_{i}^{\mu} )$$

.. and accordingly, the error depends on the weights:

$$ E = \frac{1}{2} \sum_{\mu} \sum_{j}[ y_{j}^{\mu}-f(\sum_{i}w_{ij}x_{i}^{\mu})]^{2}$$

We want the network's prediction error to be as small as possible and the weights are the knobs we can use to make that happen. Our goal is to find weights $w_{ij}$ that minimize the squared error $E$. To do this with a neural network, typically you'd use **gradient descent**.

**A mountain as a gradient descent analogy:**
Since the fastest way down a mountain is in the steepest direction, the steps taken should be in the direction that minimizes the error the most. We can find this direction by calculating the *gradient* of the squared error.

<img src='MultiNNGrad.png'>

One weight update can be calculated as:
$$\Delta w_{i} = \eta \delta x_{i} $$

with the error term $\delta$ as
$$\delta = (y-\hat{y})f^{'}(\sum w_{i}x_{i}) $$

Remember, in the above equation, $(y-\hat{y})$ is the output error and $f^{'}(h)$ refers ti the derivative of the activation function, $f(h)$ We'll call that derivative the output gradient.

Now, we'll write this out in code for the case of only one output unit. We'll also be using the sigmoid as the activation function $f(h)$.

```python
# Defining the sigmoid function for activations
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of the sigmoid function
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Input data
x = np.array([0.1, 0.3])
# Target
y = 0.2
# Input to output weights
weights = np.array([-0.8, 0.5])

# The learning rate, eta, in the weight step equation
learnrate = 0.5

# The linear combination performed by the node (h in f(h) and f'(h))
h = x[0]*weights[0] + x[1]*weights[1]
# or h = np.dot(x, weights)

# The neural network output (y-hat)
nn_output = sigmoid(h)

# Output error (y - y-hat)
error = y- nn_output

# output gradient (f'(h))
output_grad = sigmoid_prime(h)

# Error term (lowercase delta)
error_term = error * output_grad

# Gradient descent step
del_w = [learnrate * error_term * x[0],
         learnrate * error_term * x[1]]
# or del_w = learnrate * error_term * x

In [2]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

def sigmoid_prime(x):
    """
    # Derivative of the sigmoid function
    """
    return sigmoid(x) * (1 - sigmoid(x))

learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])

### Calculate one gradient descent step for each weight
### Note: Some steps have been consolidated, so there are
###       fewer variable names than in the above sample code

# TODO: Calculate the node's linear combination of inputs and weights
h = x[0]*w[0] + x[1]*w[1] + x[2]*w[2] + x[3]*w[3]

# TODO: Calculate output of neural network
nn_output = sigmoid(h)

# TODO: Calculate error of neural network
error = y - nn_output

# TODO: Calculate the error term
#       Remember, this requires the output gradient, which we haven't
#       specifically added a variable for.
output_grad = sigmoid_prime(h)
error_term = error * output_grad

# TODO: Calculate change in weights
del_w = learnrate * error_term * x

print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
0.6899744811276125
Amount of Error:
-0.1899744811276125
Change in Weights:
[-0.02031869 -0.04063738 -0.06095608 -0.08127477]
