## Sigmoid Example
Use backpropagation to compute the gradients for x and w of the function.
$$f(w,x) = \frac{1}{1+e^{-(w_0x_0 + w_1x_1 + w_2)}}$$

### Sigmoid Function \\(\sigma(x)\\)
The following are the sigmoid function and its derivative

$$
\sigma(x) = \frac{1}{1+e^{-x}} \\\\
\rightarrow \hspace{0.2in} \frac{d\sigma(x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2} = \left( \frac{1 + e^{-x} - 1}{1 + e^{-x}} \right) \left( \frac{1}{1+e^{-x}} \right) 
= \left( 1 - \sigma(x) \right) \sigma(x)
$$

### Compute Gradients for \\(f(w,x)\\)
Let \\(dot(w,x) = w_{0} x_{0} + w_{1} x_{1} + w_{2}\\), So \\(f(w,x) = \sigma(dot(w,x))\\). We can now compute the gradients using chain rule.

In [None]:
import numpy as np

w = [2, -3, -3]
x = [-1, -2]

dot = w[0]*x[0] + w[1]*x[1] + w[2]
sigma = 1.0 / (1 + np.exp(-dot))

dsigma = (1 - sigma) * sigma
dfdx = [w[0] * dsigma, w[1] * dsigma]
dfdw = [x[0] * dsigma, x[1] * dsigma, 1.0 * dsigma]

dfdx, dfdw

## Staged Computation

Suppose that we have a function of the form:

$$f(x,y) = \frac{x + \sigma(y)}{\sigma(x) + (x+y)^2}$$

To be clear, this function is completely useless and it's not clear why you would ever want to compute its gradient, except for the fact that it is a good example of backpropagation in practice.

It is very important to stress that if you were to launch into performing the differentiation with respect to either x or y, you would end up with very large and complex expressions. However, it turns out that doing so is completely unnecessary because we don’t need to have an explicit function written down that evaluates the gradient. We only have to know how to compute it.

In [None]:
import numpy as np

x = 3
y = -4

# **forward pass**

# We have structured the code in such way that it contains multiple intermediate
# variables, each of which are only simple expressions for which we already know
# the local gradients. By the end of the expression we have computed the forward 
# pass. 
sigy = 1.0 / (1 +  np.exp(-y)) # sigmoid in numerator
num = x + sigy                 # numerator
sigx = 1.0 / (1 + np.exp(-x))  # signoid in denominator
xpy = x + y                    # plus
xpysqr = xpy**2
den = sigx + xpysqr            # denominator
invden = 1.0 / den             # inversion
f = num * invden

# **backward pass**

# We’ll go backwards and for every variable along the way in the forward pass.
# we will have the same variable, but one that begins with a **d**, which will 
# hold the gradient of the output of the circuit with respect to that variable.

# Additionally, note that every single piece in our backprop will involve computing
# the local gradient of that expression, and chaining it with the gradient on that
# expression with a multiplication.

dnum = invden                        # f = num * invden
dinvden = num
dden = (-1.0 / (den**2)) * dinvden   # invden = 1.0 / den 
dsigx = (1) * dden                   # den = sigx + xpysqr
dxpysqr = (1) * dden                 
dxpy = (2 * xpy) * dxpysqr           # xpysqr = xpy**2
dx = (1) * dxpy
dy = (1) * dxpy
dx += ((1-sigx) * sigx) *dsigx       # sigx = 1.0 / (1 + np.exp(-x))
dx += (1) * dnum
dsigy = (1) * dnum
dy += ((1 - sigy) * sigy) * dsigy    #sigy = 1.0 / (1 +  np.exp(-y))

dx, dy

### Gradients add up at forks

The forward expression involves the variables x,y multiple times, so when we perform backpropagation we must be careful to use **+=** instead of **=** to accumulate the gradient on these variables (otherwise we would overwrite it). This follows the multivariable chain rule in Calculus, which states that if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add.

## Patterns in backward flow

It is interesting to note that in many cases the backward-flowing gradient can be interpreted on an intuitive level. For example, the three most commonly used gates in neural networks (add,mul,max), all have very simple interpretations in terms of how they act during backpropagation.

* add gate: always takes the gradient on its output and distributes it equally to all of its inputs, regardless of what their values were during the forward pass.
* multiply gate: Its local gradients are the input values (except switched), and this is multiplied by the gradient on its output during the chain rule. It will assign a relatively huge gradient to the small input and a tiny gradient to the large input.
* max gate: routes the gradient, distributes the gradient (unchanged) to exactly one of its inputs 


## Gradients for vectorized operations

The above sections were concerned with single variables, but all concepts extend in a straight-forward manner to matrix and vector operations. However, one must pay closer attention to dimensions and transpose operations.

In [None]:
import numpy as np

# forward pass
W = np.random.randn(5, 10)
X = np.random.randn(10, 3)
D = W.dot(X)

# now suppose we had the gradient on D from above in the circuit
dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T)               # .T gives the transpose of the matrix
dX = W.T.dot(dD)