In [1]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np

# [Perceptron](http://neuralnetworksanddeeplearning.com/chap1.html#perceptrons)
A perceptron takes several binary inputs, x1,x2,…, and produces a single binary output:
![image.png](http://neuralnetworksanddeeplearning.com/images/tikz0.png)


# [Sigmoid neurons](http://neuralnetworksanddeeplearning.com/chap1.html#sigmoid_neurons)
Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. That's the crucial fact which will allow a network of sigmoid neurons to learn.
![image.png](http://neuralnetworksanddeeplearning.com/images/tikz9.png)


\begin{eqnarray} 
  \frac{1}{1+\exp(-\sum_j w_j x_j-b)}.
\end{eqnarray}

To understand the similarity to the perceptron model, suppose z≡w⋅x+b is a large positive number. Then e−z≈0 and so σ(z)≈1. In other words, when z=w⋅x+b is large and positive, the output from the sigmoid neuron is approximately 1, just as it would have been for a perceptron. Suppose on the other hand that z=w⋅x+b is very negative. Then e−z→∞, and σ(z)≈0. So when z=w⋅x+b is very negative, the behaviour of a sigmoid neuron also closely approximates a perceptron. It's only when w⋅x+b is of modest size that there's much deviation from the perceptron model.

If it's the shape of σ which really matters, and not its exact form, then why use the particular form used for σ in Equation (3)? In fact, later in the book we will occasionally consider neurons where the output is f(w⋅x+b) for some other activation function f(⋅).
## [Exercises](http://neuralnetworksanddeeplearning.com/chap1.html#exercises_191892)
* **Sigmoid neurons simulating perceptrons, part I**
Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, c>0. Show that the behaviour of the network doesn't change.

\begin{eqnarray}
  \mbox{output} = \left\{ 
    \begin{array}{ll} 
      0 & \mbox{if } c\cdot w\cdot x + c\cdot b \leq 0 \\
      1 & \mbox{if } c\cdot w\cdot x + c\cdot b > 0
    \end{array}
  \right.
\end{eqnarray}
\begin{eqnarray}
  \mbox{output} = \left\{ 
    \begin{array}{ll} 
      0 & \mbox{if } c ( w\cdot x + b) \leq 0 \\
      1 & \mbox{if } c ( w\cdot x + b) > 0
    \end{array}
  \right.
\end{eqnarray}
\begin{eqnarray}
  \mbox{output} = \left\{ 
    \begin{array}{ll} 
      0 & \mbox{if } w\cdot x + b \leq 0 \\
      1 & \mbox{if } w\cdot x + b > 0
    \end{array}
  \right.
\end{eqnarray}

* **Sigmoid neurons simulating perceptrons, part II**
Suppose we have the same setup as the last problem - a network of perceptrons. Suppose also that the overall input to the network of perceptrons has been chosen. We won't need the actual input value, we just need the input to have been fixed. Suppose the weights and biases are such that w⋅x+b≠0 for the input x to any particular perceptron in the network. Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant c>0. Show that in the limit as c→∞ the behaviour of this network of sigmoid neurons is exactly the same as the network of perceptrons. How can this fail when w⋅x+b=0 for one of the perceptrons?

\begin{eqnarray} 
  \frac{1}{1+\exp(-\sum_j c_j (w_j x_j-b))}.
\end{eqnarray}
with c→∞, then if:
\begin{array}{ll} 
  \mbox{if } w\cdot x + b \leq 0 \\
\end{array}
then
\begin{eqnarray} 
  \frac{1}{1+\exp(-\sum_j c_j (w_j x_j-b))} \approx 0.
\end{eqnarray}
if:
\begin{array}{ll} 
  \mbox{if } w\cdot x + b > 0 \\
\end{array}
then
\begin{eqnarray} 
  \frac{1}{1+\exp(-\sum_j c_j (w_j x_j-b))} \approx 1.
\end{eqnarray}

# [The architecture of neural networks](http://neuralnetworksanddeeplearning.com/chap1.html#the_architecture_of_neural_networks)
![image.png](http://neuralnetworksanddeeplearning.com/images/tikz11.png)
* The leftmost layer in this network is called the **input layer**, and the neurons within the layer are called **input neurons**. 
* The rightmost or **output layer** contains the **output neurons**, or, as in this case, a single output neuron. 
* The middle layer is called a **hidden layer**, since the neurons in this layer are neither inputs nor outputs.

such multiple layer networks are sometimes called multilayer perceptrons or **MLPs**, despite being made up of sigmoid neurons, not perceptrons.

We've been discussing neural networks where the output from one layer is used as input to the next layer. Such networks are called **feedforward neural networks**. This means there are no loops in the network - information is always fed forward, never fed back.

However, there are other models of artificial neural networks in which feedback loops are possible. These models are called **recurrent neural networks**. The idea in these models is to have neurons which fire for some limited duration of time, before becoming quiescent.

They're much closer in spirit to how our brains work than feedforward networks. And it's possible that recurrent networks can solve important problems which can only be solved with great difficulty by feedforward networks.

# [A simple network to classify handwritten digits](http://neuralnetworksanddeeplearning.com/chap1.html#a_simple_network_to_classify_handwritten_digits)

## [Exercise](http://neuralnetworksanddeeplearning.com/chap1.html#exercise_513527)
* There is a way of determining the bitwise representation of a digit by adding an extra layer to the three-layer network above. The extra layer converts the output from the previous layer into a binary representation, as illustrated in the figure below. Find a set of weights and biases for the new output layer. Assume that the first 3 layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least 0.99, and incorrect outputs have activation less than 0.01.

![image.png](http://neuralnetworksanddeeplearning.com/images/tikz13.png)

\begin{eqnarray} 
  \frac{1}{1+\exp(-\sum_j w_j x_j-b)}
\end{eqnarray}

|   |x_0001|x_0010|x_0100|x_1000|
|---|---|---|---|---|
|b  |0  |0  |0  |0  |
|w_0|-1000|-1000|-1000|-1000|
|w_1|**1000**|-1000|-1000|-1000|
|w_2|-1000|**1000**|-1000|-1000|
|w_3|**1000**|**1000**|-1000|-1000|
|w_4|-1000|-1000|**1000**|-1000|
|w_5|**1000**|-1000|**1000**|-1000|
|w_6|-1000|-1000|**1000**|-1000|
|w_7|**1000**|-1000|**1000**|-1000|
|w_8|-1000|-1000|-1000|**1000**|
|w_9|**1000**|-1000|-1000|**1000**|


# [Learning with gradient descent](http://neuralnetworksanddeeplearning.com/chap1.html#learning_with_gradient_descent)
We'll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. http://yann.lecun.com/exdb/mnist/

We'll use the notation x to denote a training input. It'll be convenient to regard each training input x as a 28×28=784-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We'll denote the corresponding desired output by y=y(x), where y is a 10-dimensional vector. For example, if a particular training image, x, depicts a 6, then y(x)=(0,0,0,0,0,0,1,0,0,0)T is the desired output from the network.

What we'd like is an algorithm which lets us find weights and biases so that the output from the network approximates y(x) for all training inputs x. To quantify how well we're achieving this goal we define a **cost function** (Sometimes referred to as a **loss** or **objective function**):

\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2
\end{eqnarray}

Here, w denotes the collection of all weights in the network, b all the biases, n is the total number of training inputs, a is the vector of outputs from the network when x is input, and the sum is over all training inputs, x.

The aim of our training algorithm will be to minimize the cost C(w,b) as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We'll do that using an algorithm known as gradient descent.

\begin{eqnarray} 
  \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +
  \frac{\partial C}{\partial v_2} \Delta v_2
\end{eqnarray}

\begin{eqnarray} 
  \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, 
  \frac{\partial C}{\partial v_2} \right)^T
\end{eqnarray}

\begin{eqnarray} 
  \Delta C \approx \nabla C \cdot \Delta v
\end{eqnarray}

Suppose we choose:
\begin{eqnarray}
  \Delta v = -\eta \nabla C,
\end{eqnarray}

where η is a small, positive parameter (known as the learning rate).
\begin{eqnarray}
  \Delta C \approx -\eta
  \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2
\end{eqnarray}

Because
\begin{eqnarray}
\| \nabla C
\|^2 \geq 0
\end{eqnarray}

this guarantees that 
\begin{eqnarray}
\Delta C \leq 0
\end{eqnarray}

\begin{eqnarray}
  v \rightarrow v' = v -\eta \nabla C.
\tag{11}\end{eqnarray}

Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient ∇C, and then to move in the opposite direction, "falling down" the slope of the valley.

## [exercises](http://neuralnetworksanddeeplearning.com/chap1.html#exercises_647181)
* I explained gradient descent when C is a function of two variables, and when it's a function of more than two variables. What happens when C is a function of just one variable? Can you provide a geometric interpretation of what gradient descent is doing in the one-dimensional case?

In [24]:
class Parabool:
    def __init__(self, c_0, c_1, c_2):
        self.c_0 = c_0
        self.c_1 = c_1
        self.c_2 = c_2

    def set_x(self, X):
        self.X = X
        
    def evaluate(self):
        self.Y = self.c_2*self.X**2 + self.c_1*self.X + self.c_0

    def evaluate_derivative(self, x):
        self.C = 2*self.c_2*self.X + self.c_1
        
    def plot_parabool(self):
        plt.plot(self.X, self.Y)

    def plot_tangent(self):
        plt.plot(self.X, self.Y_dot)

    
X = np.arange(-10, 10)

    
parabool = Parabool(1,2,3)

parabool.set_x(X)
parabool.evaluate()
parabool.evaluate_derivative()
parabool.plot_parabool()
parabool.plot_tangent()


plot.show()

# a = 1
# b = 2
# c = 0
# X = np.arange(-10, 10)
# Y = a*X*X + b*X + c 


# x_1 = -5
# y_1 = a*x_1*x_1 + b*x_1 + c

# c_1 = 2*a*x_1 + b
# b_1 = y_1 - a_1*x_1

# Y_1 = c_1*X + b_1


# mu = .1

# x_new = x_1 - mu * c_1

# plt.plot(X, Y, X, Y_1, linewidth=4, label='bla')
# plt.show()


<IPython.core.display.Javascript object>

AttributeError: 'list' object has no attribute 'show'