In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Neural Network ##

A neural network is a complex, non-linear function with a large number of free parameters than can be tuned to make the network act as a "universal function interpolator".  In takes a vector of inputs, $\vec{x}$, and produces a vector of outputs, $\vec{y}$.

In the simplest form of network there is an input layer, and output layer and one "hidden" layer.  Each layer is an array of (artificial) "neurons" which are simply functions which take inputs and compute a non-linear function of the inputs to produce a single output.  In the prototypical example a neuron would have weights $w_i$ and compute
$$
  \mathrm{Output} = \tanh\left( \sum_{i=1}^N w_i \mathrm{Input}_i + w_0 \right)
$$
where $\tanh$ is the non-linear "activation function" and $w_0$ is a "bias" which can be adjusted along with the "weights" $w_i$.  The outputs of these neurons then feed into other neurons and so on. (People don't actually use a tanh, but rather the closely related "sigmoid" function or some other non-linear function but a tanh works fine for the purposes of this notebook.)

The network is "trained" by taking a set of known inputs and outputs and minimizing a loss function (or maximizing a score) which is typically something like
$$
  \mathrm{Loss} = \sum \left[ \vec{y}_a - \vec{O}(\vec{x}_a)\right]^2
$$
where $\vec{O}(\vec{x})$ is the output of the neural network.  The standard way to find the weights ($w_i$ for each neuron) which minimize the loss function is to use gradient descent. Computing the gradient of the loss function is just a long exercise in the chain rule -- which becomes a set of matrix multiplications.  At each step rather than apply a full gradient step one usually moves only part way along the gradient.  The scaling factor is known as the "learning rate".  Ideally one would use something like Newton's method, so that the scaling was estimated from the Hessian matrix.  But this is typically not done because of the complexity of computing the Hessian.  Instead various heuristics are often employed (going by names like Nesterov acceleration, AdaProp, ADAM, ...).

It's probably easiest to just look at the code below to see how it works:

In [2]:
class NeuralNetwork:
    """
    A Python class to implement a very basic neural network with one
    hidden layer.  The network uses a tanh activation and is trained
    using back-propagation.
    Example usage for an XOR function:
      nn = NeuralNetwork(2,2,1)
      tset = [ [[0,0],[0]] , [[0,1],[1]], [[1,0],[1]], [[1,1],[0]] ]
      nn.train(tset)
      nn.test(tset)
    """
    __author__ = "Martin White"
    __version__ = "1.0"
    __email__  = "mwhite@berkeley.edu"

    def interp(self,inputs):
        """Evaluate the neural network at the inputs.
        The inputs are provided unscaled, they are scaled internally."""
        if len(inputs)!= self.ni-1:
            raise ValueError("Inputs wrong length in update.")
        self.ai[:-1] = (np.array(inputs)-self.xmin)/(self.xmax-self.xmin)
        self.ah = np.tanh( np.dot(self.ai,self.wi) )
        self.ao = np.tanh( np.dot(self.ah,self.wo) )
        return((self.zmax-self.zmin)*self.ao[:]+self.zmin)

    def back_propagate(self,targets,eps,ups):
        """Does a single backwards propagation step."""
        if len(targets)!=self.no:
            raise ValueError("Wrong number of target values.")
        val = (targets-self.zmin)/(self.zmax-self.zmin)
        output_deltas = (1-self.ao**2)*(val-self.ao)
        hidden_deltas = (1-self.ah**2)*np.dot(self.wo,output_deltas)
        self.wo += eps*np.outer(self.ah,output_deltas)+ups*self.co
        self.co  = np.outer(self.ah,output_deltas)
        self.wi += eps*np.outer(self.ai,hidden_deltas)+ups*self.ci
        self.ci  = np.outer(self.ai,hidden_deltas)
        error    = 0.5*np.sum( (val-self.ao)**2 )
        return(error)

    def train(self,tset,iter=1000,eps=0.5,ups=0.1):
        """Train the network using backward propagation.
        The training set is a list, each element is itself a list
        the first element of which is the input vector and the second
        is the output vector.
        The first step is to work out the ranges to scale everything
        to lie in [0,1).  The scaling is handled elsewhere.
        eps is the "learning rate" and ups the "momentum factor"."""
        print("Training neural network ...")
        self.xmin = np.zeros(len(tset[0][0])) + 1e30
        self.xmax = np.zeros(len(tset[0][0])) - 1e30
        self.zmin = np.zeros(len(tset[0][1])) + 1e30
        self.zmax = np.zeros(len(tset[0][1])) - 1e30
        for t in tset:
            inp,val = t[0],t[1]
            for j in range(len(inp)):
                if self.xmin[j]>inp[j]:
                    self.xmin[j]=inp[j]
                if self.xmax[j]<inp[j]:
                    self.xmax[j]=inp[j]
            for k in range(len(val)):
                if self.zmin[k]>val[k]:
                    self.zmin[k]=val[k]
                if self.zmax[k]<val[k]:
                    self.zmax[k]=val[k]
        print("Scalings: ",self.xmin,self.xmax,self.zmin,self.zmax)
        for i in range(iter):
            error=0.0
            for t in tset:
                self.interp(t[0])
                error += self.back_propagate(t[1],eps,ups)
            if i%100==0:
                print("Iteration {:6d} error {:.5f}".format(i,error))

    def test(self,tset,verbose=True):
        """Test the network on a test set "tset"."""
        print("Testing neural network...")
        maxerr = 0.0
        for t in tset:
            if verbose:
                print(t[0],"-> {:10.6f} c.f. {:10.6f}".format(self.interp(t[0])[0],t[1][0]))
            err = np.max( np.abs(self.interp(t[0])-t[1]) )
            if err>maxerr:
                maxerr = err
                maxpnt = t[0]
        print("Maxerr=",maxerr,", at ",maxpnt," with value ",self.interp(maxpnt))

    def __init__(self,ni,nh,no):
        """Initialize the class (but do not yet train it).
        The arguments are the number of input, hidden and output nodes
        (since we have only a single "hidden" layer).
        """
        # Set up the number of nodes, include a "bias" node in "ni".
        self.ni,self.nh,self.no = ni+1,nh,no
        # Initialize the activations to unity and create random weights.
        self.ai= np.zeros(self.ni) + 1.0
        self.ah= np.zeros(self.nh) + 1.0
        self.ao= np.zeros(self.no) + 1.0
        self.wi= np.random.uniform(low=-0.1,high=0.1,size=self.ni*self.nh)
        self.wi.shape = (self.ni,self.nh)
        self.wo= np.random.uniform(low=-1.0,high=1.0,size=self.nh*self.no)
        self.wo.shape = (self.nh,self.no)
        # Store some weights for the momenta.
        self.ci = np.zeros((self.ni,self.nh),dtype='float')
        self.co = np.zeros((self.nh,self.no),dtype='float')
        #

As an example let's consider a network with 2 inputs, a hidden layer of 2 neurons and a single output neuron.  This isn't going to be a particularly powerful interpolator, but we'll see if we can teach it the XOR function...

In [3]:
nn = NeuralNetwork(2,2,1) # 2 inputs, 2 hidden neurons and 1 output neuron.
# The training set is the XOR function on two inputs:
tset = [ [[0,0],[0]] , [[0,1],[1]], [[1,0],[1]], [[1,1],[0]] ]
nn.train(tset)
# Now ideally we'd test this network on a different set of data than we used
# to train it, but for the XOR function this is kind of difficult since we've
# used all of the values in the training ... so we'll just see how well it does:
nn.test(tset)

Training neural network ...
Scalings:  [0. 0.] [1. 1.] [0.] [1.]
Iteration      0 error 0.95109
Iteration    100 error 0.13764
Iteration    200 error 0.00435
Iteration    300 error 0.00183
Iteration    400 error 0.00114
Iteration    500 error 0.00082
Iteration    600 error 0.00064
Iteration    700 error 0.00170
Iteration    800 error 0.00046
Iteration    900 error 0.00039
Testing neural network...
[0, 0] ->   0.001810 c.f.   0.000000
[0, 1] ->   0.981729 c.f.   1.000000
[1, 0] ->   0.981585 c.f.   1.000000
[1, 1] ->  -0.002021 c.f.   0.000000
Maxerr= 0.018415163367067278 , at  [1, 0]  with value  [0.98158484]


Since the weights are initialized randomly, and we take only a finite number of steps in our "training" to find the minimum of the loss function, we will get subtly different behavior if we just run it again ... the differences will give us a very rough indication of how well converged the network is.

In [4]:
nn = NeuralNetwork(2,2,1)
nn.train(tset)
nn.test(tset)

Training neural network ...
Scalings:  [0. 0.] [1. 1.] [0.] [1.]
Iteration      0 error 1.03353
Iteration    100 error 0.13263
Iteration    200 error 0.00439
Iteration    300 error 0.00184
Iteration    400 error 0.00114
Iteration    500 error 0.00083
Iteration    600 error 0.00064
Iteration    700 error 0.00053
Iteration    800 error 0.00049
Iteration    900 error 0.00039
Testing neural network...
[0, 0] ->   0.001912 c.f.   0.000000
[0, 1] ->   0.981705 c.f.   1.000000
[1, 0] ->   0.981563 c.f.   1.000000
[1, 1] ->  -0.002028 c.f.   0.000000
Maxerr= 0.01843725889012493 , at  [1, 0]  with value  [0.98156274]


What if we wanted to do a more interesting function.  Let's choose a Gaussian just because.  The such a fascinating function we're going to need many more hidden layer neurons.  In fact what we would probably do is have more than a single hidden layer.  But for now let's just increase the number of hidden layer neurons ...

In [8]:
tset = []
for x1 in np.linspace(-1,1,15):
    for x2 in np.linspace(-1,1,15):
        ff = np.exp( -0.5*(x1**2+x2**2) )
        tset.append( [ [x1,x2], [ff] ] )
#
nn = NeuralNetwork(2,50,1)
# There are automated ways of setting "eps" and "ups", but for
# now let's just set them to small numbers and up the number
# of iterations ...
nn.train(tset,iter=8000,eps=0.003,ups=0.003)
#

Training neural network ...
Scalings:  [-1. -1.] [1. 1.] [0.36787944] [1.]
Iteration      0 error 4.89470
Iteration    100 error 6.06672
Iteration    200 error 6.10429
Iteration    300 error 6.06479
Iteration    400 error 5.42991
Iteration    500 error 3.32821
Iteration    600 error 2.49596
Iteration    700 error 2.46944
Iteration    800 error 2.47539
Iteration    900 error 2.48885
Iteration   1000 error 2.50366
Iteration   1100 error 2.51699
Iteration   1200 error 2.52645
Iteration   1300 error 2.52829
Iteration   1400 error 2.51489
Iteration   1500 error 2.46899
Iteration   1600 error 2.34883
Iteration   1700 error 2.05446
Iteration   1800 error 1.44130
Iteration   1900 error 0.79872
Iteration   2000 error 0.60264
Iteration   2100 error 0.58620
Iteration   2200 error 0.57843
Iteration   2300 error 0.57066
Iteration   2400 error 0.56600
Iteration   2500 error 0.56336
Iteration   2600 error 0.56153
Iteration   2700 error 0.55994
Iteration   2800 error 0.55832
Iteration   2900 error 0.5

In [9]:
# Let's pick a few random values to test:
tset = []
for x1 in sorted(np.random.uniform(low=-1.,high=1.0,size=3)):
    for x2 in sorted(np.random.uniform(low=-1.,high=1.0,size=3)):
        ff = np.exp( -0.5*(x1**2+x2**2) )
        tset.append( [ [x1,x2], [ff] ] )
#
nn.test(tset)

Testing neural network...
[-0.9251164712385394, -0.2653328025302264] ->   0.610451 c.f.   0.629316
[-0.9251164712385394, 0.2695312278399009] ->   0.644263 c.f.   0.628610
[-0.9251164712385394, 0.851178146918123] ->   0.460166 c.f.   0.453767
[-0.10613332212495363, -0.5029047831595017] ->   0.870571 c.f.   0.876263
[-0.10613332212495363, -0.057667308801595984] ->   0.916156 c.f.   0.992732
[-0.10613332212495363, 0.007333263214478691] ->   0.915140 c.f.   0.994357
[0.3730609074939666, -0.8362086775982078] ->   0.667955 c.f.   0.657566
[0.3730609074939666, 0.24493947429179808] ->   0.875094 c.f.   0.905213
[0.3730609074939666, 0.9241131217954823] ->   0.607272 c.f.   0.608609
Maxerr= 0.07921709648093689 , at  [-0.10613332212495363, 0.007333263214478691]  with value  [0.91513986]


So it doesn't do too badly, though it's not exactly high precision as we've coded it up.  Given more training data and a bigger network we could make it perform better.