# Discussion: neural networks and gradient descent

The high level view for today:
- Neural networks are a learning model, i.e., a function parametrized in a certain way
- The parameters of the neural network are the weights between nodes, the $w_{ji}^{(\ell)}$
- Things like the number of layers and the nonlinearities used to compute the output of the layers (e.g., tanh or relu) are *hyperparameters*
- Gradient descent is an iterative mechanism to learn the $w_{ji}^{(\ell)}$ from data
- Backpropagation is an algorithm used to compute the gradients of the *error function* with respect to the weights
- Things like *mini-batch gradient descent* and *stochastic gradient descent* are techniques used to make faster updates to the weights


### Neural networks
- A neural network is a function $h : \mathcal{X} \to \mathcal{Y}$ that we can represent using an acyclic (for now!) directed graph
- Suppose the input space is $d$-dimensional. Then $\mathcal{X} = \mathbb{R}^d$
- The inputs, which we label $x^{(0)} \in \mathbb{R}^d$, have no incoming edges. At the zeroth layer, we have $d^{(0)} = d$ nodes
- The nodes with edges from the inputs are said to belong to the first layer
- The nodes with edges from the $(\ell -1)$-th layer are said to belong to the $\ell$-th layer
- The edge from the $i$-th node of the $(\ell -1)$-th to the $j$-th node of the $\ell$-layer is annotated with a weight $w_{ji}^{(\ell)}$
- Suppose at the $\ell$-th layer we have $d^{(\ell)}$ nodes

- Here we have an example of a neural network with two inputs and two layers: one with three nodes, and the output layer with one node
<img src="nnSample.png" />


- The output $x_j^{(\ell)}$ of the $j$-th node at the $\ell$-th layer is given by

$x_j^{(\ell)} = g\left(s_j^{(\ell)}\right)$, where $s_j^{(\ell)} = \sum_{i = 1}^{d^{(\ell - 1)}} w_{ji}^{(\ell)} x_i^{(\ell - 1)}$,

where $g$ is a nonlinear function called *activation function*, usually tanh or relu

- If we think of the $w_{ji}^{(\ell)}$ as the elements of a matrix $w^{(\ell)}$ of size $d^{(\ell)} \times d^{(\ell - 1)}$, we can write

$s^{(\ell)} = w^{(\ell)} x^{(\ell - 1)}$

- Observe that to compute the output of a node at layer $\ell$ we need to compute the outputs of all previous layers $1, \ldots, \ell - 1$. Thus evaluation of the nodes in the neural network propagates forward

- **How many weights do we have to train in a neural network with $6$ inputs and $3$ layers with $d^{(1)} = 3$, $d^{(2)} = 4$, and $d^{(3)} = 2$?**
- **What is the computational cost of evaluating a function implemented as a neural network with $L$ layers and $d^{(\ell)}$ nodes in layer $\ell$?**

### The error loss
- We need a function to quantify how well we are doing. This is the error loss
- The error loss $e$ is used to adjust the weights through gradient descent. We update the weights according to the rule

$w_{ji}^{(\ell)} \gets w_{ji}^{(\ell)} - \eta \frac{\partial e}{\partial w_{ji}^{(\ell)}}$,

where $\eta$ is a hyperparameter called *learning rate* and *step size*. Note that this rule updates the weights in the direction that decreases the error
- We train the neural network using $N$ samples $(x_i, y_i)$, where $x_i \in \mathbb{R}^{d^{(0)}}$ and $y_i \in \mathbb{R}^{d^{(L)}}$
- Suppose $e_p$ allows us to evaluate the contribution to the total error loss from a single sample $(x_i, y_i)$
- We have a choice. We could update the weights by minimizing the error for all training samples:

$e = \frac{1}{N} \sum_{i = 1}^N e_p \left(h(x_i), y_i\right)$

This is called *batch gradient descent*

A variant of this method is to choose random subsets of the samples and make an update:

$e = \frac{1}{|S|} \sum_{i \in S} e_p \left(h(x_i), y_i\right)$,

where $S \subseteq \{1, \ldots, N\}$. This is called *mini-batch gradient descent*. Of course, at each individual step, we keep varying $S$ in order to make use of all our data

Another option is to just pick a random sample $(x_r, y_r)$, compute the gradient there, and update the weights. In that case, the error would be

$e = e_p \left(h(x_r), y_r\right)$

This goes by the name *stochastic gradient descent*

- **Based on what you understand now, what advantages and disadvantages do you imagine the three methods above have?**

### Updating the weights
- The algorithm used to compute $\frac{\partial e}{\partial w_{ji}^{(\ell)}}$ is Backpropagation
- We define

$\delta_j^{(\ell)} = \frac{\partial e}{\partial s_j^{(\ell)}}$

Remember that $s^{(\ell)} = w^{(\ell)} x^{(\ell - 1)}$

- This quantity is directly related to the gradient we want. The relation is

$\frac{\partial e}{\partial w_{ji}^{(\ell)}} = \delta_j^{(\ell)} x_i^{(\ell - 1)}$

- We already have all the $x_i^{(\ell - 1)}$ as a result of forward propagation (evaluation of the neural network)
- Now, observe that we can directly compute $\delta^{(L)}$. To compute $\delta$ at other layers, we use the expression

$\delta_i^{(\ell)} = g'\left(s_i^{(\ell)}\right) \sum_{j = 1}^{d^{(\ell + 1)}} w_{ji}^{(\ell + 1)} \delta_j^{(\ell + 1)}$

Thus, to compute $\delta^{(\ell)}$, we need to compute $\delta^{(k)}$ for $\ell + 1 \le k \le L$
- In matrix notation, we have

$\delta^{(\ell)} = g'\left(s^{(\ell)}\right) \circ \left( w^{(\ell + 1)} \right)^{\top} \delta^{(\ell + 1)}$,

where $\circ$ stands for the Hadamard (i.e., element-wise) product

- The following class defines a neural network implementing the equations we just discussed. Observe that we are coding the neural network from scratch (i.e., we are not using ML libraries)


In [None]:
import numpy as np

class neuralNetwork:

    # nodesPerLayer is an array containing the number of nodes for each layer: If we pass [2,3], we get a 1-layer NN. The input layer will contain 2 nodes; the first layer 3 nodes.
    def __init__(self, nodesPerLayer, localFunc='relu'):
        L = len(nodesPerLayer) - 1
        self.nodesPerLayer = nodesPerLayer
        self.L = L
        self.x = {i:{} for i in range(0,L+1)}
        self.s = {i:{} for i in range(1,L+1)}
        self.W = {i:1.0/(nodesPerLayer[i-1])*np.random.randn(nodesPerLayer[i], nodesPerLayer[i-1]) for i in range(1,L+1)}
        self.W_grad = {i:np.zeros((nodesPerLayer[i], nodesPerLayer[i-1])) for i in range(1,L+1)}
        self.delta = {i:{} for i in range(1,L+1)}
        # Activation function g
        if localFunc == 'relu':
            # The following two lines are wrong! They just implement the identity function
            self.innerLayerActivation = lambda a: a
            self.innerLayerActivationDer = lambda a : 1.0
            # Add code to correctly compute relu and its derivative
            ### start relu ###

            ### end relu ###
        elif localFunc == 'tanh':
            self.innerLayerActivation = np.tanh
            self.innerLayerActivationDer = lambda a : 1.0 - np.tanh(a)**2
        else:
            assert False
        # activation function for output layers. Notice this is just the identity
        self.outLayerActivation = lambda a : a
        self.outLayerActivationDer = lambda a : 1.0
        # Pointwise error e_p
        self.errorFunc = lambda a, b: 0.5*(a - b)**2
        self.errorFuncDer = lambda a, b: (a - b)

    # Here we go from input to output in order to evaluate the function h
    def forwardPropagate(self, xSamples):
        (d,n) = xSamples.shape
        assert d == self.nodesPerLayer[0]
        # This is the zero-th layer. It consists of the inputs
        self.x[0] = xSamples
        L = self.L
        for k in range(1, L):
            # This is the k-th layer
            self.s[k] = self.W[k] @ self.x[k-1]
            self.x[k] = self.innerLayerActivation(self.s[k])
        # output layer activation
        self.s[L] = self.W[L] @ self.x[L-1]
        self.x[L] = self.outLayerActivation(self.s[L])

    # Here we go from output to input in order to compute gradients
    def backPropagate(self, ySamples):
        (d,n) = ySamples.shape
        # base step: set the deltas at the last layer
        self.delta[self.L]  = self.outLayerActivationDer(self.s[self.L]) * self.errorFuncDer(self.x[self.L], ySamples)
        self.W_grad[self.L] = 1.0/n * (self.delta[self.L] @ self.x[self.L-1].transpose())
        # now we work our way backwards
        for k in reversed(range(1,self.L)):
            self.delta[k] = self.innerLayerActivationDer(self.s[k]) * (self.W[k+1].transpose() @ self.delta[k+1])
            self.W_grad[k] = 1.0/n * (self.delta[k] @ self.x[k-1].transpose())

    # Given a set of n xSamples and ySamples, adjust the weights through backprop
    def adjustWeights(self, xSamples, ySamples, learningRate):
        self.forwardPropagate(xSamples)
        self.backPropagate(ySamples)
        for k in range(1,self.L+1):
            self.W[k] -= learningRate*self.W_grad[k]

    def evaluate(self, xSamples):
        self.forwardPropagate(xSamples)
        return self.x[self.L]

    def getError(self, xSamples, ySamples):
        predictions = self.evaluate(xSamples)
        return self.errorFunc(predictions, ySamples).mean()

    def getGradients(self):
        return self.W_grad


- **Looking at the code above, how do you think we can implement batch gradient descent vs. stochastic gradient descent vs. mini batch gradient descent? In other words, how should we call `adjustWeights`?**


# Setup for example

- Throughout this discussion, we will learn a 1D function $f(x) = x^2 + \text{noise}$
- The following cell sets up our training data


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import animation, rc
from IPython.display import HTML
from copy import copy

seed = 100
np.random.seed(seed)
xSamples = np.reshape(np.linspace(-10, 10, 100), (1,100))
ySamples = xSamples ** 2 + 0.1*np.random.randn(xSamples.shape[0], xSamples.shape[1])

fig, ax = plt.subplots(1,1)
ax.set_title('Data we will approximate')
ax.set_xlabel(r'$x$'); ax.set_ylabel(r'$y$')
ax.plot(xSamples[0,:],ySamples[0,:], '.')


- We will train a neural network with 2 inputs, 4 hidden nodes in a single layer, and 1 output. We use one of our 2 inputs as a constant 1
- Our objective is to closely observe the neural network as it is trained
- The following cell sets up the plots that we will visualize


In [None]:
# PLOTS
fig, ax = plt.subplots(3,2,figsize=(12,16));

# first plot: original function and approximation
ax[0,0].set_xlim(( -10, 10))
ax[0,0].set_ylim((0, 100))
origData, = ax[0,0].plot([], [], '.')
nnApprox, = ax[0,0].plot([], [], lw=2)
ax[0,0].legend(['Original data', 'Neural network'])
ax[0,0].set_xlabel(r'$x$')
ax[0,0].set_ylabel(r'$y$')
# second plot: error vs. iterations
ax[0,1].set_xlim(( -100, 10000))
ax[0,1].set_ylim((0.5, 3.5))
errPlot, = ax[0,1].plot([], [], lw=2)
ax[0,1].set_title('Approximation error')
ax[0,1].set_xlabel('Iterations')
fig.canvas.draw()
labels = [r"$10^{"+i.get_text()+"}$" for i in ax[0,1].get_yticklabels()]
ax[0,1].set_yticklabels(labels)
ax[0,1].set_ylabel('Training error')
# third plot: gradients in first layer (first argument)
ax[1,0].set_xlim(( -100, 10000))
ax[1,0].set_ylim((-20, 20))
lineL1G1 = [ax[1,0].plot([],[],lw=1)[0] for i in range(5)]
ax[1,0].set_title('Gradients in first layer')
ax[1,0].set_xlabel('Iterations')
ax[1,0].set_ylabel('Gradient values')
legend = [r'$\frac{\partial e}{\partial w_{'+str(i)+'1}^{(1)}}$' for i in range(1,5)]
legend += [r'$\Vert \frac{\partial e}{\partial w_{*1}^{(1)}} \Vert$']
ax[1,0].legend(legend, prop={'size': 16})
# fourth plot: gradients in first layer (second argument)
ax[1,1].set_xlim(( -100, 10000))
ax[1,1].set_ylim((-20, 20))
lineL1G2 = [ax[1,1].plot([],[],lw=1)[0] for i in range(5)]
ax[1,1].set_title('Gradients in first layer')
ax[1,1].set_xlabel('Iterations')
ax[1,1].set_ylabel('Gradient values')
legend = [r'$\frac{\partial e}{\partial w_{'+str(i)+'2}^{(1)}}$' for i in range(1,5)]
legend += [r'$\Vert \frac{\partial e}{\partial w_{*2}^{(1)}} \Vert$']
ax[1,1].legend(legend, prop={'size': 16},loc='upper right')
# fifth plot: gradients in second layer
ax[2,0].set_xlim(( -100, 10000))
ax[2,0].set_ylim((-20, 20))
lineL2G = [ax[2,0].plot([],[],lw=1)[0] for i in range(5)]
ax[2,0].set_title('Gradients in second layer')
ax[2,0].set_xlabel('Iterations')
ax[2,0].set_ylabel('Gradient values')
legend = [r'$\frac{\partial e}{\partial w_{1'+str(i)+'}^{(2)}}$' for i in range(1,5)]
legend += [r'$\Vert \frac{\partial e}{\partial w_{1*}^{(2)}} \Vert$']
ax[2,0].legend(legend, prop={'size': 16},loc='upper right')
# sixth plot: weights in second layer
ax[2,1].set_xlim(( -100, 10000))
ax[2,1].set_ylim((-1, 12))
lineL2W = [ax[2,1].plot([],[],lw=1)[0] for i in range(4)]
ax[2,1].set_title('Weights in second layer')
ax[2,1].set_xlabel('Iterations')
ax[2,1].set_ylabel('Weight values')
ax[2,1].legend([r'$w_{1'+str(i)+'}^{(2)}$' for i in range(1,5)],
               prop={'size': 10},loc="upper left")


def init():
    nnApprox.set_data([], [])
    origData.set_data([], [])
    errPlot.set_data([], [])
    for thisLine in lineL1G1: thisLine.set_data([], [])
    for thisLine in lineL1G2: thisLine.set_data([], [])
    for thisLine in lineL2G:   thisLine.set_data([], [])
    for thisLine in lineL2W:   thisLine.set_data([], [])
    return [nnApprox,origData,errPlot] + lineL1G1 + lineL1G2 + lineL2G + lineL2W

def animate(i):
    m = 1 if i <10 else 100
    for j in range(m):
        error.append(np.log10(nNwk.getError(newX, ySamples)))
        if descent == 'batch':
            # use the entire training set to update the weights
            nNwk.adjustWeights(newX, ySamples, eta)
        elif descent == 'sgd':
            # use one training sample to update the weights
            (d,n) = newX.shape
            sampleIndex = np.random.randint(n)
            nNwk.adjustWeights(newX[:,[sampleIndex]], ySamples[:,[sampleIndex]], eta)
        elif descent == 'minibatch':
            # use several training samples to update the weights
            (d,n) = newX.shape
            sampleIndex = np.random.choice(n, miniBatchSize)
            nNwk.adjustWeights(newX[:,sampleIndex], ySamples[:,sampleIndex], eta)
        else:
            assert False
        gradW1 = nNwk.getGradients()[1]
        gradL1.append(gradW1)
        gradW2 = nNwk.getGradients()[2]
        gradL2.append(gradW2)
        W2.append(copy(nNwk.W[2]))
    outputs = nNwk.evaluate(newX)
    #plt.plot(np.transpose(xSamples), np.transpose(outputs))

    x = xSamples[0,:]
    y = outputs[0,:]
    nnApprox.set_data(x, y)
    origData.set_data(xSamples, ySamples)
    errPlot.set_data([k for k in range(1,len(error)+1)], error)
    # first layer gradients
    grads = np.array(gradL1)
    for j, thisLine in enumerate(lineL1G1):
        if j < len(lineL1G1) - 1:
            yVals = grads[:,j,0]
        else:
            yVals = np.linalg.norm(grads[:,:,[0]],axis=1)[:,0]
        thisLine.set_data([k for k in range(1,len(yVals)+1)], yVals)

    for j, thisLine in enumerate(lineL1G2):
        if j < len(lineL1G2) - 1:
            yVals = grads[:,j,1]
        else:
            yVals = np.linalg.norm(grads[:,:,[1]],axis=1)[:,0]
        thisLine.set_data([k for k in range(1,len(yVals)+1)], yVals)

    # second layer gradients
    grads = np.array(gradL2)
    for j, thisLine in enumerate(lineL2G):
        if j < len(lineL2G) - 1:
            yVals = grads[:,0,j]
        else:
            yVals = np.linalg.norm(grads[:,[0],:],axis=2)[:,0]
        thisLine.set_data([k for k in range(1,len(yVals)+1)], yVals)

    # second layer weights
    weights = np.array(W2)
    for j, thisLine in enumerate(lineL2W):
        yVals = weights[:,0,j]
        thisLine.set_data([k for k in range(1,len(yVals)+1)], yVals)

    return [nnApprox,origData,errPlot] + lineL1G1 + lineL1G2 + lineL2G + lineL2W


rc('animation', html='html5')


# Batch gradient descent

- Now we will observe how a neural network learns a 1D function $f(x) = x^2 + \text{noise}$ using batch gradient descent


In [None]:
# build a neural network with 2 input layers, some middle layers, and 1 output layer
seed = 10
np.random.seed(seed)
nNwk = neuralNetwork([2,4,1])
newX = np.concatenate([np.ones(xSamples.shape), xSamples], axis=0)
error = []
gradL1 = []; gradL2 = []
W2 = []
descent = 'batch'
eta = 0.001
ax[1,0].set_ylim((-20, 20))
ax[1,1].set_ylim((-20, 20))
ax[2,0].set_ylim((-20, 20))

anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=50, blit=True)

anim


In [None]:
fig.savefig('gradientDescDisc-GD-a.png')


- Considering the animations above, we can identify three regions in the plot of the training error. **What is the significance of those regions?**
- The only randomness of the training algorithm is in the initialization of the weights of the neural network. **Change the random seed using the following cell, execute the code, and discuss what happens**


In [None]:
# Now using a different seed
newSeed = 10
### start seed ###

### end seed ###
np.random.seed(newSeed)
nNwk = neuralNetwork([2,4,1])
newX = np.concatenate([np.ones(xSamples.shape), xSamples], axis=0)
error = []
gradL1 = []; gradL2 = []
W2 = []
descent = 'batch'
eta = 0.001
ax[1,0].set_ylim((-20, 20))
ax[1,1].set_ylim((-20, 20))
ax[2,0].set_ylim((-20, 20))

anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=50, blit=True)

anim


In [None]:
fig.savefig('gradientDescDisc-GD-b.png')


# Stochastic gradient descent

- We will learn the same function as before, but this time we will use stochastic gradient descent
- This means that we will update gradients using a single sample chosen at random


In [None]:
seed = 10
np.random.seed(seed)
nNwk = neuralNetwork([2,4,1])
descent = 'sgd'
eta = 0.001
error = []
gradL1 = []; gradL2 = []
W2 = []
ax[1,0].set_ylim((-1000, 1000))
ax[1,1].set_ylim((-1000, 1000))
ax[2,0].set_ylim((-1000, 1000))

anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=50, blit=True)

anim


In [None]:
fig.savefig('gradientDescDisc-SGD-a.png')


- **The following cell fits a naural network using the seed you chose for batch gradient descent. Generate the visualization and discuss what happens**
- **How does the behavior of stochastic gradient descent compare with that of gradient descent?**


In [None]:
# Use different seeds below
np.random.seed(newSeed)
nNwk = neuralNetwork([2,4,1])
descent = 'sgd'
eta = 0.001
error = []
gradL1 = []; gradL2 = []
W2 = []
ax[1,0].set_ylim((-1000, 1000))
ax[1,1].set_ylim((-1000, 1000))
ax[2,0].set_ylim((-1000, 1000))

anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=50, blit=True)

anim


In [None]:
fig.savefig('gradientDescDisc-SGD-b.png')


# Mini-batch gradient descent

- Now we will use mini-batch gradient descent to learn the same data as before
- This means that we will update gradients using several samples chosen at random


In [None]:
seed = 10
np.random.seed(seed)
nNwk = neuralNetwork([2,4,1])
descent = 'minibatch'
miniBatchSize = 20

eta = 0.001
error = []
gradL1 = []; gradL2 = []
W2 = []
ax[1,0].set_ylim((-50, 50))
ax[1,1].set_ylim((-500, 500))
ax[2,0].set_ylim((-500, 500))

anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=50, blit=True)

anim


In [None]:
fig.savefig('gradientDescDisc-minibatch-a.png')


- Now you have seen the three descent strategies. **Would you be surprised if someone told you that mini-batch gradient descent is the most popular technique? Why?**
- **Finally, add more hidden nodes to the neural network using the code below and see what happens**


In [None]:
# Set up the plots
fig, ax = plt.subplots(1,2,figsize=(12,4));

# first plot: original function and approximation
ax[0].set_xlim(( -10, 10))
ax[0].set_ylim((0, 100))
origData, = ax[0].plot([], [], '.')
nnApprox, = ax[0].plot([], [], lw=2)
ax[0].legend(['Original data', 'Neural network'])
ax[0].set_xlabel(r'$x$')
ax[0].set_ylabel(r'$y$')
# second plot: error vs. iterations
ax[1].set_xlim(( -100, 10000))
ax[1].set_ylim((-1.0, 3.5))
errPlot, = ax[1].plot([], [], lw=2)
ax[1].set_title('Approximation error')
ax[1].set_xlabel('Iterations')
fig.canvas.draw()
labels = [r"$10^{"+i.get_text()+"}$" for i in ax[1].get_yticklabels()]
ax[1].set_yticklabels(labels)
ax[1].set_ylabel('Training error')


def init():
    nnApprox.set_data([], [])
    origData.set_data([], [])
    errPlot.set_data([], [])
    return [nnApprox,origData,errPlot]

def animate(i):
    m = 1 if i <10 else 100
    for j in range(m):
        error.append(np.log10(nNwk.getError(newX, ySamples)))
        if descent == 'batch':
            # use the entire training set to update the weights
            nNwk.adjustWeights(newX, ySamples, eta)
        elif descent == 'sgd':
            # use one training sample to update the weights
            (d,n) = newX.shape
            sampleIndex = np.random.randint(n)
            nNwk.adjustWeights(newX[:,[sampleIndex]], ySamples[:,[sampleIndex]], eta)
        elif descent == 'minibatch':
            # use several training samples to update the weights
            (d,n) = newX.shape
            sampleIndex = np.random.choice(n, miniBatchSize)
            nNwk.adjustWeights(newX[:,sampleIndex], ySamples[:,sampleIndex], eta)
        else:
            assert False
        gradW1 = nNwk.getGradients()[1]
        gradL1.append(gradW1)
        gradW2 = nNwk.getGradients()[2]
        gradL2.append(gradW2)
        W2.append(copy(nNwk.W[2]))
    outputs = nNwk.evaluate(newX)
    #plt.plot(np.transpose(xSamples), np.transpose(outputs))

    x = xSamples[0,:]
    y = outputs[0,:]
    nnApprox.set_data(x, y)
    origData.set_data(xSamples, ySamples)
    errPlot.set_data([k for k in range(1,len(error)+1)], error)

    return [nnApprox,origData,errPlot]


In [None]:
# now carry out the approximation. Change the number of layers below
seed = 10
np.random.seed(seed)

nNwk = neuralNetwork([2,4,1])
### start moreLayers ###

### end moreLayers ###
descent = 'minibatch'
miniBatchSize = 20

eta = 0.001
error = []

anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=50, blit=True)

anim


In [None]:
fig.savefig('gradientDescDisc-minibatch-moreLayers.png')
