<img src="img/vs265header.svg"/>


<h1 align="center">Lab 2 - Supervised Learning <font color="red"> [SOLUTIONS] </font> </h1> 

<h2 align="center">1. Linear Neuron with sigmoidal output nonlinearity </h2> 

Derive the modified learning rule for a linear neuron with sigmoidal output nonlinearity:

$$y= \sigma(u) = \frac{1}{1 + e^{-u}}$$
with $u = w^T x = \sum_i w_i x_i$

<font color="red">Solution: </font>The objective function for a McCullogh-Pitt neuron is given by:
$$ E = \frac{1}{2} \sum_k [T^{(k)} - \sigma(u^{(k)})]^2 $$

where $T$ is the teacher signal and $k$ indexes the sample from our data.

The partial derivative of $\sigma$ with respect to $w_i$ using the chain rule is:

\begin{eqnarray*}
\frac{\partial \sigma}{\partial w_i} &=& \frac{\partial \sigma}{\partial u} \frac{\partial u}{\partial w_i} \\
                                     &=& \frac{1}{1 + e^{-u}} \frac{e^{-u}}{1 + e^{-u}} x_i \\
                                     &=& \sigma(u) (1 - \sigma(u)) x_i
\end{eqnarray*}

The partial derivative of the Energy function with respect to $w_i$ is:
$$ \frac{\partial E}{\partial w_i} = - \sum_k (T^{(k)} - \sigma(u^{(k)})) \frac{\partial \sigma}{\partial w_i} $$

Taking the negative gradient with a learning rate $\eta$ we get our update rule: 
$$ \Delta w_i = - \eta \frac{\partial E}{\partial w_i} = \eta \sum_k (T^{(k)} - \sigma(u^{(k)})) \sigma(u^{(k)}) (1 - \sigma(u^{(k)})) x^{(k)}_i$$


<h2 align="center">2. Single layer network </h2> 

Train a single neuron to discriminate between the apples and oranges data in apples.npy and oranges.npy. Try this for both a linear neuron and one with a sigmoidal output nonlinearity. (Use $+1/-1$ as the category assignments in the linear case, and $1/0$ in the non-linear case.) Use the code below to visualize the convergence of the solution during learning. You must fill in the code for simulating network itself and learning of the weights. Comment on the differences you observe between the sigmoid and linear case.

In [1]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from utils.lab2_utils import HyperPlanePlotter
import pdb

In [2]:
# Load the Apples and Oranges data
apples  = np.load('data/apples.npy')
oranges = np.load('data/oranges.npy')

# initialize data array
data = np.hstack((apples,oranges))
dimensions, numSamples = data.shape

In [3]:
# initialize teachers
halfNumSamples = int(numSamples/2)
teacherLinear = np.ones(numSamples)
teacherLinear[halfNumSamples:] *= -1
teacherSigmoid = np.ones(numSamples)
teacherSigmoid[halfNumSamples:] *= 0

# number of trials - ## Modify these so your learning converges
numTrials = 200

# learning rates - ## Modify these so your learning converges by the end
etaLinear  = 1e-2
etaSigmoid = 0.2

# intialize plotter
plotter = HyperPlanePlotter(data, apples, oranges, numTrials)
plotEvery = numTrials // 10

In [4]:
def sigmoid(u):
    return 1.0 / (1.0 + np.exp(-u))

def sigmoidDeriv(u):
    return sigmoid(u) * (1 - sigmoid(u))

def identity(u):
    return u

def identityDeriv(u):
    return 1

def get_parameters(name):
    if name == "Linear":
        return identity, identityDeriv, teacherLinear, etaLinear
    return sigmoid, sigmoidDeriv, teacherSigmoid, etaSigmoid

In [5]:
def optimizeSingle(name):
    func, funcDeriv, teacher, eta = get_parameters(name)
    
    # initialize weights and bias
    weights = np.random.randn(2,1)
    bias    = np.random.randn(1)
    
    # initialize plots
    plotter.setupPlotProb2(name, weights, bias)

    # loop over trials
    for t in range(numTrials):
        # initialize weight derivatives and error for this trial
        errorT = 0
        weightsDeriv = np.zeros((2,1))
        biasDeriv = 0
        
        # loop over training set
        for i in range(numSamples):
            # compute neuron output
            u = np.dot(weights.T, data[:,i]) + bias
            predict = func(u[0])

            # compute error
            error = (teacher[i] - predict)
            errorGrad = error * funcDeriv(u)

            # accumulate weight derivative
            weightsDeriv += errorGrad * data[:,i:i+1]

            # accumulate bias derivative
            biasDeriv += errorGrad

            # accumulate error for this trail
            errorT += 1/2 * error ** 2

        # update weights and bias
        weights  += eta * weightsDeriv
        bias += eta * biasDeriv

        # update display of separating hyperplane every 10 iterations
        if t % plotEvery == 0:
            plotter.updatePlotProb2(weights, bias)
        plotter.plotErrorProb2(name, t, errorT)

In [None]:
plotter.initPlotProb2()
optimizeSingle("Linear")
optimizeSingle("Sigmoid")

<IPython.core.display.Javascript object>

<font color="red">Solution: </font> Both the linear and sigmoid network are able to discriminate the patterns. The error is reduced faster for the sigmoid network.

<h2 align="center">3. Multilayer network </h2> 

Augment the data from question 2 with the additional datasets apples2.npy and oranges2.npy. As you can see from plotting out the combined data, the problem of discriminating the apples from the oranges is no longer linearly separable, so we must use a multilayer network for this problem. Start by deriving the learning rules for a two layer network. Then, train a two-layer network (using backprop) to learn to discriminate between apples and oranges. Use the code below to get started. Experiment with adding a momentum term to see if it helps with convergence.

To make sure your solution works, we have provided you with a good initialization of the weights (goodInit=True). After you get this solution working you should experiment with random initializations (goodInit=False). In the description of your solution you should comment on the following:

a) From your learned solution, describe in words how the two layers work together to discriminate between apples and oranges. <br/>
b) The effect momentum has on the learning <br/>
c) The solutions learned when goodInit=False and why they happen <br/>

<font color="red">Solution: </font> The learning rules are as follows:
\begin{eqnarray*}
u_y &=& W x \\
y &=& \sigma (u_y) \\
u_z &=& V y \\
z &=& \sigma(u_z)
\end{eqnarray*}

\begin{eqnarray*}
\Delta V &=& (T - z) \sigma'(u_z) y \\
\Delta W &=& (T - z) \sigma'(u_z) V \sigma'(u_y) x
\end{eqnarray*}

In [None]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from utils.lab2_utils import HyperPlanePlotter
import pdb

In [None]:
# load data
apples = np.load('data/apples.npy')                                                                                                                                        
oranges = np.load('data/oranges.npy')                                                                                                                                      
apples2 = np.load('data/apples2.npy')                                                                                                                                      
oranges2 = np.load('data/oranges2.npy')

# initialize data array
apples = np.hstack((apples, apples2))
oranges = np.hstack((oranges, oranges2))
data = np.hstack((apples, oranges))
dimensions,numSamples = data.shape
halfNumSamples = int(numSamples/2)

In [None]:
# initialize teacher
teacher = np.ones(numSamples)
teacher[halfNumSamples:] *= 0

# learning rate
eta=1e-1

# number of trials - you may want to make this smaller or larger
numTrials = 4000

# plotting
plotter = HyperPlanePlotter(data, apples, oranges, numTrials, halfNumSamples)
plotEvery  = numTrials // 50
plotErrorEvery = numTrials // 100

In [None]:
def sigmoid(u):
    return 1.0 / (1.0 + np.exp(-u))

def sigmoidDeriv(u):
    return sigmoid(u) * (1 - sigmoid(u))

In [None]:
def optimizeMulti(goodInit, momentum=False):
    # initialize weights and biases
    weightsOne = np.load('init/weightsOne.npy') if goodInit else np.random.randn(2,2) # first layer weights                                                                                                                                        
    biasOne    = np.load('init/biasOne.npy') if goodInit else np.random.randn(2,1)                                                                                                                                                                 
    weightsTwo = np.load('init/weightsTwo.npy') if goodInit else np.random.randn(2,1) # second layer weights                                                                                                                                       
    biasTwo    = np.load('init/biasTwo.npy') if goodInit else np.random.randn(1)                                                                                                                                                                   
    
    gamma = 0.2 if momentum else 0
    plotter.setupPlotProb3(weightsOne, biasOne, weightsTwo, biasTwo)

    weightsOneDerivLast = 0; biasOneDerivLast = 0; weightsTwoDerivLast = 0; biasTwoDerivLast = 0
    # loop over trials
    for t in range(numTrials):
        # initialize derivative of weights, biases, and error array for each trial                                                                                                                                                                    
        weightsOneDeriv = np.zeros((2,2))                                                                                                                                                             
        biasOneDeriv    = np.zeros((2,1))                                                                                                                                                            
        weightsTwoDeriv = np.zeros((2,1))                                                                                                                                                             
        biasTwoDeriv    = np.zeros(1)

        errorT = 0  
        # loop over training set
        for i in range(numSamples):
            # forward pass
            uy = weightsOne @ data[:,i:i+1] + biasOne
            y = sigmoid(uy)
            uz = (weightsTwo.T @ y + biasTwo)[0]
            z  = sigmoid(uz)

            # compute error
            error = teacher[i] - z
            
            # second layer derivatives
            biasTwoDeriv[0] += eta * error * sigmoidDeriv(uz)
            weightsTwoDeriv += eta * error * sigmoidDeriv(uz) * y
            
            # first layer derivatives
            biasOneDeriv += eta * error * sigmoidDeriv(uz) * weightsTwo * sigmoidDeriv(y)
            weightsOneDeriv += eta * error * sigmoidDeriv(uz) *  (weightsTwo * sigmoidDeriv(uy)) @ data[:,i:i+1].T
                           
            # accumulate error
            errorT += 1/2 * error ** 2
                
        # update weights and bias
        weightsOne += eta * weightsOneDeriv + gamma*weightsOneDerivLast
        biasOne    += eta * biasOneDeriv + gamma*biasOneDerivLast
        weightsTwo += eta * weightsTwoDeriv + gamma*weightsTwoDerivLast
        biasTwo    += eta * biasTwoDeriv + gamma*biasTwoDerivLast

        weightsOneDerivLast = weightsOneDeriv
        biasOneDerivLast = biasOneDeriv
        weightsTwoDerivLast = weightsTwoDeriv
        biasTwoDerivLast = biasTwoDeriv

        # update display of separating hyperplane every 10 iterations
        if t % plotEvery == 0:
            plotter.updatePlotProb3(weightsOne, biasOne, weightsTwo, biasTwo)
        if t % plotErrorEvery == 0:
            plotter.plotErrorProb3(t, errorT)

In [None]:
optimizeMulti(goodInit=True, momentum=False)

In [None]:
optimizeMulti(goodInit=True, momentum=True)

In [None]:
optimizeMulti(goodInit=False, momentum=True)

<font color="red">Solution: </font> <br/>
a) The first layer of the network projects the apples and oranges into a space where they are linearly separable, as can be seen in the "Hidden representation" plot.<br/>
b) Momentum causes the learning to decrease substantially faster.<br/>
c) Because our energy function is non-convex our optimization converges to local minima. Depending on the initialization this local minimum may be good or bad, this becomes apparent with random initializations.<br/>

<h2 align="center">4. Pattern Discrimination Task </h2> 

Consider the following pattern discrimination task:

![title](img/lab2.4.png)

Describe how you think you are able to discriminate between these two patterns? How would you expect a network to discriminate between them? Try training a two-layer neural network to discriminate these patterns. How many hidden units are needed? What representation is learned by the hidden units in order to solve this problem? Are they what you expected?

<font color="red">Solution: </font> Intuitively I see a "T" as two 3-length bars perpendicular to eachother. I see a "S" as a bar with an extra dot on opposite corners. I would expect my network to learn features that correspond to bars and opposite corners.

In [None]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from utils.lab2_utils import FilterPlotter
import pdb

In [None]:
# initialize data array
S = np.load('data/S.npy')
T = np.load('data/T.npy')
data = np.hstack((S,T))
numInputUnits,numSamples = data.shape
halfNumSamples = int(numSamples/2)

In [None]:
# initialize teacher
teacher = np.ones(numSamples)
teacher[halfNumSamples:] *= 0

# learning rate
eta=4e-1

# number of trials - you may want to make this smaller or larger
numTrials = 2000

# plotting
plotter = FilterPlotter(numTrials)
plotHiddenUnitsEvery  = numTrials // 20
plotErrorEvery = numTrials // 50

In [None]:
def sigmoid(u):
    return 1.0 / (1.0 + np.exp(-u))

def sigmoidDeriv(u):
    return sigmoid(u) * (1 - sigmoid(u))

In [None]:
def optimize(numHiddenUnits, momentum=False):
    # initialize weights and biases
    weightsOne = np.random.randn(numHiddenUnits, numInputUnits) # first layer weights                                                                                                                                        
    biasOne    = np.random.randn(numHiddenUnits,1)                                                                                                                                                                 
    weightsTwo = np.random.randn(numHiddenUnits,1) # second layer weights                                                                                                                                       
    biasTwo    = np.random.randn(1)                                                                                                                                                                
    
    gamma = 0.2 if momentum else 0

    plotter.setupPlots(weightsOne, numHiddenUnits)

    weightsOneDerivLast = 0; biasOneDerivLast = 0; weightsTwoDerivLast = 0; biasTwoDerivLast = 0
    # loop over trials
    for t in range(numTrials):
        # initialize derivative of weights, biases, and error array for each trial                                                                                                                                                                    
        weightsOneDeriv = np.zeros((numHiddenUnits, numInputUnits))                                                                                                                                                             
        biasOneDeriv    = np.zeros((numHiddenUnits,1))                                                                                                                                                            
        weightsTwoDeriv = np.zeros((numHiddenUnits,1))                                                                                                                                                             
        biasTwoDeriv    = np.zeros(1)

        errorT = 0  
        # loop over training set
        for i in range(numSamples):
            # forward pass
            uy = weightsOne @ data[:,i:i+1] + biasOne
            y = sigmoid(uy)
            uz = (weightsTwo.T @ y + biasTwo)[0]
            z  = sigmoid(uz)

            # compute error
            error = teacher[i] - z
            
            # second layer derivatives
            biasTwoDeriv[0] += eta * error * sigmoidDeriv(uz)
            weightsTwoDeriv += eta * error * sigmoidDeriv(uz) * y
            
            # first layer derivatives
            biasOneDeriv += eta * error * sigmoidDeriv(uz) * weightsTwo * sigmoidDeriv(y)
            weightsOneDeriv += eta * error * sigmoidDeriv(uz) * (weightsTwo * sigmoidDeriv(uy)) @ data[:,i:i+1].T
                           
            # accumulate error
            errorT += abs(error)
                
        # update weights and bias
        weightsOne += eta * weightsOneDeriv + gamma*weightsOneDerivLast
        biasOne    += eta * biasOneDeriv + gamma*biasOneDerivLast
        weightsTwo += eta * weightsTwoDeriv + gamma*weightsTwoDerivLast
        biasTwo    += eta * biasTwoDeriv + gamma*biasTwoDerivLast

        # track previous weight derivatives to use momentum
        weightsOneDerivLast = weightsOneDeriv
        biasOneDerivLast = biasOneDeriv
        weightsTwoDerivLast = weightsTwoDeriv
        biasTwoDerivLast = biasTwoDeriv

        if t % plotHiddenUnitsEvery == 0:
            plotter.updatePlots(weightsOne)
        if t % plotErrorEvery == 0:
            plotter.plotError(t, errorT)
    print ("Final Error: %.2f" % errorT)

In [None]:
optimize(numHiddenUnits=2, momentum=True)

In [None]:
optimize(numHiddenUnits=4, momentum=True)

Four units seems to reliably solve this problem. However, the features learned by these units is not easily interpretable. It seems they are often looking for corners but I don't see the units looking for bars.'