# Assignment 1 for ELEC576 - Deep Machine Learning
## Author:  R. Tyler McLaughlin
## Department: SSPB
## Date:  10/02/17

![](make_moons.png)

## Derivative of tanh

$$\tanh x = { \frac{\sinh x}{\cosh x} }
 = { \frac {e^{x}-e^{-x}}{e^{x} +e^{-x}} } $$


$$\frac{d}{dx}{\sinh x} = \frac{1}{2}*(\frac{d}{dx}e^{x}-\frac{d}{dx}e^{-x}) = \frac{1}{2}*(e^{x} + e^{-x}) = \cosh x $$
$$\frac{d}{dx}{\cosh x} = \frac{1}{2}*(\frac{d}{dx}e^{x}+\frac{d}{dx}e^{-x}) = \frac{1}{2}*(e^{x} - e^{-x}) = \sinh x. $$
<p style="text-align: center;">  
Applying the quotient rule:
</p>

$$ \frac{d}{dx}{\tanh x} = \frac{\cosh x \frac{d}{dx} \sinh x - \sinh x \frac{d}{dx} \cosh x } {{\cosh}^{2} x} $$

$$ = \frac{\cosh x \cosh x - \sinh x \sinh x}{\cosh^{2} x}$$

$$ = 1 - \frac{\sinh ^{2}x}{\cosh^{2}x} = 1 - \tanh^{2} x $$

$$ \blacksquare . $$




## Derivative of logistic sigmoid

$$ \sigma (x) =  \frac{1}{1 + e^{-x}}$$

<p style="text-align: center;">  
Applying the chain rule:
</p>
$$ \frac{d}{dx} \sigma (x) = - \frac{1}{(1 + e^{-x})^{2}} * -e^{-x} $$

$$ = \sigma (x) \frac{e^{-x}}{1 + e^{-x}} $$

$$ = \sigma (x) (1 - \sigma (x)) $$

$$ \blacksquare . $$



## Derivative of ReLU

The Rectified Linear Unit is defined as $ f(x)=\text{max}(0,x) $.  
We can rewrite this using two cases:

\begin{equation} 
f(x)=
    \begin{cases}
      x, & \text{if}\ x>0 \\
      0, & \text{otherwise}
    \end{cases}
\end{equation} .

Upon simple differentiation of the two cases, we get

\begin{equation} 
f'(x)=
    \begin{cases}
      1, & \text{if}\ x>0 \\
      0, & \text{otherwise}
    \end{cases}
\end{equation} 

The ReLU is discontinuous at x = 0, 
therefore its derivative at x = 0 is technically not defined;
however, we are explicitly setting $ f'(0) = 0 $ in the statement above, so we have defined a derivative $\forall x \in \mathbb{R}$ 



# Backward Pass - Backpropagation

## Derivative 1:

$ \frac{\partial L}{\partial W^{(2)}}$.

By the chain rule:

$ = \frac{\partial L}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial z^{(2)}} \frac{\partial z^{(2)}}{\partial W^{(2)}} $

However, there is a mathematical trick where we can speed up backpropagation computations if we **compose** the softmax and cross-entropy.  This means we'll be looking at factors #1 and #2 together, $\frac{\partial L}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial z^{(2)}} = \frac{\partial L}{\partial z^{(2)}}$.  We will later add factor #3 $\frac{\partial z^{(2)}}{\partial W^{(2)}}$ 

This trick was taken from the "Deep Learning" textbook by Goodfellow, Bengio, and Courville, page 199.

### factors #1 and #2 together

We can ignore N datapoints for now, and consider only the sum over classes $k \in C$.

Let's compute the following:
$\frac{\partial L}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial z^{(2)}} = \frac{\partial L}{\partial \mathbf{z}^{(2)}} $

Expressing the output layer's activations element-wise:

$$\frac{\partial L}{\partial z_{i}} = - \sum\limits_{k \in C} y_{k} \frac{\partial log a_{k}}{\partial z_{i}} = - \sum\limits_{k \in C} y_{k} \frac{1}{a_{k}} \frac{\partial  a_{k}}{\partial z_{i}}$$


Let's split the derivative into two cases:  

#### case 1:  $ i = j$.

$$ \frac{\partial  a_{i}}{\partial z_{i}}  = \frac{\partial}{\partial z_{i}} \frac{exp(z_{i})}{\sum\limits_{k \in C}exp(z_{k})}$$

By the quotient rule, we get:

$$ \frac{\big[\sum\limits_{k}exp(z_{k})\big]exp(z_{i}) - exp(z_{i})exp(z_{i}) }{ \big[{\sum\limits_{k}exp(z_{k})}\big]^{2} } $$.

$$ =  \frac{\big[\sum\limits_{k}exp(z_{k})\big] - exp(z_{i}) }{\sum\limits_{k}exp(z_{k})}  \frac{exp(z_{i})}{\sum\limits_{k}exp(z_{k})} $$.

This allows us to express the derivative of the softmax in terms of softmax function itself.

$$ \frac{\partial  a_{i}}{\partial z_{i}} = [1 - softmax(z_{i})] softmax(z_{i})  = [1 - a_{i}]a_{i} $$

Reincorporating into the full derivative above, we get:

$$ - \sum\limits_{k = i} y_{k} \frac{1}{a_{i}} \frac{\partial  a_{i}}{\partial z_{i}}$$

$$ =  -  y_{i} \frac{1}{a_{i}} \frac{\partial  a_{i}}{\partial z_{i}}$$

$$ =  -  y_{i} \frac{1}{a_{i}} [1 - a_{i}]a_{i}$$

$$ = - y_{i} (1 - a_{i}) $$

for i = j.






#### case 2: $ i \neq j$.

$$ \frac{\partial  a_{i}}{\partial z_{j}}  = \frac{\partial}{\partial z_{j}} \frac{exp(z_{i})}{\sum\limits_{k \in C}exp(z_{k})}$$

Using the quotient rule again:

$$ \frac{0 \cdot exp(z_{i}) - exp(z_{i})exp(z_{j}) }{ \big[{\sum\limits_{k}exp(z_{k})}\big]^{2} }  = -\frac{exp(z_{i})}{\sum\limits_{k}exp(z_{k})} \frac{exp(z_{j})}{\sum\limits_{k}exp(z_{k})} = - softmax(z_{i})softmax(z_{j}) = -a_{i}a_{j}$$.

Incorporating this into the derivative above,

$$ - \sum\limits_{i \neq j \in C} y_{j} \frac{1}{a_{j}} \frac{\partial  a_{j}}{\partial z_{i}} =  -\sum\limits_{i \neq j \in C} y_{j} \frac{1}{a_{j}} (-a_{j} a_{i}) =  \sum\limits_{i \neq j \in C} y_{j} a_{i} $$.

#### combining two cases:

$$ \frac{\partial L}{\partial z_{i}} = - y_{i} (1 - a_{i}) + \sum\limits_{j \neq i \in C} y_{j} a_{i} $$
$$ =  - y_{i} +  y_{i}a_{i} + \sum\limits_{j \neq i \in C} y_{j} a_{i} $$

$$ = -y_{i} + \sum\limits_{j \in C} y_{j}a_{i} $$

Rearranging and then finally realizing that since y is a one-hot vector, it sums to 1, we get 
$$ = a_{i}\sum\limits_{j \in C}y_{j} - y_{i}  $$

$$ = a_{i} \cdot 1 - y_{i}   = a_{i}^{(2)} - y_{i}$$.

So the **gradient** of the loss function with respect to a particular z_{i} is quite simple mathematically and conceptually.  It is the difference between the true class identity pre-specified in y and the current activation a_{i} found by training at the ith ouput neuron. 

### factor #3

We have $\frac{\partial L}{\partial z_{i}^{(2)}}$ and we want the matrix $ \frac{\partial L}{\partial W^{(2)}} $.  The chain rule tells us that $ \frac{\partial L}{\partial W_{w}^{(2)}} = \sum\limits_{i}\frac{\partial L}{\partial z_{i}^{(2)}}\frac{\partial z_{i}^{(2)}}{\partial W_{w}^{(2)}},$ where $w$ represents a tuple of two indices (flattened matrix/tensor notation). 

Thus, to finish this part, we need $\frac{\partial z_{i}^{(2)}}{\partial W_{w}}$.

The connection between the hidden layer to the input of the output layer is given in terms of the activation at each of the hidden neurons, plus the weights, and biases:

$ \mathbf{z}^{(2)} = \mathbf{a}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}.$

$ \mathbf{a}^{(2)} $  is an $N \times H$ matrix and $\mathbf{W}^{(2)}$  is an $H \times C$ matrix, where N is the number of data points, H is the number of hidden layers, and C is the number of categories (output neurons).

$ \mathbf{z}^{(2)} = \begin{bmatrix} a_{0,0} & a_{0,1} & \dots & a_{0,H} \\ \vdots  & \ddots & & \vdots \\ a_{N,0} & \dots & & a_{N,H} \end{bmatrix} \begin{bmatrix} W_{0,0} & W_{0,1} & \dots & W_{0,C} \\ W_{1,0} & W_{1,1} & & W_{1,C} \\  \vdots & & \ddots & \vdots \\ W_{H,0}  & \dots  & & W_{H,C}   \end{bmatrix}  + \mathbf{b}^{(2)} $ 

$ = \begin{bmatrix} a_{0,0}W_{0,0} + a_{0,1}W_{1,0} + a_{0,2}W_{2,0} + ... + a_{0,H}W_{H,0} & \dots & a_{0,0}W_{0,C} + a_{0,1}W_{1,C} + a_{0,2}W_{2,C} + ... + a_{0,H}W_{H,C} \\ \vdots & \ddots & \vdots \\ 
a_{N,0}W_{0,0} + a_{N,1}W_{1,0} + a_{N,2}W_{2,0} + ... + a_{N,H}W_{H,0} & \dots & a_{N,0}W_{0,C} + a_{N,1}W_{1,C} + a_{N,2}W_{2,C} + ... + a_{N,H}W_{H,C}\end{bmatrix} + \mathbf{b}^{(2)} $

We only care about $z_{i}^{(2)}$, not the full matrix $\mathbf{z}^{(2)}$ so we can consider the ith column of this matrix:

$\begin{bmatrix} a_{0,0}W_{0,i} + a_{0,1}W_{1,i} + a_{0,2}W_{2,i} + ... + a_{0,H}W_{H,i}  \\ \vdots \\ a_{N,0}W_{0,i} + a_{N,1}W_{1,i} + a_{N,2}W_{2,i} + ... + a_{N,H}W_{H,i}
\end{bmatrix} $

We can calculate $\frac{\partial z_{i}^{(2)}}{\partial W_{w}}$ by considering two cases, depending on the tuple of indices $w$.

#### case 1:  $w = ( h, j), h \in H, j = i $

Taking the derivative causes every term  to vanish in each element of the column vector $z_{i}^{(2)}$ *except for a single term*.  This term is equal to a_{n,h}.



#### case 2:  $w = ( h, j), h \in H, j \neq i$

In this case, the derivative is simply equal to zero.

#### full matrix of partial derivatives of $z_{j}$ with respect to the weights
\begin{equation} 
\frac{\partial z_{i}^{(2)}}{\partial W_{h,j}}=
    \begin{cases}
      a_{h}^{(1)}, & \text{if}\ j = i \\
      0, & \text{otherwise}
    \end{cases}
\end{equation} .


### Finishing the chain rule for $\frac {\partial L }{\partial \mathbf{W}^{(2)}} $

Recall, we are trying to find the derivative of the loss function with respect to each of the weights.

$$\frac{\partial L}{W_{i,j}} = \sum\limits_{k}\frac{\partial L}{\partial z^{(2)}_{k}}\frac{\partial z^{(2)}_{k}}{\partial W_{i,j}} $$

We may write this compactly as $ \nabla_\mathbf{W} L.$

Let's run through a few example entries in this matrix to get a feel for the pattern.

#### example entry:  $W_{0,0}$

$$\frac{\partial L}{W_{0,0}} = \sum\limits_{k}\frac{\partial L}{\partial z^{(2)}_{k}}\frac{\partial z^{(2)}_{k}}{\partial W_{0,0}}$$ 

Let's expand the sum over k.  Our multi-layer neural network has only two possible values for k (two classes), so the sum is simple:

$$ = \frac{\partial L}{\partial z^{(2)}_{0}}\frac{\partial z^{(2)}_{0}}{\partial W_{0,0}} + \frac{\partial L}{\partial z^{(2)}_{1}}\frac{\partial z^{(2)}_{1}}{\partial W_{0,0}}$$ 

Let's next substitute what we derived above analytically for the first partial derivative:

$$ = (a_{0}^{(2)} - y_{0})\frac{\partial z^{(2)}_{0}}{\partial W_{0,0}}  + (a_{1}^{(2)} - y_{1})\frac{\partial z^{(2)}_{1}}{\partial W_{0,0}}$$  

and then substitute what we derived for the second partial derivative:

$$ = (a_{0}^{(2)} - y_{0}) \cdot a_{0}^{(1)}  + (a_{1}^{(2)} - y_{1}) \cdot 0$$  

$$ = (a_{0}^{(2)} - y_{0}) \cdot a_{0}^{(1)}  .$$

OK great.  Let's take a look at another element of $ \nabla_\mathbf{W} L.$

#### example entry:  $W_{0,1}$

$$ \frac{\partial L}{W_{0,1}} = \frac{\partial L}{\partial z^{(2)}_{0}}\frac{\partial z^{(2)}_{0}}{\partial W_{0,1}} + \frac{\partial L}{\partial z^{(2)}_{1}}\frac{\partial z^{(2)}_{1}}{\partial W_{0,1}}$$ 

Substitution is straightforward.  Observe how the cancellation is different:

$$ = (a_{0}^{(2)} - y_{0})\cdot 0  + (a_{1}^{(2)} - y_{1}) \cdot a_{0}^{(1)} $$  

$$ = (a_{1}^{(2)} - y_{1})\cdot a_{0}^{(1)} $$  

We can do one more example before generalizing and constructing the full matrix:

#### example entry:  $W_{1,0}$

$$ \frac{\partial L}{W_{1,0}} = \frac{\partial L}{\partial z^{(2)}_{0}}\frac{\partial z^{(2)}_{0}}{\partial W_{1,0}} + \frac{\partial L}{\partial z^{(2)}_{1}}\frac{\partial z^{(2)}_{1}}{\partial W_{1,0}}$$ 

$$ = (a_{0}^{(2)} - y_{0})\frac{\partial z^{(2)}_{0}}{\partial W_{1,0}}  + (a_{1}^{(2)} - y_{1})\frac{\partial z^{(2)}_{1}}{\partial W_{1,0}}$$  

$$ = (a_{0}^{(2)} - y_{0})\cdot a_{1}  + (a_{1}^{(2)} - y_{1})\cdot 0 $$  

$$ = (a_{0}^{(2)} - y_{0}) \cdot a_{1}^{(1)} $$

#### generalizing:

$$ (\nabla_\mathbf{W} L)_{i,j}  =  \frac{\partial L}{\partial W_{i,j}^{(2)} } = (a_{j}^{(2)} - y_{j}) \cdot a_{i}^{(1)}$$,

for $ i \in C,$ the number of output layers, and $j \in H ,$ the number of hidden layers.

This can be succinctly written in matrix form!

$$ (\nabla_\mathbf{W} L)  =  \frac{\partial L}{\partial \mathbf{W^{(2)}} } = \mathbf{a^{(1)}}^\top \cdot (\mathbf{a^{(2)}} - \mathbf{y}) $$,

$$ \blacksquare . $$



## Derivative 2:

Calculate $ \frac{\partial L}{\partial \mathbf{b}^{(2)}},$  where $\mathbf{b}^{(2)}$ is a $ 1 \times C$ bias matrix.

### chain rule

$$\frac{\partial L}{b_{i}^{(2)}} = \sum\limits_{k \in C}\frac{\partial L}{\partial z^{(2)}_{k}}\frac{\partial z^{(2)}_{k}}{\partial b_{i}^{(2)}} = (a_{0}^{(2)} - y_{0})\frac{\partial z^{(2)}_{0}}{\partial b_{i}^{(2)}}  + (a_{1}^{(2)} - y_{1})\frac{\partial z^{(2)}_{1}}{\partial b_{i}^{(2)}}$$

Recall that $\mathbf{z}^{(2)} = \mathbf{a}^{(1)} \cdot \mathbf{W}^{(2)} + \mathbf{b}^{(2)} $.

If we let $ \mathbf{q} = \mathbf{a}^{(1)} \cdot \mathbf{W}^{(2)} $, then  $\mathbf{z}^{(2)} = \begin{bmatrix} q_{0,0} & \dots & q_{0,C} \\
\vdots  & \ddots &  \vdots \\
q_{N,0} & \dots & q_{N,C}
\end{bmatrix} + \begin{bmatrix} b_{0}^{(2)} & \dots & b_{C}^{(2)}\end{bmatrix} $ 

Note:  technically $\mathbf{b}^{(2)}$ should be an $ N \times C$ dimensional matrix to get the summed terms to agree, but NumPy doesn't care about this and repeats (a.k.a. recycles) along the $N$ rows, yielding:

$\mathbf{z}^{(2)} = \begin{bmatrix} q_{0,0} + b_{0}^{(2)} & \dots & q_{0,C} + b_{C}^{(2)} \\
\vdots  & \ddots &  \vdots \\
q_{N,0} + b_{0}^{(2)} & \dots & q_{N,C} + b_{C}^{(2)}
\end{bmatrix} .$  

So each $ z_{i}^{(2)} $ is a column of this matrix.   We note that $z_{i}^{(2)}$ does not depend on $b_{j}$ for $i \neq j$ and thus $ \frac{\partial z^{(2)}_{i}}{\partial b_{j}^{(2)}} = 0.$  Otherwise, if $ i = j$, then $\frac{\partial z^{(2)}_{i}}{\partial b_{i}^{(2)}} = 1.$ 

### substituting into the chain rule:

Just as we did before for $\frac{\partial L}{\partial W_{i,j}}$, let's look at a few specific cases before generalizing.

$$\frac{\partial L}{b_{i}^{(2)}} = \sum\limits_{k \in C}\frac{\partial L}{\partial z^{(2)}_{k}}\frac{\partial z^{(2)}_{k}}{\partial b_{i}^{(2)}} = (a_{0}^{(2)} - y_{0})\frac{\partial z^{(2)}_{0}}{\partial b_{i}^{(2)}}  + (a_{1}^{(2)} - y_{1})\frac{\partial z^{(2)}_{1}}{\partial b_{i}^{(2)}} + \dots + (a_{C}^{(2)} - y_{C})\frac{\partial z^{(2)}_{1}}{\partial b_{i}^{(2)}}$$.  Every term vanishes except where $ i = j$, in which case, that term is equal to $(a_{i}^{(2)} - y_{i})$.  

Thus, $\frac{\partial L}{b_{i}^{(2)}} = (a_{i}^{(2)} - y_{i})$.

$$ \blacksquare . $$



## Derivative 3:

Here we calculate $ \frac{\partial L}{\partial \mathbf{W}^{(1)}}$, where $\mathbf{W}^{(1)}$ is an $I \times H$ matrix with $H$ representing the number of hidden nodes and $I$ representing the number of input nodes.  This derivative is also a matrix.

Let $a$ be the element-wise activation function.  Let $a'$ be its derivative.

First, we can work on the gradient of $ z_{i}^{(2)} $ (the input to arbitrary node in layer 2) with respect to  $ z_{j}^{(1)}$ (the input to arbitrary node in layer 1).

$$ \frac{\partial z_{i}^{(2)}}{\partial z_{j}^{(1)}} = \sum\limits_{p}\frac{\partial z_{i}^{(2)}}{\partial a_{p} } \frac{\partial a_{p}}{\partial z_{j}^{(1)}}  $$

where $p$ is a tuple of indices and $a$ is the $N x H$ matrix of activations.

Recall the equation for the input to the second layer: 
$$\mathbf{z}^{(2)} = \mathbf{a}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)} $$

If we take a  single index of z, we get the following:

$$ z_{i}^{(2)} = z  $$



# Training our Three Layer Neural Network

## Experimenting with Activation Functions

### Hyperbolic Tangent Activation Function
![](1e_figures/Figure_1.E.1.tanh.png)

### Sigmoid Activation Function

![](1e_figures/Figure_1.E.1.sigmoid.png)

This, to my eyes, looks **identical** to using the tangent activation!

 Note:  I tried re-implementing the sigmoid function calculation with a stable algorithm (Scipy built-in) to avoid overflow:


In [12]:
import numpy as np
import scipy.special

z = -1e3
# OLD CALCULATION
activation = 1. / (1 + np.exp(-z))  # yields overflow for z = -1e3
activation

  


0.0

In [13]:
# NEW CALCULATION using Scipy.special.expit
activation = scipy.special.expit(z)  # does not overflow
activation

0.0

### ReLU Activation Function

![](1e_figures/Figure_1.E.1.relu.png)

## Varying the number of hidden nodes

### 1 Node
![](1e_figures/Figure_1.E.1.tanh.1-nodes.png)

1 node does as well as a simple multiple regression.
### 2 Nodes
![](1e_figures/Figure_1.E.1.tanh.2-nodes.png)

2 nodes are able to fit the data slightly better than a single linear decision split.
### 4 Nodes
![](1e_figures/Figure_1.E.1.tanh.4-nodes.png)

At 4 nodes, we see a proper fit to the crescent shape of the data.

### 6 Nodes
![](1e_figures/Figure_1.E.1.tanh.6-nodes.png)

6 nodes gives a lower loss than 4 nodes, classifying the points more accurately, although at this point we appear to be overfitting to the random jitter.

### 9 Nodes
![](1e_figures/Figure_1.E.1.tanh.9-nodes.png)

Adding more than 6 hidden nodes does not appear to improve the quality of the fit.

# Part 1 f:  Building a Deep Neural Network


## Implementation Notes
The code "n_layer_neural_network-final.py" accepts input that specifies the number of nodes per layer in the form:

In [None]:
layer_sizes = [X.shape[1], 3, 3, 3, 3]

This example prepares for the instantiation of a multilayer perceptron with 4 hidden layers with 3 nodes each.
The next example builds a network with 10 nodes in the first hidden layer, 9 nodes in the second, and so on, decreasing until 2 nodes are reached.



In [None]:
layer_sizes = [X.shape[1], 10, 9, 8, 7, 6, 5, 4, 3, 2]


Keeping a dictionary of activation functions outside of the classes really helped with compartmentalization of the code.  So too did using separate classes for the general MLP layer and the final, Output layer.  Initially, I had tried to use if else control statements to do different things if the flag `last_layer` was `True`.

I had to implement feedforward AND backprop for every layer AND also in `class DeepNeuralNetwork`.  I inherited from `class NeuralNetwork`, but by the end of implementing everything, I only really inherited the visualize_decision_boundary function.  

`class OutputLayer` inherits from `class Layer` and if I recall correctly, only  methods actFun, diff_ActFun, and backprop were overridden. 


I used I used Layer.feedforward to implement DeepNeuralNetwork.feedforward.

Analogously, I used Layer.backprop to implement DeepNeuralNetwork.backprop.  
The `reversed` function came in handy for iterating over layers in reverse order!

In [None]:
    def backprop(self, y):
        delta_term = y
        for layer in reversed(self.layers):
            delta_term = layer.backprop(delta_term)
    



I decided not to add $L^{2}$-norm regularization terms to the final loss.  I used regularization during backprop.  To be more consistent and more resistant to overfitting, an addition regularization step would have been helpful in theory, but my fits looked pretty good already without them.  Plus I believe I'd have to iterate over all Weights from all layers, so this would slow down my computation time a bit.

Implementation of the MLP creation involves looping over the layer_sizes, creating many regular, fully-connected layers, then appending a hidden layer class which has the softmax activation function.

## Performance with respect to depth

Let's try a deeper network, with 3 hidden nodes in two hidden layers (6 hidden nodes total).

In [None]:
layer_sizes = [X.shape[1], 3, 3]

![](1f_figures/mlp_depth-3_3.png)
This just looks a bit like a **smoother version** of the shallow network.  What if we increase the number of nodes while keeping the layers fixed?

In [None]:
layer_sizes = [X.shape[1], 6, 6]

![](1f_figures/mlp_depth-6_6.png)

This network with two hidden layers, and 6 nodes per layer is brilliant!

In [None]:
layer_sizes = [X.shape[1], 10, 8, 6, 4, 3]

![](1f_figures/mlp_depth-10-8-6-4-3.png)

And this one is quite bad....  The extra depth doesn't seem to help with 20000 iterations.  What if instead of shrinking deeper layers, we fan out, increasing the size of the hidden layers near the output layer?


In [None]:
layer_sizes = [X.shape[1], 3, 4, 6, 8, 10]

![](1f_figures/mlp_depth-3-4-6-8-10.png)

Well this is much better, but I wonder how it fares with more than 20,000 iterations.  I'll try increasing to 80,000.


![](1f_figures/mlp_depth-3-4-6-8-10_80k_runtime.png)

Ok, cool we can see a bit of underlying the crescent shape!

### varying the nonlinearity

![](1f_figures/mlp_depth-6_6_sigmoid.png)

This looks **basically identical** to the tanh.

## Different Data Sets

First I tried fitting to the Load_Wines data set from Scikit-learn.
![](1f_figures/plot_wine_data.png)

This data set is plottable.  Unfortunately, I was unable to visualize the decision boundary because there are too many features (about 13!).

The same goes for the breast cancer dataset, which has 30 features.   For reference, the Make Moons dataset has two features.

What I wanted was a dataset with more than two classes, however, because I was curious about how the deep neural net would perform with distinguishing three or four classes.

Let's use the **Make Blobs** dataset because it give flexibility over the number of "centers" of the blobs.

For all the following images, I ran with the same hyperparameters:  80,000 iterations and a learning rate $\epsilon$ of 0.01.



In [None]:
X, y = generate_data_blobs(centers=3)
layer_sizes = [X.shape[1], 16, 16, 16, 16]

![](1f_figures/blobs_3_fiveLayers)

This looks like it's not converging, OR maybe it's overfitting.   Either way, let's try reducing the number of layers.  

In [None]:
X, y = generate_data_blobs(centers=3)
layer_sizes = [X.shape[1], 16, 16]


![](1f_figures/blobs_3_threeLayers)

Better fit with the same number of iterations.  Let's see what happens when we increase the number of blob centers aka classes to 4 classes.

![](1f_figures/blobs_4_threeLayers)

Increasing to 6 classes:

![](1f_figures/blobs_6_threeLayers)

And lastly, 12 classes:

![](1f_figures/blobs_12_threeLayers)


This neural network seems to have strong capacity to fit a large number of classes.  The only "problem area" is in the center of the plot, where the points become congested.

To look at one final neural architecture, let's try changing the activation function to Sigmoid while maintainining the same depth and number of nodes.



In [None]:
layer_sizes = [X.shape[1], 16,16]
actFun_type ='sigmoid'

![](1f_figures/blobs_12_threeLayers_sigmoid)

Wow.  Sigmoid actually fits the data better.



# Part 2: Training a Simple Deep Convolutional Network on MNIST

## Part a) Build and Train a 4-layer DCN

### conv1(5-5-1-32) - ReLU - maxpool(2-2) - conv2(5-5-32-64) - ReLU - maxpool(2-2) fc(1024) - ReLU - DropOut(0.5) - Softmax(10)

All my code for this section is in the file dcn_mnist_part2a.py

Last few lines of the terminal output:

step 4900, training accuracy 0.98
step 5000, training accuracy 0.96
step 5100, training accuracy 1
step 5200, training accuracy 1
step 5300, training accuracy 0.96
step 5400, training accuracy 1
test accuracy 0.9869
The training takes 1014.218075 second to finish

About 17 minutes to run!  Not bad on my macbook pro.

After running the file I moved my results directory to results_part_2a/

#### visualized results
This shows the training loss value as a function of iteration number.
![](2a_figures/scalars.png)

#### Computational Graph
![](2a_figures/graphs.png)

## Part b) More on Visualizing Your Training

### Statistics
Because we don't want this report to get too gigantic, let's show some **select figures**.  If you're interested, all the figures are on my Github page https://github.com/TylerMclaughlin/rice-deep-machine-learning-class/tree/master/assignment1/2b_figures/

#### Statistics for the weights for convolutional layer 1
![](2b_figures/W_conv1.png)
#### Statistics for the biases for convolutional layer 1
![](2b_figures/b_conv1.png)

#### Statistics for the net inputs to convolutional layer 1
![](2b_figures/input_1.png)

#### Statistics for the ReLU activations of convolutional layer 1
![](2b_figures/h_conv1.png)

#### Statistics for the max-pooled activations of convolutional layer 1
![](2b_figures/h_pool1.png)

#### Example histogram for the weights for convolutional layer 1
![](2b_figures/W_conv1_histogram.png)


That the min of some parameters is at zero the entire session means that the neuron or synapse is dead.  :|

### Validation and Test Set Accuracy

Summaries for test and validation set prediction accuracy were output every epoch, so thus there are only 5 data points:
#### Test Accuracy
![](2b_figures/test_accuracy.png)
#### Validation Accuracy
![](2b_figures/val_accuracy.png)

The two plots look identical, but when I hovered over specific data points in TensorBoard, I saw there was a slight difference (about a tenth of a percent).

# Part 2c:  Time for More Fun!!!

## Playing with nonlinearities

### Using Tanh as an activation function

#### Note:  Swapping all ReLUs with tanhs!

![](2c_figures/tanh/h_conv1.png)

What's nice with tanh that you can see in this plot is you don't get dead neurons!  The minimum activation after the first convolutional layer is far from zero, unlike with ReLU where it is stuck at ~1e-16 from the beginning.  The nonzero activation for tanh implies that all neurons are alive.

While the test and validation accuracies (>95%) were quite high after the first epoch, diverse, not-quite-convergent behavior in the weights and biases is observed with tanh.

![](2c_figures/tanh/W_fc1.png)
![](2c_figures/tanh/b_conv1.png)



### Using Sigmoid as an Activation Function

![](2c_figures/tanh/test_accuracy.png)
Only 82.7 percent after the first epoch!

However, weights and activations look pretty smooth.  Frequent non-monotonic behavior.
![](2c_figures/sigmoid/h_conv1.png)
![](2c_figures/sigmoid/W_fc1.png)


### Using Leaky ReLU (L-ReLU) as an Activation Function

 $\mathbf{\alpha} $ **is set to -0.01**


![](2c_figures/LReLu/test_accuracy.png)

Very fast accuracy of 96.32 after the first epoch.

![](2c_figures/LReLu/h_conv.png)
Summary statistics are much noisier, which is probably a good thing for exploration of the landscape.  No dead neurons.


![](2c_figures/LReLu/input1_hist.png)
This histogram is sharply peaked with wide tails!  what does this mean?



### Using ELU as an Activation Function

Exponential Linear Units (ELUs) are known to fit very fast.

Test accuracy looks the same as the Leaky ReLU.

![](2c_figures/ELU/h_conv1.png)
Activation summaries mean, std, max, min, all appear to be dropping... not sure why they would all do this unless the neurons are all becoming way less active.




## Playing with gradient descent optimizers



### adagrad

![](2c_figures/adagrad_loss.png)

### vanilla SGD

![](2c_figures/vanilla_loss.png)


### Momentum

![](2c_figures/adagrad_loss.png)

It looks like loss is made worse with all optimizers examined other than Adam.
(Adam was using for ReLU settings)

Momentum is better than vanilla SGD, which no surprise, appears to be the worst of the bunch.

## Playing with Xavier Initialization


### Using Xavier Initialization type:  Uniform distribution
![](2c_figures/xav_uniform.png)

This give an extremely high accuracy 99.04 for test, 99.06 for validation in a very short amount of time (fast convergence).

