# Deep learning from scratch: homework 3

### General instructions

Complete the exercises listed below in this Jupyter notebook - leaving all of your code in Python cells in the notebook itself.  Feel free to add any necessary cells.  

Included with the notebook are 

- a custom utilities file called `custom_plotter.py` that provides various plotting functionalities (for unit tests to help you debug) as well as some other processing code.


- datasets for exercises: `noisy_sin_sample.csv` and `2_eggs.csv`.

be sure you have these files located in the same directory where you put this notebook to work!

### When submitting this homework:
    
**Make sure all output is present in your notebook prior to submission**

#### <span style="color:#a50e3e;">Exercise 4. </span>  Read the notes below

Yes, thats it, **there is nothing to turn in for this exercise** (yes, you will get these points for free!).  Just read the notes below!  You will find useful concepts - and blocks of code - for further exercises there.

In [1]:
# import autograd functionality
import autograd.numpy as np
from autograd.util import flatten_func
from autograd import grad as compute_grad   

# import custom utilities and plotter
import custom_plotter as plotter

# import various other libraries
import copy
import matplotlib.pyplot as plt

# this is needed to compensate for %matplotl+ib notebook's tendancy to blow up images when plotted inline
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%matplotlib notebook
%load_ext autoreload
%autoreload 2

Feel free to use the following ``gradient_descent`` function below for this exercise.

In [2]:
# gradient descent function
def gradient_descent(g,w,alpha,max_its,beta,version):    
    # flatten the input function, create gradient based on flat function
    g_flat, unflatten, w = flatten_func(g, w)
    grad = compute_grad(g_flat)

    # record history
    w_hist = []
    w_hist.append(unflatten(w))

    # start gradient descent loop
    z = np.zeros((np.shape(w)))      # momentum term
    
    # over the line
    for k in range(max_its):   
        # plug in value into func and derivative
        grad_eval = grad(w)
        grad_eval.shape = np.shape(w)

        ### normalized or unnormalized descent step? ###
        if version == 'normalized':
            grad_norm = np.linalg.norm(grad_eval)
            if grad_norm == 0:
                grad_norm += 10**-6*np.sign(2*np.random.rand(1) - 1)
            grad_eval /= grad_norm
            
        # take descent step with momentum
        z = beta*z + grad_eval
        w = w - alpha*z

        # record weight update
        w_hist.append(unflatten(w))

    return w_hist

# 3.  (Optimization tricks part 2, continued) Nonlinear features, normalization, and gradient descent performance

As we have seen, whenever we want to perform some sort of nonlinear regression or classification, we transform the input of a dataset using a linear combination of $U$ *nonlinear* feature transformations $f_1\,,f_2\,..,f_U$, having a prediction function that (for single dimensional input) looks like 

\begin{equation}
\text{predict}\left({x},\omega\right) = w_0 + w_1\,f_1\left({x}\right) + \cdots + w_U\,f_U\left({x}\right)
\end{equation}

*In theory* the more feature transformations - often called just *features* for short - we use, the more nonlinear our prediction becomes.  Indeed if we use too many we risk overfitting any dataset.  However *in practice* when optimizing (whether with nonlinear regression or classification) via *gradient descent* with a fixed number of maximum iterations we can often find that it gets harder to fit more nonlinear models to data.  In fact in some instances we can find that as we add nonlinear features our final fit becomes *less* nonlinear because our cost function has become harder for gradient descent to minimize properly. 

This confusing empirical fact is due to precisely the sort of input normalization issue described in the previous exercises.  That is, often nonlinear feature transformations *create* highly imbalanced input distributions, which in turn create the sort of long narrow valleys we try to avoid when using (either normalized or unnormalized) gradient descent as an optimizer. 

## 3.1  Fixed kernel feature transformations

To apply fixed features we take a set of *ordered basis functions* s $f_1,f_2,...,f_U$ (with no internal parameters) and transform the input.  In the next ``Python`` cell we give a compact feature transformation function called ``compute_features`` that does just this for *polynomial features*: we plug in our entire set of input and the first $U$ monomial features of it are returned.  In other words this returns a sequence of $P\times 1$ *transformed* inputs $\mathbf{f}_u$

\begin{equation}
\mathbf{f}_u = \begin{bmatrix}
f_u\left(x_1\right) \\
f_u\left(x_2\right) \\
\vdots \\
f_u\left(x_P\right) \\
\end{bmatrix}
\end{equation}

which contains each of our input points raised to the $u^{th}$ power.

In [3]:
# compact functionality for creating polynomial feature transformations - degree polynomials between 1 and U
def compute_features(x,U):
    return np.asarray([x**deg for deg in range(1,U+1)])[:,:,0].T

Remember: when employing such a fixed feature transformation $\mathbf{f}_u$ becomes our $u^{th}$ input dimension, and its distribution touches the weight $w_u$ (you can see this by examining the form of our ``predict`` function in equation (1)).  Thus having previously seen the effect of unnormalized versus normalized input feature distributions first hand in the case of *linear* regression / classification, we can expect something quite similar to arise in the nonlinear case based on the distribution of our features $\mathbf{f}_u$ (notice we never apply any 'normalization' to the bias since only one value touches it - the number $1$).  In other words, if the distributions of our transformed input dimensions are not by design similar, we can expect that normalizing them will substantially improve the performance of gradient descent since this will ameliorate (at least to some extent) the problem of long narrow valleys.

> If the distributions of our transformed input dimensions are not by design similar, we can expect that normalizing them will substantially improve the performance of gradient descent since this will ameliorate (at least to some extent) the problem of long narrow valleys.

#### <span style="color:#a50e3e;">Example 1. </span>  The distribution of polynomial features for a toy dataset

Lets begin by examining a the distribution of polynomial features for the toy sinusoidal dataset loaded in and shown below.

In [4]:
# load data
csvname = 'noisy_sin_sample.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:,:-1]
y = data[:,-1:]
# plot everything
plotter_demo = plotter.Visualizer()
plotter_demo.plot_regression_data(x,y)

<IPython.core.display.Javascript object>

To get a sense of the distribution of transformed features in our present case we can compute and plot each of the first $U = 20$ of them for our toy dataset.  We can examine then the distribution of each monomial feature using the ``feature_distributions`` plotting function introduced in the previous exercise.  A copy of this has been also in our ``custom_utilities`` backend file, and we can call it as shown below (no need to repeat the code block defining this function).

In [5]:
# make degree 20 polynomial features
U = 20
f = compute_features(x,U)

In [6]:
# show a plot of the distribution of each feature
title = 'distribution of first 20 polynomial features'
labels = [r'$f_{' + str(n+1) + '}$' for n in range(U)]
plotter_demo.feature_distributions(f,title,labels=labels)

<IPython.core.display.Javascript object>

These feature distributions are really collapsing rapidly and are - all together - not very similar across the feature space.  Why is this happening?  Because the input points in $\mathbf{x}$ here all lie in the range $[0,1]$, and raising any value in that range to a power makes it exponentially *smaller*.

Therefore we can expect gradient descent to be slow to converge if these unnormalized input features are used (an example below using our toy dataset will confirm this suspicion).  Indeed if left unnormalized this sort of feature distribution imbalance can lead to substantially increased difficulty in properly tuning nonlinear models (using gradient descent) as we increase the number of feature transformations $U$.  And we have done it to ourselves - in applying nonlinear feature transformations (here polynomials, but this is more commonly true as well) we *create* this problem.

> If left unnormalized fixed feature distribution imbalance can lead to substantially increased difficulty in properly tuning nonlinear models (using gradient descent) as we increase the number of fixed feature transformations $U$. 

At the extremes this can even lead to confusing scenarios where adding more nonlinear features does not improve the nonlinearity of a tuned model as one expects, because fine tuning via gradient descent has been made increasingly difficult with each added nonlinear feature dimension.  For example with polynomials - as indicated in our example here - the problem typically gets *worse* as we add more monomial terms since monomials exponentially *shrink* values in the interval $(-1,1)$ and exponentially *grow* essentially all other values (i.e., those with absolute value greater than 1).

---

Now when we take polynomial features we have some values that will grow exponentially *larger*.  So not even our standard normalization of the input can save us from ourselves in this case.  We need to normalize the nonlinear features themselves.  We normalize each by subtracting off their mean and dividing by their standard deviation

\begin{equation}
f_u \left(x_p \right) \longleftarrow \frac{f_u \left(x_p \right) - \mu_{f_u}}{\sigma_{f_u}}
\end{equation}

where

\begin{array}
\
\mu_{f_u} = \frac{1}{P}\sum_{p=1}^{P}f_u\left(x_p \right) \\
\sigma_{f_u} = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(f_u\left(x_p \right) - \mu_{f_u} \right)^2}.
\end{array}

Normalizing both the original input and the nonlinear features we would likewise replace each feature with

\begin{equation}
f_u \left(x_p \right) \longleftarrow \frac{f_u \left(\frac{x_p - \mu_{x}}{\sigma_{x}} \right) - \mu_{f_u}}{\sigma_{f_u}}.
\end{equation}

With this input/feature normalization we can write our normalized ``predict_normalized`` function using this notation as

\begin{equation}
\text{predict_normalized}\left(x,\omega\right) = w_0 + w_1\left(\frac{f_1\left(\frac{x - \mu_x}{\sigma_x} \right) - \mu_{f_1}}{\sigma_{f_1}}\right) + w_2\left(\frac{f_2\left(\frac{x - \mu_x}{\sigma_x} \right) - \mu_{f_2}}{\sigma_{f_2}}\right) + \cdots + w_U\left(\frac{f_U\left(\frac{x - \mu_x}{\sigma_x} \right) - \mu_{f_U}}{\sigma_{f_U}}\right).
\end{equation} 

Now every weight in our model (with the exception of the bias of course) is attached to a normalized distribution of data.  Note in the instance our input $\mathbf{x}$ is in general $N$ dimensional we normalize along each coordinate, giving the general normalized update

\begin{equation}
\text{predict_normalized}\left(\mathbf{x},\omega\right) = w_0 + w_1\left(\frac{f_1\left(\frac{x_1 - \mu_{x_1}}{\sigma_{x_1}},\frac{x_2 - \mu_{x_2}}{\sigma_{x_2}},...,\frac{x_N - \mu_{x_N}}{\sigma_{x_N}} \right) - \mu_{f_1}}{\sigma_{f_1}}\right) + \cdots + w_U\left(\frac{f_U\left(\frac{x_1 - \mu_{x_1}}{\sigma_{x_1}},\frac{x_2 - \mu_{x_2}}{\sigma_{x_2}},...,\frac{x_N - \mu_{x_N}}{\sigma_{x_N}}  \right) - \mu_{f_U}}{\sigma_{f_U}}\right).
\end{equation} 

where $\mu_{x_n}$ and $\sigma_{x_n}$ are the mean and standard deviation of the data along the $n^{th}$ dimension of the input.

#### <span style="color:#a50e3e;">Example 2. </span>  The distribution of normalized polynomial features for a toy dataset

Here we normalize the input data and then corresponding polynomial featurese for the noisy sinusoidal dataset in Example 1.  Following we visualize the distribution of the first $U = 20$ normalized polynoimal features, mirroring the plot shown in the previous example.

In [7]:
# our normalization function
def normalize(data,data_mean,data_std):
    normalized_data = (data - data_mean)/data_std
    return normalized_data

In [8]:
# compute the mean and standard deviation of our input
x_means = np.mean(x,axis = 0)
x_stds = np.std(x,axis = 0)

# normalize data using the function above
x_normed = normalize(x,x_means,x_stds)

In [9]:
# make degree 5 polynomial features
U = 20
f = compute_features(x_normed,U)

# normalize polynomial features
f_means = np.mean(f,axis = 0)
f_stds = np.std(f,axis = 0)

# normalize features using the function above
f_normed = normalize(f,f_means,f_stds)

In [10]:
# show a plot of the distribution of each feature
title = 'distribution of first 20 normalized polynomial features'
labels = [r'$f_{' + str(n+1) + '}$' for n in range(U)]
plotter_demo.feature_distributions(f_normed,title,labels=labels)

<IPython.core.display.Javascript object>

Clearly normalizing does not completely fix the problem of uneven feature distributions here, as the *exponential* shrinking / enlarging property of polynomial terms is too powerful to be completely counteracted by mere normalization (this is indeed one of their major practical flaws).  However even so, normalization aids the convergence of gradient descent considerably - as we will see in the next example. 

#### <span style="color:#a50e3e;">Example 3. </span>  Unnormalized versus normalized polynomial features and gradient descent convergence

In this example we use a degree $U = 5$ polynomial model to fit our toy dataset and compare how rapidly the model can be properly tuned when we use the raw nonlinear features themselves versus when we normalize both the input data and features.  Starting with the unnormalized case, in the next cell we have our ``predict`` and ``least_squares`` implementations.  

In [11]:
# number of polynomial features to use
U = 5

# our predict function 
def predict(x,w):        
    # compute our current set of features
    f = compute_features(x,U)  
    print(f)
    # compute linear model
    vals = w[0] + np.dot(f,w[1:])
    return vals

# least squares
least_squares = lambda w: np.sum((predict(x,w) - y)**2)

In the next cell we run gradient descent to fit our polynomial model using unnormalized gradient descent with a steplength parameter $\alpha$ of the form $10^{-\gamma}$ where $\gamma$ is the smallest positive integer that produces convergence with an initial point $\mathbf{w}^0 = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}$ at the origin (all zeros).

In [12]:
# parameters of gradient descent
alpha = 10**(-2); max_its = 1000; beta = 0; w_init = np.zeros((U+1,1));
w_init = 0.1*np.random.randn(U+1,1)

# run gradient descent, create cost history (for cost function plot comparison) associated with output weight history
weight_history_1 = gradient_descent(least_squares,w_init,alpha,max_its,beta,version = 'unnormalized')
cost_history_1 = [least_squares(v) for v in weight_history_1]

[[  6.86080000e-01   4.70705766e-01   3.22941812e-01   2.21563919e-01
    1.52010573e-01]
 [  5.06790000e-01   2.56836104e-01   1.30161969e-01   6.59647844e-02
    3.34302931e-02]
 [  3.45450000e-01   1.19335702e-01   4.12245184e-02   1.42410099e-02
    4.91955687e-03]
 [  7.58100000e-01   5.74715610e-01   4.35691904e-01   3.30298032e-01
    2.50398938e-01]
 [  1.90300000e-01   3.62140900e-02   6.89154133e-03   1.31146031e-03
    2.49570898e-04]
 [  8.13360000e-01   6.61554490e-01   5.38081960e-01   4.37654343e-01
    3.55970536e-01]
 [  9.53990000e-01   9.10096920e-01   8.68223361e-01   8.28276404e-01
    7.90167407e-01]
 [  1.73580000e-01   3.01300164e-02   5.22996825e-03   9.07817888e-04
    1.57579029e-04]
 [  1.45150000e-01   2.10685225e-02   3.05809604e-03   4.43882640e-04
    6.44295652e-05]
 [  3.55310000e-01   1.26245196e-01   4.48561806e-02   1.59378495e-02
    5.66287732e-03]
 [  5.97090000e-01   3.56516468e-01   2.12872418e-01   1.27103992e-01
    7.58925226e-02]
 [  4.1464

[[  6.86080000e-01   4.70705766e-01   3.22941812e-01   2.21563919e-01
    1.52010573e-01]
 [  5.06790000e-01   2.56836104e-01   1.30161969e-01   6.59647844e-02
    3.34302931e-02]
 [  3.45450000e-01   1.19335702e-01   4.12245184e-02   1.42410099e-02
    4.91955687e-03]
 [  7.58100000e-01   5.74715610e-01   4.35691904e-01   3.30298032e-01
    2.50398938e-01]
 [  1.90300000e-01   3.62140900e-02   6.89154133e-03   1.31146031e-03
    2.49570898e-04]
 [  8.13360000e-01   6.61554490e-01   5.38081960e-01   4.37654343e-01
    3.55970536e-01]
 [  9.53990000e-01   9.10096920e-01   8.68223361e-01   8.28276404e-01
    7.90167407e-01]
 [  1.73580000e-01   3.01300164e-02   5.22996825e-03   9.07817888e-04
    1.57579029e-04]
 [  1.45150000e-01   2.10685225e-02   3.05809604e-03   4.43882640e-04
    6.44295652e-05]
 [  3.55310000e-01   1.26245196e-01   4.48561806e-02   1.59378495e-02
    5.66287732e-03]
 [  5.97090000e-01   3.56516468e-01   2.12872418e-01   1.27103992e-01
    7.58925226e-02]
 [  4.1464

[[  6.86080000e-01   4.70705766e-01   3.22941812e-01   2.21563919e-01
    1.52010573e-01]
 [  5.06790000e-01   2.56836104e-01   1.30161969e-01   6.59647844e-02
    3.34302931e-02]
 [  3.45450000e-01   1.19335702e-01   4.12245184e-02   1.42410099e-02
    4.91955687e-03]
 [  7.58100000e-01   5.74715610e-01   4.35691904e-01   3.30298032e-01
    2.50398938e-01]
 [  1.90300000e-01   3.62140900e-02   6.89154133e-03   1.31146031e-03
    2.49570898e-04]
 [  8.13360000e-01   6.61554490e-01   5.38081960e-01   4.37654343e-01
    3.55970536e-01]
 [  9.53990000e-01   9.10096920e-01   8.68223361e-01   8.28276404e-01
    7.90167407e-01]
 [  1.73580000e-01   3.01300164e-02   5.22996825e-03   9.07817888e-04
    1.57579029e-04]
 [  1.45150000e-01   2.10685225e-02   3.05809604e-03   4.43882640e-04
    6.44295652e-05]
 [  3.55310000e-01   1.26245196e-01   4.48561806e-02   1.59378495e-02
    5.66287732e-03]
 [  5.97090000e-01   3.56516468e-01   2.12872418e-01   1.27103992e-01
    7.58925226e-02]
 [  4.1464

[[  6.86080000e-01   4.70705766e-01   3.22941812e-01   2.21563919e-01
    1.52010573e-01]
 [  5.06790000e-01   2.56836104e-01   1.30161969e-01   6.59647844e-02
    3.34302931e-02]
 [  3.45450000e-01   1.19335702e-01   4.12245184e-02   1.42410099e-02
    4.91955687e-03]
 [  7.58100000e-01   5.74715610e-01   4.35691904e-01   3.30298032e-01
    2.50398938e-01]
 [  1.90300000e-01   3.62140900e-02   6.89154133e-03   1.31146031e-03
    2.49570898e-04]
 [  8.13360000e-01   6.61554490e-01   5.38081960e-01   4.37654343e-01
    3.55970536e-01]
 [  9.53990000e-01   9.10096920e-01   8.68223361e-01   8.28276404e-01
    7.90167407e-01]
 [  1.73580000e-01   3.01300164e-02   5.22996825e-03   9.07817888e-04
    1.57579029e-04]
 [  1.45150000e-01   2.10685225e-02   3.05809604e-03   4.43882640e-04
    6.44295652e-05]
 [  3.55310000e-01   1.26245196e-01   4.48561806e-02   1.59378495e-02
    5.66287732e-03]
 [  5.97090000e-01   3.56516468e-01   2.12872418e-01   1.27103992e-01
    7.58925226e-02]
 [  4.1464

[[  6.86080000e-01   4.70705766e-01   3.22941812e-01   2.21563919e-01
    1.52010573e-01]
 [  5.06790000e-01   2.56836104e-01   1.30161969e-01   6.59647844e-02
    3.34302931e-02]
 [  3.45450000e-01   1.19335702e-01   4.12245184e-02   1.42410099e-02
    4.91955687e-03]
 [  7.58100000e-01   5.74715610e-01   4.35691904e-01   3.30298032e-01
    2.50398938e-01]
 [  1.90300000e-01   3.62140900e-02   6.89154133e-03   1.31146031e-03
    2.49570898e-04]
 [  8.13360000e-01   6.61554490e-01   5.38081960e-01   4.37654343e-01
    3.55970536e-01]
 [  9.53990000e-01   9.10096920e-01   8.68223361e-01   8.28276404e-01
    7.90167407e-01]
 [  1.73580000e-01   3.01300164e-02   5.22996825e-03   9.07817888e-04
    1.57579029e-04]
 [  1.45150000e-01   2.10685225e-02   3.05809604e-03   4.43882640e-04
    6.44295652e-05]
 [  3.55310000e-01   1.26245196e-01   4.48561806e-02   1.59378495e-02
    5.66287732e-03]
 [  5.97090000e-01   3.56516468e-01   2.12872418e-01   1.27103992e-01
    7.58925226e-02]
 [  4.1464

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


To evaluate new testing input using our trained model we simply plug them into our ``predict`` function (we will do this after running the normalized version of this experiment). 

Now we repeat the same experiment only we normalize both our input and polynomial features.  Here will see that far fewer gradient descent steps are needed to produce an even better fit.  In what follows we will implement the input and feature normalized prediction function precisely as stated in equation (5) above.  First, however, we need to compute the input and feature statistics (their mean and standard deviations). 

In [13]:
# compute the mean and standard deviation of the input
x_means = np.mean(x,axis = 0)
x_stds = np.std(x,axis = 0)

# normalize input using the statistics above
x_normed = normalize(x,x_means,x_stds)

# compute features of our normalized input
f = compute_features(x_normed,U)  

# compute the mean and standard deviation of our features
f_means = np.mean(f,axis = 0)
f_stds = np.std(f,axis = 0)

With these statistics in hand, we can then implement our input/feature normalized ``predict`` function.  Again this implementation - while a direct translation of the ``predict_normalized`` formula in equation (5) - is inefficient: since polynomials have no internal parameters we can pre-compute the normalized features on the training input beforehand (instead of re-computing these whenever ``predict_normalized`` is called), but for our small toy dataset this overhead will not be prohibitive.

In [14]:
# our predict function 
def predict_normalized(x,w):   
    # normalize input
    x_normed = normalize(x,x_means,x_stds)
    
    # compute features of normalized input
    f = compute_features(x_normed,U)  

    # normalize the training data features 
    f_normed = normalize(f,f_means,f_stds)
    
    # compute linear model
    vals = w[0] + np.dot(f_normed,w[1:])
    return vals

# least squares
least_squares = lambda w: np.sum((predict_normalized(x,w) - y)**2)

Now we run gradient descent using the same parameters as previously, though note a smaller steplength parameter $\alpha$ (of our generic form $10^{-\gamma}$ had to be used.

In [15]:
# parameters of gradient descent
alpha = 10**(-3); max_its = 1000; beta = 0; w_init = w_init;

# run gradient descent, create cost history (for cost function plot comparison) associated with output weight history
weight_history_2 = gradient_descent(least_squares,w_init,alpha,max_its,beta,version = 'unnormalized')
cost_history_2 = [least_squares(v) for v in weight_history_2]

With both experiments now complete we can compare the cost function history of each gradient descent run.  We do this in the cell below.  In the normalized case we converge much more rapidly than the unnormalized case.

In [16]:
# plot the cost function history for our current run of gradient descent
histories = [cost_history_1, cost_history_2]
labels = ['run on unnormalized features','run on normalized features']
plotter_demo.compare_regression_histories(histories,start=100,labels=labels)

<IPython.core.display.Javascript object>

Now we can evaluate either version of the degree 5 model with new testing data.  Doing this for a fine sampling of input points along the interval where our toy data is defined we can visualize our nonlinear fit to the data.  Clearly (from the cost history plot above) the normalized version will produce the better fit - and indeed it does.

In [17]:
# compare final fits using unnormalized and normalized predictors
plotter_demo.compare_regression_fits(x,y,predict,predict_normalized,weight_history_1[-1],weight_history_2[-1],title1 = 'fit using unnormalized predictor',title2 = 'fit using normalized predictor')

<IPython.core.display.Javascript object>

[[ -6.95182000e-02   4.83278013e-03  -3.35966176e-04   2.33557638e-05
   -1.62365066e-06]]
[[ -6.57839043e-02   4.32752207e-03  -2.84681298e-04   1.87274473e-05
   -1.23196460e-06]]
[[ -6.20496087e-02   3.85015394e-03  -2.38900545e-04   1.48236854e-05
   -9.19803876e-07]]
[[ -5.83153130e-02   3.40067574e-03  -1.98311470e-04   1.15645955e-05
   -6.74393004e-07]]
[[ -5.45810174e-02   2.97908746e-03  -1.62601624e-04   8.87496209e-06
   -4.84404460e-07]]
[[ -5.08467217e-02   2.58538911e-03  -1.31458561e-04   6.68423686e-06
   -3.39871532e-07]]
[[ -4.71124261e-02   2.21958069e-03  -1.04569831e-04   4.92653845e-06
   -2.32101178e-07]]
[[ -4.33781304e-02   1.88166220e-03  -8.16229883e-05   3.54065263e-06
   -1.53586892e-07]]
[[ -3.96438348e-02   1.57163364e-03  -6.23055842e-05   2.47003229e-06
   -9.79215519e-08]]
[[ -3.59095391e-02   1.28949500e-03  -4.63051712e-05   1.66279736e-06
   -5.97102867e-08]]
[[ -3.21752435e-02   1.03524629e-03  -3.33093015e-05   1.07173489e-06
   -3.44833309e-08]]

## 3.2  Single hidden layer perceptron / feedforward network feature transformations

Thus far we have seen how normalizing each input dimension of a dataset, as well each dimension of a fixed kernel feature transformation (e.g., each polynomial degree of a single input dataset), significantly aids gradient descent in terms of convergence to optimal parameter values.  In other words, thus far we have seen how normalizing the distribution of every single feature dimension that touches a weight makes gradient descent run faster (because it alleviates the long narrow valley problem).  

>  Thus far we have seen how normalizing every distribution that touches a weight makes gradient descent run faster (because it alleviates the long narrow valley problem).  This same intuition carries over to multilayer perceptrons - only now we also have *internal* weights that touch distributions of *activation outputs*.  

This intuition carries over completely from fixed features to parameterized networks - why should it not?  The only wrinkle is that since network features are recursive and parameterized that we will have to normalize not just the input data and final network features, but every dimension of a network that touches a single weight.  Because of the way a network is put together this consists of normalizing the output of each and every activation unit, since each of these touches an individual weight.  We will quickly review why this is the case below.

When employing network features, our ``predict`` function takes the form

\begin{equation}
\text{predict}\left(\mathbf{x},\omega\right) = w_0 + w_1\,f_1\left(\mathbf{x},\omega_1\right) + \cdots + w_U\,f_U\left(\mathbf{x},\omega_U\right)
\end{equation}

where each nonlinear feature $f_u$ is a recursively defined parameterized function (here $\omega_u$ is the set of $f_u$'s internal weights).  In the case of a single hidden layer these features take the form

\begin{equation}
f_u\left(x_1,x_2,\ldots,x_N, \omega_u\right)=a\left(w_{0,u}+\underset{n=1}{\overset{N}{\sum}}{w_{n,u}\,x_n}\right)
\end{equation}

where $a\left(\cdot\right)$ is called an *activation function* and is typically chosen from a list of elementary mathematical functions like *tanh*, the *rectified linear unit*, the *maxout* function, etc.,

#### <span style="color:#a50e3e;">Example 4. </span>  Unnormalized single layer feature distributions

Using a *tanh* activation function and $N = 1$, the single layer network feature above can be written as

\begin{equation}
f_u\left(x,\omega_u\right) = \text{tanh}\left(w_{0,u} + w_{1,u}x\right).
\end{equation}

As we did in the previous case with polynomials (in example 1 above) here we use the sinusoidal dataset and plot $U$ of these units $f_1,f_2,...,f_U$, randomly setting each's internal weights.  In the next ``Python`` cell we have a compact feature transformation function called ``compute_features`` that does just this: we plug in our entire set of input into these units and the $U$ activation outputs are returned as the $P\times 1$ *transformed* input vector $\mathbf{f}_u$

\begin{equation}
\mathbf{f}_u = \begin{bmatrix}
f_u\left(x_1,\omega_{f_u}\right) \\
f_u\left(x_2,\omega_{f_u}\right) \\
\vdots \\
f_u\left(x_P,\omega_{f_u}\right). \\
\end{bmatrix}
\end{equation}

In [18]:
def activation(t):
    # tanh activation
    f = np.tanh(t)
    return f

# functionality for creating single hidden layer features
def compute_features(x,w):
    # append one to each datapoint
    o = np.ones((np.shape(x)))
    xt = np.concatenate((o,x),axis = 1)
    
    # take inner product
    inp = np.dot(xt,w)
    
    # now take nonlinear activation
    a = activation(inp)
    return a

Now we can examine a distribution of $U = 20$ units (where each unit has randomized weights).  You can refresh this cell to see different results.

Predictably, we can see that these distributions will be d (with regards to the speed of gradient descent) because they are so dissimilar. 

In [19]:
# make degree 5 polynomial features
U = 20
w = np.random.randn(2,U)
f = compute_features(x,w)

# show a plot of the distribution of each feature
title = 'distribution of 20 single unit features'
labels = [r'$f_{' + str(n+1) + '}$' for n in range(U)]
plotter_demo.feature_distributions(f,title,labels=labels)

<IPython.core.display.Javascript object>

---

Studying the formula for the single hidden layer unit and the ``predict`` functions above, we can see that the weight $w_u$ acts on the output of the activation function $a\left(\cdot\right)$ over the entire set of inputs.  Therefore normalizing each of these feature / activation output distributions will aid in the gradient descent optimization (we never need to normalize a quantity with respect to either the outer bias $w_0$ or any internal bias weight since only one value ever touches these weights - the number $1$).  This is the complete analog of normalizing polynomial features.

Swapping out a normalized version of the single hidden layer feature / activation output in our ``predict`` function we would make the substitution 

\begin{equation}
f_u\left(x_1,x_2,\ldots,x_N, \omega_u\right)  \longleftarrow \frac{f_u\left(\frac{x_1 - \mu_{x_1}}{\sigma_{x_1}},\frac{x_2 - \mu_{x_2}}{\sigma_{x_2}},...,\frac{x_N - \mu_{x_N}}{\sigma_{x_N}},\omega_u\right) - \mu_{f_u}}{\sigma_{f_u}}
\end{equation}

where $\mu_{x_n}$ / $\sigma_{x_n}$ and $\mu_{f_u}$ / $\sigma_{f_u}$ are the mean and standard deviation of the $n^{th}$ dimension of the input and $u^{th}$ feature / activation output.  

However note importantly: *unlike* the case with fixed kernel functions, here our nonlinearities contain internal parameters.  Thus even if we normalize the activation outputs one time their distribution will change *every time the internal weights in $\omega_u$ do*, i.e., at every gradient descent step!  Another way to think about it: technically our feature statistics are function's of their unit's internal parameters too!  i.e., $\mu_{f_u} \longleftarrow \mu_{f_u}\left(\omega_u\right)$ and $\sigma{f_u} \longleftarrow \sigma{f_u}\left(\omega_u\right)$, thus they need to be re-computed with each change in $\omega_u$.  Because of this,it is very is convenient to simply add a normalization step directly into the feature computation module of a network architecture.

> Activation output distributions must be normalized every time the internal weights of the network change (i.e., at each gradient descent step).  Therefore it is convenient to add a normalization step directly into the feature computation module of a network architecture.

#### <span style="color:#a50e3e;">Example 5. </span>  Unnormalized single layer feature distributions

Here we normalize each feature / activation output distribution shown in the previous example.  Using the same random weights, this plot consists of $U = 20$ single layer tanh units.  In the ``Python`` cells below we i) normalize the input ii) define a new ``compuate_normalized_features`` function that adds an activation output normalization step to the previous feature computation module and iii) plot the normalized feature distributions.

In [20]:
# compute the mean and standard deviation of our input, then normalize the input
x_means = np.mean(x,axis = 0)
x_stds = np.std(x,axis = 0)

# normalize the input data
x_normed = normalize(x,x_means,x_stds)

In [21]:
# functionality for creating single hidden layer features
def compute_features_normalized(x,w):    
    # take linear combination of input
    inp = w[0] + np.dot(x,w[1:])
    
    # now take nonlinear activation
    a = activation(inp)

    # normalize activation output - first compute the mean 
    # and standard deviation of our input
    a_means = np.mean(a,axis = 0)
    a_stds = np.std(a,axis = 0)

    # now normalize
    a_normed = normalize(a,a_means,a_stds)
     
    return a_normed

In [22]:
# compute features / units on normalized input
f_normed = compute_features_normalized(x_normed,w)

# show a plot of the distribution of each feature
title = 'distribution of 20 single layer normalized unit features'
labels = [r'$f_{' + str(n+1) + '}$' for n in range(U)]
plotter_demo.feature_distributions(f_normed,title,labels=labels)

<IPython.core.display.Javascript object>

These distributions look much better than the unnormalized versions from the previous example - in that they are now much more similar to each other.  

---

Deeper network features are defined by recursing on the simple theme of i) taking a linear combination of the activation output and ii) applying the same activation to this linear combination ([see our notes on multilayer perceptrons for further details](https://jermwatt.github.io/mlrefined/blog_posts/Nonlinear_Supervised_Learning/Part_4_multi_layer_perceptrons.html)). 
).  For example, following this recipe a three layer network feature can be written generically as  

\begin{equation}
f_u^{\left(3\right)}\left(x_1,x_2,\ldots,x_N,\omega_u\right)=a^{(3)}_{\,}\left(w^{\left(3\right)}_{0,u}+\underset{i=1}{\overset{U_2}{\sum}}{w^{\left(3\right)}_{i,u}}\,a^{(2)}_{\,}\left(w^{\left(2\right)}_{0,i}+\underset{k=1}{\overset{U_1}{\sum}}{w^{\left(2\right)}_{k,u}}\,a^{(1)}_{\,}\left(w^{\left(1\right)}_{0,k}+\underset{n=1}{\overset{N}{\sum}}w^{\left(1\right)}_{n,k}\,x_n\right)\right)\right)
\end{equation}

In analogy to what we have seen so far - what quantities should be normalized here to avoid potential long narrow valleys in particular dimensions?  Answer: any distribution touching a weight.  Here this includes 

- the third and final activation output distribution $f_u^{(3)} = a_{\,}^{(3)}$ over all $P$ inputs, since this distribution touches the weight $w_u$


- the second layer activation output distribution $a_{\,}^{(2)}$ for each fixed value of $i$, since each such distribution touches the third layer weight $w^{\left(3\right)}_{i,u}$


- the first layer activation output distribution $a_{\,}^{(1)}$ for each fixed value of $k$, since each such distribution touches the second layer weight $w^{\left(2\right)}_{k,u}$


- the input distribution of the data $x_n$ along the $n^{th}$ dimension for each fixed value of $n$, since each such distribution touches the first layer weight $w^{\left(1\right)}_{n,k}$

This same pattern idea for deeper networks as well - if we want our cost function to suffer less from long narrow valleys when using a set of deep network features we want to make sure each and every weight-touching distribution is normalized.  This set of distributions is comprise of each dimension of the input and every activation output. 

> If we want our cost function to suffer less from long narrow valleys when using a set of deep network features we want to make sure each and every weight-touching distribution is normalized.  This set of distributions is comprise of each dimension of the input and every activation output.

**Note:** Given how helpful we have seen the concept of weight-touching distribution normalization in aiding the convergence of gradient descent for linear regression / classification, as well as with fixed kernel features, it seems quite intuitive that the same concept should be equally beneficial for the case of multilayer perceptrons.   While versions of this idea have surely been used for years by practitioners as a 'hack' for speeding up the training of networks, this notion (in the context of multilayer perceptrons) was only recently published formally in a journal (and it received enormous fan-fare, since the concept is so simple yet so very helpful) where it was given the name [Batch Normalization](https://arxiv.org/abs/1502.03167).

#### <span style="color:#a50e3e;">Example 6. </span>  Comparing unnormalized and normalized deep network activation output distributions

In this example we study every weight-touching distribution of a standard unnormalized and normalized 3 layer network - like the one shown in the equation above - using our sinusoidal dataset first introduced in example 1.  This network will have 7 units in the first layer, 10 in the second layer, and 5 in the third layer (these numbers were chosen at random) and we will use the tanh activation function.   To design / compute this network we will use the ``Python`` based architecture design / computation tools described in [Section 1.4 of our notes on the multilayer perceptron](https://jermwatt.github.io/mlrefined/blog_posts/Nonlinear_Supervised_Learning/Part_4_multi_layer_perceptrons.html).  

In the next cells we define our network architecture, and use the function ``initialize_network_weights`` discussed in the notes to create an a random set of weights for our network.

In [23]:
# A 3 layer network architecture
N = np.shape(x)[1]
M = np.shape(y)[1]
U_1 = 7                # number of units in layer 1
U_2 = 10               # number of units in layer 2
U_3 = 5                # number of units in layer 3

# the list defines our network architecture
layer_sizes = [N, U_1,U_2,U_3,M]

In [24]:
# create initial weights for arbitrary feedforward network
def initialize_network_weights(layer_sizes,scale):
    # container for entire weight tensor
    weights = []
    
    # loop over desired layer sizes and create appropriately sized initial 
    # weight matrix for each layer
    for k in range(len(layer_sizes)-1):
        # get layer sizes for current weight matrix
        U_k = layer_sizes[k]
        U_k_plus_1 = layer_sizes[k+1]

        # make weight matrix
        weight = scale*np.random.randn(U_k + 1,U_k_plus_1)
        weights.append(weight)

    # re-express weights so that w_init[0] = omega_inner contains all 
    # internal weight matrices, and w_init[1] = w contains weights of 
    # final linear combination in predict function
    w_init = [weights[:-1],weights[-1]]
    
    return w_init

In [25]:
# generate initial weights for our network
w_init = initialize_network_weights(layer_sizes,scale = 0.1)

Next we have our ``compute_features`` function, which contains our standard unnormalized network architecture. 

In [26]:
# fully evaluate our network features using the tensor of weights in omega_inner
def compute_features(x, omega_inner):
    # pad data with ones to deal with bias
    o = np.ones((np.shape(x)[0],1))
    a_padded = np.concatenate((o,x),axis = 1)
    
    # loop through weights and update each layer of the network
    for W in omega_inner:
        # output of layer activation
        a = activation(np.dot(a_padded,W))
                
        #  pad with ones (to compactly take care of bias) for next layer computation
        o = np.ones((np.shape(a)[0],1))
        a_padded = np.concatenate((o,a),axis = 1)
        
    return a_padded

A version of this function has been placed in the utilities backend file, where we collect the activation output distribution at each iteration of the loop.

With our initial weights defined we can then plot our weight-touching distributions, which we do in the next ``Python`` cell.  In the top panel the distribution of the input data is shown, below this in the second panel is the distribution of each unit in the first layer of the network - the distribution of the $k^{th}$ unit here is denoted $a_{k}^{(1)}$.  In the two panels that follow are the corresponding activation distributions for the second and third layers of the network - here the $i^{th}$ distribution of the second layer is denoted $a_{i}^{(2)}$, and the $u^{th}$ third layer unit distribution $a_{u}^{(3)}$.

In [53]:
# plot each weight-touching distribution
plotter_demo.activation_distributions(x,w_init)

<IPython.core.display.Javascript object>

These clearly look uneven.  By normalizing the input and adding in a normalization step in our computation of the features we can ameliorate this issue.  We do this in the next two ``Python`` cells.  We call our normalized network update function ``compute_features_normalized`` to keep in line with our previous functionality for simpler networks / features.

In [27]:
# compute the mean and standard deviation of our input, then normalize the input
x_means = np.mean(x,axis = 0)
x_stds = np.std(x,axis = 0)

# normalize the input data
x_normed = normalize(x,x_means,x_stds)

In [28]:
def compute_features_normalized(x, inner_weights):
    # pad data with ones to deal with bias
    o = np.ones((np.shape(x)[0],1))
    a_padded = np.concatenate((o,x),axis = 1)
        
    # loop through weights and update each layer of the network
    for W in inner_weights:
        # output of layer activation
        a = activation(np.dot(a_padded,W))
                
        ### normalize output of activation
        # compute the mean and standard deviation of the activation output distributions
        a_means = np.mean(a,axis = 0)
        a_stds = np.std(a,axis = 0)
        
        # normalize the activation outputs
        a_normed = normalize(a,a_means,a_stds)
            
        # pad with ones for bias
        o = np.ones((np.shape(a_normed)[0],1))
        a_padded = np.concatenate((o,a_normed),axis = 1)
    
    return a_padded

A version of this normalized architecture computer has been placed in the backend utilities file for this notebook, where we collect the normalized input and activation distributions at each iteration of the update loop.  These distributions are then plotted via the command below.   All of the formatting in this plot mirrors the previous unnormalized plot detailed above.

In [29]:
# plot the normalized input and activation output distributions for our network
plotter_demo.activation_distributions(x,w_init,kind = 'normalized')

<IPython.core.display.Javascript object>

These look far better in general than the unnormalized case, and so we can expect that gradient descent will have a much easier time optimizing the normalized network.

#### <span style="color:#a50e3e;">Example 7. </span>  Comparing unnormalized and normalized deep network activation output distributions for regression

In this example we compare the speed at which gradient descent can tune a 4 layer architecture, and the same architecture with its input and each activation output normalized, to fit our noisy sinusoidal dataset.  The network uses the `tanh` activation, has 10 units in each layer, and is defined / initialized in the next cell using the ``initialize_network_weights`` detailed in the previous example as well as the notes on multilayer perceptrons / feedforward networks.

In [30]:
# load data
csvname = 'noisy_sin_sample.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:,:-1]
y = data[:,-1:]

In [31]:
# A 3 layer network architecture
N = np.shape(x)[1]
M = np.shape(y)[1]
U_1 = 10                # number of units in layer 1
U_2 = 10               # number of units in layer 2
U_3 = 10                # number of units in layer 3

# the list defines our network architecture
layer_sizes = [N, U_1,U_2,U_3,M]

# generate initial weights for our network
w_init = initialize_network_weights(layer_sizes,scale = 0.1)

First we tune the raw network.  Below we have our standard ``predict`` and ``least_squares`` functions. 

In [32]:
# our predict function 
def predict(x,w):     
    # feature trasnsformations
    f = compute_features(x,w[0])
    
    # compute linear model
    vals = np.dot(f,w[1])
    print(vals)
    return vals

# least squares
least_squares = lambda w: np.sum((predict(x,w) - y)**2)

In the next cell we run gradient descent to fit our (unnormalized) network model using 2000 steps of (unnormalized) gradient descent with a steplength parameter $\alpha$ of the form $10^{-\gamma}$ where $\gamma$ is the smallest positive integer that produces convergence with our random initial point.

In [33]:
# parameters of gradient descent
alpha = 10**(-3); max_its = 2000; beta = 0; 

# run gradient descent, create cost history (for cost function plot comparison) associated with output weight history
weight_history_1 = gradient_descent(least_squares,w_init,alpha,max_its,beta,version = 'unnormalized')
cost_history_1 = [least_squares(v) for v in weight_history_1]

Autograd ArrayNode with value [[-0.26835513]
 [-0.26800579]
 [-0.26768857]
 [-0.26849447]
 [-0.26738114]
 [-0.26860099]
 [-0.26887046]
 [-0.26734788]
 [-0.26729127]
 [-0.26770803]
 [-0.26818217]
 [-0.26782493]
 [-0.26806683]
 [-0.26798334]
 [-0.26785081]
 [-0.26743024]
 [-0.26775494]
 [-0.26867703]
 [-0.26704828]
 [-0.26804663]
 [-0.26804074]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.2503408 ]
 [-0.24999885]
 [-0.24968838]
 [-0.25047719]
 [-0.24938751]
 [-0.25058146]
 [-0.25084526]
 [-0.24935496]
 [-0.24929955]
 [-0.24970743]
 [-0.25017149]
 [-0.24982184]
 [-0.2500586 ]
 [-0.24997689]
 [-0.24984717]
 [-0.24943555]
 [-0.24975334]
 [-0.2506559 ]
 [-0.24906176]
 [-0.25003883]
 [-0.25003307]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.23315609]
 [-0.23282096]
 [-0.23251669]
 [-0.23328977]
 [-0.23222186]
 [-0.23339198]
 [-0.23365056]
 [-0.23218996]
 [-0.23213567]
 [-0.23253536]
 [-0.23299015]
 [-0.23264748]
 [-0.23287951]
 [-0.23279943]
 [-0.2326723 ]
 [-0.232268

 [ 0.10009085]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.1002926 ]
 [ 0.10067806]
 [ 0.10102717]
 [ 0.10013855]
 [ 0.10136472]
 [ 0.10002067]
 [ 0.09972199]
 [ 0.1014012 ]
 [ 0.10146327]
 [ 0.10100578]
 [ 0.10048358]
 [ 0.10087721]
 [ 0.10061079]
 [ 0.10070279]
 [ 0.10084872]
 [ 0.10131087]
 [ 0.10095419]
 [ 0.09993646]
 [ 0.10172939]
 [ 0.10063305]
 [ 0.10063954]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10081085]
 [ 0.10119947]
 [ 0.10155145]
 [ 0.10065554]
 [ 0.10189177]
 [ 0.10053669]
 [ 0.10023556]
 [ 0.10192855]
 [ 0.10199113]
 [ 0.10152988]
 [ 0.1010034 ]
 [ 0.10140025]
 [ 0.10113165]
 [ 0.1012244 ]
 [ 0.10137154]
 [ 0.10183748]
 [ 0.10147787]
 [ 0.10045179]
 [ 0.10225943]
 [ 0.10115409]
 [ 0.10116064]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10130287]
 [ 0.10169467]
 [ 0.10204953]
 [ 0.10114629]
 [ 0.10239264]
 [ 0.10102648]
 [ 0.10072289]
 [ 0.10242972]
 [ 0.10249281]
 [ 0.10202778]
 [ 0.10149699]
 [ 0.1018971 ]
 [ 0.10162629]
 [ 0

Autograd ArrayNode with value [[ 0.11019205]
 [ 0.11086403]
 [ 0.11147289]
 [ 0.10992361]
 [ 0.1120617 ]
 [ 0.10971825]
 [ 0.10919813]
 [ 0.11212533]
 [ 0.1122336 ]
 [ 0.11143558]
 [ 0.11052495]
 [ 0.11121133]
 [ 0.11074673]
 [ 0.11090716]
 [ 0.11116166]
 [ 0.11196776]
 [ 0.1113456 ]
 [ 0.10957156]
 [ 0.11269778]
 [ 0.11078555]
 [ 0.11079687]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.11019545]
 [ 0.11087118]
 [ 0.11148343]
 [ 0.10992552]
 [ 0.11207552]
 [ 0.10971901]
 [ 0.109196  ]
 [ 0.1121395 ]
 [ 0.11224837]
 [ 0.11144591]
 [ 0.1105302 ]
 [ 0.11122041]
 [ 0.11075322]
 [ 0.11091454]
 [ 0.11117046]
 [ 0.11198105]
 [ 0.11135543]
 [ 0.10957151]
 [ 0.11271514]
 [ 0.11079226]
 [ 0.11080364]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.11019847]
 [ 0.11087794]
 [ 0.11149359]
 [ 0.10992703]
 [ 0.11208896]
 [ 0.10971939]
 [ 0.10919348]
 [ 0.11215331]
 [ 0.11226278]
 [ 0.11145586]
 [ 0.11053507]
 [ 0.11122911]
 [ 0.11075933]
 [ 0.11092154]
 [ 0.11117888]
 [ 0.111993

 [ 0.11087154]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10997125]
 [ 0.11097045]
 [ 0.11187617]
 [ 0.10957225]
 [ 0.1127523 ]
 [ 0.10926707]
 [ 0.10849446]
 [ 0.11284699]
 [ 0.11300812]
 [ 0.11182065]
 [ 0.11046618]
 [ 0.11148704]
 [ 0.11079599]
 [ 0.11103458]
 [ 0.11141315]
 [ 0.11261251]
 [ 0.11168679]
 [ 0.10904914]
 [ 0.11369894]
 [ 0.11085373]
 [ 0.11087055]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10996629]
 [ 0.11096989]
 [ 0.11187961]
 [ 0.10956552]
 [ 0.11275962]
 [ 0.109259  ]
 [ 0.10848299]
 [ 0.11285473]
 [ 0.11301657]
 [ 0.11182385]
 [ 0.1104634 ]
 [ 0.11148876]
 [ 0.11079466]
 [ 0.11103431]
 [ 0.11141454]
 [ 0.11261921]
 [ 0.1116894 ]
 [ 0.10904011]
 [ 0.11371044]
 [ 0.11085265]
 [ 0.11086955]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.1099613 ]
 [ 0.11096932]
 [ 0.11188305]
 [ 0.10955877]
 [ 0.11276694]
 [ 0.10925091]
 [ 0.10847149]
 [ 0.11286247]
 [ 0.11302502]
 [ 0.11182704]
 [ 0.1104606 ]
 [ 0.11149048]
 [ 0.11079332]
 [ 0

 [ 0.11074713]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10944742]
 [ 0.11088954]
 [ 0.11219743]
 [ 0.1088718 ]
 [ 0.11346305]
 [ 0.10843166]
 [ 0.10731785]
 [ 0.11359986]
 [ 0.11383265]
 [ 0.11211725]
 [ 0.11016164]
 [ 0.11163545]
 [ 0.11063769]
 [ 0.11098214]
 [ 0.11152875]
 [ 0.11326109]
 [ 0.11192392]
 [ 0.10811741]
 [ 0.11483083]
 [ 0.11072104]
 [ 0.11074533]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.1094405 ]
 [ 0.1108883 ]
 [ 0.11220134]
 [ 0.10886262]
 [ 0.11347195]
 [ 0.10842076]
 [ 0.10730258]
 [ 0.1136093 ]
 [ 0.11384301]
 [ 0.11212084]
 [ 0.11015753]
 [ 0.11163715]
 [ 0.11063546]
 [ 0.11098126]
 [ 0.11153002]
 [ 0.11326919]
 [ 0.11192675]
 [ 0.10810528]
 [ 0.11484514]
 [ 0.11071913]
 [ 0.11074352]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10943356]
 [ 0.11088705]
 [ 0.11220526]
 [ 0.10885341]
 [ 0.11348088]
 [ 0.10840981]
 [ 0.10728726]
 [ 0.11361877]
 [ 0.1138534 ]
 [ 0.11212445]
 [ 0.1101534 ]
 [ 0.11163884]
 [ 0.11063321]
 [ 0

Autograd ArrayNode with value [[ 0.10895348]
 [ 0.11079372]
 [ 0.11246336]
 [ 0.10821924]
 [ 0.11407952]
 [ 0.10765793]
 [ 0.10623799]
 [ 0.11425424]
 [ 0.11455156]
 [ 0.11236099]
 [ 0.10986475]
 [ 0.11174587]
 [ 0.11047227]
 [ 0.11091191]
 [ 0.11160965]
 [ 0.1138216 ]
 [ 0.11211415]
 [ 0.10725724]
 [ 0.1158265 ]
 [ 0.11057865]
 [ 0.11060965]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10894457]
 [ 0.11079188]
 [ 0.11246795]
 [ 0.10820751]
 [ 0.11409034]
 [ 0.10764405]
 [ 0.10621867]
 [ 0.11426574]
 [ 0.1145642 ]
 [ 0.11236518]
 [ 0.10985934]
 [ 0.1117477 ]
 [ 0.1104692 ]
 [ 0.11091052]
 [ 0.11161095]
 [ 0.11383142]
 [ 0.11211739]
 [ 0.10724181]
 [ 0.11584406]
 [ 0.11057598]
 [ 0.1106071 ]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10893561]
 [ 0.11079002]
 [ 0.11247255]
 [ 0.10819573]
 [ 0.1141012 ]
 [ 0.1076301 ]
 [ 0.10619927]
 [ 0.11427727]
 [ 0.11457689]
 [ 0.11236939]
 [ 0.1098539 ]
 [ 0.11174952]
 [ 0.1104661 ]
 [ 0.11090912]
 [ 0.11161225]
 [ 0.113841

 [ 0.11036187]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.1081117 ]
 [ 0.11060767]
 [ 0.11287371]
 [ 0.10711639]
 [ 0.11506813]
 [ 0.10635575]
 [ 0.10443259]
 [ 0.1153054 ]
 [ 0.11570919]
 [ 0.11273473]
 [ 0.10934745]
 [ 0.11189979]
 [ 0.11017156]
 [ 0.11076804]
 [ 0.1117149 ]
 [ 0.11471787]
 [ 0.11239967]
 [ 0.10581289]
 [ 0.11744086]
 [ 0.11031587]
 [ 0.11035794]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10809895]
 [ 0.1106047 ]
 [ 0.11287964]
 [ 0.10709975]
 [ 0.11508268]
 [ 0.10633613]
 [ 0.10440548]
 [ 0.11532089]
 [ 0.11572627]
 [ 0.11274011]
 [ 0.10933953]
 [ 0.11190189]
 [ 0.11016687]
 [ 0.11076569]
 [ 0.11171628]
 [ 0.11473105]
 [ 0.11240373]
 [ 0.10579116]
 [ 0.11746476]
 [ 0.11031175]
 [ 0.11035398]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10808614]
 [ 0.11060171]
 [ 0.11288558]
 [ 0.10708302]
 [ 0.1150973 ]
 [ 0.10631643]
 [ 0.10437825]
 [ 0.11533645]
 [ 0.11574342]
 [ 0.11274551]
 [ 0.10933158]
 [ 0.11190399]
 [ 0.11016216]
 [ 0

 [ 0.10989134]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10664133]
 [ 0.1102454 ]
 [ 0.11352058]
 [ 0.10520539]
 [ 0.1166943 ]
 [ 0.10410853]
 [ 0.10133761]
 [ 0.11703756]
 [ 0.11762173]
 [ 0.11331964]
 [ 0.10842518]
 [ 0.11211263]
 [ 0.10961538]
 [ 0.11047709]
 [ 0.1118454 ]
 [ 0.11618763]
 [ 0.11283522]
 [ 0.10332603]
 [ 0.12012735]
 [ 0.10982385]
 [ 0.10988461]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.1066209 ]
 [ 0.11024015]
 [ 0.11352917]
 [ 0.10517892]
 [ 0.11671632]
 [ 0.10407747]
 [ 0.10129496]
 [ 0.11706104]
 [ 0.11764768]
 [ 0.11332738]
 [ 0.10841226]
 [ 0.11211526]
 [ 0.10960747]
 [ 0.11047282]
 [ 0.1118469 ]
 [ 0.1162075 ]
 [ 0.11284091]
 [ 0.10329169]
 [ 0.12016392]
 [ 0.10981682]
 [ 0.10987784]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10660036]
 [ 0.11023486]
 [ 0.11353779]
 [ 0.10515231]
 [ 0.11673845]
 [ 0.10404623]
 [ 0.10125207]
 [ 0.11708463]
 [ 0.11767376]
 [ 0.11333515]
 [ 0.10839926]
 [ 0.1121179 ]
 [ 0.10959952]
 [ 0

 [ 0.10914522]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10443281]
 [ 0.1096569 ]
 [ 0.11441006]
 [ 0.10235371]
 [ 0.1190197 ]
 [ 0.10076659]
 [ 0.09676144]
 [ 0.11951842]
 [ 0.12036721]
 [ 0.11411832]
 [ 0.10701753]
 [ 0.11236618]
 [ 0.10874315]
 [ 0.10999299]
 [ 0.11197834]
 [ 0.1182836 ]
 [ 0.11341505]
 [ 0.0996349 ]
 [ 0.12400841]
 [ 0.10904547]
 [ 0.10913361]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10439908]
 [ 0.10964765]
 [ 0.11442318]
 [ 0.10231027]
 [ 0.11905455]
 [ 0.10071575]
 [ 0.09669199]
 [ 0.11955563]
 [ 0.12040843]
 [ 0.11413006]
 [ 0.10699591]
 [ 0.11236967]
 [ 0.10872962]
 [ 0.10998532]
 [ 0.11198001]
 [ 0.11831498]
 [ 0.11342348]
 [ 0.0995788 ]
 [ 0.12406682]
 [ 0.10903336]
 [ 0.10912191]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10436514]
 [ 0.10963835]
 [ 0.11443637]
 [ 0.10226656]
 [ 0.11908962]
 [ 0.10066459]
 [ 0.09662211]
 [ 0.11959306]
 [ 0.12044989]
 [ 0.11414187]
 [ 0.10697414]
 [ 0.11237318]
 [ 0.10871599]
 [ 0

 [ 0.10781255]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10063341]
 [ 0.10858874]
 [ 0.11583978]
 [ 0.09747248]
 [ 0.12287987]
 [ 0.09506179]
 [ 0.08898806]
 [ 0.12364187]
 [ 0.12493884]
 [ 0.11539444]
 [ 0.10456733]
 [ 0.11272056]
 [ 0.10719608]
 [ 0.1091011 ]
 [ 0.11212886]
 [ 0.12175528]
 [ 0.11432106]
 [ 0.09334415]
 [ 0.13050368]
 [ 0.10765681]
 [ 0.10779113]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10057312]
 [ 0.10857148]
 [ 0.11586193]
 [ 0.09739517]
 [ 0.12294039]
 [ 0.09497154]
 [ 0.08886533]
 [ 0.12370655]
 [ 0.1250106 ]
 [ 0.11541416]
 [ 0.10452829]
 [ 0.11272573]
 [ 0.10717126]
 [ 0.10908662]
 [ 0.11213082]
 [ 0.12180967]
 [ 0.11433494]
 [ 0.09324469]
 [ 0.13060582]
 [ 0.10763449]
 [ 0.10776954]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.10051238]
 [ 0.10855408]
 [ 0.11588422]
 [ 0.09731728]
 [ 0.12300135]
 [ 0.09488061]
 [ 0.0887417 ]
 [ 0.12377169]
 [ 0.12508288]
 [ 0.11543402]
 [ 0.10448895]
 [ 0.11273094]
 [ 0.10714626]
 [ 0

Autograd ArrayNode with value [[ 0.09381679]
 [ 0.10661269]
 [ 0.11830679]
 [ 0.0887452 ]
 [ 0.12967917]
 [ 0.08488297]
 [ 0.07517608]
 [ 0.13091077]
 [ 0.13300724]
 [ 0.11758792]
 [ 0.10013916]
 [ 0.11327332]
 [ 0.10436971]
 [ 0.10743815]
 [ 0.11231898]
 [ 0.12786171]
 [ 0.11585554]
 [ 0.08213427]
 [ 0.14200401]
 [ 0.10511163]
 [ 0.10532794]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.09370045]
 [ 0.10657878]
 [ 0.1183487 ]
 [ 0.0885964 ]
 [ 0.1297951 ]
 [ 0.08470953]
 [ 0.0749411 ]
 [ 0.13103472]
 [ 0.13314486]
 [ 0.11762516]
 [ 0.10006346]
 [ 0.11328255]
 [ 0.10432131]
 [ 0.10740958]
 [ 0.11232202]
 [ 0.12796579]
 [ 0.11588153]
 [ 0.08194334]
 [ 0.14220029]
 [ 0.10506801]
 [ 0.10528573]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.09358308]
 [ 0.10654457]
 [ 0.11839099]
 [ 0.08844627]
 [ 0.12991207]
 [ 0.08453455]
 [ 0.07470406]
 [ 0.13115979]
 [ 0.13328372]
 [ 0.11766273]
 [ 0.0999871 ]
 [ 0.11329186]
 [ 0.10427247]
 [ 0.10738075]
 [ 0.11232509]
 [ 0.128070

Autograd ArrayNode with value [[ 0.08051387]
 [ 0.10277025]
 [ 0.12319754]
 [ 0.07172917]
 [ 0.14311087]
 [ 0.06505555]
 [ 0.04835281]
 [ 0.14526898]
 [ 0.14894291]
 [ 0.12194004]
 [ 0.09149564]
 [ 0.11439703]
 [ 0.09886064]
 [ 0.10420984]
 [ 0.11272972]
 [ 0.13992654]
 [ 0.11891047]
 [ 0.06031528]
 [ 0.1647096 ]
 [ 0.10015348]
 [ 0.10053049]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.08026774]
 [ 0.1027003 ]
 [ 0.12329078]
 [ 0.07141414]
 [ 0.14336396]
 [ 0.06468846]
 [ 0.04785663]
 [ 0.14553941]
 [ 0.14924288]
 [ 0.1220232 ]
 [ 0.09133619]
 [ 0.11441984]
 [ 0.0987596 ]
 [ 0.10415135]
 [ 0.1127392 ]
 [ 0.14015405]
 [ 0.11896938]
 [ 0.05991137]
 [ 0.16513627]
 [ 0.10006271]
 [ 0.10044272]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.08001903]
 [ 0.10262967]
 [ 0.12338513]
 [ 0.07109578]
 [ 0.14361992]
 [ 0.06431748]
 [ 0.04735523]
 [ 0.14581291]
 [ 0.14954623]
 [ 0.12210736]
 [ 0.09117508]
 [ 0.11444297]
 [ 0.09865754]
 [ 0.10409231]
 [ 0.11274886]
 [ 0.140384

Autograd ArrayNode with value [[ 0.05131261]
 [ 0.09489669]
 [ 0.13520223]
 [ 0.03424245]
 [ 0.17463251]
 [ 0.02133474]
 [-0.01071157]
 [ 0.17890821]
 [ 0.18618663]
 [ 0.1327154 ]
 [ 0.07276415]
 [ 0.11781173]
 [ 0.08721088]
 [ 0.09772945]
 [ 0.11452094]
 [ 0.16832369]
 [ 0.1267266 ]
 [ 0.01220073]
 [ 0.21739576]
 [ 0.08975121]
 [ 0.09049225]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.05072902]
 [ 0.09474845]
 [ 0.13546209]
 [ 0.03349082]
 [ 0.17529373]
 [ 0.0204572 ]
 [-0.01189677]
 [ 0.17961296]
 [ 0.18696543]
 [ 0.13294999]
 [ 0.07239383]
 [ 0.11789507]
 [ 0.08698532]
 [ 0.09760976]
 [ 0.11457094]
 [ 0.16892067]
 [ 0.12690036]
 [ 0.01123474]
 [ 0.21849128]
 [ 0.08955119]
 [ 0.09029968]] and 1 progenitors(s)
Autograd ArrayNode with value [[ 0.05013821]
 [ 0.09459875]
 [ 0.13572595]
 [ 0.03272981]
 [ 0.17596434]
 [ 0.01956867]
 [-0.01309681]
 [ 0.18032767]
 [ 0.1877552 ]
 [ 0.13318823]
 [ 0.0720191 ]
 [ 0.11798004]
 [ 0.08675726]
 [ 0.09748898]
 [ 0.11462212]
 [ 0.169526

 [ 0.06885609]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.02161974]
 [ 0.07864688]
 [ 0.17267222]
 [-0.06025355]
 [ 0.26492956]
 [-0.08916865]
 [-0.15967574]
 [ 0.27491118]
 [ 0.29188197]
 [ 0.166853  ]
 [ 0.02747847]
 [ 0.13201305]
 [ 0.06082817]
 [ 0.08522595]
 [ 0.12433133]
 [ 0.25018811]
 [ 0.15284483]
 [-0.10945958]
 [ 0.36423212]
 [ 0.06671231]
 [ 0.06842978]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.02315465]
 [ 0.07834302]
 [ 0.17354604]
 [-0.06225033]
 [ 0.26695866]
 [-0.09150511]
 [-0.16281458]
 [ 0.27706435]
 [ 0.29424545]
 [ 0.16765372]
 [ 0.02654154]
 [ 0.13237635]
 [ 0.06030293]
 [ 0.08500404]
 [ 0.12459831]
 [ 0.25203351]
 [ 0.1534696 ]
 [-0.11203095]
 [ 0.36748   ]
 [ 0.06626008]
 [ 0.06799887]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.02470963]
 [ 0.07803633]
 [ 0.17443394]
 [-0.0642734 ]
 [ 0.26901818]
 [-0.09387228]
 [-0.16599403]
 [ 0.27924967]
 [ 0.29664395]
 [ 0.16846746]
 [ 0.02559281]
 [ 0.13274638]
 [ 0.0597717 ]
 [ 0

Autograd ArrayNode with value [[-0.1933607 ]
 [ 0.04792306]
 [ 0.27889901]
 [-0.28333366]
 [ 0.50315278]
 [-0.34921592]
 [-0.50375902]
 [ 0.52700365]
 [ 0.56730219]
 [ 0.26460516]
 [-0.07634576]
 [ 0.17891725]
 [ 0.00444262]
 [ 0.06402005]
 [ 0.16002116]
 [ 0.46774523]
 [ 0.23016695]
 [-0.39461972]
 [ 0.7348213 ]
 [ 0.01878025]
 [ 0.02296914]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.19643782]
 [ 0.0473919 ]
 [ 0.28087114]
 [-0.28730966]
 [ 0.50748067]
 [-0.3538248 ]
 [-0.509748  ]
 [ 0.53157238]
 [ 0.57227222]
 [ 0.26642323]
 [-0.07820698]
 [ 0.17980725]
 [ 0.00344289]
 [ 0.06366297]
 [ 0.16070591]
 [ 0.47171139]
 [ 0.2316128 ]
 [-0.39965033]
 [ 0.74136878]
 [ 0.01793473]
 [ 0.02216873]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.19952216]
 [ 0.04685945]
 [ 0.28284867]
 [-0.29129398]
 [ 0.51181782]
 [-0.35844229]
 [-0.51574419]
 [ 0.5361505 ]
 [ 0.5772517 ]
 [ 0.26824633]
 [-0.08007293]
 [ 0.18069989]
 [ 0.0024406 ]
 [ 0.06330507]
 [ 0.16139275]
 [ 0.475686

 [-0.02408292]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.37590788]
 [ 0.01658159]
 [ 0.39705412]
 [-0.51768333]
 [ 0.75671374]
 [-0.61921279]
 [-0.84830373]
 [ 0.79388849]
 [ 0.85611314]
 [ 0.37366244]
 [-0.187237  ]
 [ 0.23278371]
 [-0.05500001]
 [ 0.04312762]
 [ 0.20161344]
 [ 0.7010806 ]
 [ 0.31716012]
 [-0.68790817]
 [ 1.10587761]
 [-0.03141981]
 [-0.02452581]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.37761762]
 [ 0.01630109]
 [ 0.39817963]
 [-0.5198697 ]
 [ 0.75904222]
 [-0.62171963]
 [-0.85145153]
 [ 0.79632904]
 [ 0.85873485]
 [ 0.37470397]
 [-0.18827345]
 [ 0.23331023]
 [-0.05554787]
 [ 0.04294649]
 [ 0.20202438]
 [ 0.70323639]
 [ 0.31799681]
 [-0.69062012]
 [ 1.10913418]
 [-0.03187973]
 [-0.02495999]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.37929622]
 [ 0.01602652]
 [ 0.39928546]
 [-0.52201641]
 [ 0.76132696]
 [-0.62418096]
 [-0.85454164]
 [ 0.79872342]
 [ 0.86130639]
 [ 0.37572739]
 [-0.18929064]
 [ 0.23382819]
 [-0.05608508]
 [ 0

Autograd ArrayNode with value [[-0.44128032]
 [ 0.00763872]
 [ 0.44158105]
 [-0.60199839]
 [ 0.84362084]
 [-0.71622005]
 [-0.9703839 ]
 [ 0.8844907 ]
 [ 0.95256365]
 [ 0.41508896]
 [-0.22588729]
 [ 0.25490414]
 [-0.07439762]
 [ 0.03805258]
 [ 0.21933921]
 [ 0.78218976]
 [ 0.35096554]
 [-0.79299811]
 [ 1.22116648]
 [-0.04737153]
 [-0.03947022]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.44158018]
 [ 0.00762684]
 [ 0.44180776]
 [-0.60239923]
 [ 0.84398817]
 [-0.71668897]
 [-0.97098633]
 [ 0.88486656]
 [ 0.95295107]
 [ 0.41530324]
 [-0.22604766]
 [ 0.2550366 ]
 [-0.07446078]
 [ 0.03805938]
 [ 0.2194521 ]
 [ 0.78254209]
 [ 0.35114839]
 [-0.79351036]
 [ 1.22156493]
 [-0.04741766]
 [-0.03951139]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.44187299]
 [ 0.00761629]
 [ 0.4420299 ]
 [-0.60279116]
 [ 0.84434575]
 [-0.71714776]
 [-0.97157622]
 [ 0.88523218]
 [ 0.95332745]
 [ 0.41551332]
 [-0.22620365]
 [ 0.25516703]
 [-0.07452153]
 [ 0.03806708]
 [ 0.21956345]
 [ 0.782885

Autograd ArrayNode with value [[-0.45324397]
 [ 0.00922964]
 [ 0.45212199]
 [-0.61902594]
 [ 0.85594581]
 [-0.73669771]
 [-0.99758014]
 [ 0.89656837]
 [ 0.96404029]
 [ 0.42526676]
 [-0.23106791]
 [ 0.26233172]
 [-0.07507771]
 [ 0.04044861]
 [ 0.22603647]
 [ 0.79472943]
 [ 0.36015334]
 [-0.81567253]
 [ 1.227887  ]
 [-0.04728843]
 [-0.03916667]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.45330199]
 [ 0.0092615 ]
 [ 0.45219029]
 [-0.61912036]
 [ 0.85597689]
 [-0.73681707]
 [-0.99774676]
 [ 0.89659122]
 [ 0.96404752]
 [ 0.42533491]
 [-0.23107886]
 [ 0.26239273]
 [-0.07505952]
 [ 0.04048507]
 [ 0.22609455]
 [ 0.79477128]
 [ 0.36021986]
 [-0.81580766]
 [ 1.22780917]
 [-0.04726555]
 [-0.03914246]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.45335939]
 [ 0.00929345]
 [ 0.45225818]
 [-0.61921397]
 [ 0.85600707]
 [-0.7369355 ]
 [-0.99791216]
 [ 0.89661313]
 [ 0.96405375]
 [ 0.42540266]
 [-0.23108943]
 [ 0.26245354]
 [-0.07504113]
 [ 0.0405216 ]
 [ 0.22615247]
 [ 0.794812

Autograd ArrayNode with value [[-0.45714526]
 [ 0.01217203]
 [ 0.4572595 ]
 [-0.62572035]
 [ 0.85690499]
 [-0.74527784]
 [-1.00956303]
 [ 0.89671874]
 [ 0.96268069]
 [ 0.43045152]
 [-0.23134577]
 [ 0.26726322]
 [-0.07314634]
 [ 0.04372618]
 [ 0.2307934 ]
 [ 0.79676724]
 [ 0.36534529]
 [-0.82542253]
 [ 1.21858185]
 [-0.04500764]
 [-0.03678667]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.45718487]
 [ 0.01220645]
 [ 0.45731454]
 [-0.62578981]
 [ 0.85690667]
 [-0.74536702]
 [-1.00968549]
 [ 0.89671056]
 [ 0.96265422]
 [ 0.43050742]
 [-0.23134612]
 [ 0.26731802]
 [-0.07312277]
 [ 0.04376416]
 [ 0.23084656]
 [ 0.79678218]
 [ 0.36540201]
 [-0.82552362]
 [ 1.21846054]
 [-0.04498031]
 [-0.03675827]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.45722447]
 [ 0.01224086]
 [ 0.45736955]
 [-0.62585924]
 [ 0.85690828]
 [-0.74545612]
 [-1.00980779]
 [ 0.8967023 ]
 [ 0.96262765]
 [ 0.43056329]
 [-0.23134647]
 [ 0.2673728 ]
 [-0.07309919]
 [ 0.04380213]
 [ 0.23089972]
 [ 0.796797

Autograd ArrayNode with value [[-0.46011253]
 [ 0.01469687]
 [ 0.46132105]
 [-0.63084835]
 [ 0.85692397]
 [-0.75179856]
 [-1.01828418]
 [ 0.89599328]
 [ 0.96057595]
 [ 0.43457927]
 [-0.2314165 ]
 [ 0.27131467]
 [-0.07143298]
 [ 0.04651787]
 [ 0.23472327]
 [ 0.79778725]
 [ 0.36953853]
 [-0.83276342]
 [ 1.2093565 ]
 [-0.04301303]
 [-0.03471232]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.46015384]
 [ 0.01473058]
 [ 0.46137624]
 [-0.63091835]
 [ 0.85692359]
 [-0.75188654]
 [-1.0183983 ]
 [ 0.89598267]
 [ 0.9605464 ]
 [ 0.43463537]
 [-0.23141852]
 [ 0.27136961]
 [-0.07141049]
 [ 0.04655527]
 [ 0.2347765 ]
 [ 0.79780063]
 [ 0.36959549]
 [-0.83286167]
 [ 1.20922942]
 [-0.04298666]
 [-0.03468485]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.4601952 ]
 [ 0.01476426]
 [ 0.46143146]
 [-0.63098838]
 [ 0.85692321]
 [-0.75197454]
 [-1.01851235]
 [ 0.89597205]
 [ 0.96051682]
 [ 0.43469148]
 [-0.23142059]
 [ 0.27142455]
 [-0.07138803]
 [ 0.04659265]
 [ 0.23482974]
 [ 0.797814

Autograd ArrayNode with value [[-0.46317511]
 [ 0.01705358]
 [ 0.46528955]
 [-0.63592001]
 [ 0.85690018]
 [-0.75809299]
 [-1.02618627]
 [ 0.89523243]
 [ 0.95845094]
 [ 0.43861017]
 [-0.2316628 ]
 [ 0.27524245]
 [-0.06989721]
 [ 0.04914607]
 [ 0.23852283]
 [ 0.79875516]
 [ 0.37362494]
 [-0.8397311 ]
 [ 1.20022507]
 [-0.04119348]
 [-0.03281218]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.46322032]
 [ 0.01708621]
 [ 0.46534627]
 [-0.63599312]
 [ 0.85690001]
 [-0.75818251]
 [-1.02629468]
 [ 0.89522172]
 [ 0.95842074]
 [ 0.43866774]
 [-0.2316679 ]
 [ 0.27529822]
 [-0.06987654]
 [ 0.04918266]
 [ 0.23857668]
 [ 0.79876917]
 [ 0.37368319]
 [-0.83982925]
 [ 1.20009497]
 [-0.04116868]
 [-0.03278622]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.46326559]
 [ 0.01711882]
 [ 0.46540301]
 [-0.63606627]
 [ 0.85689985]
 [-0.75827204]
 [-1.02640301]
 [ 0.89521102]
 [ 0.95839054]
 [ 0.43872534]
 [-0.23167305]
 [ 0.27535401]
 [-0.0698559 ]
 [ 0.04921924]
 [ 0.23863054]
 [ 0.798783

Autograd ArrayNode with value [[-0.46670208]
 [ 0.01941769]
 [ 0.46955986]
 [-0.64147793]
 [ 0.85691189]
 [-0.76479634]
 [-1.03397863]
 [ 0.89445194]
 [ 0.95620523]
 [ 0.44294104]
 [-0.23218291]
 [ 0.27940907]
 [-0.06845253]
 [ 0.05181651]
 [ 0.24253687]
 [ 0.79983284]
 [ 0.37799689]
 [-0.84700406]
 [ 1.19048964]
 [-0.03943206]
 [-0.03096061]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.46675243]
 [ 0.01944888]
 [ 0.46961866]
 [-0.64155522]
 [ 0.85691244]
 [-0.7648881 ]
 [-1.03408057]
 [ 0.89444161]
 [ 0.95617478]
 [ 0.44300061]
 [-0.23219206]
 [ 0.27946595]
 [-0.06843428]
 [ 0.05185202]
 [ 0.24259153]
 [ 0.79984804]
 [ 0.37805687]
 [-0.84710248]
 [ 1.1903566 ]
 [-0.03940936]
 [-0.03093664]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.46680287]
 [ 0.01948005]
 [ 0.46967749]
 [-0.64163258]
 [ 0.85691301]
 [-0.76497988]
 [-1.0341824 ]
 [ 0.89443129]
 [ 0.95614432]
 [ 0.44306021]
 [-0.23220128]
 [ 0.27952285]
 [-0.06841607]
 [ 0.05188752]
 [ 0.24264621]
 [ 0.799863

 [-0.02938406]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.47042245]
 [ 0.02154284]
 [ 0.47375438]
 [-0.64704422]
 [ 0.85698452]
 [-0.77129917]
 [-1.04086775]
 [ 0.89375059]
 [ 0.95407334]
 [ 0.44718698]
 [-0.23298131]
 [ 0.2834312 ]
 [-0.06727038]
 [ 0.05425836]
 [ 0.24639236]
 [ 0.80094696]
 [ 0.38226096]
 [-0.85389798]
 [ 1.18108379]
 [-0.03792625]
 [-0.02936245]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.47047864]
 [ 0.02157227]
 [ 0.47381553]
 [-0.64712613]
 [ 0.85698612]
 [-0.77139326]
 [-1.04096229]
 [ 0.89374095]
 [ 0.95404294]
 [ 0.44724881]
 [-0.23299519]
 [ 0.28348927]
 [-0.06725501]
 [ 0.05429253]
 [ 0.24644787]
 [ 0.80096369]
 [ 0.38232288]
 [-0.85399648]
 [ 1.18094808]
 [-0.03790606]
 [-0.02934089]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.47053492]
 [ 0.02160167]
 [ 0.47387671]
 [-0.64720811]
 [ 0.85698774]
 [-0.77148739]
 [-1.04105672]
 [ 0.89373132]
 [ 0.95401254]
 [ 0.44731068]
 [-0.23300914]
 [ 0.28354736]
 [-0.06723968]
 [ 0

 [-0.02791668]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.47477593]
 [ 0.02360917]
 [ 0.47831958]
 [-0.65321571]
 [ 0.85715527]
 [-0.77825666]
 [-1.04743771]
 [ 0.89308668]
 [ 0.95186793]
 [ 0.45179767]
 [-0.23420359]
 [ 0.28771873]
 [-0.06627858]
 [ 0.05668871]
 [ 0.25047752]
 [ 0.80224079]
 [ 0.38686278]
 [-0.86107724]
 [ 1.17108532]
 [-0.03656714]
 [-0.02789826]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.47483931]
 [ 0.02363622]
 [ 0.47838363]
 [-0.65330305]
 [ 0.85715846]
 [-0.77835319]
 [-1.04752266]
 [ 0.89307824]
 [ 0.95183801]
 [ 0.45186227]
 [-0.23422349]
 [ 0.28777817]
 [-0.06626698]
 [ 0.05672101]
 [ 0.25053395]
 [ 0.80225964]
 [ 0.38692703]
 [-0.86117531]
 [ 1.17094712]
 [-0.03655027]
 [-0.02787988]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.4749028 ]
 [ 0.02366322]
 [ 0.47844772]
 [-0.65339048]
 [ 0.85716168]
 [-0.77844977]
 [-1.04760747]
 [ 0.89306982]
 [ 0.95180809]
 [ 0.45192692]
 [-0.23424348]
 [ 0.28783763]
 [-0.06625544]
 [ 0

Autograd ArrayNode with value [[-0.47984247]
 [ 0.02552518]
 [ 0.48325165]
 [-0.6599913 ]
 [ 0.85747574]
 [-0.78558304]
 [-1.05336716]
 [ 0.89251835]
 [ 0.9496594 ]
 [ 0.4567649 ]
 [-0.23596682]
 [ 0.29223298]
 [-0.06558266]
 [ 0.05902207]
 [ 0.25474609]
 [ 0.80375706]
 [ 0.39178273]
 [-0.86838925]
 [ 1.1606388 ]
 [-0.03545607]
 [-0.02666806]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.47991454]
 [ 0.02554906]
 [ 0.48331928]
 [-0.66008481]
 [ 0.85748125]
 [-0.78568184]
 [-1.05343967]
 [ 0.89251179]
 [ 0.94963056]
 [ 0.45683291]
 [-0.2359943 ]
 [ 0.29229397]
 [-0.06557596]
 [ 0.05905185]
 [ 0.25480351]
 [ 0.80377882]
 [ 0.39184979]
 [-0.86848596]
 [ 1.16049848]
 [-0.03544353]
 [-0.02665385]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.47998674]
 [ 0.02557288]
 [ 0.48338697]
 [-0.6601784 ]
 [ 0.8574868 ]
 [-0.78578066]
 [-1.05351199]
 [ 0.89250525]
 [ 0.94960174]
 [ 0.45690096]
 [-0.23602189]
 [ 0.29235499]
 [-0.06556932]
 [ 0.05908158]
 [ 0.25486095]
 [ 0.803800

Autograd ArrayNode with value [[-0.48561312]
 [ 0.02717222]
 [ 0.48847945]
 [-0.66724706]
 [ 0.85800003]
 [-0.79305917]
 [-1.05824658]
 [ 0.89211955]
 [ 0.94755863]
 [ 0.46201242]
 [-0.2383582 ]
 [ 0.29687225]
 [-0.06529727]
 [ 0.06113982]
 [ 0.25909203]
 [ 0.80552302]
 [ 0.39693328]
 [-0.87555037]
 [ 1.15005225]
 [-0.03470973]
 [-0.02578899]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.48569531]
 [ 0.02719206]
 [ 0.48855143]
 [-0.66734719]
 [ 0.85800869]
 [-0.79315963]
 [-1.0583034 ]
 [ 0.89211565]
 [ 0.9475316 ]
 [ 0.46208454]
 [-0.2383949 ]
 [ 0.29693504]
 [-0.06529666]
 [ 0.06116635]
 [ 0.25915054]
 [ 0.80554857]
 [ 0.39700372]
 [-0.87564432]
 [ 1.14991033]
 [-0.03470263]
 [-0.02578002]] and 1 progenitors(s)
Autograd ArrayNode with value [[-0.48577764]
 [ 0.02721184]
 [ 0.48862348]
 [-0.6674474 ]
 [ 0.8580174 ]
 [-0.79326011]
 [-1.05835998]
 [ 0.8921118 ]
 [ 0.94750459]
 [ 0.46215672]
 [-0.23843174]
 [ 0.29699786]
 [-0.06529615]
 [ 0.06119283]
 [ 0.25920906]
 [ 0.805574

 [-0.05286023]]
[[-0.04511507]
 [-0.04483249]
 [-0.04457624]
 [-0.04522791]
 [-0.04432818]
 [-0.04531422]
 [-0.04553279]
 [-0.04430135]
 [-0.0442557 ]
 [-0.04459195]
 [-0.04497511]
 [-0.04468635]
 [-0.04488183]
 [-0.04481435]
 [-0.04470726]
 [-0.04436777]
 [-0.04462983]
 [-0.04537587]
 [-0.04405985]
 [-0.0448655 ]
 [-0.04486074]]
[[-0.03749488]
 [-0.03721308]
 [-0.03695755]
 [-0.03760742]
 [-0.0367102 ]
 [-0.0376935 ]
 [-0.03791149]
 [-0.03668345]
 [-0.03663793]
 [-0.03697321]
 [-0.0373553 ]
 [-0.03706735]
 [-0.03726228]
 [-0.03719498]
 [-0.03708819]
 [-0.03674968]
 [-0.03701099]
 [-0.03775498]
 [-0.03644265]
 [-0.037246  ]
 [-0.03724125]]
[[-0.03023764]
 [-0.0299564 ]
 [-0.0297014 ]
 [-0.03034995]
 [-0.02945458]
 [-0.03043586]
 [-0.03065344]
 [-0.02942789]
 [-0.02938247]
 [-0.02971703]
 [-0.03009833]
 [-0.02981097]
 [-0.0300055 ]
 [-0.02993834]
 [-0.02983177]
 [-0.02949397]
 [-0.02975473]
 [-0.03049722]
 [-0.02918762]
 [-0.02998925]
 [-0.02998451]]
[[-0.02332681]
 [-0.02304594]
 [-0.0

 [ 0.11088784]]
[[ 0.11018754]
 [ 0.11096675]
 [ 0.11167287]
 [ 0.1098763 ]
 [ 0.1123558 ]
 [ 0.10963822]
 [ 0.10903531]
 [ 0.1124296 ]
 [ 0.11255519]
 [ 0.1116296 ]
 [ 0.11057354]
 [ 0.11136952]
 [ 0.11083072]
 [ 0.11101676]
 [ 0.11131191]
 [ 0.11224684]
 [ 0.11152524]
 [ 0.10946818]
 [ 0.11309359]
 [ 0.11087574]
 [ 0.11088886]]
[[ 0.11018493]
 [ 0.11096807]
 [ 0.11167776]
 [ 0.10987212]
 [ 0.11236414]
 [ 0.10963284]
 [ 0.1090269 ]
 [ 0.11243831]
 [ 0.11256453]
 [ 0.11163426]
 [ 0.11057288]
 [ 0.11137287]
 [ 0.11083136]
 [ 0.11101833]
 [ 0.11131497]
 [ 0.11225463]
 [ 0.11152938]
 [ 0.10946194]
 [ 0.11310566]
 [ 0.1108766 ]
 [ 0.11088979]]
[[ 0.11018222]
 [ 0.1109693 ]
 [ 0.11168256]
 [ 0.10986784]
 [ 0.11237239]
 [ 0.10962735]
 [ 0.10901838]
 [ 0.11244695]
 [ 0.1125738 ]
 [ 0.11163884]
 [ 0.11057212]
 [ 0.11137614]
 [ 0.1108319 ]
 [ 0.11101981]
 [ 0.11131795]
 [ 0.11226233]
 [ 0.11153344]
 [ 0.1094556 ]
 [ 0.11311766]
 [ 0.11087737]
 [ 0.11089062]]
[[ 0.11017941]
 [ 0.11097044]
 [ 0.1

[[ 0.09731759]
 [ 0.10763174]
 [ 0.11704549]
 [ 0.0932246 ]
 [ 0.12619305]
 [ 0.09010537]
 [ 0.08225623]
 [ 0.12718345]
 [ 0.12886927]
 [ 0.11646706]
 [ 0.10241582]
 [ 0.11299471]
 [ 0.10582495]
 [ 0.10829657]
 [ 0.1122265 ]
 [ 0.12473146]
 [ 0.115073  ]
 [ 0.08788417]
 [ 0.13610322]
 [ 0.10642263]
 [ 0.10659689]]
[[ 0.09723119]
 [ 0.10760665]
 [ 0.11707669]
 [ 0.09311399]
 [ 0.12627913]
 [ 0.08997637]
 [ 0.08208121]
 [ 0.12727548]
 [ 0.12897142]
 [ 0.11649479]
 [ 0.10235967]
 [ 0.11300165]
 [ 0.10578909]
 [ 0.10827545]
 [ 0.11222885]
 [ 0.12480876]
 [ 0.11509238]
 [ 0.08774211]
 [ 0.13624885]
 [ 0.10639034]
 [ 0.10656563]]
[[ 0.09714407]
 [ 0.10758135]
 [ 0.11710813]
 [ 0.09300247]
 [ 0.12636591]
 [ 0.08984631]
 [ 0.08190476]
 [ 0.12736825]
 [ 0.12907441]
 [ 0.11652274]
 [ 0.10230306]
 [ 0.11300865]
 [ 0.10575293]
 [ 0.10825415]
 [ 0.11223122]
 [ 0.12488669]
 [ 0.11511191]
 [ 0.08759889]
 [ 0.13639568]
 [ 0.10635777]
 [ 0.10653411]]
[[ 0.09705623]
 [ 0.10755584]
 [ 0.11713983]
 [ 0.09

Next, we perform the same experiment using the normalized form of the exact network.  Here the input, as well as each activation output, has been normalized according to its respective distribution.  Note here the normalized network is computed by the function ``compute_features_normalized``, which was defined in the previous example.  Note: here we input the normalized data into our prediction function.  To distinguish our ``least_squares`` function that takes in this normalized predictor, we call it ``least_squares_normalized``.

In [34]:
# compute the mean and standard deviation of our input, then normalize the input
x_means = np.mean(x,axis = 0)
x_stds = np.std(x,axis = 0)

# normalize the input data
x_normed = normalize(x,x_means,x_stds)

In [35]:
# our predict function 
def predict_normalized(x,w):     
    # feature trasnsformations
    f = compute_features_normalized(x,w[0])
    
    # compute linear model
    vals = np.dot(f,w[1])
    return vals

# least squares
least_squares_normalized = lambda w: np.sum((predict_normalized(x_normed,w) - y)**2)

Now we perform the same run of gradient descent as done previously, using the same initialization.

In [36]:
# parameters of gradient descent
alpha = 10**(-3); max_its =2000; beta = 0; 

# run gradient descent, create cost history (for cost function plot comparison) associated with output weight history
weight_history_2 = gradient_descent(least_squares_normalized,w_init,alpha,max_its,beta,version = 'unnormalized')
cost_history_2 = [least_squares_normalized(v) for v in weight_history_2]

With both runs complete we can now examine their cost histories to visualize just how much faster the gradient descent works on the normalized architecture.  The difference in convergence behavior is enormous.

In [37]:
# plot the cost function history for our current run of gradient descent
histories = [cost_history_1, cost_history_2]
labels = ['run on unnormalized architecture','run on normalized architecture']
plotter_demo.compare_regression_histories(histories,start=100,labels=labels)

<IPython.core.display.Javascript object>

Now we want to plot the fit provided by both runs.  But remember, as we saw in when discussing the normalization of fixed features, we need to be careful when doing this with our normalized architecture.  This is because the normalized version of the network *is normalized with respect to the training data* - in this case our noisy sinusoidal dataset.  Thus in order to properly evaluate test points - like a fine range of input values we can use to illustrate the fit of the normalized architecture - these test points must be normalized with respect to the same network statistics (i.e., the same input and activation output distribution normalizations) used to evaluate the training data.  

> In order to properly evaluate test points with our normalized architecture they must be normalized with respect to the same network statistics (i.e., the same input and activation output distribution normalizations) used for the training data.

So - practically speaking - to evaluate test points for a given set of weights (like the final weights learned by gradient descent) we must first pass our training data and this set of weights back through the network and collect all of the input / activation output means and standard deviations used in the normalization process.   Then when we evaluate the new test points normalization is performed with *these training statistics*.  

The function ``compute_features_normalized_testing`` function below performs both of these tasks.  Passing the training data and a set of weights through it we can collect the training statistics in the container called ``stats``.  Then we can use the same function to evaluate new test points by passing the test data, the same set of weights, and these statistics to the function.

In [39]:
def compute_features_normalized_testing(x, inner_weights,stats):
    '''
    An adjusted normalized architecture compute function that collects network statistics as the training data
    passes through each layer, and applies them to properly normalize test data.
    '''
    # are you using this to compute stats on training data (stats empty) or to normalize testing data (stats not empty)
    switch =  'testing'
    if np.size(stats) == 0:
        switch = 'training'
        
    # if no stats given collect directly from input
    x_means = 0
    y_means = 0
    if switch == 'training':
        x_means = np.mean(x,axis = 0)
        x_stds = np.std(x,axis = 0)
        stats.append([x_means,x_stds])
    elif switch == 'testing':
        x_means = stats[0][0]
        x_stds = stats[0][1]

    # normalize input
    x_normed = normalize(x,x_means,x_stds)
    
    # pad data with ones to deal with bias
    o = np.ones((np.shape(x_normed)[0],1))
    a_padded = np.concatenate((o,x_normed),axis = 1)
        
    # loop through weights and update each layer of the network
    c = 1
    for W in inner_weights:
        # output of layer activation
        a = activation(np.dot(a_padded,W))
                
        ### normalize output of activation
        a_means = 0
        a_stds = 0
        if switch == 'training':
            # compute the mean and standard deviation of the activation output distributions
            a_means = np.mean(a,axis = 0)
            a_stds = np.std(a,axis = 0)
            stats.append([a_means,a_stds])
        elif switch == 'testing':
            a_means = stats[c][0]
            a_stds = stats[c][1]
            
        # normalize the activation outputs
        a_normed = normalize(a,a_means,a_stds)
            
        # pad with ones for bias
        o = np.ones((np.shape(a_normed)[0],1))
        a_padded = np.concatenate((o,a_normed),axis = 1)
        c+=1
    
    return a_padded,stats

In the next cell we use the function above to compute the training statistics with respect to the best weights of the gradient descent run on our normalized architecture.

In [40]:
# get best weights from first run
best_ind = np.argmin(cost_history_1)
w1 = weight_history_1[best_ind]

# collect training normalization statistics for the final set of weights learned by gradient descent
best_ind = np.argmin(cost_history_2)
w2 = weight_history_2[best_ind]
a_padded,training_stats = compute_features_normalized_testing(x,w2[0],[])

Now to use our ``compute_features_normalized_testing`` function on new test data we create a testing prediction function that calls it, which we call ``predict_testing``.

In [41]:
# our predict function 
def predict_testing(x,w):     
    # feature trasnsformations
    f,stats = compute_features_normalized_testing(x,w[0],training_stats)
    
    # compute linear model
    vals = np.dot(f,w[1])
    return vals

Finally with this adjustment made to evaluate test points with our normalized architecture we can evaluate and compare both the original and normalized architectures - using their final weights learned from their respective gradient descent runs - to compare them on new test points.  Below we evaluate both architectures over a fine range of testing input to illustrate the final fit of each architecture.

As we saw earlier - gradient descent could minimize the normalized architecture substantially more than the original.  This fact is reflected in the quality of the nonlinear fit each model provides for the data - with the normalized version being significantly better.

In [42]:
# compare final fits using unnormalized and normalized predictors
plotter_demo.compare_regression_fits(x,y,predict,predict_testing,w1,w2,title1 = 'fit using unnormalized architecture',title2 = 'fit using normalized architecture')

<IPython.core.display.Javascript object>

With such a higly parameterized architecture one might suspect that we should be able overfit significantly to this dataset.  This will certainly occur if we continue to run (unnormalized) gradient descent to minimize e.g., the normalized least squares cost.  However using other first order tricks one can further speed up gradient descent - like e.g., the normalized gradient descent step or momentum - and overfit rather quickly here.

#### <span style="color:#a50e3e;">Example 8. </span>  Comparing unnormalized and normalized deep network activation output distributions for classification

In this example we compare the speed at which gradient descent converges when using a standard and normalized architecture to perform nonlinear classification, using the dataset shown below.

In [43]:
# load data
csvname = '2_eggs.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:,:-1]
y = data[:,-1:]

# plot everything
plotter_demo = plotter.Visualizer()
plotter_demo.plot_classification_data(x,y)

<IPython.core.display.Javascript object>

Here we will use a 3 layer network, with 10 units in each layer.  This network is defined and initialized in the cell below.

In [77]:
# A 3 layer network architecture
N = np.shape(x)[1]
M = np.shape(y)[1]
U_1 = 10                # number of units in layer 1
U_2 = 10                # number of units in layer 2
U_3 = 10                # number of units in layer 3

# the list defines our network architecture
layer_sizes = [N, U_1,U_2,U_3,M]

# generate initial weights for our network
w_init = initialize_network_weights(layer_sizes,scale = 0.8)

Below we define our ``softmax`` cost function, as well as the counting cost (to count the number of misclassifications at each step of gradient descent) called ``count``.

In [78]:
# softmax cost function
softmax = lambda w: np.sum(np.log(1 + np.exp(-y*predict(x,w))))
count = lambda w: 0.25*np.sum((np.sign(predict(x,w)) - y)**2)

Now we run (unnormalized) gradient descent for 200 iterations to tune our original architecture.  Here we use our standard steplength value $\alpha$ of the form $10^{-\gamma}$ for the smallest positive integer $\gamma$ that provides convergence.

In [79]:
# parameters of gradient descent
alpha = 10**(-3); max_its = 200; beta = 0; version = 'unnormalized'
cost = softmax

# run gradient descent, create cost history (for cost function plot comparison) associated with output weight history
weight_history_1 = gradient_descent(cost,w_init,alpha,max_its,beta,version = version)
count_history_1 = [count(v) for v in weight_history_1]
cost_history_1 = [softmax(v) for v in weight_history_1]

Next we do the same for our unnormalized version of the network.  In the next cell we normalize the input data, and define softmax and counting cost functions ``softmax_normalized`` and ``count_normalized`` that take in this normalized data.

In [80]:
# compute the mean and standard deviation of our input, then normalize the input
x_means = np.mean(x,axis = 0)
x_stds = np.std(x,axis = 0)

# normalize the input data
x_normed = normalize(x,x_means,x_stds)

# softmax cost function
softmax_normalized = lambda w: np.sum(np.log(1 + np.exp(-y*predict_normalized(x_normed,w))))
count_normalized = lambda w: 0.25*np.sum((np.sign(predict_normalized(x_normed,w)) - y)**2)

Now we make the analagous run of gradient descent to tune the parameters of our normalized architecture.

In [81]:
# parameters of gradient descent
alpha = 10**(-3); max_its = 200; beta = 0; version = 'unnormalized'
cost = softmax_normalized

# run gradient descent, create cost history (for cost function plot comparison) associated with output weight history
weight_history_2 = gradient_descent(cost,w_init,alpha,max_its,beta,version = version)
count_history_2 = [count_normalized(v) for v in weight_history_2]
cost_history_2 = [softmax_normalized(v) for v in weight_history_2]

With both gradient descent runs complete we can compare the cost function value and number of misclassifications at each step of the runs.  This is plotted below.  As with the previous regression example, the difference in terms of convergence is enormous, with gradient descent converging much more rapidly with the normalized architecture.

In [82]:
# plot cost function history
start = 0   # at which iteration to begin plotting the cost function history
count_histories = [count_history_1,count_history_2]
cost_histories = [cost_history_1,cost_history_2]

labels = ['unnormalized network ','normalized network']
plotter_demo.compare_classification_histories(count_histories,cost_histories,start,labels = labels)

<IPython.core.display.Javascript object>

Since the unnormalized fit is clearly quite poor after only 200 iterations of gradient descent, we just examine the fit of the tuned normalized architecture.  As discussed in the previous example, in order to visualize this fit we must extract the normalization statistics from our network when passing the training data (and desired set of weights) through it.  We do this in the next cell, employing the same functionality for doing so introduced in the previous example.

In [83]:
# extract normalization statistics from the network over our training data
best_ind = np.argmin(cost_history_2)
w2 = weight_history_2[best_ind]
a_padded,training_stats = compute_features_normalized_testing(x,w2[0],[])

To evaluate new test points using this set of training statistics we define a ``predict_testing`` function below.

In [84]:
# our predict function 
def predict_testing(x,w):     
    # feature trasnsformations
    f,stats = compute_features_normalized_testing(x,w[0],training_stats)
    
    # compute linear model
    vals = np.dot(f,w[1])
    return vals

With our statistics computed and testing predictor constructed we can now properly evaluate test points with our normalized architecture.  In particular we can evaluate a fine set of test points over the input range of our dataset to produce a visualization of the normalized architecture's nonlinear fit - which we do below.  In the left panel the fit is shown 'from above', while in the right panel the same fit is shown from the regression point of view 'from the side'.  The fit is a good one, producing zero misclassifications.

In [85]:
# plot the dataset along with classification boundary (in the left panel) and corresponding surface fit (in the right panel)
plotter_demo.plot_classification_data(x,y,predict = predict_testing,weights = weight_history_2[-1])

<IPython.core.display.Javascript object>