The steepest-descent direction is the optimal direction only from the perspective of infinitesimal steps. It can sometimes become ascent direction or cause oscillation in the high curvature area of the loss function.

In this section, we will discuss several clever learning strategies that work well in these ill-conditioned settings.

# Learning Rate Decay

A constant learning rate is not desirable because
- A lower learning rate used early on will cause the algorithm to take too long to come even close to an optimal solution.
- A larger initial learning rate will allow the algorithm to come reasonably close to a good solution at first; however, the algorithm will then oscillate around the point for a very long time , or diverge in an unstable way.

Allowing the learning rate to decay over time can naturally achive the desired learning-rate adjustment to avoid these challenges.

There several common decay functions as follows. In these fucntions, the learning rate $a_t$ can be expressed in terms of the initial decay rate $a_0$, epoch $t$ and decay parameter $k$.
- Exponential decay
$$a_t = a_0 exp(-k\cdot t)$$
- Inverse decay
$$a_t = \frac{a_0}{1+k\cdot t}$$
- Step decay (update decay every $T$ steps)
$$a_t = a_0\cdot k^{int(t/T)}$$
- Tracking the loss on a held-out portion of the training data set, and reduce the learning rate whenever this loss stops improving.

--------------------

# Momentum-Based Learning

Momentum-based techniques recognize that zigzagging is a result of highly contradictory steps that cancel out one another and reduce the effective size of the steps in the correct direction. In this point of veiw, it makes a lot more sense to move in an "averaged" direction of the last few steps, so that the zigzagging is smooth out.

$$V\Leftarrow \beta V-\alpha\frac{\partial L}{\partial W};\qquad W\Leftarrow W+V$$

where $V$ denotes the corrected direction, and it is corrected by adding the weighted past directions $\beta V$.

Advantages of momentum-based learning

- **Accelerate learning**. From the expression, we see that if the loss surface is a straight valley with only one correct direction, normal gradient descent learning algorithm will easily oscillate, whereas the momentum-base learning algorithm will mute the oscillations and provide larger steps in the correct direction. 
- **Help avoid local minima**. The smoothing term $\beta V$ ,which makes use of the last vector, is also an overshoot term. With this term, learning process may jump over the local minima. 



## Nesterov Momentum

The traditional momentum method add the smoothing term after computing the partial derivatives at current point of weights. And here, we shall introduce a method named Nesterov momentum that add the smoothing term at first and then compute the partial derivatives at the modified point of weights. Note that the only difference from the standard momentum method is in terms of where the gradient is computed.

Traditional momentum method

$$W_t\xrightarrow[\text{derivatives on }W_t]{-\alpha\frac{\partial L(W)}{\partial W}}W_t'\xrightarrow[\text{previous vector (smoothing term)}]{+\beta V}W_{t+1}$$

Nesterov Momentum method

$$W_t\xrightarrow[\text{previous vector (smoothing term)}]{+\beta V}W_t'\xrightarrow[\text{derivatives on }W_{t}' = W_t+\beta V]{-\alpha\frac{\partial L(W_{t}')}{\partial W}}W_{t+1}$$

The update may be computed as follows:

$$V\Leftarrow \beta V-\alpha\frac{\partial L(W+\beta V)}{\partial W};\qquad W \Leftarrow W+V$$

Nesterov momentum method **converge faster when the weights start reaching near the optimum.** Because the later computing gradient $\frac{\partial L(W+\beta V)}{\partial W}$ can provide a preciser direction towards the optimal point.

---------------

# Parameter-Specific Learning Rates

The basis idea in the momentum methods of the previous section is to leverage the consistency in the gradient direction of certain parameters in order to speed up the updates. This goal can be achieved more explicitly by having different learning rates for different parameters. The idea is that <font color='red'>**parameters with large partial derivatives are often oscillating and zigzagging, whereas parameters with small partial derivatives tend to be more consistent but move in the same direction**</font>.


## AdaGrad

Let $A_i$ be the accumulated square magnitude of the partial derivative denoted by $\frac{\partial L}{\partial w_i}$.

$$A_i \Leftarrow A_i+\left(\frac{\partial L}{\partial w_i}\right)^2\quad \forall i$$

Larger partial derivatives means oscillating and zigzagging, so the magnitude is set to be the denominator in order to give small update on the weight.

$$w_i \Leftarrow w_i - \frac{\alpha}{\sqrt{A_i}}\left(\frac{\partial L}{\partial w_i}\right)\quad \forall i$$

*If desired, one can use $\sqrt{A_i+\epsilon}$ in the denominator instead of $\sqrt{A_i}$ to avoid ill-conditioning. Here, $\epsilon$ is a small positive value such as $10^{-8}$.*

Drawback of AdaGrad

- The update of weights will tend to slow down over time due to the fact that $A_i$ is the aggregate value of the entire history of partial derivatives.
- The aggregate scaling factor depend on ancient history, which can enventually become stale. The use of stale scaling factors can increase inaccuracy.

## RMSProp

RMSProp is a method that solve the slowing down and ancient history problems occurred in AdaGrad. The basic idea is to use a decay factor $\rho\in (0,1)$, and weight the squared partial derivatives ocurring $t$ updates ago by $\rho^{t}$. Thus the parameter $A_i$ is called exponential averaged magnitude, and it updates as follows.

$$A_i \Leftarrow \rho A_i + (1-\rho)\left(\frac{\partial L}{\partial w_i}\right)^2\quad \forall i$$

The weights update expression is the same as that in AdaGrad.

$$w_i\Leftarrow w_i - \frac{\alpha}{\sqrt{A_i}}\left(\frac{\partial L}{\partial w_i}\right)\quad \forall i$$

Note that $A_i$ is initialized to $0$. This causes some bias in early iterations, which disappears over the longer term.

## RMSProp with Nesterov Momentum

Weights update using the Nesterov momentum method and take the exponential averaged magnitude into account.

$$v_i\Leftarrow \beta v_i-\frac{\alpha}{\sqrt{A_i}}\left(\frac{\partial L(W+\beta V)}{\partial w_i}\right);\quad w_i\Leftarrow w_i+v_i\quad \forall i$$

The exponential averaged magnitude update using the RMSProp method.

$$A_i \Leftarrow \rho A_i + (1-\rho)\left(\frac{\partial L(W+\beta V)}{\partial w_i}\right)^2\quad \forall i$$

## AdaDelta

The AdaDelta algorithm uses a similar update as RMSProp, except that it **eliminates the need for a global learning parameter $\alpha$** by computing it as a function of incremental update in previous iterations.

Consider the update of RMSProp, which is repeated below:

$$w_i\Leftarrow w_i-\underbrace{\frac{\alpha}{\sqrt{A_i}}\left(\frac{\partial L}{\partial w_i}\right)}_{\Delta w_i};\quad \forall i$$

As with the exponentially smoothed gradients $A_i$, we keep an exponentially smoothed value $\delta_i$ of the values of $\Delta w_i$ in previous iterations with the same decay parameter $\rho$：

$$\delta_i \Leftarrow \rho \delta_i + (1-\rho)(\Delta w_i)^2\quad\forall i$$

For a given iteration, the value of $\delta_i$ can be computed using only the iterations before it because the value of $\Delta w_i$ is not yet available. On the other hand, $A_i$ can be computed using the partial derivative in the current iteration as well. This is a subtle difference between how $A_i$ and $\delta_i$ are computed. This results in the following AdaDelta update:

$$w_i\Leftarrow w_i - \underbrace{\sqrt{\frac{\delta_i}{A_i}}\left(\frac{\partial L}{\partial w_i}\right)}_{\Delta w_i}\quad \forall i$$

where the parameter $\alpha$ for the learning rate is completely missing.

## Adam

The Adam algorithm uses a similar "signal-to-noise" normalization as AdaGrad and RMSProp. Here we consider the update of exponentially averaged value $A_i$ takes the same form as that in RMSProp.

$$A_i\Leftarrow \rho A_i + (1-\rho)\left(\frac{\partial L}{\partial w_i}\right)^2\quad\forall i$$

However, there are two key differences form the Adam to RMSProp algorithm.

1. Adam exponentially smooths the first-order gradient in order to incorporate momentum into the update. Recall that the momentum-based method uses only the previous one velocity for correcting, and the Adam algorithm exponentially uses all the previous partial derivatives.
$$F_i \Leftarrow \rho_f F_i + (1-\rho_f)\left(\frac{\partial L}{\partial w_i}\right)\quad\forall i$$
2. The learning rate $\alpha_t$ depends on the iteration index $t$, and is defined as follows.
$$\alpha_t = \alpha\underbrace{\left(\frac{\sqrt{1-\rho^t}}{1-\rho_f^t}\right)}_{\text{Adjust Bias}}$$
Technically, the adjustment to the learning rate is actually a bias correction factor that is applied to account for the unrealistic initialization of the two exponential smoothing mechanisms, and it is particularly important in early iterations. Both $F_i$ and $A_i$ are initialized to $0$, which causes bias in early iterations. The two quantities are affected differently by the bias, which accounts for the ratio in the $\alpha_t$ equation. For large $t$, the initialization bias correction factor converges to $1$, and $\alpha_t$ converges to $\alpha$.

The default suggested values of $\rho_f$ and $\rho$ are $0.9$ and $0.999$.

*The Adam algorithm is extremely popular because it incorporates most of the advantages of other algorithms, and often performs competitively with respect to the best of the other methods*.

------------




In [1]:

# for showing iteratively
%matplotlib notebook

import matplotlib.pyplot as plt
import numpy as np
import warnings

# convert warnings to error
warnings.filterwarnings("error")

# learning rate decay
epoch_t = 0

# conjugate gradient algorithm
q0 = None


TRAIN_FINISHED = 100000

LT = 5000
LR = 0.1
X = None
Y = None
x = None
y = None
cont = None
levels = None

Gaussian = lambda t, mu, sigma: 1.0/(sigma*np.sqrt(2*np.pi))*np.exp(-(t-mu)**2/(2*sigma**2))

class EXP_TRAIN_FINISHED(Exception):
    pass

def exp_decay(derivatives):
    global epoch_t
    a0 = 0.01
    k = 0.001
    at = a0 * np.exp(-k*epoch_t)
    epoch_t += 1
    return -at*derivatives
    
def inverse_decay(derivatives):
    global epoch_t
    a0 = 0.01
    k = 0.001
    at = a0 / (1+k*epoch_t)
    epoch_t += 1
    return -at*derivatives

def step_decay(derivatives):
    global epoch_t
    a0 = 0.01
    k = 0.5
    T = 200
    at = a0 * k**(epoch_t//T)
    epoch_t += 1
    return -at*derivatives

V = None
def momentum_learning(derivatives):
    global V
    a0 = 0.01
    b = 0.8
    if V is None:
        V = -a0 * derivatives
    else:
        V = b*V - a0 * derivatives
    return V

update_v = 0
def nesterov_momentum(derivatives):
    global V, update_v
    a0 = 0.01
    b = 0.8
    if V is None:
        V = -a0 * derivatives
    elif update_v == 0:
        # for calculating the dL(W+bV)/dW
        update_v = 1
        return b * V
    else:
        # V = bV + dL(W+bV)/dW
        V = b*V - a0 * derivatives
        update_v = 0
    return V

A = None
def AdaGrad(derivatives):
    a0 = 0.1
    global A
    if A is None:
        A = derivatives**2
    else:
        A += derivatives**2
    return -a0/(np.sqrt(A)+1e-8)*derivatives

def RMSProp(derivatives):
    a0 = 0.1
    r = 0.9
    global A
    if A is None:
        A = np.zeros(derivatives.shape)
    A = r*A + (1-r)*derivatives**2
    return -a0/(np.sqrt(A)+1e-8)*derivatives

def RMSProp_Nesterov(derivatives):
    global V, update_v, A
    a0 = 0.02
    b = 0.8
    r = 0.9
    if V is None:
        V = -a0 * derivatives
        A = np.zeros(derivatives.shape)
    elif update_v == 0:
        # for calculating the dL(W+bV)/dW
        update_v = 1
        return b * V
    else:
        A = r*A + (1-r)*derivatives**2
        # V = bV + dL(W+bV)/dW
        V = b*V - a0/(np.sqrt(A)+1e-8) * derivatives
        update_v = 0
    return V

sum_delta = None
last_delta = None
def AdaDelta(derivatives):
    a0 = 0.1
    r = 0.9
    global A, sum_delta, last_delta
    if A is None:
        A = derivatives**2
        sum_delta = np.full(A.shape, a0**2)
    else:
        A = r*A + (1-r)*derivatives**2
        sum_delta = r*sum_delta + (1-r)*(last_delta)**2
    last_delta = np.sqrt(sum_delta/(A+1e-8))*derivatives
    return -last_delta

F = None
def Adam(derivatives):
    a0 = 0.1
    rf = 0.9
    r = 0.999
    global A, F
    if A is None:
        A = np.full(derivatives.shape, 0)
        F = np.full(derivatives.shape, 0)
    A = r*A + (1-r)*derivatives**2
    F = rf*F + (1-rf)*derivatives
    at = a0 * (np.sqrt(1-r)/(1-rf))
    return -at/(np.sqrt(A)+1e-8)*F

def tanh(a):
    y = np.tanh(a)
    return y

def dtanh(o):
    y = 1-o**2
    return y

def sigmoid(a):
    y = 1.0/(1.0+np.exp(-a))
    return y

def dsigmoid(o):
    y = o*(1-o)
    return y

def compute_weights_error(derivatives, variance):
    ret = LR * variance * derivatives
    return ret

class Layer:
    network = None
    unit_number = 0
    outputs = None
    errors = None
    weights = None
    derivatives = None
    prev_layer = None
    next_layer = None
    variance = 0.0
#     bias = None
#     bias_derivatives = None
    
    w_start = 0
    b_start = 0
    act_errors = None
    
    act_func = tanh
    link_func = dtanh
    
    def __init__(self, K, afunc=tanh, lfunc=dtanh):
        self.unit_number = K
        self.outputs = np.zeros(K)
        self.errors = np.zeros(K)
        self.act_errors = np.zeros(K)
#         self.bias = np.zeros(K)
#         self.bias_derivatives = np.zeros(K)
        self.act_func = afunc
        self.link_func = lfunc
        
    def build_connection(self, network):
        self.network = network
        j = self.prev_layer.unit_number
        self.variance = 1.0/(j * self.unit_number)
#         self.weights = np.random.normal(0,\
#                                         np.sqrt(self.variance),\
#                                         (self.unit_number, j))
#         self.derivatives = np.zeros((self.unit_number, j))
        
        w_size = self.unit_number * j
        self.w_start = self.network.w_size
        self.network.weights = np.hstack((self.network.weights, np.random.normal(0, np.sqrt(self.variance), w_size)))
        self.network.w_size += w_size
        
        b_size = self.unit_number
        self.b_start = self.network.w_size
        self.network.weights = np.hstack((self.network.weights, np.zeros(b_size)))
        self.network.w_size += b_size
        
        self.network.derivatives = np.zeros(self.network.w_size)
        self.network.hessian = np.zeros((self.network.w_size, self.network.w_size))
        self.network.b = np.zeros(self.network.w_size)
    
#     def forward_propagation(self):
#         for k in range(self.unit_number):
#             self.outputs[k] = self.act_func(np.sum(self.prev_layer.outputs * self.weights[k])+self.bias[k])

    def forward_propagation(self):
        for k in range(self.unit_number):
            s = 0.0
            for j in range(self.prev_layer.unit_number):
                s += self.prev_layer.outputs[j] * self.network.weights[self.w_start+k*self.prev_layer.unit_number+j]
            s += self.network.weights[self.b_start + k]
            self.outputs[k] = self.act_func(s)
            
#     def backward_propagation(self, update_weight):
#         for k in range(self.unit_number):
#             self.errors[k] = self.link_func(self.outputs[k]) * np.sum(self.next_layer.weights[:,k] * self.next_layer.errors)
#             self.derivatives[k] += self.errors[k] * self.prev_layer.outputs
#         self.bias_derivatives += self.errors
#         if update_weight == 1:
#             self.weights -= compute_weights_error(self.derivatives, self.variance)
#             self.derivatives *= 0.0
#             self.bias -= compute_weights_error(self.bias_derivatives, self.variance)
#             self.bias_derivatives *= 0.0

    def backward_propagation(self, update_weight):
        for k in range(self.unit_number):
            s = 0.0
            a = 0.0
            for l in range(self.next_layer.unit_number):
                s += self.next_layer.errors[l] * self.network.weights[self.next_layer.w_start+l*self.unit_number+k]
                a += self.next_layer.act_errors[l] * self.network.weights[self.next_layer.w_start+l*self.unit_number+k]
            self.errors[k] = self.link_func(self.outputs[k]) * s
            self.act_errors[k] = self.link_func(self.outputs[k]) * a
            for j in range(self.prev_layer.unit_number):
                self.network.derivatives[self.w_start+k*self.prev_layer.unit_number+j] += self.errors[k] * self.prev_layer.outputs[j]
                self.network.b[self.w_start+k*self.prev_layer.unit_number+j] = self.act_errors[k] * self.prev_layer.outputs[j]
            self.network.derivatives[self.b_start+k] += self.errors[k]
            self.network.b[self.b_start+k] = self.act_errors[k]
            
    def dump(self):
        print("outputs {}".format(self.outputs))
        print("errors {}".format(self.errors))
        print("weights {}".format(self.weights))
        print("derivatives {}".format(self.derivatives))

# Mixture Gaussian Parameters Layer
class OutputLayer(Layer):
    targets = None
#     def backward_propagation(self, update_weight):
#         self.errors = self.outputs-self.targets
#         for k in range(self.unit_number):
#             self.derivatives[k] += self.errors[k] * self.prev_layer.outputs
#         self.bias_derivatives += self.errors
#         if update_weight == 1:
#             self.weights -= compute_weights_error(self.derivatives, self.variance)
#             self.derivatives *= 0.0
#             self.bias -= compute_weights_error(self.bias_derivatives, self.variance)
#             self.bias_derivatives *= 0.0
            
    def backward_propagation(self, update_weight):
        self.errors = self.outputs-self.targets
        self.act_errors = np.ones(self.act_errors.shape)
        for k in range(self.unit_number):
            for j in range(self.prev_layer.unit_number):
                self.network.derivatives[self.w_start+k*self.prev_layer.unit_number+j] += self.errors[k] * self.prev_layer.outputs[j]
                self.network.b[self.w_start+k*self.prev_layer.unit_number+j] = self.act_errors[k] * self.prev_layer.outputs[j]
            self.network.derivatives[self.b_start+k] += self.errors[k]
            self.network.b[self.b_start+k] = self.act_errors[k]
            
    def update_target(self, t):
        self.targets = t

class InputLayer(Layer):
    outputs = None
    unit_number = 0
    prev_layer = None
    next_layer = None
    
    def __init__(self, K):
        self.unit_number = K
        
    def build_connection(self, network):
        self.network = network
        return
        
    def update_input(self, inputs):
        self.outputs = inputs

class Network:
    input_layer = None
    output_layer = None
    inputs = None
    targets = None
    hessian = None
    weights = None
    derivatives = None
    b = None
    w_size = 0
    
    def __init__(self):
        self.weights = np.array([])
        return
    
    def add_layer(self, layer):
        if self.input_layer == None:
            self.input_layer = layer
        else:
            layer.prev_layer = self.output_layer
            self.output_layer.next_layer = layer
            
        self.output_layer = layer
        layer.build_connection(self)
        
    def update_hessian(self):
        b = self.b.reshape(-1, 1)
        y = self.output_layer.outputs[0]
        self.hessian += y * (1-y) * (b @ b.T)
        
    def update(self, update_weight):
        self.update_hessian()
        if update_weight == 1:
            if np.allclose(self.derivatives, np.zeros(self.w_size)):
                raise EXP_TRAIN_FINISHED("Train finished")
            #print(self.weights)
            #print(self.derivatives)
            #print(self.hessian)
            #print(np.linalg.inv(self.hessian))
            
            # exponential decay
            #ret = exp_decay(self.derivatives)
            
            # inverse decay
            #ret = inverse_decay(self.derivatives)
            
            # step decay
            #ret = step_decay(self.derivatives)
            
            # momentum-based learning
            #ret = momentum_learning(self.derivatives)
            
            # nesterov momentum
            #ret = nesterov_momentum(self.derivatives)
            
            # AdaGrad
            #ret = AdaGrad(self.derivatives)
            
            # RMSProp
            #ret = RMSProp(self.derivatives)
            
            # RMSProp and Nesterov momentum combination
            #ret = RMSProp_Nesterov(self.derivatives)
            
            # AdaDelta
            #ret = AdaDelta(self.derivatives)
            
            # Adam
            ret = Adam(self.derivatives)
            
            # Conjugate Gradients algorithm
            #ret = conjugate_gradient(self.derivatives, self.hessian)
            
            self.weights += ret
            self.derivatives *= 0
            self.hessian *= 0
        return 1
        
    def train(self, observations, targets):
        N = len(observations)
        for n in range(N):
            self.input_layer.update_input(observations[n])
            self.output_layer.update_target(targets[n])
            
            layer = self.input_layer.next_layer
            while layer != None:
                layer.forward_propagation()
                layer = layer.next_layer
            
            # gradient descent/stotastic gradient descent
            if n == N-1:
                update_weight = 1
            else:
                update_weight = 0
            layer = self.output_layer
            while layer != self.input_layer:
                layer.backward_propagation(update_weight)
                layer = layer.prev_layer
            self.update(update_weight)

    def Get_Error(self, observations, targets, show):
        N = len(observations)
        es = 0.0
        for n in range(N):
            self.inputs = observations[n]
            self.targets = targets[n]
            
            layer = self.input_layer
            while layer != None:
                layer.forward_propagation()
                layer = layer.next_layer
            #print("target {}".format(self.targets))
            #print("output {}".format(self.output_layer.outputs))
            #e = np.sum(self.targets * np.log(self.output_layer.outputs))
            # for sigmoid only
            e = self.targets[0] * np.log(self.output_layer.outputs[0]) \
                + (1 - self.targets[0]) * np.log(1 - self.output_layer.outputs[0])
            es += e
            if show == 1:
                print(self.targets, self.output_layer.outputs)
        return -es
                
    def test(self, new_input):
        self.input_layer.update_input(new_input)
        
        layer = self.input_layer.next_layer
        while layer != None:
            layer.forward_propagation()
            layer = layer.next_layer
        return self.output_layer.outputs
    
    def dump(self):
        layer = self.input_layer.next_layer
        while layer != None:
            layer.dump()
            layer = layer.next_layer
        return

def gen_training_data(ax):
    mean1 = [0.2, 0.2]
    cov1 = [[0.03, -0.01], 
           [-0.01, 0.02]]
    X1 = np.random.multivariate_normal(mean1, cov1, 20)

    mean12 = [0.7, 0.7]
    cov12 = [[0.01, -0.01], 
           [-0.01, 0.02]]
    X12 = np.random.multivariate_normal(mean12, cov12, 20)
    X1 = np.vstack((X1, X12))
    T1 = np.zeros(len(X1))
    
    mean2 = [0.4, 0.6]
    cov2 = [[0.03, -0.02], 
           [-0.02, 0.04]]
    X2 = np.random.multivariate_normal(mean2, cov2, 40)
    T2 = np.ones(len(X2))
    
    ax.scatter(X1.T[0], X1.T[1], s=50,  facecolors='none', edgecolors='blue')
    ax.scatter(X2.T[0], X2.T[1], s=50,  facecolors='red', edgecolors='none', marker='x')
    
    X = np.vstack((X1, X2))
    T = np.hstack((T1, T2))
    return X, T

def show_pic(network, ax, fig, init=0):
    global x, y, X, Y, Z, cont, levels
    if init == 1:
        x = np.linspace(0,1,20)
        y = np.linspace(0,1,20)
        X, Y = np.meshgrid(x, y)
        levels = np.arange(0.1, 1.1, 0.2)
    
    Z = np.zeros(X.shape)
    for i in range(len(y)):
        for j in range(len(x)):
            Z[i][j] = network.test([x[j], y[i]])
    if init != 1:
        for coll in cont.collections: 
            coll.remove()
    cont = ax.contour(X, Y, Z, levels=levels) 
    fig.canvas.draw()
    
def training_show_process(network, X, T, fig, ax):
    show_pic(network, ax, fig, 1)

    for i in range(LT):
        network.train(X, T)
        if i % 10 == 9:
            show_pic(network, ax, fig)
    return

def build_network():
    network = Network()
    ilayer = InputLayer(2)
    network.add_layer(ilayer)
    layer1 = Layer(10)
    network.add_layer(layer1)
    # layer2 = Layer(5)
    # network.add_layer(layer2)
    olayer = OutputLayer(1, sigmoid, dsigmoid)
    network.add_layer(olayer)
    return network

def main():
    fig = plt.figure(figsize=(11,5), dpi=60)
    ax1 = fig.add_subplot(1,2,1)
    ax2 = fig.add_subplot(1,2,2)
    ax1.set_xlim(0, 1)
    ax1.set_ylim(0, 1)
    
    X, T = gen_training_data(ax1)
    network = build_network()
    try:
        training_show_process(network, X, T, fig, ax1)
    except EXP_TRAIN_FINISHED:
        show_pic(network, ax1, fig)
        raise

if __name__=="__main__":
    main()

<IPython.core.display.Javascript object>