# Let Neural Network Learn

Machine Learning:
* Extract characteristic quantities, which can present the essential character of the original pictures
* Use machine learning techniques to learn the patterns of the pictures.

But neural network can learn the characteristic of a picture directly, does not need people to come up with the idea about what should be used to present the characteristics.

One of the main boons of the neural network is that it has the same procedure to handle every kind of problems.

Training data is to train the network, testing data is to test those data that is not used in training. The final goal of training is to generalize, making the network work for all the input data.

Over fitting: a state that a network works extremely well for a specific data set but not good for other data sets.

## Loss Function
Loss function is a metric to represent how bad the network is. To be more specific, in what extent the network is not able to predict the training data set.

### Mean Squared Error
$$
E = \frac{1}{2}\sum\limits_k(y_k - t_k)^2
$$
* $y_k$ is the output of the network
* $t_k$ is the oversight data(data from the training data set)
* $k$ is the dimension count of the data

Example:

In [20]:
import numpy as np

y1 = np.array([0.1, 0.05, 0.6, 0.0, 0.05, 0.1, 0.0, 0.1, 0.0, 0.0])
y2 = np.array([0.1, 0.05, 0.1, 0.0, 0.05, 0.1, 0.0, 0.6, 0.0, 0.0])
t = np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

def mean_squared_error(y: np.ndarray, t: np.ndarray) -> np.float64:
    diff: np.ndarray = y - t
    diff *= diff
    return np.sum(diff) / 2

result: np.float64 = mean_squared_error(y1, t)
print("high accuracy:")
print(type(result))
print(result)

result: np.float64 = mean_squared_error(y2, t)
print("low accuracy:")
print(type(result))
print(result)

high accuracy:
<class 'numpy.float64'>
0.09750000000000003
low accuracy:
<class 'numpy.float64'>
0.5975


### Cross Entrophy Error
$$
E = -\sum\limits_k t_k\log y_k
$$

Because the $t_k$ only have one $t_i$ is 1, others are all 0. Therefore, $E$ is only calculating the $\log$ of only one $y_k$

In [21]:
def cross_entrophy_error(y: np.ndarray, t: np.ndarray) -> np.float64:
    y += 1e-7
    y = np.log(y)
    sum: np.float64 = np.sum(t * y)
    return -sum

result: np.float64 = cross_entrophy_error(y1, t)
print("high accuracy:")
print(type(result))
print(result)

result: np.float64 = cross_entrophy_error(y2, t)
print("low accuracy:")
print(type(result))
print(result)

high accuracy:
<class 'numpy.float64'>
0.510825457099338
low accuracy:
<class 'numpy.float64'>
2.302584092994546


### mini-batch training
We cannot just calculate every loss, add up and normalize. If the dataset is too big, do training for a time can take a long time.

The training of a neural network will choose a mini-batch, and let it learn based on the mini-batches

In [22]:
# mnist mini-batch
import sys, os
sys.path.append("..")
import numpy as np
from my_mnist import load_mnist
import my_func as func

x_train: np.ndarray
t_train: np.ndarray
x_test: np.ndarray
t_test: np.ndarray
(x_train, t_train), (x_test, t_test) = load_mnist(one_hot_label=True)

# do a mini-batch of size of 10
batch_size: np.int32 = np.int32(10)
mask: np.ndarray = np.random.choice(len(x_train), batch_size)
x_batch: np.ndarray = x_train[mask]
t_batch: np.ndarray = t_train[mask]
# this way to visit the elements in array is that only the True idx will be selected

In [23]:
# x_train, t_train, x_test, t_test have already been loaded previously
def mini_batch(size: int):
    mask:np.ndarray = np.random.choice(len(x_train), size)
    x_batch:np.ndarray = x_train[mask]
    t_batch:np.ndarray = t_train[mask]
    return x_batch, t_batch

# if t_train is in the form of one hot, cross entrophy is super easy
def batch_cross_entrophy_error(y: np.ndarray, t: np.ndarray) -> np.float64:
    batch_size = len(y)
    print(f"batch_size = {batch_size}")
    # shape of y and t is the same, therefore we do not neet to reshape and we can use the element wise operate directly
    y += 1e-7
    logy: np.ndarray = np.log(y)
    sum = np.sum(t * logy)
    sum /= batch_size # normalize
    return -sum

network = func.init_network()

(y, t) = mini_batch(10)

y = func.predict(network, y)
# loop below is to normalize
for x in y:
    x /= np.sum(x)

result = batch_cross_entrophy_error(y, t)
print(result)

# the non-one-hot form read the book

batch_size = 10
0.06804710952565074


## Numerical Differentiation
Using the gradience to decide how to change the coefficience to reduce the loss

### Implement Differentiation
Use the center differential:
$$
f'(x) = \lim\limits_{h\to 0}\frac{f(x + h) - f(x - h)}{2h}
$$

Rather than the forward differential:
$$
f'(x) = \lim\limits_{h\to 0}\frac{f(x+h) - f(x)}{h}
$$

Because when h can not be infinitesimal, center differential is closer to the $f'(x)$ than the forward differential.

In [24]:
def numerical_diff(f, x):
    h = 1e-4
    return (f(x + h) - f(x - h)) / 2 * h

example:
$$
y = 0.01x^2 + 0.1x
$$

In [25]:
def function(x):
    return 0.01 * x * x + 0.1 * x

print(numerical_diff(function, 5))
print(numerical_diff(function, 10))

1.9999999999908982e-09
2.9999999999974494e-09


### Partial Derivative
Consider function:
$$
f(x_0, x_1) = x_0^2 + x_1^2
$$

Partial derivative of $x_1$ will show how fast the value will change in the direction of $x_1$

Now want to know how to calculate $\frac{\partial f}{\partial x_0}, \frac{\partial f}{\partial x_1}$

To calculate $\frac{\partial f}{\partial x_0}$ at $(3, 4)$, we can create a new function:
$$
f(x_0) = x_0 ^2 + 4 ^ 2
$$
and use the `numerical diff` function

In [26]:
def function1(x0, x1):
    return x0 * x0 + x1 * x1

def function1_mod(x0):
    return x0 * x0 + 4 * 4

print(f"f'(3, 4) for x0 is {numerical_diff(function1_mod, 3)}")

f'(3, 4) for x0 is 6.000000000003781e-08


## Gradient

Easy to understand the function below:
* create a `grad`
* for each $x_i$, calculate `grad[i]`
* return `grad`

In [27]:
def numerical_grad(f, x):
    h = 1e-4
    grad = np.zeros_like(x)
    
    for idx in range(len(x)):
        temp_val = x[idx]

        x[idx] += h
        f_plus_h = f(x)
        x[idx] = temp_val - h
        f_sub_h = f(x)

        grad[idx] = (f_plus_h - f_sub_h) / (2 * h)
        x[idx] = temp_val
    return grad

def function2(x):
    return x[0] * x[0] + x[1] * x[1]

print(f"function2 grad at (3, 4) is {numerical_grad(function2, np.array([3.0, 4.0]))}")

3.0001
3.0
2.9999
3.0
4.0001
4.0
3.9999
4.0
function2 grad at (3, 4) is [6. 8.]


vector of the gradient will point to the direction in which the value of the function will decrease

Therefore, if we can get the loss function and calculate the gradient of it, we will know how to reduce the loss

### Gradient Method
Let the coefficient change in the direction of the gradient, and finally the function will arrive the saddle point
$$
x_0 = x_0 - \eta\frac{\partial f}{\partial x_0}\\
x_1 = x_1 - \eta\frac{\partial f}{\partial x_1}
$$

Pay attention to the word I use is "saddle point", the gradient method can only reach the nearest saddle point and can not arrive at the minimum point.

$\eta$ is called *learning rate*, deciding how fast the coefficient change according to the gradient.

In [28]:
def grad_descent(f, init_x, learn_rate = 0.01, step_num = 1000) :
    x = init_x
    for i in range(step_num) :
        grad = numerical_grad(f, x)
        x -= learn_rate * grad
    return x

# use this method to calculate the minumum of function2
init_x = np.array([3.0, 4.0])
final_x = grad_descent(function2, init_x)
print(f'final value of function2 is {function2(final_x)}')
print('the right result should be very close to 0')

3.0001
3.0
2.9999
3.0
4.0001
4.0
3.9999
4.0
2.9400999999999624
2.939999999999962
2.939899999999962
2.939999999999962
3.920100000000009
3.920000000000009
3.9199000000000086
3.920000000000009
2.881299999999737
2.881199999999737
2.8810999999997366
2.881199999999737
3.841700000000064
3.8416000000000636
3.8415000000000634
3.8416000000000636
2.8236759999996477
2.8235759999996475
2.8234759999996473
2.8235759999996475
3.764868000000004
3.7647680000000037
3.7646680000000035
3.7647680000000037
2.7672044799995246
2.7671044799995244
2.767004479999524
2.7671044799995244
3.689572639999899
3.6894726399998987
3.6893726399998985
3.6894726399998987
2.71186239039945
2.71176239039945
2.7116623903994497
2.71176239039945
3.6157831871997996
3.6156831871997994
3.615583187199799
3.6156831871997994
2.6576271425914624
2.657527142591462
2.657427142591462
2.657527142591462
3.5434695234556384
3.543369523455638
3.543269523455638
3.543369523455638
2.6044765997395527
2.6043765997395525
2.6042765997395523
2.60437659973

### Implementation
is in the python script in the current dir