<a href="https://colab.research.google.com/github/LucyMariel/Lucy/blob/master/NeuralNetwork.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A neural network is a mathematical model that imitates the information transmission of the brain (nervous system).
In nerve cells (neurons), when the input stimulus exceeds the threshold value, it fires (activates), generates an action potential, and transmits information to other cells.
This firing occurs in a chain to convey information and realize calculations. The change in the strength of neuron connections is called learning.

Neural networks, on the other hand, map an input vector to another vector space by means of weights (joint weights) and activation functions.
This is done in a chain of steps to extract information. In this process The emphasis on weight is called learning.

**MNIST dataset**

Here, in explaining the neural network, we will explain using a data set consisting of image data of handwritten characters called MNIST.

For example, in addition to a normal neural network that calculates prediction results from one line of data (three-layer shallow neural network (Network training) as shown in the figure at the bottom of the image), prediction results from two-dimensional data such as images. For example, a convolutional neural network (CNN) that calculates.

Originally, the image given as input is two-dimensional data, but by converting each element into one line and interpreting it as one-dimensional input data, we will create a normal neural network.

In this text, you will learn shallow neural networks (NN), deep neural networks (DNN), and convolutional neural networks (CNN) in that order.
In addition, we will explain all neural networks using the MNIST dataset as an example.

Originally, CNN-based methods are often used for image data, but for ease of understanding, this text uses the MNIST dataset to unify NN / DNN / CNN. I will implement it.

**Preparing the dataset**


Handwritten character data set (MNIST data set) is input to a neural network to perform 0~9 (number) multi-class classification.

Idea: (y_train: for correct answer labels)
To divide into two classes: 1.
The correct answer label corresponding to one sample is used for training as a scalar (int type) such as 0 or 1.
In the case of dividing into multiple classes: The correct answer label corresponding to one sample is used for training.
Given an outcome label corresponding to one sample, transform it into a one-hot vector with the length of the class and use it for training.

PUBLICATION

Left: Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner "Gradient-Based Learnign Applied to Document Recognition".10page, Fig.4 Size-Normlized examples from the MNIST database.Yann LeCun`s Publications. November 1998.http://yann. lecun.com/exdb/publis/pdf/lecun-98.pdf (reference 2022-05-20)

Summary
A neural network imitates the function of the human brain
There are various types of neural networks (NN, DNN, CNN)

**Neural Network Structure**

Get an overview of neural networks
Understand the scratch code template of neural networks.
Neural networks, as the name implies, are algorithms that mimic the human nervous system.

As the whole structure, as shown in the image below, a large number of neurons are gathered and composed, the leftmost neuron is input (MNIST data in this case), and it is finally passed through various neurons. The value output from the rightmost neuron is the predicted value.

Each "0" in the above image is called a node, and nodes constitute a layer. The layers consist of one input layer and one output layer, and any number of intermediate layers can be added (the image shows one intermediate layer).

In addition, each node is connected to all the nodes in the next layer. That's why we use the word "Dense" in other explanations and tutorials.

In the calculation flow, the values of MNIST data (28 pixels x 28 pixels = 784) smoothed as shown in the above image are input to the input layer as input values, passed through the intermediate layer through various calculations, and the predicted value is output at output layer.
As for the number of nodes in the output layer, since MNIST data is classified into 10 classes (because it is handwritten data from 0 to 9), there are 10 nodes, and values from 0 to 1 are calculated from each of the 10 nodes, and the maximum value. Adopt the node of as a class.

For example, if each output of a node in the output layer is as follows, 0.87 is the maximum value, meaning that the predicted result is that the input image is the character "5". (In this case, the handwritten character in the above figure is 3, so we conclude that the prediction result is different.)

herefore, the node with the highest value is considered.
This is an implementation story, but when using the largest node, you can use np.argmax (NumPy's method), which returns the index of the maximum value along the axis.
LINK: NumPy API reference --numpy.argmax

**Reasoning and learning**

Inference processing is called forward propagation.

In forward propagation, as the word forward means, input data travels forward in the network (here from left to right).
After that, each hidden layer (intermediate layer) receives the input data, processes it according to the activation function, and passes it to the next layer.

At this time, the forward propagation sequentially calculates and saves the intermediate variables in the calculation graph defined by the neural network, and proceeds from the input layer to the output layer.

Next, the learning process is called backpropagation (backpropagation method).

Backpropagation is a method of calculating the gradients of neural network parameters.
Specifically, it is a method of tracing the network from the output layer to the input layer in reverse according to the chain rule of calculus.
This algorithm stores the intermediate variables (partial derivatives) needed to calculate the gradient with respect to the parameter.

Backpropagation sequentially calculates and saves the gradients of intermediate variables and parameters in the neural network in reverse order.

The detailed processing of each will be introduced in the following texts.
Here, let's get an overview only.

**Data to use**

This time, we will use a data set called MNIST that stores image data of handwritten characters.
However, since it is originally 2D data, we will create a normal neural network using the smoothed (converted from 2D to 1D) data.

Originally, the image data uses a CNN algorithm, but in order to make it easier to understand, we will implement NN / DNN / CNN uniformly using the MNIST dataset.

As we saw above, the MNIST dataset is an image with a size of 28 pixels x 28 pixels = 784 pixels, with 0-9 handwritten characters.

** Scratch code overview**

class ScratchSimpleNeuralNetrowkClassifier():
    def __init__(self,batch_size = 20,n_features = 784,n_nodes1 = 400,n_nodes2 = 200,n_output = 10,sigma = 0.02,lr = 0.01,epoch = 10, verbose=True):
        self.verbose = verbose
        self.batch_size = batch_size
        self.n_features = n_features
        self.n_nodes1 = n_nodes1
        self.n_nodes2 = n_nodes2
        self.n_output = n_output
        self.sigma = sigma
        self.lr = lr
        self.epoch = epoch
        self.loss_train = []
        self.loss_val = []
   
    def fit(self, X, y, X_val=None, y_val=None):
        pass
    
    def forward(self, X):
        pass
    
    def backward(self, X, y):
        pass
            
    def tanh_function(self, A):
        pass
    
    def softmax(self, A):
        pass

    def cross_entropy_error(self, y, Z):
        pass
        
    def predict(self, X):
        pass

Line 1: Class definition
Second line: Constructor definition. As arguments, batch_size (number of data to be trained at one time), n_features (number of feature quantities of explanatory variables), n_nodes1 (number of nodes in the first layer), n_nodes2 (number of nodes in the second layer), n_output (nodes in the output layer) Number) ・ sigma (mean value of normal distribution) ・ lr (learning rate) ・ epoch (number of learnings) ・ verbose (whether or not to output the learning process)
3rd to 11th lines: Argument member variables
Lines 12 to 13: Variable definition for loss storage
Lines 15-16: Learning function. Since it is a template, function definition with pass
Lines 18-19: Forward Propagation Function
Lines 21-22: Backpropagation function
Lines 24 to 25: Activation function (tanh)
Lines 27-28: Activation function (softmax)
Lines 30-31: Loss function Cross entropy error function
Lines 33-34: Prediction function

Summary
The structure of a neural network includes an input layer, one or more intermediate layers, and an output layer.

**Introduction to Neural Network Computation**

Understand the calculation outline of neural networks
Understand that the calculated output of each node will be the input of the next layer

Calculation at each node
Now that you understand the general structure of the neural network, what kind of processing is being performed on each of the large number of nodes?

As shown in the image above, the input layer contains 784 MNIST data as they are.

Each node in the middle layer receives the total value (A) of the value obtained by multiplying 784 pieces of data as the input value of each node. Then, the value (f (A)) obtained by passing the total value (A) through the function (f) is the output value of each node. The same is true if there are multiple intermediate layers, and the output from the previous intermediate layer is received by the next intermediate layer as input.

The output layer is treated in the same way as the middle layer.

**Bias**

Each middle layer needs to have a node that acts as a constant term in linear regression called bias.

 Summary
The calculation is done at each node, and the output of one layer becomes the input of the next layer.

**Neural network activation function**

Learn the theory and implementation of various types of activation functions

What is the activation function?

The activation function is the function f introduced earlier.

The purpose of the activation function is to introduce non-linearity into the network. In fact, without the activation function, only linearly separable problems can be solved.

There are various types of activation functions. The typical ones are as follows.

Identity function
Mathematical formulas and Python programs
f(x)=x. depending THE FUNCTION

In [None]:
def identity_function(self,X):
    return X

step function
Mathematical formulas and Python programs
f
(
x
)
=
{
0

(
x
<
0
)
1

(
x
>=
0
)


In [None]:
def step_function(self,X):
    result = np.array(X >= 0, dtype=np.int)
    return result

relu function
Mathematical formulas and Python programs
r
e
l
u
(
x
)
=
m
a
x
(
0
,
x
)

In [None]:
def relu(self,X):
    result = np.max([np.zeros(X.shape), X], axis=0)
    return result

softmax function (softmax)
Mathematical formulas and Python programs
f
(
x
)
=
e
x
p
(
x
i
)
∑
n
i
=
1
e
x
p
(
x
i
)


In [None]:
def softmax(self,X):
    result = np.exp(X) / np.sum(np.exp(X), axis=1, keepdims=True)
    return result

sigmoid function (sigmoid)
Mathematical formulas and Python programs
s
i
g
m
o
i
d
(
x
)
=
1/
1
+
e
−
x


In [None]:
def sigmoid(self,X):
    result = 1 / (1 + np.exp(-X))
    return result

tanh function (hyperbolic tangent)
Mathematical formulas and Python programs
t
a
n
h
(
x
)
=
e
x
−
e
−
x
e
x
+
e
−
x


In [None]:
def tanh(self,X):
    result = (np.exp(X)-np.exp(-X))/(np.exp(X)+np.exp(-X))
    # or
    #  result = np.tanh(X)
    return result

Summary
The activation function helps to introduce non-linearity into the model
There are many activation functions

**Neural network calculation graph**

Understand the computational graphs of different processes.
Computational graph: a visualisation of a computational process represented by multiple nodes and edges.

Chain rule: the derivative of a composite function is represented by the derivative of each of the functions that make up that composite function.


calculation graphs
First, let's represent a simple formula using a calculation graph.

Given the following two equations.

z=t2

t=x+y
This can be expressed using a calculation graph as follows.
Following the stream from the left represents the next stream.

x,y comes in as input.
Addition of x,y is performed at the addition node.
The result of the addition is output as t
t is squared at the multiplication node
The squared value is output as z
Now let us assume that we want to find the derivative of z with respect to x. The formula for the derivative is given below.

Summary
Provides an overview of computational graphs and explains how to calculate partial derivatives.
Forward and back propagation of a neural network, computed using the concept of computational graphs.

**Neural network weights**
Understand and implement initialization of neural network weights
Creating code to determine initial values for "Problem 1" weights
Add the following program to the end of the constructor

In [None]:
self.W1 = self.sigma * np.random.randn(self.n_features, self.n_nodes1)
self.W2 = self.sigma * np.random.randn(self.n_nodes1, self.n_nodes2)
self.W3 = self.sigma * np.random.randn(self.n_nodes2, self.n_output)
self.B1 = self.sigma * np.random.randn(1, self.n_nodes1)
self.B2 = self.sigma * np.random.randn(1, self.n_nodes2)
self.B3 = self.sigma * np.random.randn(1, self.n_output)

Line 1: Initialization of weights. The shape of the weights is (n_features (number of explanatory variables (28 * 28)),n_nodes1 (number of nodes in the first layer)), so the input of the neural network is (batch_size (number of data to be trained at once),n_features (number of explanatory variables)) Since the data of shape comes in, the shape (shape of output) after calculation using this weight becomes (batch_size,n_nodes1)
 (The reason why the shape of weight becomes (self.n_features, self.n_nodes1). The data of the shape of (batch_size (number of data to be trained at a time),n_features (number of explanatory variables (28 * 28))) comes in as the input of the neural network. For example, the shape (shape of the output) after the calculation using the weight W1 for this input should be (batch_size,n_nodes1), so the shape of the weights is (n_features (number of explanatory variables),n_nodes1 (number of nodes in the first layer)).

(batch_size,n_features)×(n_features,n_nodes1)=(batch_size,n_nodes1)

Line 2: Initialize the weights. shape is (n_nodes1,n_nodes2), so the output of the previous layer (batch_size,n_nodes1) is received as input, so the calculated shape (output shape) is (batch_size,n_nodes2)

(batch_size,n_nodes1)×(n_nodes1,n_nodes2)=(batch_size,n_nodes2)

Line 3: Initialize the weights. shape is (n_nodes2,n_output), so the output of the previous layer (batch_size,n_nodes2) is received as input, so the calculated shape (output shape) is (batch_size,n_output)

(n_nodes1,n_nodes2)×(n_nodes2,n_output)=(batch_size,n_output)

Line 4: Initialize the bias, where shape is (1,n_nodes1). The role of the bias is that of a constant term in linear regression. It is added to the output of the first layer during the calculation.

Line 5: Initialize bias. shape is (1,n_nodes2). Add to the output of the second layer when calculating

Line 6: Initialize bias. shape is (1,n_output). Add to the output of the third layer in the calculation.

Summary
Weights and biases can be initialized in a variety of ways
Here we use a Gaussian distribution

**Learning Neural Networks**

purpose
Understanding Forward Propagation
Know the cross-entropy error
Understanding Backpropagation

What is forward propagation?

In forward propagation, as the word "forward" implies, the input data travels through the network in the forward direction (here from left to right).
Each hidden layer (intermediate layer) then receives the input data, processes it according to its activation function, and passes it to the next layer.

Thus, forward propagation sequentially computes and stores intermediate variables in the computational graph defined by the neural network. It proceeds from the input layer to the output layer.


What is cross-entropy error (cross-entropy loss)?

When working with machine learning and deep learning problems, loss functions and cost functions are used to optimize the model being trained.
The goal is almost always to minimize the loss function. The smaller the loss, the better the model. Cross-entropy error is an important cost function. It is used to optimize classification models.
Although many XX functions have appeared, a simple picture is the relationship: objective function ⊃ cost function, error function, loss function.
Although strictly different, these functions are motivated by the desire to minimize the error between the estimates and the labels.)

The purpose of cross-entropy is to obtain the output probability (P) and measure its distance from the value of the correct answer label.

As an example, suppose the desired output of a class and the output of a model are as follows

The goal is to bring the output of the model as close as possible to the desired output (the value of the correct label). During model training, the weights of the model are iteratively adjusted with the goal of minimizing cross-entropy error.
The process of adjusting the weights is what model learning is all about, and when a model continues to learn and losses are minimized, we say that the model is learning.

What is back-propagation?

Back propagation, or error back propagation, is a method of calculating the gradient of the parameters of a neural network. In other words, it is a method that follows the chain rule of calculus, tracing the network backwards from the output layer to the input layer.
This algorithm preserves the intermediate variables (partial derivatives) needed to compute the gradients for the parameters.

Backpropagation sequentially calculates and saves the gradients of intermediate variables and parameters in the neural network in reverse order.

 Summary
Forward propagation, cross-entropy error, and back propagation are key points in the learning process of neural networks

**Forward propagation of neural networks (forward propagation process)**

Understand the flow of the forward propagation process of a neural network
Implementing a neural network forward propagation process using Python
Calculation at each node
Now that you understand the general structure of the neural network, what kind of processing is being performed on each of the large number of nodes?

Like linear regression and the steepest descent methods that have been implemented in other texts, neural nets also have mechanisms to increase estimation accuracy as they are trained.

It is called the error back propagation method. While inference propagated values from left to right, the error back propagation method, as the name implies, propagates from right to left, updating the weights w in the process as in the steepest descent method.
Forward propagation is called forward propagation and back propagation is called back propagation.



Implementation
The forward propagation process we have seen above can be implemented using Python as follows.

In [None]:
def forward(self, X):
    self.A1 = X @ self.W1 + self.B1
    self.Z1 = self.tanh_function(self.A1)
    self.A2 = self.Z1 @ self.W2 + self.B2
    self.Z2 = self.tanh_function(self.A2)
    self.A3 = self.Z2 @ self.W3 + self.B3
    self.Z3 = self.softmax(self.A3)

Line 1: Function definition. It takes an explanatory variable X as an argument.

Line 2: Add bias B1 to the result of the matrix product of explanatory variable X and weight W1 to compute A1 in the first layer

Line 3: The result A1 from row 2 is passed through the function (f) (here the activation function tanh) to obtain the output of the first layer

Line 4: Result Z1 from row 3 is the input for the second layer, and bias B2 is added to the result of the matrix product of Z1 and weight W2 to compute A2 for the second layer

Line 5: The result of row 4 is passed through the function (f) (here the activation function tanh) to obtain the output of the second layer

Line 6: Result Z2 from row 3 is the input for the third layer, and bias B3 is added to the result of the matrix product of Z2 weights W3 to compute A3 for the third layer

Line 7: The result of line 6 is passed through function (f) (softmax for the activation function at output).

 Summary
Forward propagation is essential in the learning process of neural networks
This process is described in the forward function above

**Implementation of cross-entropy error in neural networks**

When working with machine learning and deep learning problems, loss functions and cost functions are used to optimize the model being trained. The goal is almost always to minimize the loss function.
The smaller the loss, the better the model. Cross-entropy error is an important cost function. It is used to optimize classification models.

The purpose of cross-entropy is to obtain the output probability (P) and measure its distance from the value of the correct label.

As an example, suppose the desired output of a class and the output of a model are as follows XXX The goal is to bring the output of the model as close as possible to the desired output (the value of the correct label).
During model training, the weights of the model are iteratively adjusted with the goal of minimizing cross-entropy loss. The process of adjusting the weights is what model learning is all about, and as the model continues to learn and losses are minimized, we say that the model is learning.


Implementation of "Problem 3" Cross-Entropy Error
Let's implement it based on the cross-entropy error function described above

In [None]:
def cross_entropy_error(self, y, Z):
    L = - np.sum(y * np.log(Z+1e-7)) / len(y)
    return L

Line 1: Definition of the cross-entropy error function. The following are passed as arguments: y is the label, Z is the estimated value (the value output from the neural network)
Line 2: Implement the formula for the cross-entropy error. Add 1e-7 to avoid errors.
3rd line: Return the cross-entropy error.

Summary
Cross-entropy error is a loss function used in class classification.
When calculating the cross-entropy error, 1e-7is added to the calculation of np.logto avoid errors

**Neural network backpropagation (error backpropagation)**
Understand the process of error back propagation
Implement the process of error back propagation using Python

What is back propagation (error back propagation)?
As with the steepest descent method, the error is determined and each weight is updated by multiplying the gradient calculated from the error by the learning rate.

In the steepest descent method, each value has been updated so that the loss function is minimized. The error back propagation method similarly uses the cross-entropy error as the loss function when classifying classes.
weight update
As we introduced that the weights are updated to minimize the cross-entropy error function, the weights are updated according to the learning rate using the slope obtained by differentiation. This is called the gradient descent method, and there are two types of gradient descent methods

The "normal" gradient descent method
As you will see in the formulas that follow, the weights are updated for all data (batches).

The "stochastic" gradient descent method
Unlike the "normal" gradient descent method,weights are updated on a subset (mini-batch) of data.

This stochastic gradient descent method is used to update the weights of the neural network.

Expressed in mathematical form, the stochastic gradient descent method is as follows
First, here is some basic knowledge about the differential and partial derivatives that are used in these calculations.

Differentiation and Partial Differentiation
Differentiation is the determination of the slope of an arbitrary function at an arbitrary point.

Differentiation

Consider the case of differentiating a function f(x). The derivative formula is as follows.

d
f
d
x
=
lim
h
→
0

f
(
x
+
h
)
−
f
(
x
)
h
Based on the above formula, the function for differentiation is implemented as follows.

In [None]:
def numerical_diff(f, x):
    h = 1e-4 # 0.0001
    return (f(x+h) - f(x)) / (h)

Now, let's actually differentiate the following function using the above function.

formula
f
(
x
)
=
0.01
x
2
+
0.1
x
differential expression

d
f/
d
x
=
0.02
x
+
0.1
The following is a Python description.

In [None]:
def function_1(x):
    return 0.01*x**2 + 0.1*x

Now let's find the derivative of this function at x=5,10.
Let's compare the results of the actual manual differentiation with those carried out with the function for differentiation.

In [None]:
numerical_diff(function_1,5) # 0.20000099999917254
numerical_diff(function_1,10) # 0.3000009999976072

 partial differentiation

Next, let us review the definition of partial differentiation.

Partial differentiation is the differentiation of multiple variables by any variable alone. Suppose we actually have the following variables

formula

f
(
x
0
,
x
1
)
=
x
2
0
+
x
2
1

Let us find the derivative of this function when x0=3.0 x1=4.
Due to the specification of the function for differentiation, it is written as follows

In [None]:
def function_tmp1(x0):
    return x0*x0 + 4.0 ** 2.
def function_tmp2(x1):
    return 3.0**2 + x1*x1

Let's compare the results of the actual manual differentiation with those carried out with the function for differentiation.

In [None]:
numerical_diff(function_tmp1,3.0)# 6.000099999994291
numerical_diff(function_tmp1,4.0)# 8.00009999998963

The numerical_diffdefined so far required the partial derivatives of $x_0$ and $x_1$ to be computed separately during partial differentiation. Next, let us define a function of partial differentiation that can compute the partial differentiation simultaneously.

Formula for Error Back Propagation Method
As the name implies, the error back propagation method computes from the output layer in reverse order of the forward propagation. Therefore, the calculation starts from the third layer.

$ \ frac {\ partial L} {\ partial A_3} $: Gradient of loss $ L $ for $ A_3 $ (batch_size, n_output)

$ \ frac {\ partial L} {\ partial A_ {3_j}} $: Gradient of loss $ L $ for $ A_3 $ in the jth sample (n_nodes2,)

$ \ frac {\ partial L} {\ partial B_3} $: Gradient of loss $ L $ for $ B_3 $ (n_output,)

$\frac{\partial L}{\partial W_3}$ : gradient of loss $L$ with respect to $W_3$ (n_nodes2, n_output)

$ \ frac {\ partial L} {\ partial Z_2} $: Gradient of loss $ L $ for $ Z_2 $ (batch_size, n_nodes2)

$Z_{3}$ : output of softmax function (batch_size, n_nodes2)

$ Y $: Correct label (batch_size, n_output)

$Z_{2}$ : output of the second layer activation function (batch_size, n_nodes2)

$W_3$ : 3rd layer weights (n_nodes2, n_output)

$ n_ {b} $: batch size, batch_size

"Second layer"

$\frac{\partial L}{\partial W_2}$ : gradient of loss $L$ with respect to $W_2$ (n_nodes1, n_nodes2)

$\frac{\partial L}{\partial W_2}$ : gradient of loss $L$ with respect to $W_2$ (n_nodes1, n_nodes2)

$ \ frac {\ partial L} {\ partial B_2} $: Gradient of loss $ L $ for $ B_2 $ (n_nodes2,)

$\frac{\partial L}{\partial W_2}$ : gradient of loss $L$ with respect to $W_2$ (n_nodes1, n_nodes2)

$ \ frac {\ partial L} {\ partial Z_2} $: Gradient of loss $ L $ for $ Z_2 $ (batch_size, n_nodes2)

$ A_2 $: Second layer output (batch_size, n_nodes2)

$Z_{1}$ : output of the first layer activation function (batch_size, n_nodes1)

$W_2$ : second layer weights (n_nodes1, n_nodes2)

⊙
The symbol of means "Hadamard product".
The Hadamard product is the calculation of the product of each component of matrices of the same size. The output size will be the same.

"First layer"

Formula for Error Back Propagation Method
ディープロに記載の数式に則り、逆伝播の処理をプログラムとして実装すると下記のようになります。

In [None]:
def backward(self, X, y):
    dA3 = (self.Z3 - y)/self.batch_size
    dW3 = self.Z2.T @ dA3
    dB3 = np.sum(dA3, axis=0)
    dZ2 = dA3 @ self.W3.T
    dA2 = dZ2 * (1 - self.tanh_function(self.A2)**2)
    dW2 = self.Z1.T @ dA2
    dB2 = np.sum(dA2, axis=0)
    dZ1 = dA2 @ self.W2.T
    dA1 = dZ1 * (1 - self.tanh_function(self.A1)**2)
    dW1 = X.T @ dA1
    dB1 = np.sum(dA1, axis=0)
    self.W3 -= self.lr * dW3
    self.B3 -= self.lr * dB3
    self.W2 -= self.lr * dW2
    self.B2 -= self.lr * dB2
    self.W1 -= self.lr * dW1
    self.B1 -= self.lr * dB1

Line 1: Definition of the function. It takes as arguments the explanatory variable X and the objective variable y.
Line 2: Third layer, the inverse of the activation function softmax, the part of $\frac{\partial L}{\partial A_3} = \frac{1}{n_b}(Z_{3} - Y)$
Line 3: Third layer, the inverse of the weights, $\frac {\partial L}{\partial W_3} = Z_{2}^{T}\cdot \frac{\partial L}{\partial A_3}$ part
Line 4: third layer, the inverse of the bias, $\frac{\partial L}{\partial B_3} = \sum_{j}^{n_b}\frac{\partial L}{\partial A_{3_j}}$ part
Line 5: 2nd layer, calculation of inverse of output, $\frac{\partial L}{\partial Z_2} = \frac{\partial L}{\partial A_3} \cdot W_3^T$ part
Line 6: second layer, calculation of the inverse of the activation function tanh, $\frac{\partial L}{\partial A_2} = \frac{\partial L}{\partial Z_2} \odot {1-tanh^2(A_{2})}$ part
Line 7: second layer line 8: second layer, the inverse of the weights, $\frac{\partial L}{\partial W_2} = Z_{1}^T \cdot \frac{\partial L}{\partial A_2}$ part
Line 8: second layer, the inverse of the biases, $\frac{\partial L}{\partial B_2} = \sum_{j}^{n_b}\frac{\partial L}{\partial A_{2_j}}$ part
Line 9: 1st layer, calculation of the inverse of output $\frac{\partial L}{\partial Z_1} = \frac{\partial L}{\partial A_2} \cdot W_2^T$.
Line 10: 1st layer, inverse calculation of activation function tanh, $\frac{\partial L}{\partial A_1} = \frac{\partial L}{\partial Z_1} \odot {1-tanh^2(A_{1})}$.
Line 11: First layer, the inverse of the weights, $\frac{\partial L}{\partial W_1} = X^T \cdot \frac{\partial L}{\partial A_1}$
Line 12: First layer, the inverse of the bias, $\frac{\partial L}{\partial B_1} = \sum_{j}^{n_b}\frac{\partial L}{\partial A_{1_j}}$ part
Lines 13 to 18: update weights and bias by applying learning rate to each calculated slope

Summary
Backpropagation is an integral part of the neural network training process, updating weights and biases
This process is described in the backward function above

**Neural network estimation**

purpose
 Understanding the process flow of neural network estimation
 Programmatically implement the estimation process

 Estimation

 Let's add a predict function to the classScratchSimpleNeuralNetrowkClassifierthat we have created so far.

In [None]:
def predict(self, X):
    self.forward(X)
    return np.argmax(self.Z3, axis=1)

Line 1: Definition of a function.
Line 2: The function of forward propagation is executed by passing the variable X.
Line 3: The output layer has 10 nodes, and the index of the maximum value is determined as the classified class, so the index of the maximum value is obtained by np.argmax.

With regard to the processing of the third line, in the case of class classification, the process is as described above (adopting the index of the maximum value).
However, in the case of continuous value prediction, there is one node in the output layer, and it is common to adopt a constant function for the activation function of the output layer and an MSE for the loss function.

Summary
Estimation, performed by the predict function.

**Neural network learning and estimation**
Implement a function to perform learning using the neural network class you created.
Perform learning functions
Visualize recorded accuracy and loss



Problem 6" Learning and Estimation
Let's implement a new fit functionin the neural network class we have created. After implementing thefit function, perform training and calculate Accuracy.

Implementation of learning functions
First, the fit functionis implemented using Python as follows.

In [None]:
def fit(self, X, y, X_val=None, y_val=None):
    for _ in range(self.epoch):
        get_mini_batch = GetMiniBatch(X, y, batch_size=self.batch_size)
        for mini_X_train, mini_y_train in get_mini_batch:
            self.forward(mini_X_train)
            self.backward(mini_X_train, mini_y_train)
        self.forward(X)
        self.loss_train.append(self.cross_entropy_error(y, self.Z3))
        if X_val is not None:
            self.forward(X_val)
            self.loss_val.append(self.cross_entropy_error(y_val, self.Z3))
    if self.verbose:
        if X_val is None:
            print(self.loss_train)
        else:
            print(self.loss_train,self.loss_val)

Line 1: Function definition. It receives as arguments the explanatory variable X for the training data, the objective variable y for the training data, the explanatory variable X_val for the evaluation data, and the objective variable y_val for the evaluation data.
Line 2: Looping through the training count.
Line 3:X and yto theGetMiniBatchiterator, and create an iterator get_mini_batchfor mini-batch
Line 4: A for statement is executed using the get_mini_batchterator created. The iterator generates explanatory variables and objective variables for the mini-batch size, which are received as mini_X_train, mini_y_train.
Line 5: Forward propagation process
Line 6: Back propagation process
Line 7: Once all mini batches have been trained, the entire set of explanatory variables is passed through the forward propagation process.
Line 8: The output layer output self.Z3calculated in line 7, along with the correct answer data y, is passed to the cross-entropy error function to obtain the loss. The loss is stored in loss_trainThis loss_trainwill be used to visualize the learning process in the future.
Line 9: Determine whether or not X_val, the evaluation data, contains a value.
Line 10: Forward propagation of the evaluation data
Line 11: Pass the output layer output self.Z3calculated in line 10, together with the correct answer data y_val, to the cross-entropy error function to obtain the loss. The loss is stored in loss_val. Thisloss_valis used to visualize the learning process in the future.
Line 12: Determine whether or not to output the learning process.
Line 13: If X_valcontains no value, printonlyloss_train
Line 14: If X_valcontains a value, printbothloss_trainandloss_val

Study Implementation
Conduct the study using the scratch classes you have created so far.

In [None]:
nn = ScratchSimpleNeuralNetrowkClassifier(epoch=10)
nn.fit(X_train,y_train_one_hot, X_val, y_val_one_hot)

Line 1: Instantiation of scratch classnn
Line 2: Executing fit functinoof instancenn​

Calculation of Accuracy
Calculate Accuracy using the learned model!

In [None]:
from sklearn.metrics import accuracy_score
pred_train = nn.predict(X_train)
accuracy = accuracy_score(y_train, pred_train)

Line 1: imports the function for calculating accuracy
Line 2: executes the predict function of the learned instance nnand receives the return value (predicted value) as pred_train​
Line 3: passes the correct answer data (y_train) and predicted value (pred_train) toaccuracy_scoreand calculates accuracy.

Plotting the "Problem 7" learning curve

The loss at each learning frequency is recorded inloss_train and loss_valin the fit function, and is visualized by referring to them via instances.

plt.plot(range(nn.epoch), nn.loss_train)
Line 1: The X and y to be drawn are put into the matplotlib functions for visualization. Here, the x-axis is the number of training runs, and the y-axis is the loss_train

5. Summary
Visualization of recorded accuracies and losses can help you know if the learning process went smoothly