<a href="https://colab.research.google.com/github/LucyMariel/Lucy/blob/master/ScratchDeepNeuralNetwork.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will extend the implementation of the neural network created from scratch.
Last time, we created a three-layer neural network, but this time we will rewrite it into one that can easily be expanded to an arbitrary number of layers. After that, we will be able to deal with advanced functions, activation functions, initial values, and optimization methods.

By doing this from scratch, we aim to give you an idea of the inner workings of the various frameworks that we will be using.

The name should be changed from Scratch Deep Neural Network Classifier class.

Classifying layers, etc.
By putting them in a class, we will make the implementation easy to change the configuration.

Places to modify

Number of layers
Layer type (other types of layers such as convolutional layers will appear in the future)
Types of activation functions
Weight and bias initialization method
Optimization method
To do this, we create classes for all the coupling layers, for the various activation functions, for the initialisation of weights and biases, and for each of the optimisation methods.

You can implement it freely, but here is a simple example. Create an instance of the fully connected layer and activation function as in sample code 1, and use it as in sample code 2 and 3. Each class will be explained later.

<<Sample Code 1>>

In the fit method of ScratchDeepNeuralNetrowkClassifier

In [None]:

# self.sigma: Standard deviation of Gaussian distribution
# self.lr : Learning rate
# self.n_nodes1: Number of nodes in the first layer
# self.n_nodes2: Number of nodes in the second layer
# self.n_output: Number of nodes in the output layer

optimizer = SGD(self.lr)
self.FC1 = FC(self.n_features, self.n_nodes1, SimpleInitializer(self.sigma), optimizer)
self.activation1 = Tanh()
self.FC2 = FC(self.n_nodes1, self.n_nodes2, SimpleInitializer(self.sigma), optimizer)
self.activation2 = Tanh()
self.FC3 = FC(self.n_nodes2, self.n_output, SimpleInitializer(self.sigma), optimizer)
self.activation3 = Softmax()


《Sample code 2》

Forward for each iteration

In [None]:
A1 = self.FC1.forward(X)
Z1 = self.activation1.forward(A1)
A2 = self.FC2.forward(Z1)
Z2 = self.activation2.forward(A2)
A3 = self.FC3.forward(Z2)
Z3 = self.activation3.forward(A3)

《Sample code 3》

Backward for each iteration

In [None]:
dA3 = self.activation3.backward(Z3, Y) # 交差エントロピー誤差とソフトマックスTo合わせている
dZ2 = self.FC3.backward(dA3)
dA2 = self.activation2.backward(dZ2)
dZ1 = self.FC2.backward(dA2)
dA1 = self.activation1.backward(dZ1)
dZ0 = self.FC1.backward(dA1) # dZ0 is not used

[Problem 1] Classifying fully connected layers
Please classify the fully connected layer.

Below is a template. Initialize weights and bias in the constructor, and prepare forward and backward methods. By holding the weight W, the bias B, and the forward input X as instance variables, complicated input/output becomes unnecessary.

You can also pass an instance as an argument. Therefore, if you receive the instance initializer of the initialization method in the constructor, it will be initialized. You can change the initialization method by changing the instance to be passed.

You can also pass your self as an argument. You can use this to update the layer weights like self.optimizer.update(self) There are multiple values required for the update, but all can be instance variables of the fully connected layer.

The initialization method and the class of optimization methods are described later.

Model

This scratch model represents a fully connected (FC) layer, which is a crucial component of a neural network. The purpose of this class is to define the structure and functionality of an FC layer, including its initialization, forward propagation, and backward propagation steps. Here’s a breakdown of its components and their purposes:

In [None]:
class FC:
    """
    Number of nodes Fully connected layer from n_nodes1 to n_nodes2
    Parameters
    ----------
    n_nodes1 : int
      Number of nodes in the previous layer
    n_nodes2 : int
      Number of nodes in the later layer
    initializer: instance of initialization method
    optimizer: instance of optimization method
    """
    def __init__(self, n_nodes1, n_nodes2, initializer, optimizer):
        self.optimizer = optimizer
        # Initialize
        # To initialize the FC layer with the given number of nodes, initializer, and optimizer.
        pass
    def forward(self, X): #To compute the output of the layer by applying the weights and biases to the input data.
        """
        forward
        Parameters
        ----------
        X : of the following form. ndarray, shape (batch_size, n_nodes1)
            入力
        Returns
        ----------
        A : of the following form. ndarray, shape (batch_size, n_nodes2)
            output
        """
        pass
        return A
    def backward(self, dA): #To compute the gradient of the loss with respect to the input of the layer (dZ), which will be propagated back to the previous layer.
        """
        Backward
        Parameters
        ----------
        dA : of the following form. ndarray, shape (batch_size, n_nodes2)
            Gradient flowing from behind
        Returns
        ----------
        dZ : of the following form. ndarray, shape (batch_size, n_nodes1)
            Gradient to flow forward
        """
        pass
        # update
        self = self.optimizer.update(self)
        return dZ


[Problem 2] Classifying the initialization method
Classify the initialization code.

As mentioned above, we will be able to pass an instance of the initialization method to the constructor of the fully connected layer. Please add the necessary code to the following template. By receiving the standard deviation value (sigma) in the constructor, it is not necessary to pass this value (sigma) in the class of the fully connected layer.

The initialization method we have been dealing with so far will be named the SimpleInitializer class.

《Model 1 FC Layer》

In [None]:
class SimpleInitializer:
    """
    Simple initialization with Gaussian distribution
    Parameters
    ----------
    sigma : float
      Standard deviation of Gaussian distribution
    """
    def __init__(self, sigma):
        self.sigma = sigma
    def W(self, n_nodes1, n_nodes2):
        """
        Weight initialization
        Parameters
        ----------
        n_nodes1 : int
          Number of nodes in the previous layer
        n_nodes2 : int
          Number of nodes in the later layer

        Returns
        ----------
        W :
        """
        pass
        return W
    def B(self, n_nodes2):
        """
        Bias initialization
        Parameters
        ----------
        n_nodes2 : int
          Number of nodes in the later layer

        Returns
        ----------
        B :
        """
        pass
        return B

[Problem 3] Classifying optimization methods
Please classify the optimization method.

With respect to the optimization method, it is passed as an instance to the fully connected layer as well as the initialization method. When backward, we can update it as self.optimizer.update(self). Please add the necessary code to the following template.

The optimization methods we have dealt with so far are created as SGD class (Stochastic Gradient Descent).

 Prototype

In [None]:
class SGD:
    """
    Stochastic gradient descent
    Parameters
    ----------
    lr : Learning rate
    """
    def __init__(self, lr):
        self.lr = lr
    def update(self, layer):
        """
        Update weights and biases for a layer
        Parameters
        ----------
        layer : Instance of the layer before update
        """

[Problem 4] Classifying activation functions
Please classify the activation function.

The backpropagation of the softmax function is simplified by implementing it including the calculation of the cross entropy error.

Evolutionary element
We will implement other than the activation functions, initial values of weights, and optimization methods that we have not seen so far.

[Problem 5] ReLU class creation
Please implement ReLU (Rectified Linear Unit) which is a commonly used activation function as ReLU class.

ReLU is the following formula.

[Problem 6] Initial value of weight
So far, the initial values of weights and bias have been simply Gaussian distributions, and standard deviation has been treated as a hyperparameter. However, it is known what value it should be. For sigmoidal and hyperbolic tangent functions, the initial value of Xavier (or the initial value of Glorot) is used, and for ReLU the initial value of He.

Create XavierI nitializer class and HeIn itializer class.

Initial value of Xavier
The standard deviation $\sigma$ at the initial value of Xavier can be obtained by the following equation

σ
​ ​
=
​ ​
1
​ ​
√
​ ​
n
$n$ : number of nodes in the previous layer

"paper"

Glorot, X., & Bengio, Y. (nd). Understanding the difficulty of training deep feedforward neural networks.

Initial value of He
The standard deviation $\sigma$ at He's initial value can be obtained by the following equation

σ
​ ​
=
​ ​
√
​ ​
2
​ ​
n
$n$ : number of nodes in the previous layer

"paper"

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

[Problem 7] Optimization method
The most common method is to vary the learning rate in the learning process. Please create a class of AdaGrad which is the basic method.

First, check the SGD you have been using so far.

W
′
i
=
W
i
−
α
∂
l
∂
W
i
B
′
i
=
B
i
−
α
∂
l
∂
B
i
$\alpha$ : learning rate (can be changed from layer to layer, but basically assumed to be the same for all)

$\frac{\partial L}{\partial W_i}$ : slope of loss $L$ with respect to $W_i$.

$\frac{\partial L}{\partial B_i}$ : slope of loss $L$ with respect to $B_i$.

Next is AdaGrad. The bias formula is omitted, but it does something similar to weighting.

Gradually reduce the learning rate for that weight as it is updated. Save the sum of the squares of the gradients $H$ for each iteration, and reduce the learning rate by that amount.

The learning rate will be different for each weight.

H
′
i
=
H
i
+
∂
l
∂
W
i
⊙
∂
l
∂
W
i
W
′
i
=
W
i
−
α
(
1
√
H
′
i
⊙
∂
l
∂
W
i
)
$H_i$ : sum of squares of gradients up to the previous iteration for the i-th layer (initial value is 0)

$H_i^{\prime}$ : updated $H_i$.

"paper"

Duchi JDUCHI, J., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization * Elad Hazan. Journal of Machine Learning Research (Vol. 12).

[Problem 8] Class completion
Complete the Scratch Deep Neural Netrowk Classifier class that can be trained and estimated with any configuration.

[Problem 9] Learning and estimation
Create several networks with varying numbers of layers and activation functions. Then, train and estimate the MNIST data and calculate the Accuracy.

Model 2 layer with Random data

In [1]:
import numpy as np

class SimpleInitializer:
    def initialize(self, shape):
        return np.random.randn(*shape) * 0.01

class SGD:
    def __init__(self, learning_rate):
        self.learning_rate = learning_rate

    def update(self, W, B, dW, dB):
        W -= self.learning_rate * dW
        B -= self.learning_rate * dB
        return W, B

Implement the FC Layer:
Use the code from the previous example for the FC layer.

Build a Simple Neural Network:
Create a simple neural network that includes the FC layer.

In [2]:
class SimpleNN:
    def __init__(self, input_size, hidden_size, output_size):
        initializer = SimpleInitializer()
        optimizer = SGD(learning_rate=0.01)
        self.fc1 = FC(input_size, hidden_size, initializer, optimizer)
        self.fc2 = FC(hidden_size, output_size, initializer, optimizer)

    def forward(self, X):
        A1 = self.fc1.forward(X)
        A2 = self.fc2.forward(A1)
        return A2

    def backward(self, dA):
        dA1 = self.fc2.backward(dA)
        dZ = self.fc1.backward(dA1)
        return dZ

Prepare the Data:
Assume we have some training data X_train and corresponding labels y_train.

In [3]:
# Example data
X_train = np.random.randn(10, 4)  # 10 samples, 4 features
y_train = np.random.randn(10, 3)  # 10 samples, 3 output classes

Train the Model:
Perform forward and backward passes to train the model.

In [None]:
# Initialize the model
input_size = 4
hidden_size = 5
output_size = 3
model = SimpleNN(input_size, hidden_size, output_size)

# Training loop
epochs = 1000
for epoch in range(epochs):
    # Forward pass
    output = model.forward(X_train)

    # Compute loss (simple mean squared error)
    loss = np.mean((output - y_train) ** 2)

    # Compute gradient of the loss w.r.t. output
    dA = 2 * (output - y_train) / y_train.shape[0]

    # Backward pass
    model.backward(dA)

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

This simple implementation demonstrates how to integrate the FC layer into a neural network and feed it data. It includes:

An initializer for weights and biases.
An optimizer for updating weights and biases.
A simple neural network with one hidden layer.
Training loop with forward and backward passes.
This model and training loop can be extended with more complex architectures, loss functions, and training procedures as needed.

2 layers

In [7]:
import numpy as np

class SimpleInitializer:
    def initialize(self, shape):
        return np.random.randn(*shape) * 0.01

class SGD:
    def __init__(self, learning_rate):
        self.learning_rate = learning_rate

    def update(self, W, B, dW, dB):
        W -= self.learning_rate * dW
        B -= self.learning_rate * dB
        return W, B

class FC:
    def __init__(self, n_nodes1, n_nodes2, initializer, optimizer):
        self.W = initializer.initialize((n_nodes1, n_nodes2))
        self.B = initializer.initialize((1, n_nodes2))
        self.optimizer = optimizer

    def forward(self, X):
        self.X = X
        A = X @ self.W + self.B
        return A

    def backward(self, dA):
        dW = self.X.T @ dA
        dB = np.sum(dA, axis=0, keepdims=True)
        dZ = dA @ self.W.T

        self.W, self.B = self.optimizer.update(self.W, self.B, dW, dB)

        return dZ

class SimpleNN:
    def __init__(self, input_size, hidden_size, output_size):
        initializer = SimpleInitializer()
        optimizer = SGD(learning_rate=0.01)
        self.fc1 = FC(input_size, hidden_size, initializer, optimizer)
        self.fc2 = FC(hidden_size, output_size, initializer, optimizer)

    def forward(self, X):
        A1 = self.fc1.forward(X)
        A2 = self.fc2.forward(A1)
        return A2

    def backward(self, dA):
        dA1 = self.fc2.backward(dA)
        dZ = self.fc1.backward(dA1)
        return dZ

# Example data
X_train = np.random.randn(10, 4)  # 10 samples, 4 features
y_train = np.random.randn(10, 3)  # 10 samples, 3 output classes

# Initialize the model
input_size = 4
hidden_size = 5
output_size = 3
model = SimpleNN(input_size, hidden_size, output_size)

# Training loop
epochs = 1000
for epoch in range(epochs):
    # Forward pass
    output = model.forward(X_train)

    # Compute loss (simple mean squared error)
    loss = np.mean((output - y_train) ** 2)

    # Compute gradient of the loss w.r.t. output
    dA = 2 * (output - y_train) / y_train.shape[0]

    # Backward pass
    model.backward(dA)

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")


Epoch 0, Loss: 1.374524418408315
Epoch 100, Loss: 1.285417688962274
Epoch 200, Loss: 1.172939970615394
Epoch 300, Loss: 0.7052619064188007
Epoch 400, Loss: 0.6693141496497926
Epoch 500, Loss: 0.6532647447823573
Epoch 600, Loss: 0.6228268621464818
Epoch 700, Loss: 0.5947804694322335
Epoch 800, Loss: 0.5834218987744719
Epoch 900, Loss: 0.5790782155746281


In [8]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Predictions
predictions = model.forward(X_train)

# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_train, predictions)
print(f"Mean Absolute Error: {mae}")

# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(y_train, predictions))
print(f"Root Mean Squared Error: {rmse}")

# R-squared (R²)
r2 = r2_score(y_train, predictions)
print(f"R-squared: {r2}")

Mean Absolute Error: 0.5686681653243103
Root Mean Squared Error: 0.7595997706195532
R-squared: 0.5732025125234386


1 layer

In [5]:
import numpy as np

class SimpleInitializer:
    def initialize(self, shape):
        return np.random.randn(*shape) * 0.01

class SGD:
    def __init__(self, learning_rate):
        self.learning_rate = learning_rate

    def update(self, W, B, dW, dB):
        W -= self.learning_rate * dW
        B -= self.learning_rate * dB
        return W, B

class FC:
    def __init__(self, n_nodes1, n_nodes2, initializer, optimizer):
        self.W = initializer.initialize((n_nodes1, n_nodes2))
        self.B = initializer.initialize((1, n_nodes2))
        self.optimizer = optimizer

    def forward(self, X):
        self.X = X
        A = X @ self.W + self.B
        return A

    def backward(self, dA):
        dW = self.X.T @ dA
        dB = np.sum(dA, axis=0, keepdims=True)
        dZ = dA @ self.W.T

        self.W, self.B = self.optimizer.update(self.W, self.B, dW, dB)

        return dZ

class SimpleNN:
    def __init__(self, input_size, output_size):
        initializer = SimpleInitializer()
        optimizer = SGD(learning_rate=0.01)
        self.fc = FC(input_size, output_size, initializer, optimizer)

    def forward(self, X):
        A = self.fc.forward(X)
        return A

    def backward(self, dA):
        dZ = self.fc.backward(dA)
        return dZ

# Example data
X_train = np.random.randn(10, 4)  # 10 samples, 4 features
y_train = np.random.randn(10, 3)  # 10 samples, 3 output classes

# Initialize the model
input_size = 4
output_size = 3
model = SimpleNN(input_size, output_size)

# Training loop
epochs = 1000
for epoch in range(epochs):
    # Forward pass
    output = model.forward(X_train)

    # Compute loss (simple mean squared error)
    loss = np.mean((output - y_train) ** 2)

    # Compute gradient of the loss w.r.t. output
    dA = 2 * (output - y_train) / y_train.shape[0]

    # Backward pass
    model.backward(dA)

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")


Epoch 0, Loss: 0.8917309973762257
Epoch 100, Loss: 0.5127562498993703
Epoch 200, Loss: 0.4876356393312072
Epoch 300, Loss: 0.48018429432849047
Epoch 400, Loss: 0.47739942243023137
Epoch 500, Loss: 0.4762835816401125
Epoch 600, Loss: 0.47582704205677956
Epoch 700, Loss: 0.47563909398821774
Epoch 800, Loss: 0.47556157928122433
Epoch 900, Loss: 0.47552959324784255


Model Capacity:
Two-layer Model: A model with two fully connected layers has more capacity to learn complex patterns in the data. It can capture non-linear relationships better than a single-layer model. This is because each layer can learn a different representation of the data, which, when combined, can represent more complex functions.
One-layer Model: A single fully connected layer has limited capacity. It can only capture linear relationships between the input features and the output.

Learning Dynamics:
Two-layer Model: The loss reduction in the two-layer model shows a more gradual improvement, which is typical for models with more parameters and complexity. The initial higher loss indicates that the model starts off less optimized, but it gradually learns the patterns in the data.
One-layer Model: The one-layer model shows a rapid decrease in loss initially, which indicates it is easier to optimize but reaches its limit sooner. This is typical for simpler models, as they converge faster but might not achieve as low a loss as more complex models.

Overfitting vs. Underfitting:
Two-layer Model: With more layers, there's a risk of overfitting if the model becomes too complex relative to the amount of training data. However, with proper regularization and enough data, it can generalize well.
One-layer Model: A single-layer model might underfit the data, meaning it might not capture all the relevant patterns. This can be seen if the training loss plateaus at a higher value compared to the more complex model.

In [6]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Predictions
predictions = model.forward(X_train)

# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_train, predictions)
print(f"Mean Absolute Error: {mae}")

# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(y_train, predictions))
print(f"Root Mean Squared Error: {rmse}")

# R-squared (R²)
r2 = r2_score(y_train, predictions)
print(f"R-squared: {r2}")

Mean Absolute Error: 0.5509931714687599
Root Mean Squared Error: 0.6895769662181007
R-squared: 0.42419082441458206


To see the predictions of the SimpleNN model and its current weights (W) and biases (B), you can add some code after your training loop to make predictions and display the model parameters. Here's how you can modify your code:

In [10]:
import numpy as np

class SimpleInitializer:
    def initialize(self, shape):
        return np.random.randn(*shape) * 0.01

class SGD:
    def __init__(self, learning_rate):
        self.learning_rate = learning_rate

    def update(self, W, B, dW, dB):
        W -= self.learning_rate * dW
        B -= self.learning_rate * dB
        return W, B

class FC:
    def __init__(self, n_nodes1, n_nodes2, initializer, optimizer):
        self.W = initializer.initialize((n_nodes1, n_nodes2))
        self.B = initializer.initialize((1, n_nodes2))
        self.optimizer = optimizer

    def forward(self, X):
        self.X = X
        A = X @ self.W + self.B
        return A

    def backward(self, dA):
        dW = self.X.T @ dA
        dB = np.sum(dA, axis=0, keepdims=True)
        dZ = dA @ self.W.T

        self.W, self.B = self.optimizer.update(self.W, self.B, dW, dB)
        return dZ

class SimpleNN:
    def __init__(self, input_size, output_size):
        initializer = SimpleInitializer()
        optimizer = SGD(learning_rate=0.01)
        self.fc = FC(input_size, output_size, initializer, optimizer)

    def forward(self, X):
        A = self.fc.forward(X)
        return A

    def backward(self, dA):
        dZ = self.fc.backward(dA)
        return dZ

# Example data
X_train = np.random.randn(10, 4)  # 10 samples, 4 features
y_train = np.random.randn(10, 3)  # 10 samples, 3 output classes

# Initialize the model
input_size = 4
output_size = 3
model = SimpleNN(input_size, output_size)

# Training loop
epochs = 1000
for epoch in range(epochs):
    # Forward pass
    output = model.forward(X_train)

    # Compute loss (simple mean squared error)
    loss = np.mean((output - y_train) ** 2)

    # Compute gradient of the loss w.r.t. output
    dA = 2 * (output - y_train) / y_train.shape[0]

    # Backward pass
    model.backward(dA)

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# After training
# Forward pass to get predictions
predictions = model.forward(X_train)
print("\nExample predictions:")
print(predictions)

# Display model parameters
print("\nModel parameters:")
print("Weight matrix (W):")
print(model.fc.W)
print("\nBias matrix (B):")
print(model.fc.B)


Epoch 0, Loss: 1.1788306692217547
Epoch 100, Loss: 0.7723893737307727
Epoch 200, Loss: 0.7426058965066623
Epoch 300, Loss: 0.7369098394697801
Epoch 400, Loss: 0.7351943797278327
Epoch 500, Loss: 0.7345603094634776
Epoch 600, Loss: 0.7343067125676473
Epoch 700, Loss: 0.734202105652827
Epoch 800, Loss: 0.7341584137971006
Epoch 900, Loss: 0.7341400709452637

Example predictions:
[[-0.89296643 -0.1649297  -0.01049177]
 [-0.39552486 -1.15703503 -0.66937901]
 [-1.20981986  0.37867071  0.88885449]
 [-0.89978599  0.30695112  0.6673922 ]
 [-1.14924691 -0.21980929  0.34633305]
 [ 0.26059571 -0.11019482 -1.0431927 ]
 [-0.64469452 -0.25163366 -0.3310182 ]
 [-0.60337224 -0.14781291  0.38773812]
 [-0.53201344  0.01978117 -0.0813861 ]
 [-1.32445332  0.80690994  0.33707705]]

Model parameters:
Weight matrix (W):
[[ 0.38747398 -0.02134242 -0.52878613]
 [-0.46397796  0.08630813 -0.01203217]
 [ 0.00190301  0.08857254  0.06947919]
 [-0.1957224   0.59083288  0.43926975]]

Bias matrix (B):
[[-0.78623841 -0.