<a href="https://colab.research.google.com/github/MLcmore2023/MLcmore2023/blob/main/convolutional_neural_network_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convolutional Neural Network
CNN is a supervised machine learning algorithm widely used for image classification, object detection, and other computer vision tasks. Inspired by the human visual system, a CNN comprises interconnected layers of specialized neurons that automatically learn and detect features from raw image data. The network's core components include convolutional layers, pooling layers, and fully connected layers. During training, the network adjusts its learnable parameters, such as filters and biases, to minimize the difference between predicted and actual labels. CNNs is good at capturing local patterns and hierarchical representations in images, thanks to the convolutional operations that extract features and the pooling layers that downsample the data.

### Import libraries and initialize random generator

In [25]:
import time # for measuring training time
import numpy as np # for linear algebra
from keras.datasets import mnist # for loading the dataset

np.random.seed(0)
np.set_printoptions(threshold=7) # printing format

### load the MNIST image data using `Keras` library
Load the MNIST data as a 2D tuple containing the training data and the test data.

In [26]:
mnist_data = mnist.load_data()

In [27]:
training_data = mnist_data[0]
test_data = mnist_data[1]

The ``training_data`` is returned as a tuple with two entries.
The first entry contains the actual training images.

In [28]:
training_inputs, training_results = training_data
"""
#same as:
training_inputs =training_data[0]
training_results =training_data[1]
"""

'\n#same as:\ntraining_inputs =training_data[0]\ntraining_results =training_data[1]\n'

This is a
numpy ndarray with 60,000 entries.  Each entry is, in turn, a
numpy ndarray with 784 values, representing the 28 * 28 = 784
pixels in a single MNIST image.

One example image:

In [29]:
display(training_inputs)
display(training_inputs.shape)

array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 

(60000, 28, 28)

The second entry in the ``training_data`` tuple is a numpy ndarray
containing 60,000 entries.  Those entries are just the digit
values (0...9) for the corresponding images contained in the first
entry of the tuple.

In [30]:
display(training_results)
display(training_results.shape)

array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)

(60000,)

The ``test_data`` is the same, except
it contains only 10,000 images.

In [31]:
print(test_data[0].shape,test_data[1].shape)

(10000, 28, 28) (10000,)


In [32]:
test_inputs, test_results = test_data

### Preparing the dataset
We will reshape the and normalize the dataset before passing it into our CNN model. We will convert categorical labels (such as digits 0-9) into a vector numerical format. Later, our model will not directly predict the digits of handwritten images. Rather, our model gives a probability distribution of what digit an image might be. For example, an image of 2 will be [0%,0%, 100%, 0%,0%,0%,0%,0%,0%,0%] meaning it have 100% probability of being a 2, and 0% probability of being other digits.
By vectorizing the training data, we make later calculations more efficient.

In [33]:
from keras.utils import np_utils

def preprocess_data(x, y, limit):
    zero_index = np.where(y == 0)[0][:limit]
    one_index = np.where(y == 1)[0][:limit]
    all_indices = np.hstack((zero_index, one_index))
    all_indices = np.random.permutation(all_indices)
    x, y = x[all_indices], y[all_indices]
    x = x.reshape(len(x), 1, 28, 28)
    x = x.astype("float32") / 255
    y = np_utils.to_categorical(y)
    y = y.reshape(len(y), 2, 1)
    return x, y
# 500 ones, 500 zeros
x_train, y_train = preprocess_data(training_inputs, training_results, 500) # 500 images from each class = 1000
x_test, y_test = preprocess_data(test_inputs, test_results, 100) # 100 images from each class = 200
print(x_test.shape)
print(y_test.shape)


(200, 1, 28, 28)
(200, 2, 1)


## Training a CNN
CNN uses gradient descent just like many other models, and exactly the same as multilayer perceptron models.

The rough idea is:
1. Randomly initialize the parameters (weights and biases).
2. Compute the gradient of the cost function in respect to the weights and biases for EVERY image. (i.e. computing how we should change the weights and biases so that the network is less wrong in EVERY image)
3. Now we know how we should change the weights and biases so that the network is less wrong, we use this gradient to update the weights and biases (minus the weights and biases by the gradients times a tiny number called learning rate)
4. Repeat for as many times as time and computation resource permits

The above is called gradient descent.
However, computing the gradient for EVERY image is often too slow. Therefore, we only use SOME subset of the dataset, which the exact amount is called the mini_batch_size. This is called stochastic gradient descent.


Gradient descent is like a smooth ball rolling down the hill perfectly towards the steepest direction. Stochastic gradient descent is like a dice stumbling down the hill, sometimes rolling side ways, sometimes rolling up, but in general still going down.

## CNN network structure
Just like the multilayer perceptron (MLP) neural network seen before, a CNN a network of layers each with their own weights and biases that takes in inputs and give outputs.


$$y = \text{network}(x,weights, biases)$$


However, the main difference is that CNN have many different types of layers instead of like the previous tutorial only having sigmoid neurons. Different types of layers include convolutional, dense, activations, dropouts, etc.

```
network = [
    [ Convolutional(), Convolutional.initialize((1, 28, 28), 3, 32) ],
    [ Tanh(), Tanh.initialize() ],
    [ Convolutional(), Convolutional.initialize((32, 26, 26), 3, 64)],
    [ Tanh(), Tanh.initialize() ],
    [ Reshape(), Reshape.initialize( (64, 24, 24), (64 * 24 * 24, 1) ) ],
    [ Dense(), Dense.initialize( 64 * 24 * 24, 128) ],
    [ Tanh(), Tanh.initialize() ],
    [ Dense(), Dense.initialize( 128, 64) ],
    [ Tanh(), Tanh.initialize() ],
    [ Dense(), Dense.initialize( 64, 10) ],
    [ Softmax(), Softmax.initialize() ],
]
```
__sequential network__
We will implement a sequential network in this tutorial. This means the each layer in the network passes its output to the next layer until the final output is generated. (Y1 = X2, Y2 = X3 ...)

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn2.jpg" width="50%">


In a sequential network, the information flows in one direction, from the input layer through the hidden layers to the output layer, without any feedback loops or connections between nodes in the same layer. This simplicity and linearity make it easy to build and understand, which is why it is commonly used for various tasks.

__Notations__
- X: inputs
- Y: outputs
- W: parameters (weights and biases)
- E: error (difference between the model's output and actual labels)

__forward propagation__

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn1.JPG" width="50%">

No matter which kind of layer, every layer is a function that takes in inputs and gives outputs according to some weights and biases. To make a prediction, we pass the input (hand written image) to the first layer. The first layer's output goes into the second layer, and its output goes into the third layer, and so on until the last layer. This is known as feed forward, or forward propagation. Every layer will have a `.forward()` function that takes in X and gives Y.

In [34]:
def predict(network, input):
    for layer, parameters in network:
        output = layer.forward(input, parameters)

        # the output of this layer becomes of input of the next layer (which is for the next iteration of loop)
        input = output

    return output

__backward propagation__

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn1.JPG" width="50%">


Each layer must also be able to update its weights and biases according to the error derivative, so they can minimize the errors and therefore improve on accuracy. Each layer will take in $\frac{\Delta E}{\Delta Y}$ (we will explain how to get this later) and needs to calculate the gradient $\frac{\Delta E}{\Delta W}$ so it knows how to update its $W$ to minimize the error. It needs to also calculate $\frac{\Delta E}{\Delta X}$ which will be given the layer before this layer. Every layer's $\frac{\Delta E}{\Delta Y}$ is equal to the next layer's $\frac{\Delta E}{\Delta X}$.

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn3.jpg" width="50%">

The process of 1) every layer calculating error gradient, 2) updating its parameters, and 3) passing $\frac{\Delta E}{\Delta X}$ to the previous layer is known as back propagation. This is how we train a model. We will explain where the last layer get the error from.

## Layers
As seen above, every layer must have forward propagation and backward propagation. We will call this `forward` and `backward` function, and organize them into classes. All layers will follow this template:

In [35]:
class Generic_layer:
  def forward(input, parameters):
    output = None # calculate output
    return output

  def backward(input, output_gradient, learning_rate, parameters):
    # calculate gradients in respect to the parameters
    parameter_gradient = None

    # update the parameters according to the parameter gradient
    parameters = parameters - parameter_gradient * learning_rate

    # calculate gradients in respect to the inputs and return it
    input_gradient = None
    return input_gradient


## Dense layer
A dense layer, also known as a fully connected layer, is a type of layer where each neuron (or node) in the layer is connected to every neuron in the previous layer. The input data is a 1D array and then uses the weight matrix and biase vector to compute the output (also 1D array). The dense layer plays a crucial role in transforming the input data into higher-level representations, enabling the neural network to learn complex patterns and make predictions across various tasks, such as image recognition, natural language processing, and more.

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn4.jpg" width="50%">

This can be written in matrix form. The input to the dense layer is denoted as vector $\mathbf{x} = [x_1, x_2, \ldots, x_N]$. Let's assume we have an input vector $\mathbf{x}$ of size $N$ and output vector $\mathbf{y}$ of size $M$. The output, denoted as $\mathbf{y} = [y_1, y_2, \ldots, y_M]$, is computed as follows:

$$
\mathbf{y} = \mathbf{W} \cdot \mathbf{x} + \mathbf{b}
$$

where $\mathbf{W}$ is a $M$x$N$ matrix and $\mathbf{B}$ is a vector of size $M$

The gradient equations for backward propagation is:

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn5.jpg" width="30%">





In [36]:
class Dense():
    def initialize(input_size, output_size):
        weights = np.random.randn(output_size, input_size)
        bias = np.random.randn(output_size, 1)
        parameters = [weights, bias]
        return parameters

    def forward(input, parameters):
        # unpack the parameters
        weights, bias = parameters

        return np.dot(weights, input) + bias

    def backward(input, output_gradient, learning_rate, parameters):
        # unpack the parameters
        weights, bias = parameters

        # calculate gradients
        weights_gradient = np.dot(output_gradient, input.T)
        input_gradient = np.dot(weights.T, output_gradient)

        # update the parameters according to the parameter gradient
        weights -= learning_rate * weights_gradient
        bias -= learning_rate * output_gradient

        return input_gradient

## Activation layer
The activation layer takes in some input neurons and simply passes them through an activation function. Thus for that layer the output has the same shape as the input. It also have no parameters so the backpropagation function does not update anything.

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn7.jpg" width="30%">

The backpropagation equations are as follow:

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn6.jpg" width="20%">


In [37]:
class Generic_activation_layer():
    def initialize():
        # no need to do anything because there are no parameters
        return None

    def forward(input, parameters):
        return # activation_func(input)

    def backward(input, output_gradient, learning_rate, parameters): #learning rate is not used
        return # np.multiply(output_gradient, activation_func_prime(input))


## Tanh activation layer
The hyperbolic tangent is an activation function commonly used in neural networks. It maps the input values to the range (-1, 1), making it a good choice for normalization and handling data with negative and positive values. The tanh function is defined as tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x)). As an activation function, tanh is symmetric around the origin, making it suitable for tasks where positive and negative values need to be treated similarly.

$$ \text{tanh}(x) = \frac{{e^x - e^{-x}}}{{e^x + e^{-x}}} $$

$$ \frac{{d}}{{dx}} \text{{tanh}}(x) = 1 - \text{{tanh}}^2(x) $$


<img src="
https://mathworld.wolfram.com/images/interactive/TanhReal.gif" width="30%">


In [38]:
def tanh(x):
    return np.tanh(x)
def tanh_prime(x):
    return 1 - np.tanh(x) ** 2
class Tanh():
    def initialize():
        # no need to do anything because there are no parameters
        return None

    def forward(input, parameters):
        return tanh(input)

    def backward(input, output_gradient, learning_rate, parameters): #learning rate is not used
        return np.multiply(output_gradient, tanh_prime(input))


## Rectified Linear Unit (ReLU) activation layer
ReLU is defined as ReLU(x) = max(0, x), which means it returns the input value if it is positive and zero otherwise. ReLU introduces non-linearity to the network, enabling it to learn complex patterns and representations. The derivative of ReLU is 1 for positive inputs and 0 for negative inputs, making it computationally efficient to calculate gradients during the backpropagation process.

$$\text{ReLU}(x) = \max(0, x)$$

$$\frac{d}{dx} \text{ReLU}(x) =
\begin{cases}
      1 & \text{if } x > 0 \\
      0 & \text{if } x \leq 0
\end{cases}
$$

<img src="https://www.nomidl.com/wp-content/uploads/2022/04/image-10.png" width="30%">



In [39]:
class ReLU:
    def initialize():
        # no need to do anything because there are no parameters
        return None

    def forward(input, parameters):
        return np.maximum(0, input)

    def backward(input, output_gradient, learning_rate, parameters): # learning rate is not used
        relu_gradient = input > 0
        return np.multiply(output_gradient, relu_gradient)


## Sigmoid activation layer
The sigmoid function is a popular activation function in early NN models for several reason.
1. It is continuous and differentiable, enabling calculation of gradients.
2. It is non-linear, which means it can solve non-linearly separable problem
3. It's output is between 0 and 1, which stabilize the training process by preventing large, unbounded values from propagating through the network. It also allows for a natural interpretation of the output as a probability.

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$
$$
\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))
$$

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1200px-Logistic-curve.svg.png" width="30%">


In [40]:
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z) * (1 - sigmoid(z))

class Sigmoid():
    def initialize():
        # no need to do anything because there are no parameters
        return None

    def forward(input, parameters):
        return sigmoid(input)

    def backward(input, output_gradient, learning_rate, parameters): #learning rate and parameters is not used
        return np.multiply(output_gradient, sigmoid_prime(input))


## Softmax activation layer
The softmax activation function is commonly used in the output layer of neural networks for multiclass classification tasks. It converts a vector of real numbers into a probability distribution, where the sum of all the elements in the output vector becomes 1. The softmax function is defined as follows:

$$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}$$
$$ \frac{\partial}{\partial x_i} \text{Softmax}(x_i) = \frac{e^{x_i} \left( \sum_{j=1}^{N} e^{x_j} \right) - e^{x_i} e^{x_i}}{\left( \sum_{j=1}^{N} e^{x_j} \right)^2} = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}} \cdot \left( 1 - \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}} \right)
$$
The softmax function ensures that the output values are positive and normalized, allowing them to represent the probabilities of different classes. This makes softmax ideal for multi-class classification problems, as it enables the model to predict the class with the highest probability.




In [41]:
class Softmax():
    def initialize():
        # no need to do anything because there are no parameters
        return None

    def forward(input, parameters):
        tmp = np.exp(input)
        output = tmp / np.sum(tmp)
        return output

    def backward(input, output_gradient, learning_rate, parameters):
        output = Softmax.forward(input, parameters)
        n = np.size(output)
        return np.dot((np.identity(n) - output.T) * output, output_gradient)


## Reshape layer
The reshape layer is used to change the dimensions of the input vectors while preserving its total number of elements. It allows for the transformation of the data to fit the desired shape required by subsequent layers in the network.




In [42]:
class Reshape():
    def initialize(input_shape, output_shape):
        parameters =  input_shape, output_shape
        return parameters

    def forward(input, parameters):
        input_shape, output_shape = parameters
        return np.reshape(input, output_shape)

    def backward(input, output_gradient, learning_rate, parameters):
        input_shape, output_shape = parameters
        return np.reshape(output_gradient, input_shape)


## Convolution and cross-correlation

<img src="https://miro.medium.com/v2/resize:fit:640/0*e-SMFTzO8r7skkpc" width="60%">

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn9.jpg" width="50%">

In [43]:
def correlate2d(input_image, kernel, mode='valid'):
    # Get dimensions of input image and kernel
    input_height, input_width = input_image.shape
    kernel_height, kernel_width = kernel.shape

    if mode == 'valid':
        # Calculate output size for 'valid' mode
        output_height = input_height - kernel_height + 1
        output_width = input_width - kernel_width + 1
    elif mode == 'same':
        # Calculate output size for 'same' mode
        output_height = input_height
        output_width = input_width
    else:  # mode == 'full'
        # Calculate output size for 'full' mode
        output_height = input_height + kernel_height - 1
        output_width = input_width + kernel_width - 1

    # Initialize the output with zeros
    output = np.zeros((output_height, output_width))

    # Perform 2D correlation
    for i in range(output_height):
        for j in range(output_width):
            # Extract the region of interest (ROI) from the input image
            roi = input_image[i:i + kernel_height, j:j + kernel_width]

            # Ensure the ROI and kernel have the same shape before element-wise multiplication
            if roi.shape == kernel.shape:
                output[i, j] = np.sum(roi * kernel)

    return output

Convolve involves flipping the kernel both horizontally and vertically before sliding it across the input image to compute the output, while correlate does not perform any flipping and directly slides the kernel across the image. Therefore, we can write the `convolve2d` function using the `correlate2d`

In [44]:
def convolve2d(input_image, kernel, mode='valid'):
    # Flip the kernel horizontally and vertically
    flipped_kernel = np.flipud(np.fliplr(kernel))

    # Call the correlate2d function with the flipped kernel
    conv_result = correlate2d(input_image, flipped_kernel, mode=mode)

    return conv_result


## Convolutional Layer
The convolutional layer is a fundamental component of CNNs used in image recognition tasks. It applies a set of learnable filters, called kernels, to the input image in order to extract local features. The filters slide over the image, computing element-wise multiplications and summations to produce feature maps that highlight relevant patterns. This process enables the network to learn hierarchical representations, capturing low-level features in early layers and complex patterns in deeper layers, facilitating more efficient and accurate feature extraction for subsequent tasks like classification or object detection.

The convolutional layer contains 3 main parameters: the input shape, kernel size, and depth.
<img src="https://i0.wp.com/developersbreach.com/wp-content/uploads/2020/08/cnn_banner.png?fit=1200%2C564&ssl=1" width="80%">

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn11.jpg" width="50%">

<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn12.jpg" width="50%">



In [45]:
class Convolutional():
    def initialize(input_shape, kernel_size, depth): # e.g. 1x28x28 image, 3x3 kernels
        input_depth, input_height, input_width = input_shape
        output_shape = (depth, input_height - kernel_size + 1, input_width - kernel_size + 1)
        kernels_shape = (depth, input_depth, kernel_size, kernel_size)
        kernels = np.random.randn(*kernels_shape)
        biases = np.random.randn(*output_shape)
        parameters = depth, input_shape, output_shape, kernels_shape, kernels, biases
        return parameters

    def forward(input, parameters):
        depth, input_shape, output_shape, kernels_shape, kernels, biases = parameters
        input_depth, input_height, input_width = input_shape

        output = np.copy(biases)
        for i in range(depth):
            for j in range(input_depth):
                output[i] += correlate2d(input[j], kernels[i, j], "valid")
        return output

    def backward(input, output_gradient, learning_rate, parameters):
        depth, input_shape, output_shape, kernels_shape, kernels, biases = parameters
        input_depth, input_height, input_width = input_shape

        kernels_gradient = np.zeros(kernels_shape)
        input_gradient = np.zeros(input_shape)

        for i in range(depth):
            for j in range(input_depth):
                kernels_gradient[i, j] = correlate2d(input[j], output_gradient[i], "valid")
                input_gradient[j] += convolve2d(output_gradient[i], kernels[i, j], "full")

        kernels -= learning_rate * kernels_gradient
        biases -= learning_rate * output_gradient
        return input_gradient


## Categorical Cross-Entropy Error functions
Remember that during backpropagation, every layer gets the output error gradient from the next layer. However, the very last layer does not have another layer after it. The output error gradient of the last layer is essentially the ouput error of the entire network. Therefore, we can calculate the error using the error functions, which compares the output of CNN with the real labels.
<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/.old/Section%200%20ML%20models/images/cnn8.jpg" width="50%">

Categorical Cross-Entropy is used for multi-class classification problems, where there are more than two classes. It measures the dissimilarity between the predicted probability distribution and the true probability distribution of the target classes.

$$L_{\text{CCE}} = -\sum_{i=1}^{K} y_i \cdot \log(p_i)
$$

$$\frac{{\partial L_{\text{CCE}}}}{{\partial p_i}} = -\frac{{y_i}}{{p_i}}
$$

In [46]:
def binary_cross_entropy(y_true, y_pred):
    return np.mean(-y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred))

def binary_cross_entropy_prime(y_true, y_pred):
    return ((1 - y_true) / (1 - y_pred) - y_true / y_pred) / np.size(y_true)

In [47]:
network = [
    [ Convolutional, Convolutional.initialize((1, 28, 28), 3, 5) ],
    [ Tanh, Tanh.initialize() ],
    [ Reshape, Reshape.initialize( (5, 26, 26), (5*26*26, 1) ) ],
    [ Dense, Dense.initialize( 5* 26 * 26, 100) ],
    [ Tanh, Tanh.initialize() ],
    [ Dense, Dense.initialize( 100, 2) ],
    [ Softmax, Softmax.initialize() ],
]

def train(network, loss, loss_prime, x_train, y_train, epochs, learning_rate):
    for e in range(epochs):
        time1 = time.time()

        error = 0
        for x, y in zip(x_train, y_train):
            # forward
            output = predict(network, x)

            # error
            error += loss(y, output)

            # backward
            grad = loss_prime(y, output)
            for layer, parameters in reversed(network):
                grad = layer.backward(x, grad, learning_rate, parameters)

        time2 = time.time()
        print(f"{e + 1}/{epochs}, error={error}, time = {time2-time1}")
# train
train(
    network,
    binary_cross_entropy,
    binary_cross_entropy_prime,
    x_train,
    y_train,
    epochs=20,
    learning_rate=0.1
)


ValueError: ignored

In [None]:
# test accuracy
count_of_corrects = 0
for x, y in zip(x_test, y_test):
    output = predict(network, x)
    if np.argmax(output)==np.argmax(y):
        count_of_corrects+=1
print(count_of_corrects)


## Todo
solve bug of wrong inputs.
low level implementation of dropout , pooling

use same dataset for both

in keras, should the behavior of different models. add more layers, change sizes of kernals or neurons.

# new models

PRIORITIZE: RNN (1hour+15min, try to cover everything), Q learning.
maybe skip but try: DQN (atari games), Graph convolutional network (high level) , policy gradient

transformer.

### Exercises
1. Change the middle layer of the network from 15 neurons to 30 neurons, and observe the result


### References
- https://medium.com/@bdhuma/6-basic-things-to-know-about-convolution-daef5e1bc411
- https://towardsdatascience.com/building-a-convolutional-neural-network-from-scratch-using-numpy-a22808a00a40
- https://github.com/TheIndependentCode/Neural-Network
- https://github.com/andreoniriccardo/CNN-from-scratch
- https://www.youtube.com/watch?v=Lakz2MoHy6o
- http://neuralnetworksanddeeplearning.com/
- https://www.youtube.com/watch?v=pauPCy_s0Ok
