## 🚀 Neural Network Layer

Now that we understand what the `Perceptron` is, let's delve into creating a more complex structure than just a single perceptron or a perceptron chain. Our goal is to construct a perceptron layer, envisioning it as a column of perceptrons.

The data flow in this structure follows these steps:

- All the data is fed into every perceptron in the layer.
- Each perceptron produces a single output value.
- The outputs of all perceptrons are combined into a single column vector output.
- The output vector from one layer can serve as the input vector for the next layer.

### Theory

To build a layer of perceptrons, we already know that each perceptron must have a precise number of weights, one bias, and an activation function. Let's assume that all perceptrons in the layer share the same activation function, differing only in their input weights and biases. For instance, in a layer with three perceptrons and three inputs, we should have three weights and one bias for each perceptron (one weight for each input and one bias for each perceptron). Adding them up, we should have nine weights (3 perceptrons × 3 inputs) and 3 biases.

To manage these weights and biases efficiently, we can index them by associating two indices with each weight: one for the input it belongs to and the second for the perceptron. For example, the weight for the first perceptron and its second input would be indexed as $w_{12}$. These indices can be structured like matrix notation, leading us to represent all the weights of the layer using a weights matrix $\mathbb{W}$.

$$
\mathbb{W} = \begin{bmatrix}
    w_{11} & w_{12} & \dots & w_{1m} \\
    w_{21} & w_{22} & \dots & w_{2m} \\
    \vdots & \vdots & \ddots & \vdots \\
    w_{n1} & w_{n2} & \dots & w_{nm} \\
\end{bmatrix},
$$

where $m$ is the number of inputs to the layer, and $n$ is the number of perceptrons in the layer.

Now, let's consider biases and organize them into a column vector $\vec{b}$:

$$
\vec{b} = \begin{bmatrix}
    b_{1} \\
    b_{2} \\
    \vdots \\
    b_{n} \\
\end{bmatrix},
$$

where $n$ is the number of perceptrons in the layer. Thus, we can describe the layer using three mathematical terms: $\mathbb{W}, \vec{b}, \sigma$, where $\sigma$ denotes the activation function. Now, let's proceed to create a layer.

### Implementation

It's time to implement the aforementioned theory. Initially, we need to initialize the weights and biases, and for simplicity, we'll use random numbers.

In [11]:
import numpy as np

class Layer:
    def __init__(self, m_inputs:int, n_perceptrons:int):
        """
        Initialize a layer with random weights and biases.

        Parameters:
        - m_inputs: Number of input features.
        - n_perceptrons: Number of perceptrons (neurons) in the layer.
        - activation: String representing the activation function.
                      Default is 'relu'.
        """
        self.m = m_inputs
        self.n = n_perceptrons                           
        self.weights_matrix = np.random.rand(self.m, self.n)
        self.biases_vector = np.zeros(self.n)

In [12]:
my_first_layer = Layer(m_inputs=3, n_perceptrons=3)
print(f"Dimensions of the weights matrix are ({my_first_layer.weights_matrix.shape})")
print(f"Weight matrix of the layer: \n{my_first_layer.weights_matrix}")

print(f"\nDimensions of the biases vector are ({my_first_layer.biases_vector.shape})")
print(f"Biases vector of the layer: \n{my_first_layer.biases_vector}")

Dimensions of the weights matrix are ((3, 3))
Weight matrix of the layer: 
[[0.93052064 0.19921122 0.72346368]
 [0.43208035 0.21682592 0.76378149]
 [0.89198978 0.57299042 0.89197121]]

Dimensions of the biases vector are ((3,))
Biases vector of the layer: 
[0. 0. 0.]


Now that we have the weight matrix and biases vector for the layer, let's focus on the activation function and a method that allows us to pass inputs through the layer, obtaining all the activated results. Let's tackle these steps one by one, starting with the activation function.

#### Activation

The activation function, denoted as $\sigma$, plays a crucial role in determining the output of each perceptron in the layer. Common activation functions include the sigmoid function, hyperbolic tangent (tanh), or rectified linear unit (ReLU). 

In [17]:
import numpy as np

class Layer:
    def __init__(self, m_inputs: int, n_perceptrons: int, activation: str = 'relu'):
        self.m = m_inputs
        self.n = n_perceptrons                           
        self.weights_matrix = np.random.rand(self.m, self.n)
        self.biases_vector = np.zeros(self.n)
        self._set_activation(activation)

    
    def _set_activation(self, activation: str):
        """
        Set the activation function based on the input string.

        Parameters:
        - activation: String representing the activation function.
        """
        if activation == 'relu':
            self.activation = self.relu
        elif activation == 'sigmoid':
            self.activation = self.sigmoid
        elif activation == 'softmax':
            self.activation = self.softmax
        elif activation == 'tanh':
            self.activation = self.tanh
        else:
            raise ValueError(f"Unsupported activation function: {activation}")

    def relu(self, x):
        """
        Rectified Linear Unit (ReLU) activation function.

        Parameters:
        - x: Input.

        Returns:
        - Output after applying ReLU.
        """
        return np.maximum(0, x)

    def sigmoid(self, x):
        """
        Sigmoid activation function.

        Parameters:
        - x: Input.

        Returns:
        - Output after applying sigmoid.
        """
        return 1 / (1 + np.exp(-x))

    def softmax(self, x):
        """
        Softmax activation function.

        Parameters:
        - x: Input.

        Returns:
        - Output after applying softmax.
        """
        exp_values = np.exp(x - np.max(x))
        return exp_values / exp_values.sum(axis=0, keepdims=True)

    def tanh(self, x):
        """
        Hyperbolic Tangent (tanh) activation function.

        Parameters:
        - x: Input.

        Returns:
        - Output after applying tanh.
        """
        return np.tanh(x)


#### Matrix Computations

Now that we have some activation functions in the layer, let's create a method that can propagate the inputs through the layer. We will call this method `forward`, but don't worry about the name for now; it will become clear later why we chose this name.

So, how should we compute the output of this layer? We already know that we can compute the output of one perceptron as the dot product between the inputs and weights, add bias, and activate using the activation function. What changes with more perceptrons now? We can compute the individual dot products and get the results, and it would be totally correct. However, we can make a clever calculation, realizing that the individual dot products will give us the output vector of individual perceptron outputs.

$$
\begin{align*}
y_1 &= \sigma(\sum_{i=1}^m w_{1i}x_i + b_1)\\
y_2 &= \sigma(\sum_{i=1}^m w_{2i}x_i + b_2)\\
& \vdots \\
y_n &= \sigma(\sum_{i=1}^m w_{ni}x_i + b_n)
\end{align*}
$$

where $y_1, y_2,..., y_n$ are the outputs from individual perceptrons. We could compute the output now and be totally correct, but let's take a little dive into matrix computation again. Imagine that we have a matrix $\mathbb{W}$ and the vector $\vec{x}$. Let's see how the product of these two objects would look like.

$$
\mathbb{W} \cdot \vec{x}^T = \begin{bmatrix}
    w_{11} & w_{12} & \dots & w_{1m} \\
    w_{21} & w_{22} & \dots & w_{2m} \\
    \vdots & \vdots & \ddots & \vdots \\
    w_{n1} & w_{n2} & \dots & w_{nm} \\
\end{bmatrix} \cdot \begin{bmatrix}
    x_{1} \\
    x_{2} \\
    \vdots \\
    x_{m} \\
\end{bmatrix} = \begin{bmatrix}
    \sum_{i=1}^{m} w_{1i}x_i \\
    \sum_{i=1}^{m} w_{2i}x_i \\
    \vdots \\
    \sum_{i=1}^{m} w_{ni}x_i \\
\end{bmatrix}
$$

This result looks quite similar to the result that we got from individual perceptrons. If we just added the biases vector $\vec{b}$ to the matrix dot product, we would be even closer to the previous result.

$$
\mathbb{W} \cdot \vec{x}^T + \vec{b} = \begin{bmatrix}
    w_{11} & w_{12} & \dots & w_{1m} \\
    w_{21} & w_{22} & \dots & w_{2m} \\
    \vdots & \vdots & \ddots & \vdots \\
    w_{n1} & w_{n2} & \dots & w_{nm} \\
\end{bmatrix} \cdot \begin{bmatrix}
    x_{1} \\
    x_{2} \\
    \vdots \\
    x_{n} \\
\end{bmatrix} + \begin{bmatrix}
    b_1 \\
    b_2 \\
    \vdots \\
    b_n \\
\end{bmatrix} = \begin{bmatrix}
    \sum_{i=1}^{m} w_{1i}x_i + b_1 \\
    \sum_{i=1}^{m} w_{2i}x_i + b_2 \\
    \vdots \\
    \sum_{i=1}^{m} w_{ni}x_i + b_n \\
\end{bmatrix} = \vec{a}
$$

We are almost there. Now we just need to activate all the components from the result. We can do it by using a function that consumes a vector, applies it to all its components, and then returns a new activated vector.

$$
\sigma(\vec{a}) = \sigma(\mathbb{W}\cdot\vec{x}^T + \vec{b}) = \begin{bmatrix}
    \sigma(\sum_{i=1}^{m} w_{1i}x_i + b_1)\\
    \sigma(\sum_{i=1}^{m} w_{2i}x_i + b_2)\\
    \vdots \\
    \sigma(\sum_{i=1}^{m} w_{ni}x_i + b_n)\\
\end{bmatrix} = \vec{y}
$$

Now we can see that instead of looping over the individual perceptrons, we can just define a matrix consisting of all the weights from the layer and compute the dot product with the inputs, add a biases vector, and then apply the activation to all components of the resulting vector. Let's implement this now.

In [40]:
import numpy as np

class Layer:
    def __init__(self, m_inputs: int, n_perceptrons: int, activation: str = 'relu'):
        self.m = m_inputs
        self.n = n_perceptrons                           
        self.weights_matrix = np.random.rand(self.m, self.n)
        self.biases_vector = np.zeros(self.n)
        self._set_activation(activation)

    
    def _set_activation(self, activation: str):
        if activation == 'relu':
            self.activation = self.relu
        elif activation == 'sigmoid':
            self.activation = self.sigmoid
        elif activation == 'softmax':
            self.activation = self.softmax
        elif activation == 'tanh':
            self.activation = self.tanh
        else:
            raise ValueError(f"Unsupported activation function: {activation}")

    def relu(self, x):
        return np.maximum(0, x)

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def softmax(self, x):
        exp_values = np.exp(x - np.max(x))
        return exp_values / exp_values.sum(axis=0, keepdims=True)

    def tanh(self, x):
        return np.tanh(x)
        
    def forward(self, inputs:list):
        """
        Perform forward pass through the layer.

        Parameters:
        - inputs: List of input values.

        Returns:
        - activation: Output after applying activation function.
        """
        # Calculate the weighted sum of inputs and add biases
        argument = np.dot(self.weights_matrix.T, inputs) + self.biases_vector

        # Apply the activation function
        activation = self.activation(argument)
        return activation


In [27]:
x = np.random.rand(3)

layer = Layer(m_inputs=3, n_perceptrons=3, activation="relu")
y = layer.forward(x)

print(f"x = {x}\n y = {y}")

x = [0.70028256 0.18720134 0.80807865]
 y = [1.2176225  1.41055608 0.34167211]


#### Building a Real Neural Network

Now that we can create a real neural network by chaining layers, passing the output of the first layer as input into the second, and so on, we find ourselves with just random values. Unfortunately, we can't make any predictions using a neural network in this state. We need to figure out how to adjust the weights and biases of individual layers in the network so we can make predictions based on training data. This process will be discussed in the next chapter.

**NOTE:** In the implementation, we used $\mathbb{W}^T\cdot\vec{x}$ instead of $\mathbb{W}\cdot\vec{x}^T$ because numpy can't transpose a 1D vector. The result of this operation is just a transposed vector $\vec{y}^T$, but for our purposes, it is exactly the same as $\vec{y}$.

In [47]:
x = np.random.rand(8)

layer_1 = Layer(m_inputs=8, n_perceptrons=8, activation="relu")
layer_2 = Layer(m_inputs=8, n_perceptrons=16, activation="relu")
layer_3 = Layer(m_inputs=16, n_perceptrons=16, activation="relu")
layer_4 = Layer(m_inputs=16, n_perceptrons=16, activation="relu")
layer_5 = Layer(m_inputs=16, n_perceptrons=8, activation="relu")
layer_6 = Layer(m_inputs=8, n_perceptrons=3, activation="tanh")

y_1 = layer_1.forward(x)
y_2 = layer_2.forward(y_1)
y_3 = layer_3.forward(y_2)
y_4 = layer_4.forward(y_3)
y_5 = layer_5.forward(y_4)
result = layer_6.forward(y_5)

print(f"x = {x}\n")
print(f"y_1 = {y_1}\n")
print(f"y_2 = {y_2}\n")
print(f"y_3 = {y_3}\n")
print(f"y_4 = {y_4}\n")
print(f"y_5 = {y_5}\n")
print(f"Result = {result}")

x = [0.62581241 0.15579656 0.74573661 0.72584999 0.25279547 0.42913702
 0.16370503 0.64355034]

y_1 = [2.02617686 1.81485014 1.24673764 1.89551142 1.97027812 1.42067571
 1.41836614 1.28796219]

y_2 = [8.82083583 7.3789929  7.22495954 8.28830517 6.14173435 7.00702324
 6.08710616 6.9519832  5.74782273 8.22040577 4.79499482 7.06348432
 5.83752702 5.99333522 6.13574526 7.12622832]

y_3 = [47.91753259 57.17937117 58.0718805  48.9766543  53.53069753 48.96906559
 65.40277461 49.64824271 54.0331496  43.84550337 62.73920104 73.28308745
 73.73312402 56.51938494 61.6087385  46.76370827]

y_4 = [564.19262031 405.32983071 578.77282214 462.94881905 456.94097017
 528.97578901 343.99363559 547.55514324 467.81780315 429.01843353
 263.65203715 423.55076047 478.94673745 490.72058938 467.204378
 513.79553742]

y_5 = [4165.99325259 4022.00690757 4730.65211406 3183.88019892 3883.338997
 2916.77604951 3895.66718585 4258.0535298 ]

Result = [1. 1. 1.]
