## **Matrix operations for a full Neural Network**

---


We will highlight and make explicit the math for a Neural Network with three Layers.
An input layer, a hidden layer and an output layer.

Architecture:
1. **Layer 0** : Contains **$k_0$ neurons**.
2. **Layer 1** : Contains **$k_1$ neurons**.
3. **Layer 2** : Contains **1 neuron**.

#### General Setup

Let the number of training samples be **$m$**, and the input feature size be **$n$**.

---

### **Layer 0: Input layer**

1. **Inputs**: The input matrix $ \mathbf{X} \in \mathbb{R}^{m \times n} $, where each row corresponds to a training sample:
   $$
   \mathbf{X} = \begin{bmatrix}
   x_{1,1} & x_{1,2} & \dots & x_{1,n} \\
   x_{2,1} & x_{2,2} & \dots & x_{2,n} \\
   \vdots & \vdots & \ddots & \vdots \\
   x_{m,1} & x_{m,2} & \dots & x_{m,n}
   \end{bmatrix}
   $$

2. **Weights**: A weight matrix $ \mathbf{W}^{[0]} \in \mathbb{R}^{k_0 \times n} $:
   $$
   \mathbf{W}^{[0]} = \begin{bmatrix}
   w^{[0]}_{1,1} & w^{[0]}_{1,2} & \dots & w^{[0]}_{1,n} \\
   w^{[0]}_{2,1} & w^{[0]}_{2,2} & \dots & w^{[0]}_{2,n} \\
   \vdots & \vdots & \ddots & \vdots \\
   w^{[0]}_{k_0,1} & w^{[0]}_{k_0,2} & \dots & w^{[0]}_{k_0,n}
   \end{bmatrix}
   $$

3. **Biases**: A bias matrix $ \mathbf{B}^{[0]} \in \mathbb{R}^{m \times k_0} $, where each row is the same:
   $$
   \mathbf{B}^{[0]} = \begin{bmatrix}
   b^{[0]}_1 & b^{[0]}_2 & \dots & b^{[0]}_{k_0} \\
   b^{[0]}_1 & b^{[0]}_2 & \dots & b^{[0]}_{k_0} \\
   \vdots & \vdots & \ddots & \vdots \\
   b^{[0]}_1 & b^{[0]}_2 & \dots & b^{[0]}_{k_0}
   \end{bmatrix}
   $$

4. **Linear Combination**:
   $$
   \mathbf{Z}^{[0]} = \mathbf{X} \mathbf{W}^{[0]\top} + \mathbf{B}^{[0]}
   $$

5. **Activation**: Apply activation function $ \sigma $ element-wise:
   $$
   \mathbf{A}^{[0]} = \sigma(\mathbf{Z}^{[0]})
   $$
   Here, $ \mathbf{A}^{[0]} \in \mathbb{R}^{m \times k_1} $ represents activations for the input layer.

---

### **Layer 1: Hidden Layer**

1. **Inputs**: The activations from Layer 0:
   $$
   \mathbf{A}^{[0]} \in \mathbb{R}^{m \times k_0}
   $$

2. **Weights**: A weight matrix $ \mathbf{W}^{[1]} \in \mathbb{R}^{k_1 \times k_0} $:
   $$
   \mathbf{W}^{[1]} = \begin{bmatrix}
   w^{[1]}_{1,1} & w^{[1]}_{1,2} & \dots & w^{[1]}_{1,k_0} \\
   w^{[1]}_{2,1} & w^{[1]}_{2,2} & \dots & w^{[1]}_{2,k_0} \\
   \vdots & \vdots & \ddots & \vdots \\
   w^{[1]}_{k_1,1} & w^{[1]}_{k_1,2} & \dots & w^{[1]}_{k_1,k_0}
   \end{bmatrix}
   $$

3. **Biases**: A bias matrix $ \mathbf{B}^{[1]} \in \mathbb{R}^{m \times k_1} $, where each row is the same:
   $$
   \mathbf{B}^{[1]} = \begin{bmatrix}
   b^{[1]}_1 & b^{[1]}_2 & \dots & b^{[1]}_{k_1} \\
   b^{[1]}_1 & b^{[1]}_2 & \dots & b^{[1]}_{k_1} \\
   \vdots & \vdots & \ddots & \vdots \\
   b^{[!]}_1 & b^{[1]}_2 & \dots & b^{[1]}_{k_1}
   \end{bmatrix}
   $$

4. **Linear Combination**:
   $$
   \mathbf{Z}^{[1]} = \mathbf{A}^{[0]} \mathbf{W}^{[1]\top} + \mathbf{B}^{[1]}
   $$

5. **Activation**: Apply activation function $ \sigma $ element-wise:
   $$
   \mathbf{A}^{[1]} = \sigma(\mathbf{Z}^{[1]})
   $$
   Here, $ \mathbf{A}^{[1]} \in \mathbb{R}^{m \times k_1} $ represents activations for the first hidden layer.

---


### **Layer 2: Output Layer**

1. **Inputs**: The activations from Layer 2:
   $$
   \mathbf{A}^{[1]} \in \mathbb{R}^{m \times k_1}
   $$

2. **Weights**: A weight vector $ \mathbf{W}^{[3]} \in \mathbb{R}^{1 \times k_0} $:
   $$
   \mathbf{W}^{[2]} = \begin{bmatrix}
   w^{[2]}_{1} & w^{[2]}_{2} & \dots & w^{[2]}_{k_0}
   \end{bmatrix}
   $$

3. **Bias**: A bias vector $ \mathbf{B}^{[2]} \in \mathbb{R}^{m \times 1} $, where each row is the same:
   $$
   \mathbf{B}^{[2]} = \begin{bmatrix}
   b^{[3]} \\
   b^{[3]} \\
   \vdots \\
   b^{[3]}
   \end{bmatrix}
   $$

4. **Linear Combination**:
   $$
   \mathbf{Z}^{[2]} = \mathbf{A}^{[1]} \mathbf{W}^{[2]\top} + \mathbf{B}^{[2]}
   $$

5. **Activation (Optional)**: If the output activation $ \sigma_{\text{out}} $ is applied (e.g., sigmoid for binary classification):
   $$
   \mathbf{A}^{[2]} = \sigma_{\text{out}}(\mathbf{Z}^{[2]})
   $$
   Here, $ \mathbf{A}^{[2]} \in \mathbb{R}^{m \times 1} $ contains the final predictions.


---

### Summary

The outputs for a three-layer neural network are computed as:
1. Layer 1:
   $$
   \mathbf{Z}^{[0]} = \mathbf{X} \mathbf{W}^{[0]\top} + \mathbf{B}^{[0]}, \quad \mathbf{A}^{[0]} = \sigma(\mathbf{Z}^{[0]})
   $$
2. Layer 2:
   $$
   \mathbf{Z}^{[1]} = \mathbf{A}^{[0]} \mathbf{W}^{[1]\top} + \mathbf{B}^{[1]}, \quad \mathbf{A}^{[1]} = \sigma(\mathbf{Z}^{[1]})
   $$
3. Layer 3:
   $$
   \mathbf{Z}^{[2]} = \mathbf{A}^{[1]} \mathbf{W}^{[2]\top} + \mathbf{B}^{[2]}, \quad \mathbf{A}^{[2]} = \sigma_{\text{out}}(\mathbf{Z}^{[2]})
   $$

This process demonstrates the forward propagation for a three-layer neural network.

---