# [`nn.modules.linear.Linear`](https://github.com/JamorMoussa/NanoTorch/blob/4092b0fe7cd19cca1db2a3c99b6f5b77af9dc8a6/nanotorch/nn/modules/linear.py#L9)

In this document, we're going to cover the **Linear** layer, also called **Dense** layer, from the theory to an efficient implementation.

In the first place, we're going to cover how a single neuron works, we extend this concept to build **fully connected Layer**.

## 01. Theory - Build Fully Connected Layer from scratch

The fully connected layer is a fundamental building block of neural networks. It performs a linear transformation on the input, where each node is fully connected to every node in the previous layer.

<figure markdown="span">
    <center>
        <img src="https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/3_fully-connected-layer_0.jpg" width="500" />
    </center>
</figure>

To simplify this concept, we'll first explore how a single neuron works. Once we understand this, we can extend the idea to build a fully connected layer.

### 1.1 Neural Nets - Artificial Neuron

The **Artificial Neuron** is the basic unit used to build more complex neural networks. In this section, we'll delve into the mathematical workings of this neuron.

<figure markdown="span">
    <center>
        <img src="https://raw.githubusercontent.com/JamorMoussa/NanoTorch/dev/docs/images/docs/linear/neuron.png" width="400" />
        <figcaption> <b>Artificial Neuron</figcaption>
    </center>
</figure>

The **Artificial Neuron** is a processing unit that takes some given input and produces an output.

Mathematically, it can be described as a **function** that accepts an **input vector** $x \in \mathbb{R}^n$ and returns a **weighted sum** of that input with a **weight vector** $w \in \mathbb{R}^n$, which has the same dimension as the input $x$, and then adds a **bias** $b \in \mathbb{R}$, finally returning a scalar output $y \in \mathbb{R}$.

Formally,

$$
\begin{align*}
y = w_1 x_1 + w_2 x_2 + \hspace{0.2cm} \dots \hspace{0.2cm} + w_n x_n + b &= \sum_{i = 1}^{n} w_i x_i + b 
\end{align*}
$$

The weight $w_i$ describes the importance of the corresponding feature $x_i$, indicating how much it contributes to computing the output.

The weight vector $w$ and bias $b$ are called learnable parameters, meaning they are learned during the training process.

Acually, we can add the bias $b$ in the weighted sum, by consedering the $w_0 = b$ and set the $x_0 = 1$.

So,

$$
\begin{align*}
y &= \sum_{i = 1}^{n} w_i x_i + b = \sum_{i = 1}^{n} w_i x_i + w_0 x_0 = \sum_{i = 0}^{n} w_i x_i 
\end{align*}
$$


There is another way to compute the output $y$ using the dot product of the input vector $x = (1, x_{org}) \in \mathbb{R}^{n+1}$ and the weight vector $w = (b, w_{org}) \in \mathbb{R}^{n+1}$ as follows:

$$
y = w^T x
$$

### 1.2 Fully Connected Layer

"In the previous section, we saw how a single artificial neuron operates. Now, we can map the same input vector $x \in \mathbb{R}^{n}$ to multiple neurons and perform the same operation as before. This creates a structure called a **Fully Connected Layer**, where all output nodes are fully connected to the input nodes."

<figure markdown="span">
    <center>
        <img src="https://raw.githubusercontent.com/JamorMoussa/NanoTorch/dev/docs/images/docs/linear/fully-connected-layer.png" width="400" />
        <figcaption> <b>Fully Connected Layer</figcaption>
    </center>
</figure>

<div class="admonition note" markdown="">
<p class="admonition-title">Note</p>
<p> We will adopt a notation to maintain consistency in writing equations where the weight connecting input node $i$ to output node $j$ is denoted as $w_{ij}$. </p>
</div>



Let's start with the first output, considering it as a single neuron performing the same computation as before.

<figure markdown="span">
    <center>
        <img src="https://raw.githubusercontent.com/JamorMoussa/NanoTorch/dev/docs/images/docs/linear/fully-connected-layer-1.png" width="400" />
    </center>
</figure>

$$
    y_{1} = w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + \hspace{0.2cm} \dots \hspace{0.2cm} + w_{1n}x_n + w_{10} = \sum_{i=1}^{n}w_{1i}x_i
$$

<figure markdown="span">
    <center>
        <img src="https://raw.githubusercontent.com/JamorMoussa/NanoTorch/dev/docs/images/docs/linear/fully-connected-layer-2.png" width="400" />
    </center>
</figure>

$$
    y_{2} = w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + \hspace{0.2cm} \dots \hspace{0.2cm} + w_{2n}x_n + w_{20} = \sum_{i=1}^{n}w_{2i}x_i
$$

<figure markdown="span">
    <center>
        <img src="https://raw.githubusercontent.com/JamorMoussa/NanoTorch/dev/docs/images/docs/linear/fully-connected-layer-3.png" width="400" />
    </center>
</figure>

$$
    y_{m} = w_{m1}x_1 + w_{m2}x_2 + w_{m3}x_3 + \hspace{0.2cm} \dots \hspace{0.2cm} + w_{mn}x_n + w_{m0} = \sum_{i=1}^{n}w_{mi}x_i
$$

Beautiful. Let's stack all the equations into a single system of linear equations.

$$
\begin{equation*}
\begin{cases}
     &y_{1} = w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + \dots + w_{1n}x_n + w_{10} \\
     &y_{2} = w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + \dots + w_{2n}x_n + w_{20} \\
     &\vdots \\
     &y_{m} = w_{m1}x_1 + w_{m2}x_2 + w_{m3}x_3 + \dots + w_{mn}x_n + w_{m0} \\
\end{cases}
\end{equation*}
$$

Hey, does this remind you of something, a pattern here?

Let's turn this system of linear equations into matrix multiplications.

$$
\begin{pmatrix}
y_1 \\
y_2 \\
\vdots \\
y_m
\end{pmatrix}
=
\begin{pmatrix}
w_{10} & w_{11} & w_{12} & \dots & w_{1n} \\
w_{20} & w_{21} & w_{22} & \dots & w_{2n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
w_{m0} & w_{m1} & w_{m2} & \dots & w_{mn}
\end{pmatrix}
\begin{pmatrix}
1 \\
x_1 \\
x_2 \\
\vdots \\
x_n
\end{pmatrix}
$$

Thus, we can use the matrix formula to describe the computation of a fully connected layer as:

$$
\begin{align*}
\mathbf{y} &= W \mathbf{x}
\end{align*}
$$

Where

$$
W  = \begin{pmatrix}
w_{10} & w_{11} & w_{12} & \dots & w_{1n} \\
w_{20} & w_{21} & w_{22} & \dots & w_{2n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
w_{m0} & w_{m1} & w_{m2} & \dots & w_{mn}
\end{pmatrix} \in \mathbb{R}^{m \times (n+1)}
$$

Here, $\mathbf{x} \in \mathbb{R}^{(n+1)}$ and $\mathbf{y} \in \mathbb{R}^{m}$ denote the input and output vectors of the fully connected layer, respectively.

### 1.4 Forward Propopagation

In the previous section, we demonstrated that we could construct a fully connected layer with any number of inputs, denoted as `in_features`, and produce any number of outputs, denoted as `out_features`, by constructing a learnable matrix $W$ with dimensions *in\_features* $\times$ *out\_features*.

<figure markdown="span">
    <center>
        <img src="https://raw.githubusercontent.com/JamorMoussa/NanoTorch/dev/docs/images/docs/linear/fully-connect-layer-forward-pass.png" width="300" />
    </center>
</figure>

The forward pass is performed when we compute the output, given a input vector $x \in \mathbb{R}^{(\text{in\_features} + 1)}$ : 

$$
    \mathbf{y} = W \mathbf{x}
$$

### 1.5 Back-Propagation

This is the most exciting part.

The whole point of machine learning is to train algorithms. The process of training involves evaluating a loss function (depending on the specific task), computing the gradients of this loss with respect to the model's parameters $W$, and then using any optimization methods, such as **Adam**, to train the model.

Let's denote the loss function used to evaluate the model's performance as $L$.

The following figure shows that the fully connected layer receives the gradient flows from the subsequent layer, denoted as $\frac{\partial L}{\partial \mathbf{y}}$. This quantity is used to compute the gradient of the loss with respect to the current layer's parameters $\frac{\partial L}{\partial W}$. Then, it passes the gradient with respect to the input to the previous layers $\frac{\partial L}{\partial \mathbf{x}}$, following the chain rule in backpropagation.