# Simple Feedforward Networks

> Visual Studio Code is highly recommended to open this notebook. I used KaTeX equations in Markdown for writing equations and it might not rendered very well using other than Visual Studio Code.

A PyTorch implementation of a simple feedforward network as seen at Figure 21.3 in `Russell S. J. & Norvig P. (2020). Artificial intelligence : a modern approach (4th ed.). Pearson.` book.

![Figure 21.3](images/fig_21_3.png)

We can write an expression for the output of that network as follows (taken from Equation 21.2 of the book)
$$
\begin{equation}
\begin{split}
\^{y} &= g_5(in_5) \\
&= g_5(w_{0,5} + w_{3,5}a_3 + w_{4,5}a_4) \\
&= g_5(w_{0,5} + w_{3,5}g_3(in_3) + w_{4,5}g_4(in_4)) \\
&= g_5(w_{0,5} + w_{3,5}g_3(w_{0,3} + w_{1,3}x_1 + w_{2,3}x_2)
+ w_{4,5}g_4(w_{0,4} + w_{1,4}x_1 + w_{2,4}x_2))
\end{split}
\end{equation}
$$

## Create the network

Let the activation functions of $g_3$ and $g_4$ are using a ReLU function, and $g_5$ is just a linear function.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
class SimpleFeedForward(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.fc1 = nn.Linear(2, 2)
        self.fc2 = nn.Linear(2, 1)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        v = F.relu(self.fc1(x))
        z = self.fc2(v)
        return z

In [3]:
net = SimpleFeedForward()
net.train()

SimpleFeedForward(
  (fc1): Linear(in_features=2, out_features=2, bias=True)
  (fc2): Linear(in_features=2, out_features=1, bias=True)
)

Usually, we kept the inital weights random.
But for the sake of simplicity of our study, lets initialize the weights with easy numbers.

In [4]:
with torch.no_grad():
    last_v = 0
    for name, param in net.named_parameters():
        p_numel = param.numel()
        param.data = nn.parameter.Parameter(
            torch.arange(last_v, last_v+p_numel, dtype=torch.float32).reshape(param.shape))
        print(name, param, "\n")
        last_v += p_numel

fc1.weight Parameter containing:
tensor([[0., 1.],
        [2., 3.]], requires_grad=True) 

fc1.bias Parameter containing:
tensor([4., 5.], requires_grad=True) 

fc2.weight Parameter containing:
tensor([[6., 7.]], requires_grad=True) 

fc2.bias Parameter containing:
tensor([8.], requires_grad=True) 



## Forward-pass

Let a training example below is used.

In [5]:
x = torch.Tensor([[-2, 1]])
y = torch.Tensor([[64]])

Let's make predictions for our training example. This process is also usually called as forward-pass/forward-propagation.

In [6]:
y_hat = net(x)
print(y_hat)

tensor([[66.]], grad_fn=<AddmmBackward0>)


The output is $66$.

### Manual calculation

We can also manually calculate the output of our network.

![Figure 21.3b](images/fig_21_3b.png)

![Forward-pass of Figure 21.3b](images/fig_21_3b_forward.png)

The output is same at $66$.

## Backward-pass

We will calculate the gradient for the
network with respect to our previous single training example $(\mathbf{x},y)$. (For multiple
examples, the gradient is just the sum of the gradients for the individual examples.)

Let the squared loss function $L_2$ is used.

$$ L_2 = (y-\^{y})^2

### Manual calculation

We can manually compute the gradient of the loss with respect to (w.r.t.) the weights using the chain rule 
$$ {dy \over dx} = {dy \over du}{du \over dx}. $$

So, the gradient of our $L_2$ loss w.r.t. $w_{3,5}$ should be
$$
{\partial L_2 \over \partial w_{3,5}} = {\partial L_2 \over \partial \^{y}}{\partial \^{y} \over \partial in_5}{\partial in_5 \over \partial w_{3,5}}
$$

where
$$
{\partial L_2 \over \partial \^{y}} = {\partial \over \partial \^{y}}(y-\^{y})^2 = 2(y − \^{y})(-1) = −2(y − \^{y}),
$$

and
$$
{\partial \^{y} \over \partial in_5} = {\partial \over \partial in_5}(g_5(in_5)) = g_{5}'(in_5).
$$

Since $w_{0,5}$ and $w_{4,5}a_4$ do not depend on $w_{3,5}$, also $a_3$ does not depend on $w_{3,5}$,
$$ {\partial in_5 \over \partial w_{3,5}} = {\partial \over \partial w_{3,5}}(w_{0,5} + w_{3,5}a_3 + w_{4,5}a_4) = a_3.
$$

Finally, we have 
$$
{\partial L_2 \over \partial w_{3,5}} = −2(y − \^{y}) g_{5}'(in_5) a_3.
$$

Let's try to compute the gradient of our $L_2$ loss w.r.t. $w_{3,5}$ using that equation for our previous training example. 

Since $g_5$ is just a linear function $g_5(in_5)=in_5$, then $g_{5}'(in_5)=1$, so
$$
{\partial L_2 \over \partial w_{3,5}} = −2(64 − 66) \cdot 1 \cdot 5 = 20.
$$

Then, we can update our $w_{3,5}$ (with learning rate $\alpha=1.0$)
$$
w_{3,5} \colonequals w_{3,5} - \alpha {\partial L_2 \over \partial w_{3,5}} = 6 - 1 \cdot 20 = -14.
$$
The updated weight of $w_{3,5}$ is $-14$.

Now, let's try a slighty more difficult case, the gradient of our $L_2$ loss w.r.t. $w_{1,3}$,
$$
{\partial L_2 \over \partial w_{1,3}} = {\partial L_2 \over \partial \^{y}}{\partial \^{y} \over \partial in_5}{\partial in_5 \over \partial in_3}{\partial in_3 \over \partial w_{1,3}}.
$$

As we can see, the first few steps are identical, so we can use our previous derived functions, so

$$
\begin{align*}
{\partial L_2 \over \partial w_{1,3}} &= −2(y − \^{y}) g_{5}'(in_5){\partial \over \partial in_3}(w_{3,5}a_3){\partial in_3 \over \partial w_{1,3}} \\

&= −2(y − \^{y}) g_{5}'(in_5)w_{3,5}{\partial \over \partial in_3}g_3(in_3){\partial \over \partial w_{1,3}}(w_{0,3}+w_{1,3}x_1+w_{2,3}x_2) \\

&= −2(y − \^{y}) g_{5}'(in_5)w_{3,5}g_{3}'(in_3)x_1
\end{align*}
$$

The simplification in the last line because $w_{0,3}$ and $w_{2,3}x_2$ do not depend on $w_{1,3}$, also $x_1$ does not depend on any others.

Let's try to compute the gradient of our $L_2$ loss w.r.t. $w_{1,3}$ using that equation for our previous training example. 

The $g_3$ is a rectified linear function
$$
g_3(in_3)=\begin{cases}
   in_3 &\text{if } in_3 >= 0 \\
   0 &\text{if } in_3 < 0
\end{cases}
\\\enspace\\
g_{3}'(in_3)=\begin{cases}
   1 &\text{if } in_3 >= 0 \\
   0 &\text{if } in_3 < 0.
\end{cases}
$$

So,
$$
{\partial L_2 \over \partial w_{1,3}} = −2(64 − 66) \cdot 1 \cdot 6 \cdot 1 \cdot (-2) = -48.
$$

Then, we can update our $w_{1,3}$ (with learning rate $\alpha=1.0$)
$$
w_{1,3} \colonequals w_{1,3} - \alpha {\partial L_2 \over \partial w_{1,3}} = 0 - 1 \cdot (-48) = 48.
$$
The updated weight of $w_{1,3}$ is $48$.

### Automatic Differentiation

It was... pretty tedious, huh?

No worries! We can compute such gradients by **automatic differentiation** method, which applies the rules of calculus in a systematic way.

In our study, we will continue using PyTorch. Let's do it!

We use mean squared error loss function. The "mean" term is doesn't matter in our case, since we use only a single training example.

In [7]:
loss_fn = nn.MSELoss()

Compute the loss.

In [8]:
loss = loss_fn(y_hat, y)
print(loss)

tensor(4., grad_fn=<MseLossBackward0>)


Let's do backward-pass/back-propagation to compute the gradients.

In [9]:
loss.backward()

We use stochastic gradient descent (SGD) for updating our network parameters (weights). The "stochastic" term is doesn't matter in our case, since we use only a single training example.

In [10]:
optimizer = torch.optim.SGD(net.parameters(), lr=1.)

Then, we adjust/update the weights of our network.

In [11]:
optimizer.step()

Let's check our updated weights.

In [12]:
with torch.no_grad():
    for name, param in net.named_parameters():
        print(name, param, "\n")

fc1.weight Parameter containing:
tensor([[ 48., -23.],
        [ 58., -25.]], requires_grad=True) 

fc1.bias Parameter containing:
tensor([-20., -23.], requires_grad=True) 

fc2.weight Parameter containing:
tensor([[-14.,  -9.]], requires_grad=True) 

fc2.bias Parameter containing:
tensor([4.], requires_grad=True) 



As we can see, the updated weight are the same with our manual calculation, where $w_{3,5}$ is $-14$ and $w_{1,3}$ is $48$.

Now, we can freely doing experimentation on different network structures, activation functions, loss functions, and forms of composition without having to do lots of calculus to derive a new learning algorithm for each experiment.