# Simple Feedforward Networks

> Visual Studio Code is highly recommended to open this notebook. I used KaTeX equations in Markdown for writing equations and it might not rendered very well using other than Visual Studio Code.

A TensorFlow implementation of a simple feedforward network as seen at Figure 21.3 in `Russell S. J. & Norvig P. (2020). Artificial intelligence : a modern approach (4th ed.). Pearson.` book.

![Figure 21.3](images/fig_21_3.png)

We can write an expression for the output of that network as follows (taken from Equation 21.2 of the book)
$$
\begin{equation}
\begin{split}
\^{y} &= g_5(in_5) \\
&= g_5(w_{0,5} + w_{3,5}a_3 + w_{4,5}a_4) \\
&= g_5(w_{0,5} + w_{3,5}g_3(in_3) + w_{4,5}g_4(in_4)) \\
&= g_5(w_{0,5} + w_{3,5}g_3(w_{0,3} + w_{1,3}x_1 + w_{2,3}x_2)
+ w_{4,5}g_4(w_{0,4} + w_{1,4}x_1 + w_{2,4}x_2))
\end{split}
\end{equation}
$$

## Create the network

Let the activation functions of $g_3$ and $g_4$ are using a ReLU function, and $g_5$ is just a linear function.

In [1]:
import numpy as np
import tensorflow as tf

2023-04-29 10:10:23.109997: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-29 10:10:23.111533: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-29 10:10:23.142274: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-29 10:10:23.143084: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
class SimpleFeedForward(tf.keras.Model):
    def __init__(self) -> None:
        super().__init__()
        initializer = tf.keras.initializers.Zeros()
        self.fc1 = tf.keras.layers.Dense(2, activation="relu", name="fc1", kernel_initializer=initializer)
        self.fc2 = tf.keras.layers.Dense(1, name="fc2", kernel_initializer=initializer)
    
    def call(self, x):
        v = self.fc1(x)
        z = self.fc2(v)
        return z

In [3]:
net = SimpleFeedForward()
# initialize the weights by the kernel initializer
_ = net(tf.constant([[0, 0]]))

2023-04-29 10:10:25.238420: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-29 10:10:25.238641: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


Usually, we kept the inital weights random.
But for the sake of simplicity of our study, lets initialize the weights with easy numbers.

In [4]:
last_v = 0
for layer in net.layers:
    new_weights = []
    for i, w in enumerate(layer.get_weights()):
        p_numel = w.size
        nw = np.arange(last_v, last_v+p_numel, dtype=np.float32).reshape(w.shape, order="F")
        # ‘F’ means to read / write the elements using Fortran-like index order,
        # with the first index changing fastest, and the last index changing slowest.
        new_weights.append(nw)
        last_v += p_numel
    layer.set_weights(new_weights)
    print(layer.name, layer.get_weights(), "\n")

fc1 [array([[0., 2.],
       [1., 3.]], dtype=float32), array([4., 5.], dtype=float32)] 

fc2 [array([[6.],
       [7.]], dtype=float32), array([8.], dtype=float32)] 



## Forward-pass

Let a training example below is used.

In [5]:
x = tf.Variable([[-2, 1]], dtype=tf.float32)
y = tf.Variable([[64]], dtype=tf.float32)

Let's make predictions for our training example. This process is also usually called as forward-pass/forward-propagation.

In [6]:
y_hat = net(x)
print(y_hat)

tf.Tensor([[66.]], shape=(1, 1), dtype=float32)


The output is $66$.

### Manual calculation

We can also manually calculate the output of our network.

![Figure 21.3b](images/fig_21_3b.png)

![Forward-pass of Figure 21.3b](images/fig_21_3b_forward.png)

The output is same at $66$.

## Backward-pass

We will calculate the gradient for the
network with respect to our previous single training example $(\mathbf{x},y)$. (For multiple
examples, the gradient is just the sum of the gradients for the individual examples.)

Let the squared loss function $L_2$ is used.

$$ L_2 = (y-\^{y})^2

### Manual calculation

We can manually compute the gradient of the loss with respect to (w.r.t.) the weights using the chain rule 
$$ {dy \over dx} = {dy \over du}{du \over dx}. $$

So, the gradient of our $L_2$ loss w.r.t. $w_{3,5}$ should be
$$
{\partial L_2 \over \partial w_{3,5}} = {\partial L_2 \over \partial \^{y}}{\partial \^{y} \over \partial in_5}{\partial in_5 \over \partial w_{3,5}}
$$

where
$$
{\partial L_2 \over \partial \^{y}} = {\partial \over \partial \^{y}}(y-\^{y})^2 = 2(y − \^{y})(-1) = −2(y − \^{y}),
$$

and
$$
{\partial \^{y} \over \partial in_5} = {\partial \over \partial in_5}(g_5(in_5)) = g_{5}'(in_5).
$$

Since $w_{0,5}$ and $w_{4,5}a_4$ do not depend on $w_{3,5}$, also $a_3$ does not depend on $w_{3,5}$,
$$ {\partial in_5 \over \partial w_{3,5}} = {\partial \over \partial w_{3,5}}(w_{0,5} + w_{3,5}a_3 + w_{4,5}a_4) = a_3.
$$

Finally, we have 
$$
{\partial L_2 \over \partial w_{3,5}} = −2(y − \^{y}) g_{5}'(in_5) a_3.
$$

Let's try to compute the gradient of our $L_2$ loss w.r.t. $w_{3,5}$ using that equation for our previous training example. 

Since $g_5$ is just a linear function $g_5(in_5)=in_5$, then $g_{5}'(in_5)=1$, so
$$
{\partial L_2 \over \partial w_{3,5}} = −2(64 − 66) \cdot 1 \cdot 5 = 20.
$$

Then, we can update our $w_{3,5}$ (with learning rate $\alpha=1.0$)
$$
w_{3,5} \colonequals w_{3,5} - \alpha {\partial L_2 \over \partial w_{3,5}} = 6 - 1 \cdot 20 = -14.
$$
The updated weight of $w_{3,5}$ is $-14$.

Now, let's try a slighty more difficult case, the gradient of our $L_2$ loss w.r.t. $w_{1,3}$,
$$
{\partial L_2 \over \partial w_{1,3}} = {\partial L_2 \over \partial \^{y}}{\partial \^{y} \over \partial in_5}{\partial in_5 \over \partial in_3}{\partial in_3 \over \partial w_{1,3}}.
$$

As we can see, the first few steps are identical, so we can use our previous derived functions, so

$$
\begin{align*}
{\partial L_2 \over \partial w_{1,3}} &= −2(y − \^{y}) g_{5}'(in_5){\partial \over \partial in_3}(w_{3,5}a_3){\partial in_3 \over \partial w_{1,3}} \\

&= −2(y − \^{y}) g_{5}'(in_5)w_{3,5}{\partial \over \partial in_3}g_3(in_3){\partial \over \partial w_{1,3}}(w_{0,3}+w_{1,3}x_1+w_{2,3}x_2) \\

&= −2(y − \^{y}) g_{5}'(in_5)w_{3,5}g_{3}'(in_3)x_1
\end{align*}
$$

The simplification in the last line because $w_{0,3}$ and $w_{2,3}x_2$ do not depend on $w_{1,3}$, also $x_1$ does not depend on any others.

Let's try to compute the gradient of our $L_2$ loss w.r.t. $w_{1,3}$ using that equation for our previous training example. 

The $g_3$ is a rectified linear function
$$
g_3(in_3)=\begin{cases}
   in_3 &\text{if } in_3 >= 0 \\
   0 &\text{if } in_3 < 0
\end{cases}
\\\enspace\\
g_{3}'(in_3)=\begin{cases}
   1 &\text{if } in_3 >= 0 \\
   0 &\text{if } in_3 < 0.
\end{cases}
$$

So,
$$
{\partial L_2 \over \partial w_{1,3}} = −2(64 − 66) \cdot 1 \cdot 6 \cdot 1 \cdot (-2) = -48.
$$

Then, we can update our $w_{1,3}$ (with learning rate $\alpha=1.0$)
$$
w_{1,3} \colonequals w_{1,3} - \alpha {\partial L_2 \over \partial w_{1,3}} = 0 - 1 \cdot (-48) = 48.
$$
The updated weight of $w_{1,3}$ is $48$.

### Automatic Differentiation

It was... pretty tedious, huh?

No worries! We can compute such gradients by **automatic differentiation** method, which applies the rules of calculus in a systematic way.

In our study, we will continue using TensorFlow. Let's do it!

We use mean squared error loss function. The "mean" term is doesn't matter in our case, since we use only a single training example.

In [7]:
loss_fn = tf.keras.losses.MeanSquaredError()

Do forward propagation and compute the loss. Also, track the operations during the forward propagation using `tf.GradientTape` (required for computing gradients, the default is the variables are not tracked). See [Automatic differentation TensorFlow guide](https://www.tensorflow.org/guide/autodiff).

In [8]:
with tf.GradientTape() as tape:
    tape.watch(x)
    y_hat = net(x, training=True)
    loss = loss_fn(y_hat, y)
print(loss)

tf.Tensor(4.0, shape=(), dtype=float32)


Let's do backward-pass/back-propagation to compute the gradients.

In [9]:
gradients = tape.gradient(loss, net.trainable_variables)

In [10]:
print(gradients)

[<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-48., -56.],
       [ 24.,  28.]], dtype=float32)>, <tf.Tensor: shape=(2,), dtype=float32, numpy=array([24., 28.], dtype=float32)>, <tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[20.],
       [16.]], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([4.], dtype=float32)>]


We use stochastic gradient descent (SGD) for updating our network parameters (weights). The "stochastic" term is doesn't matter in our case, since we use only a single training example.

In [11]:
optimizer = tf.keras.optimizers.SGD(learning_rate=1.)

Then, we adjust/update the weights of our network.

In [12]:
optimizer.apply_gradients(zip(gradients, net.trainable_variables))

<tf.Variable 'UnreadVariable' shape=() dtype=int64, numpy=1>

Let's check our updated weights.

In [13]:
for layer in net.layers:
    print(layer.name, layer.get_weights(), "\n")

fc1 [array([[ 48.,  58.],
       [-23., -25.]], dtype=float32), array([-20., -23.], dtype=float32)] 

fc2 [array([[-14.],
       [ -9.]], dtype=float32), array([4.], dtype=float32)] 



As we can see, the updated weight are the same with our manual calculation, where $w_{3,5}$ is $-14$ and $w_{1,3}$ is $48$.

Now, we can freely doing experimentation on different network structures, activation functions, loss functions, and forms of composition without having to do lots of calculus to derive a new learning algorithm for each experiment.