We define the input
$$
\mathbf{X} = 
\begin{bmatrix}
0 & 0\\
1 & 0\\
0 & 1\\
1 & 1
\end{bmatrix}
$$
and outputs
$$
\mathbf{y}_\mathrm{OR} = \begin{bmatrix}0 \\ 1 \\ 1 \\ 1\end{bmatrix}, \mathbf{y}_\mathrm{XOR} = \begin{bmatrix}0 \\ 1 \\ 1 \\ 0\end{bmatrix}, \mathbf{y}_\mathrm{AND} = \begin{bmatrix}0 \\ 0 \\ 0 \\ 1\end{bmatrix}
$$

# A simple  feed forward neural network (FFNN)
We define the activation function with the sigmoid function $\sigma(x)$ = 
$$
\sigma(x) = \frac{1}{1 + e^{-x}}.
$$
We define outputs $a_1^{(1)},a_2^{(1)}$ for the hidden layer with 2 nodes:
$$
a_1^{(1)} = \sigma(z_1^{(1)}),\quad a_2^{(1)} = \sigma(z_2^{(1)}),
$$
where $z_1^{(1)}, z_2^{(1)}$ are activations:
$$
z_1 = w_{11}^{(1)}x_1 + w_{21}^{(1)}x_2 + b_1^{(1)},\quad z_2 = w_{12}^{(1)}x_1 + w_{22}^{(1)}x_2 + b_2^{(1)},
$$
where $x_1, x_2$ are inputs, $w_{ij}^{(1)}, i,j\in {1,2}^{(1)}$ are weights for the hidden layer, and $b_1^{(1)}, b_2^{(1)}$ are biases for the hidden layer. Alternatively expressed with linear algebra:
$$
W^{(1)} =
\begin{bmatrix} 
w_{11}^{(1)} & w_{12}^{(1)} \\ 
w_{21}^{(1)} & w_{22}^{(1)}
\end{bmatrix},\quad
\mathbf{b}^{(1)} = 
\begin{bmatrix}
b_1^{(1)}\\
b_2^{(1)}
\end{bmatrix},\quad
\mathbf{x} = 
\begin{bmatrix}
x_1\\
x_2
\end{bmatrix},\quad
\mathbf{z}^{(1)} =
\begin{bmatrix}
z_1^{(1)}\\
z_2^{(1)}
\end{bmatrix}
= W^{(1)^T} \mathbf{x} + \mathbf{b}^{(1)},\quad
\mathbf{a}^{(1)}(\mathbf{z}^{(1)}) = 
\begin{bmatrix}
a_1^{(1)}\\
a_2^{(1)}
\end{bmatrix}.
$$
We define the activation $z^{(2)}$ and output $y$ for the output layer:
$$
z^{(2)} =  a_1^{(1)} w_1^{(2)} + a_2^{(1)} w_2^{(2)} + b^{(2)},\quad
y = \sigma(z^{(2)})
$$
where $w_1^{(2)}, w_2^{(2)}$ are weights for the output layer and $b^{(2)}$ is the bias for the output layer. Alternatively:
$$
\mathbf{w}^{(2)} = 
\begin{bmatrix}
w_1^{(2)}\\
w_2^{(2)}
\end{bmatrix},\quad
z^{(2)} = \mathbf{w}^{(2)^T} \mathbf{a}^{(1)} + b^{(2)}
$$
As such, the model produced by our FFNN is:
$$
y = \sigma(\mathbf{w}^{(2)}\sigma(W^{(1)}\mathbf{x} + \mathbf{b}^{(1)}) + b^{(2)})
$$

We evaluate our model against the target output $y$ with the loss function $C$ for which we use MSE. As such:
$$
C(y,g)
$$
alternatively 

In [1]:
"""
Setting up a simple feed forward neural network
"""
import numpy as np

def activation_function(activation):
    return 1 / (1 + np.exp(-activation))


features = np.array([[0, 0],
                     [0, 1],
                     [1, 0],
                     [1, 1]])
features = features.T
print(features.shape)
#_number denotes which layer 
weights_1 = np.ones((2, 2), dtype=float)
biases_1 = np.random.randn(2, 1)
weights_2 = np.ones((2, 1))
bias_2 = np.random.normal()
activation_1 = weights_1.T @ features + biases_1
activation_2 = weights_2.T @ activation_function(activation_1) + bias_2
target_output = activation_function(activation_2) 
print(target_output >0.5)

(2, 4)
[[ True  True  True  True]]


Such a neural network is likely quite garbage if untrained. as such we update weights and biases to make a better model. To achieve this, we evaluate the neural network output $y$ in the loss function with the target value $t$. we collect all weights and biases in one vector
$$\boldsymbol{\theta} = 
\begin{bmatrix}
w_11^{(1)}\\
w_12^{(1)}\\
w_21^{(1)}\\
w_22^{(1)}\\
w_1^{(2)}\\
w_2^{(2)}\\
b_1^{(1)}\\
b_2^{(1)}\\
b^{(2)}
\end{bmatrix}
$$
And minimize C(t,y) w.r.t $\boldsymbol{\theta}$. We may do this with any of the gradient descent-like algorithms, and for all we require an expression of the gradient for $C$ w.r.t $\boldsymbol{\theta}$:
$$
\frac{\partial C}{\partial\boldsymbol{\theta}} = \frac{\partial C}{\partial \mathbf{y}}\frac{\partial \mathbf{y}}{\partial \mathbf{a}_1}\frac{\partial \mathbf{a_1}}{\partial\mathbf{z}}\frac{\partial \mathbf{z}}{\partial \boldsymbol{\theta}}
$$


In [22]:
import jax.numpy as np
#import numpy as np
from jax import grad
from flatten_n_unflatten import flatten

"""

"""
parameters = (weights_1, weights_2, biases_1, bias_2)
parameters_tuple, reshape_func = flatten(parameters)



def simple_ffnn(parameters_tuple, features):
  """
  Feed forwark neural network
  """
  weights_1, weights_2, biases_1, bias_2 = reshape_func(parameters_tuple)
  activation_1 = weights_1.T @ features + biases_1
  activation_2 = weights_2.T @ activation_function(activation_1) + bias_2
  target_output = activation_function(activation_2) 
  return target_output




a = simple_ffnn(parameters_tuple, features)


def loss(target, output):
  """
  MSE
  """
  return np.mean((target - output)**2)



def meta_loss(parameters_tuple, target, features):
  """
  Compute MSE for FFNN
  """
  model_output = simple_ffnn(parameters_tuple, features)
  mse = loss(target, model_output)
  return mse


def grad_meta_loss(parameters_tuple, target, features):
  """
  compute gradient of meta loss w.r.t parameters
  """
  return np.array(grad(meta_loss)(parameters_tuple, target, features))


y_or = np.array([0,1,1,1])
learning_rate = 0.01
iteration = 0
max_iter = 1000
tolerance = 1e-5
while True:
    new_parameters_tuple = parameters_tuple - learning_rate*grad_meta_loss(parameters_tuple, y_or, features)
    change = new_parameters_tuple - parameters_tuple
    parameters_tuple = new_parameters_tuple
    iteration += 1
    if max_iter <= iteration or np.sqrt(np.sum(np.array(change))) < tolerance:
      break
output = simple_ffnn(parameters_tuple, features)
print(output)


[[0.65044266 0.7250797  0.7250797  0.79442495]]
