# Exercises week 43 and 44
### The OR, AND, and XOR gates

We have two input values $x_1$ and $x_2$ which decide the output from the two types of gates. Since each input value can be either 0 or 1 we can write the input as a design matrix $X$ where the first and second column represents $x_1$ and $x_2$ respectively as:
$$X = \begin{bmatrix} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{bmatrix}$$

The output $y$ for the different gates we can write as the vectors $y^T=[0, 1,1,1]$ for the OR gate, $y^T=[0,0,0,1]$ for the AND gate, and $y^T=[0, 1, 1, 0]$ for the XOR gate. We setup this matrix and these vectors:


In [309]:
import numpy as np
import jax.numpy as jnp
from jax import grad

# Set up design matrix and output vectors
X = np.asarray([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])

# Gate target arrays
yOR = np.asarray([0, 1, 1, 1])
yAND = np.asarray([0, 0, 0, 1])
yXOR = np.asarray([0, 1, 1, 0])

We create our NN architecture with the Sigmoid function $\sigma$ as activation function where
\begin{equation}
    \sigma(x) = \frac{1}{1+e^{-x}}
\end{equation}
as such

In [310]:
# Parameters
n_hidden_nodes = 2  # hidden nodes per layer
n_categories = 2  # output value categories, for gates we only find 0 or 1
n_inputs, n_features = X.shape  # 2 inputs, 4 features
target_gate = "OR"  # choose which target gate to train on

# Activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


# Initialize random number generator with seed
rng = np.random.default_rng(2023)

# Weights and bias in the hidden layer
hidden_weights = rng.standard_normal((n_features, n_hidden_nodes))  # weights normally distributed
hidden_bias = np.zeros(n_hidden_nodes) + 0.01

# Weights and bias in the output layer
output_weights = rng.standard_normal((n_hidden_nodes, n_categories))  # weights normally distributed
output_bias = np.zeros(n_hidden_nodes) + 0.01

print(hidden_weights)
print(output_weights)

[[ 0.60172129  1.15161897]
 [-1.35946236  0.22205533]]
[[-0.77586755  0.8087058 ]
 [-0.19862826 -1.57869386]]


### Feed forward
Then we set up the feed forward algorithm and compare one pass with the target vectors $y^T$

In [311]:
def feed_forward(X):
    """Feed forward algorithm."""
    # weighted sum of inputs to the hidden layer
    z_h = X @ hidden_weights + hidden_bias

    # activation in the hidden layer
    a_h = sigmoid(z_h)

    # weighted sum of inputs to the output layer
    z_o = a_h @ output_weights + output_bias

    # axis 0 holds each input and axis 1 the probabilities of each category
    probabilities = sigmoid(z_o)
    return probabilities


def predict(X):
    """Get neural network prediction using the feed forward algorithm."""
    probabilities = feed_forward(X)
    return np.argmax(probabilities, axis=1).astype(float)  # also convert to float to not upset Jax


# Make prediction and compare with gate target y_vectors
predictions = predict(X)

print("Targets:")
print("yOR =", yOR)
print("yAND =", yAND)
print("yXOR =", yXOR)

print("\nPrediction:")
print(predictions)

Targets:
yOR = [0 1 1 1]
yAND = [0 0 0 1]
yXOR = [0 1 1 0]

Prediction:
[1. 0. 0. 0.]


We see this prediction does not match any target. This is because we only did one pass and that was with random starting weights. Now we setup the cost function and the back propagation algorithm.

<!--- 
For the cost function we use the cross entropy for binary classification given as
$$C(\boldsymbol{\theta}) = -\sum_{i=1}^n \left( y_i \ln [p(y_i | x_i, \boldsymbol{\theta)}] + (1-y_i)\ln[1-p(y_i | x_i, \boldsymbol{\theta)}] \right)$$

where the probabilities $p$ we have from the sigmoid function 
$$p(y_i=1|x_i, \boldsymbol{\theta}) = \frac{\exp(\theta_0 + \theta_1 x_i)}{1- \exp(\theta_0 + \theta_1 x_i)}$$
$$p(y_i=0|x_i, \boldsymbol{\theta}) = 1 - p(y_i=1|x_i, \boldsymbol{\theta})$$ 
-->

For the cost function we use the cross entropy for binary classification given as
$$C(\boldsymbol{W}) = -\sum_{i=1}^n \left( t_i \log a_i^L + (1-t_i)\log(1-a_i^L) \right)$$
where $t$ is the target and $a^L$ is the final activation from the final/output layer.

In [312]:
if target_gate == "OR":
    target = np.asarray([0, 1, 1, 1])
elif target_gate == "AND":
    target = np.asarray([0, 0, 0, 1])
elif target_gate == "XOR":
    target = np.asarray([0, 1, 1, 0])

In [313]:
def cost_cross_entropy(target):
    """Returns a function for the logistic cross entropy for binary classification / log loss function using a given target vector."""
    d = 1e-9  # small value to avoid infinities
    def func(X):
        return -(1 / target.size) * jnp.sum(target * jnp.log(X + d))
    # def func(x):
    #     return -np.sum(target * jnp.log(x + d) + (1 - target) * jnp.log(1 - x + d))
    return func


cost_func = cost_cross_entropy(target)
print(cost_func(predictions))

15.54245


### Calculating the analytical gradients for back propagation
We differentiate the cost function with regards to (wgt) the activation of the output layer $a_i^L$ and get:
\begin{align*}
    \frac{\partial C}{\partial a_i^L} &= -\frac{\partial}{\partial a_i^L}(t_i \ln(a_i^L) + (1-t_i)\ln(1-a_i^L))
    \\ &= -(\frac{t_i}{a_i^L} + \frac{1-t_i}{1-a_i^L}(-1))
    \\ &= \frac{1-t_i}{1-a_i^L} - \frac{t_i}{a_i^L}
    \\ &= \frac{a_i^L(1-t_i)}{a_i^L(1-a_i^L)} - \frac{t_i(1-a_i^L)}{a_i^L(1-a_i^L)}
    \\ &= \underline{\frac{a_i^L-t_i}{a_i^L(1-a_i^L)}}
\end{align*}

The expression for the output error $\delta^L$ is 

\begin{equation}
    \delta_i^L = \sigma'(z_i^L)\frac{\partial C}{\partial a_i^L}
\end{equation}

where $\sigma$ is our Sigmoid function, we can write it as

\begin{equation}
    \sigma'(x) = \frac{e^{-x}}{(1+e^{-x})^2}
\end{equation}

In [314]:
def analytic_gradient(target):
    d = 1-9  # small value to avoid infinities
    def func(x):
        return np.exp(-x)/(1 + np.exp(-x))**2 * (x - target)/(x * (1 - x) + d)
    return func

analy_grad = analytic_gradient(target)

### Using automatic differentiation for the gradient

In [315]:
auto_grad = grad(cost_func)

# Compare analytical vs automatic with yOR
print(analy_grad(predictions))
print(auto_grad(predictions))

[-0.02457649  0.03125     0.03125     0.03125   ]
[-0.0e+00 -2.5e+08 -2.5e+08 -2.5e+08]


### Back propagation