# The $XOR$ Problem

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

The XOR problem is a classic example in machine learning that highlights the limitations of linear models in solving non-linear problems. The problem involves a binary classification task where the output is 1 if either of the two input values is 1 but is 0 if both inputs are 0 or both inputs are 1.

On the surface, XOR appears to be a very simple problem; however, Minksy and Papert ([1969](https://leon.bottou.org/publications/pdf/perceptrons-2017.pdf)) showed that this was a big problem for neural network architectures of the 1960s, known as perceptrons.

The XOR problem is non-linearly separable and cannot be solved by linear models. This led to the development of more advanced techniques, such as neural networks, which can handle non-linear problems by introducing non-linearities in the activation functions. The XOR problem is a useful example to illustrate the need for more complex models in machine learning when dealing with non-linear problems.

![image](https://res.cloudinary.com/practicaldev/image/fetch/s--6OpbLFPq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/0%2ALYlt6CZJHOJkNRHJ.)

In this notebook, we will implement an ML solution to the XOR problem, exposing the limitations of linear models (like the `perceptron`).

### The `Perceptron`

To quickly summarise, a perceptron is essentially a method of separating a manifold with a _hyperplane_. This is just drawing a straight line to separate an $n$-dimensional space into two regions: True or False.

[![perceptron2](https://miro.medium.com/max/639/1*_Epn1FopggsgvwgyDA4o8w.png)](https://miro.medium.com/max/639/1*_Epn1FopggsgvwgyDA4o8w.png)

In [1]:
import numpy as np
import plotly.graph_objects as go

rs = np.random.RandomState(42)

x = np.random.randn(100)*5
x_0 = np.linspace(x.min()-.1, x.max()+.1, 500)  # evenly spaced test points


fig = go.Figure(data=go.Scatter(x=x_0, y=x_0, name='Linear Function'))
fig.update_layout(template='plotly_dark',
                  title='Linear Function')
fig.show()

A neural network is essentially a series of hyperplanes (a plane in $N$ dimensions) that group / separate regions in the target hyperplane.

Let's generate some fake data to visualize this:

- $X$ is the input features.
- $Y$ is the class label for each $x$.

We can then fit the data with an SVM and separate the samples with a hyperplane.

Note: The equation for separating the plane is given by all $x$ in $R^{3}$ such that.

$$(SVM\;Coefficients \cdot x) + b = 0$$

In [2]:
from sklearn.svm import SVC

n_samples = 100

X = np.zeros((n_samples, 3))

X[:n_samples //
    2] = rs.multivariate_normal(np.ones(3), np.eye(3), size=n_samples//2)
X[n_samples //
    2:] = rs.multivariate_normal(-np.ones(3), np.eye(3), size=n_samples//2)

Y = np.zeros(n_samples)

Y[n_samples//2:] = 1

svm = SVC(kernel='linear')

svm.fit(X, Y)

def z(x, y): 
    return (-svm.intercept_[0]-svm.coef_[0][0]*x-svm.coef_[0][1]*y) / svm.coef_[0][2]

am, aM = X[:, 0].min(), X[:, 0].max()
bm, bM = X[:, 1].min(), X[:, 1].max()
a = np.linspace(am, aM, 10)
b = np.linspace(bm, bM, 10)
a, b = np.meshgrid(a, b)


fig = go.Figure()

fig.add_surface(x=a, y=b, z=z(a, b), showscale=False, opacity=0.9)
fig.add_scatter3d(x=X[Y == 0, 0], y=X[Y == 0, 1], z=X[Y == 0, 2],
                  mode='markers', marker={'color': 'blue'}, name='Class_0')

fig.add_scatter3d(x=X[Y == 1, 0], y=X[Y == 1, 1], z=X[Y == 1, 2],
                  mode='markers', marker={'color': 'red'}, name='Class_1')

fig.update_layout(template='plotly_dark',
                  title='Hyperplane Separation',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')

fig.show()

Now, to the XOR problem. The XOR function is one of the simplest non-linear functions. It is impossible to separate True results from False using a single line. Try it yourself...

In [3]:
def xor(x1, x2):
    """
    Return the XOR (exclusive or) of two boolean values.

    Parameters:
        x1 (bool): First boolean value
        x2 (bool): Second boolean value

    Returns:
        bool: XOR of x1 and x2, which is True if only one of them is True, False otherwise.
    """
    return bool(x1) != bool(x2)


x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([xor(*i) for i in x])

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=[x[0][0]],
    y=[x[0][1]],
    marker=dict(color="crimson", size=12),
    mode="markers",
    name="False",
))

fig.add_trace(go.Scatter(
    x=[x[3][0]],
    y=[x[3][1]],
    marker=dict(color="crimson", size=12),
    mode="markers",
    name="False",
))

fig.add_trace(go.Scatter(
    x=[x[1][0]],
    y=[x[1][1]],
    marker=dict(color="blue", size=12),
    mode="markers",
    name="True",
))

fig.add_trace(go.Scatter(
    x=[x[2][0]],
    y=[x[2][1]],
    marker=dict(color="blue", size=12),
    mode="markers",
    name="True",
))

fig.update_layout(template='plotly_dark',
                  title='XOR function',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()


To solve this problem, we will need some "non-linearity", and a way to introduce this is by using `activation functions`.

### Activation functions

The way our brains work is like a sort of step function. Neurons fire a 1 if there is enough build-up of voltage. Else, it doesn't fire (i.e., a zero). We aim, via the perceptron, to recreate this behavior.

The problem with a step function is that they are discontinuous (it does not have a well-behaved derivative in all its points). This creates problems with the practicality of mathematics. Thus we tend to use a smooth function, the sigmoid, which is infinitely differentiable, or the Relu (_which has a more nicely behaved derivative_), allowing us to easily do calculus (and gradient descent) with our model.


In [4]:
x = np.random.randn(100)*5
x_0 = np.linspace(x.min()-.1, x.max()+.1, 500)

def step(array):
    """
    Applies the step function to an input array.

    Args:
        array (numpy.ndarray): Input array to apply the step function to.

    Returns:
        numpy.ndarray: Output array containing the result of applying the step 
        function to the input array. Each element in the output array is set 
        to 1 if the corresponding element in the input array is greater than 0, 
        otherwise it is set to 0.
    """
    y = np.maximum(array, 0)
    for i in range(len(y)):
        if y[i] > 0:
            y[i] = 1
    return y

def ReLU(Z):
    """
    Applies the rectified linear unit (ReLU) activation function 
    element-wise to an input array.

    Parameters:
    -----------
    Z : numpy.ndarray
        Input array of shape (n_samples, n_features), where 
        n_samples is the number of samples and n_features is 
        the number of features.

    Returns:
    --------
    numpy.ndarray
        Output array of the same shape as Z, where the ReLU 
        function is applied element-wise to Z. Specifically, the 
        function returns the maximum of each element in Z and 0.
    """
    return np.maximum(Z, 0)  # valor máximo entre Z e 0

def sigmoid(x):
    """
    Calculates the sigmoid function of the input x.

    Parameters:
    -----------
    x: array-like or float
        Input to calculate the sigmoid function.

    Returns:
    --------
    float or array-like
        The value of the sigmoid function for the input x.
    """
    return 1/(1 + np.exp(-x))

fig = go.Figure(data=go.Scatter(x=x_0, y=step(x_0), name='Step Function'))
fig.update_layout(template='plotly_dark',
                  title='Step Function',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()


fig = go.Figure(data=go.Scatter(x=x_0, y=ReLU(x_0), name='ReLU Function'))
fig.update_layout(template='plotly_dark',
                  title='ReLU Function',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()

fig = go.Figure(data=go.Scatter(x=x_0, y=sigmoid(x_0), name='Sigmoid Function'))
fig.update_layout(template='plotly_dark',
                  title='Sigmoid Function',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()


The way to solve the XOR problem involves realizing that we can just stack two perceptrons. 

Two perceptrons stacked together can solve the XOR problem because the XOR function can be separated by a linear boundary in a higher-dimensional feature space.

The first perceptron maps the input values to a higher-dimensional space using a non-linear activation function such as the sigmoid or ReLU function, which introduces non-linearity into the model.

The output of the first perceptron is then used as input to the second perceptron, which can now learn a linear boundary to separate the inputs. By stacking two perceptrons together with a non-linear activation function in between, we can create a non-linear decision boundary that can separate the two classes in the XOR problem.

This is an example of how combining multiple perceptrons with non-linear activation functions can increase the complexity and flexibility of the model to solve non-linear problems.

In [5]:

x_line = np.linspace(0, 1, 500)

x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([xor(*i) for i in x])

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=[x[0][0]],
    y=[x[0][1]],
    marker=dict(color="crimson", size=12),
    mode="markers",
    name="False",
))

fig.add_trace(go.Scatter(
    x=[x[3][0]],
    y=[x[3][1]],
    marker=dict(color="crimson", size=12),
    mode="markers",
    name="False",
))

fig.add_trace(go.Scatter(
    x=[x[1][0]],
    y=[x[1][1]],
    marker=dict(color="blue", size=12),
    mode="markers",
    name="True",
))

fig.add_trace(go.Scatter(
    x=[x[2][0]],
    y=[x[2][1]],
    marker=dict(color="blue", size=12),
    mode="markers",
    name="True",
))

fig.add_trace(go.Scatter(x=x_line - 0.1, y=x_line[::-1] - 0.1, name='Boundary 1'))

fig.add_trace(go.Scatter(x=x_line + 0.1, y=x_line[::-1] + 0.1, name='Boundary 2'))

fig.update_layout(template='plotly_dark',
                  title='Idealized decision boundary (with two line)',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()


The "_knowledge_" of a neural network is all contained in the learned parameters, which are the weights and bias. The weights are multiplied by each signal sent by their respective perceptrons, and the bias is added as: 

$$y(x) = w \cdot x + b$$ 

Where: 

- $w$ is the weight.
- $b$ is the bias.

The `backpropagation` algorithm is the key method by which we sequentially adjust the weights by backpropagating the errors from the final output neuron.

To calculate the adjustment of each weight, we define the error (loss/cost function) as anything that will decrease as we approach the target distribution. Let $E$ be the error function given by:

$$E = \frac{(y−y')^{2}}{2}$$

The learning algorithm consists of the following steps:

- Randomly initialize bias and weights.
- Iterate the training data.
- `Forward propagate`: Calculate the neural net output.
- Compute the loss function ("_size of the error_")
- `Backwards propagate`: Calculate the gradients concerning the weights and bias.
- Adjust weights and bias by gradient descent.
- Exit when the error has reached a certain threshold.

Here's a pseudo-code implementation of the steps involved in training a neural network using gradient descent:

```python

# initialize the neural network with random weights and bias
network = initialize_network()

# set the learning rate and the maximum number of iterations
learning_rate = 0.01
max_iterations = 1000

# iterate through the training data for a fixed number of epochs
for i in range(max_iterations):
    # randomly shuffle the training data
    random.shuffle(training_data)
    
    # iterate through each training example
     for example, in training_data:
        # forward propagate through the network
        output = network.forward_propagate(example.features)
        
        # compute the loss function
        loss = compute_loss(output, example.label)
        
        # backward propagate through the network
        gradients = network.backward_propagate(loss)
        
        # update the weights and bias using a stochastic gradient descent
        network.update_weights(gradients, learning_rate)
        
    # compute the training accuracy and print it
    accuracy = compute_accuracy(network, training_data)
    print("Epoch {}: training accuracy = {}".format(i, accuracy))
    
    # check if the stopping criterion has been met
    if accuracy >= 0.99:
        break

```

Now, let us try to implement this with some real code.

In [None]:
import itertools
import pandas as pd

alpha = 0.02
np.random.seed(42)

def xor(x1, x2):
    return bool(x1) != bool(x2)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sigmoid_result):
    return sigmoid_result * (1 - sigmoid_result)

def error(target, prediction):
    return .5 * (target - prediction)**2

def error_derivative(target, prediction):
    return - target + prediction

x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[xor(*i)] for i in x], dtype=int)

n_neurons_input, n_neurons_hidden, n_neurons_output, bias_per_neuron = 2, 2, 1, 1

w_hidden = np.random.random(size=(n_neurons_input, n_neurons_hidden))
b_hidden = np.random.random(size=(bias_per_neuron, n_neurons_hidden))

w_output = np.random.random(size=(n_neurons_hidden, n_neurons_output))
b_output = np.random.random(size=(bias_per_neuron, n_neurons_output))

errors = []
params = []
grads = []
epoch = 1

while True:
    if epoch == 1:
        print(f'Training...\nEpoch 1')

    y_hidden = sigmoid(np.dot(x, w_hidden) + b_hidden)
    y_output = sigmoid(np.dot(y_hidden, w_output) + b_output)

    e = error(y, y_output).mean()

    if e < 1e-4:
        print(f'Epoch {epoch}.')
        print(f'Training terminated. Loss score: {e}.')
        break

    grad_output = error_derivative(y, y_output) * sigmoid_derivative(y_output)
    grad_hidden = grad_output.dot(w_output.T) * sigmoid_derivative(y_hidden)

    w_output -= alpha * y_hidden.T.dot(grad_output)
    w_hidden -= alpha * x.T.dot(grad_hidden)

    b_output -= alpha * np.sum(grad_output)
    b_hidden -= alpha * np.sum(grad_hidden)

    errors.append(e)
    grads.append(np.concatenate((grad_output.ravel(), grad_hidden.ravel())))
    params.append(np.concatenate((w_output.ravel(), b_output.ravel(),
                                  w_hidden.ravel(), b_hidden.ravel())))
    epoch += 1


def predict(x, y):
    y_hidden = sigmoid(np.dot(x, w_hidden) + b_hidden)
    result = sigmoid(np.dot(y_hidden, w_output) + b_output)
    df = pd.DataFrame(x, columns=['x1', 'x2'])
    df['Prediction'] = result
    df['Ground Truth'] = y
    return df


predict(x, y)


We trained for over 300,000 epochs and managed to get a really small loss and a network that could solve the XOR problem. Below we plot how the loss decreased over time, the amount of gradient update each parameter received over time, and how each parameter changed during training. One thing you can perceive is that many parameters didn't get as many big updates as the other ones (especially the ones further from the output layer), and this is the `vanishing gradient` problem.

In [None]:
epochs = list(range(1, 40000))

fig = go.Figure(data=go.Scatter(x=epochs, y=errors, name='Multi-Perceptron Loss'))

fig.update_layout(template='plotly_dark',
                  title='Multi-Perceptron Loss',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()

grads_df = pd.DataFrame(grads)
params_df = pd.DataFrame(params)

fig = go.Figure()

for i in range(len(grads_df.columns)):
    fig.add_trace(go.Scatter(x=epochs, y=grads_df[i].abs(), name=f'gradient_{i}'))
    
fig.update_layout(template='plotly_dark',
                  title='Gradients (0-3: output layer) (4-11: the hidden layer)',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()

fig = go.Figure()
for i in range(len(params_df.columns)):
    fig.add_trace(go.Scatter(x=epochs, y=params_df[i], name=f'Paremeters_{i}'))

fig.update_layout(template='plotly_dark',
                  title='Weigths (0-1, 3-6) and Biases(2, 7-8)',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()


The vanishing gradient problem is a phenomenon that occurs in neural networks where the gradients of the error concerning the parameters of the early layers (closest to the input) become very small. This happens because the gradient is multiplied by the derivative of the activation function of each layer during backpropagation. If the derivative is smaller than 1, then the gradient shrinks exponentially as it is backpropagated to earlier layers, making it difficult for those layers to "_learn_."

To mitigate the vanishing gradient problem, various techniques have been developed. One common approach is to use better optimization algorithms such as adaptive gradient descent methods (e.g., `Adam` or `RMSprop`) that adjust the learning rate based on the gradient.

Another approach is to use activation functions with a derivative that is not close to zero, such as `ReLU` or variants like `leaky ReLU`. 

Also, the use of skip connections, such as in `ResNet`, helps to mitigate this problem by providing a direct path for the gradient to flow through the network.

We can also use `batch normalization` to help stabilize the gradient by normalizing the inputs to each layer. These techniques have made it possible to train much deeper neural networks with fewer issues related to the vanishing gradient problem.

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).