In this notebook, we're going to create a simple perceptron using pure Python (no TensorFlow or Keras yet). To keep things simple, we're not even going to use Numpy here.

The only standard library we're going to bring in is `math`, to define the sigmoid activation function.

In [None]:
import math

First, we define our data set. `X` defines a set of two-dimensional data points, and `y` defines our vector of outcomes (targets).

Note: typically, we'd want to normalize, standardize, or minmax scale our instances first to make the network train better. Given that our samples aren't too extreme here, we can safely skip this step.

In [None]:
X = [[0, 1], [1, 0], [2, 2], [3, 4], [4, 2], [5, 2], [4, 1], [5, 0]]
y = [0, 0, 0, 0, 1, 1, 1, 1]

Next up, we define our activation function, and can immediately define a function to predict an outcome given an instance. Since we use a sigmoid activation function, the output of the perceptron will be bounded between 0 and 1 and can be directly interpreted as a probability.

In [None]:
def sigmoid(x):
  return 1 / (1 + math.exp(-x))

def predict(instance, weights):
  # We need a weight for each input, plus a bias weight
  assert len(weights) == len(instance) + 1
  # Assume that the first weight given is our bias
  output = weights[0]
  for i in range(len(weights)-1):
    output += weights[i+1] * instance[i]
  return sigmoid(output)

We can now already see what happens if we let an untrained perceptron make some predictions, e.g. by setting all the weights to 0.

Setting these initial values for the weights is typically called "initialization", and is an important topic on its own we'll discuss later on.

In [None]:
weights = [0, 0, 0]

for i in range(len(X)):
  prediction = predict(X[i], weights)
  print(X[i], y[i], '->', prediction)

[0, 1] 0 -> 0.5
[1, 0] 0 -> 0.5
[2, 2] 0 -> 0.5
[3, 4] 0 -> 0.5
[4, 2] 1 -> 0.5
[5, 2] 1 -> 0.5
[4, 1] 1 -> 0.5
[5, 0] 1 -> 0.5


Next, we define a method to train our perceptron using one instance. This is done by calculating the error or "loss", which is then used to shuffle the weights around.

To do so, we need to calculate the gradient of our loss function with respect to the weights: $\frac{\partial L}{\partial w}$.

Let us define our loss function as $L = (y - \hat{y})^2$. Remember that our prediction is given by $\sigma(o)$ with $\sigma$ the sigmoid function and $o = w_1 x_1 + w_2 x_2 + b$.

We now need to calculate $\frac{\partial L}{\partial w} = \frac{\partial (y - \hat{y})^2}{\partial w}$. To expand this further, we utilize the chain rule as follows:

$$\frac{\partial (y - \hat{y})^2}{\partial w} = \frac{\partial (y - \hat{y})^2}{\partial (y - \hat{y})} \frac{\partial (y - \hat{y})}{\partial w} = 2(y - \hat{y})\frac{\partial (y - \hat{y})}{\partial w} = 2(y - \hat{y})(-1)\frac{\partial \hat{y}}{\partial w}$$

We know that $\hat{y}$ is given by $\sigma(o)$, so we can utilize the chain rule again:

$$\frac{\partial \hat{y}}{\partial w} = \frac{\partial \sigma(o)}{\partial w} = \frac{\partial \sigma(o)}{\partial o}\frac{\partial o}{\partial w}$$

The first partial derivative (the first derivative for the sigmoid function) is equal to $\sigma(o)(1-\sigma(o))$. The second partial derivative is equal to $x_1, x_2$ for $w=w_1, w_2$ and equal to 1 for $w=b$ (the bias weight).

We can then use gradient descent to update the weights given a learning rate:

$$w_{i,t+1} = w_{i,t} - \eta \frac{\partial L}{\partial w_i}$$

Or, expanded based on our results above:

$$w_{i,t+1} = w_{i,t} - \eta \left( 2(y - \hat{y})(-1) \sigma(o)(1-\sigma(o)) x_i\right)$$

Which is equal to:

$$w_{i,t+1} = w_{i,t} + \eta \cdot 2(y - \hat{y})\sigma(o)(1-\sigma(o)) x_i$$

If you want to go through a refresher on linear algebra, this link is a good recommendation: https://explained.ai/matrix-calculus/index.html

In [None]:
def train(instance, weights, y_true, l_rate):
  prediction = predict(instance, weights)
  abserror   = y_true - prediction
  weights[0] = weights[0] + l_rate * 2 * abserror * prediction * (1-prediction)
  for i in range(len(weights)-1):
      weights[i+1] = weights[i+1] + l_rate * 2 * abserror * prediction * (1-prediction) * instance[i]
  return weights

Next, we can set our learning rate, and train for one pass over our instances.

In [None]:
l_rate = 0.01

for i in range(len(X)):
  weights = train(X[i], weights, y[i], l_rate)

print(weights)

for i in range(len(X)):
  prediction = predict(X[i], weights)
  print(X[i], y[i], '->', prediction)

[0.00013804714817943972, 0.030374561900359753, -0.0043131276125743965]
[0, 1] 0 -> 0.49895623140008755
[1, 0] 0 -> 0.5076275604874748
[2, 2] 0 -> 0.5130622560931664
[3, 4] 0 -> 0.5184938648995636
[4, 2] 1 -> 0.5282224798656285
[5, 2] 1 -> 0.5357848625086622
[4, 1] 1 -> 0.5292971938385711
[5, 0] 1 -> 0.5379297045186441


It looks like nothing has happened. So let's try this for a couple more "epochs" (passes over the training data) and see what happens:

In [None]:
l_rate = 0.01
epochs = 2000

for n_epoch in range(epochs):
  for i in range(len(X)):
    weights = train(X[i], weights, y[i], l_rate)

print(weights)

for i in range(len(X)):
  prediction = predict(X[i], weights)
  print(X[i], y[i], '->', prediction)

[-3.1220278679378843, 1.8204208827300656, -1.2193192699453246]
[0, 1] 0 -> 0.012851662514460058
[1, 0] 0 -> 0.21389468820743307
[2, 2] 0 -> 0.12788112224834713
[3, 4] 0 -> 0.0732339330104518
[4, 2] 1 -> 0.8482598018341195
[5, 2] 1 -> 0.9718440871193595
[4, 1] 1 -> 0.9498047670006002
[5, 0] 1 -> 0.9974777451273938


Things to try:

- Try playing around with the training instances (`X`) -- what happens if the values are very dispersed? Can you implement a normalization preprocessing step?
- Try changing the initialization of the weights, can you find a configuration which requires less training epochs?
- Try changing the learning rate. Can you implement an "adaptive scheme" where you change the learning rate over the epochs?
- Try implementing an alternative, simpler weight update function: $w_{i,t+1} = w_{i,t} + \eta \cdot (y - \hat{y}) x_i$, why does this work as well?
- Can you find a bunch of training instances for which the perceptron clearly fails?