### Let's say we have to following problem:

` Print the numbers 1 to 100, excepts that if the number is divisible by 3, print "fizz"; if the number is divisible by 5, print "buzz"; and if the number is divisible by 15, print "fizzbuzz"`

In [2]:
from ml.algebra import Vector, dot

def fizz_buzz_encode(x: int) -> Vector:
    if x % 15 == 0:
        return [0, 0, 0, 1] # When 'x' is divisible by 15 (fizzbuzz)
    elif x % 5 == 0:
        return [0, 0, 1, 0] # When 'x' is divisible by 5 (buzz)
    elif x % 3 == 0:
        return [0, 1, 0, 0] # When 'x' is divisble by 3 (fizz)
    else:
        return [1, 0, 0, 0] # When 'x' is divisble by neither of the above

#### We'll use this to generate our target vectors. The input vectors are less obvious. You don't want to just use a one-dimensional vector containing the input number, for a couple of reasons. A single input captures an "intensity", but the fact that 2 is twice as much as 1, and that 4 is twice as much again, doesn't feel relevant to this problem. Additionally, with just one input the hidden layer wouldn't be able to compute very interesting features, which means it probably wouln't be able to solve the problem. It turns out that one thing that works reasonably well is to convert each number to its *binary* representation of 1s and 0s.

Floor Division("//"): The division of operands where the result is the quotient in which the digits after the decimal point are removed. But if one of the operands is negative, the result is floored , i.e., rounded away from zero (towards negative infinity).

Modulus("%"): returns the remainder when the first operand is divided by the second

In [3]:
def binary_encode(x: int) -> Vector:
    binary: List[float] = []

    for i in range(10):
        binary.append(x % 2)
        x = x // 2

    return binary

print(binary_encode(999))

[1, 1, 1, 0, 0, 1, 1, 1, 1, 1]


#### As the goal is to construct the outputs for the numbers 1 to 100, it would be cheating to train on those numbers. Therefore, we'll train on the numbers 101 to 1,023 (which is the largest number we can represent with 10 binary digits)

In [4]:
xs = [binary_encode(n) for n in range(101, 1024)]
ys = [fizz_buzz_encode(n) for n in range(101, 1024)]

#### Our neural network will have 10 input neurons (since we're representing our inputs as 10-dimensional vector) and 4 output neurons (since we're representing our targets as 4-dimensional vectors). We'll give it 25 hidden units, but we'll use a variable for that so it's easy to change:

In [22]:
import random
import tqdm

NUM_HIDDEN = 25

network = [
    # hidden layer: 10 inputs -> NUM_HIDDEN outputs
    [[random.random() for _ in range(10 + 1)] for _ in range(NUM_HIDDEN)], # random weights

    # output_layer: NUM_HIDDEN inputs -> 4 outputs
    [[random.random() for _ in range(NUM_HIDDEN + 1)] for _ in range(4)]
]

#### First lets import our previously defined functions:

In [23]:
import math

def sigmoid(t: float) -> float:
    return 1 / (1 + math.exp(-t))

In [24]:
def neuron_output(weights: Vector, inputs: Vector) -> float:
    # weights includes the bias term, imputs includes a 1
    return sigmoid(dot(weights, inputs))

In [25]:
from typing import List

def feed_forward(neural_network: List[List[Vector]], input_vector: Vector) -> List[Vector]:
    """
    Feeds the input vector through the neural network.
    Returns the outputs of all layers (not just the last one).
    """
    outputs: List[Vector] = []

    for layer in neural_network:
        input_with_bias = input_vector + [1]
        output = [neuron_output(neuron, input_with_bias) for neuron in layer]
        outputs.append(output)

        # Then the input to the next layer is the output of this one
        input_vector = output
    return outputs

In [26]:
def sqerror_gradients(network: List[List[Vector]],
                      input_vector: Vector,
                      target_vector: Vector) -> List[List[Vector]]:
    """
    Given a neural network, an input vector, and a target vector,
    make a prediction and compute the gradient of the squared error
    loss with respect to the neuron weights.
    """
    # forward pass
    hidden_outputs, outputs = feed_forward(network, input_vector)

    # gradients with respect to output neuron pre-activation outputs
    output_deltas = [output * (1 - output) * (output - target)
                     for output, target in zip(outputs, target_vector)]

    # gradients with respect to output neuron weights
    output_grads = [[output_deltas[i] * hidden_output
                     for hidden_output in hidden_outputs + [1]]
                    for i, output_neuron in enumerate(network[-1])]

    # gradients with respect to hidden neuron pre-activation outputs
    hidden_deltas = [hidden_output * (1 - hidden_output) *
                         dot(output_deltas, [n[i] for n in network[-1]])
                     for i, hidden_output in enumerate(hidden_outputs)]

    # gradients with respect to hidden neuron weights
    hidden_grads = [[hidden_deltas[i] * input for input in input_vector + [1]]
                    for i, hidden_neuron in enumerate(network[0])]

    return [hidden_grads, output_grads]

In [32]:
from ml.algebra import squared_distance
from ml.gradient_descent import gradient_step

learning_rate = 1.0

with tqdm.trange(20000) as t:
    for epoch in t:
        epoch_loss = 0.0

        for x, y in zip(xs, ys):
            predicted = feed_forward(network, x)[-1]
            epoch_loss += squared_distance(predicted, y)
            gradients = sqerror_gradients(network, x, y)

            # Take a gradient step for each neuron in each layer
            network = [[gradient_step(neuron, grad, -learning_rate) for neuron, grad in zip(layer, layer_grad)] for layer, layer_grad in zip(network, gradients)]
        t.set_description(f"fizz buzz (loss: {epoch_loss:.2f})")

fizz buzz (loss: 179.85): 100%|██████████| 20000/20000 [1:46:18<00:00,  3.14it/s]     


#### Now we have one remaining issue. Our network will produce a 4-dimensional vector of numbers, but want a single prediction. we'll do that by taking the `argmax`, which is the index of the largest value

In [33]:
def argmax(xs: list) -> int:
    """ Returns the index of the largest value """
    return max(range(len(xs)), key=lambda i: xs[i])

In [34]:
num_correct = 0

for n in range(1, 101):
    x = binary_encode(n)
    predicted = argmax(feed_forward(network, x)[-1])
    actual = argmax(fizz_buzz_encode(n))
    labels = [str(n), "fizz", "buzz", "fizzbuzz"]
    print(n, labels[predicted], labels[actual])

    if predicted == actual:
        num_correct += 1
    print(num_correct, "/", 100)

1 1 1
1 / 100
2 2 2
2 / 100
3 fizz fizz
3 / 100
4 4 4
4 / 100
5 buzz buzz
5 / 100
6 6 fizz
5 / 100
7 7 7
6 / 100
8 8 8
7 / 100
9 fizz fizz
8 / 100
10 buzz buzz
9 / 100
11 fizz 11
9 / 100
12 12 fizz
9 / 100
13 13 13
10 / 100
14 14 14
11 / 100
15 fizzbuzz fizzbuzz
12 / 100
16 16 16
13 / 100
17 17 17
14 / 100
18 fizz fizz
15 / 100
19 19 19
16 / 100
20 buzz buzz
17 / 100
21 21 fizz
17 / 100
22 22 22
18 / 100
23 23 23
19 / 100
24 fizz fizz
20 / 100
25 buzz buzz
21 / 100
26 fizz 26
21 / 100
27 fizz fizz
22 / 100
28 28 28
23 / 100
29 29 29
24 / 100
30 fizzbuzz fizzbuzz
25 / 100
31 31 31
26 / 100
32 32 32
27 / 100
33 fizz fizz
28 / 100
34 34 34
29 / 100
35 35 buzz
29 / 100
36 fizz fizz
30 / 100
37 37 37
31 / 100
38 38 38
32 / 100
39 fizz fizz
33 / 100
40 buzz buzz
34 / 100
41 fizz 41
34 / 100
42 fizz fizz
35 / 100
43 43 43
36 / 100
44 44 44
37 / 100
45 fizz fizzbuzz
37 / 100
46 46 46
38 / 100
47 47 47
39 / 100
48 fizz fizz
40 / 100
49 49 49
41 / 100
50 50 buzz
41 / 100
51 fizz fizz
42 / 100
52