# Multi-Layer Perceptron

The next logical goal in discrete optimization is extending it to multiple layers of perceptrons as in conventional deep learning. This is done by extending the concepts learnt in the perceptron chapter but by instead of attributing blame to weights, turning it on its head and attributing blame to the input instead.

We do this by constructing an "ideal input" binarized vector $\tilde{x}$ working backwards from the weights themselves and the desired output state of the network given by the ground truth $z$. We do this with a perculiar reinterpretation of perceptron equation:

Suppose that the ground truth $z$ is positive in the $i^{\text{th}}$ entry - we consider the weight matrix column $W_i$, if we were to maximize the dot product between $W_i$ and $\tilde{x}$ we'd let $\tilde{x} = W_i$, however, $\hat{z}_i$ will be positive so long as $W_i\cdot \tilde{x} > b_i$ so, we can actually choose $\tilde{x}$ to be at most $W_i\cdot W_i - b_i$ in Hamming distance from $W_i$.

We have to craft ideal inputs *for all* weight columns, so we end up with a family of ideal strings $\tilde{x}_i$ which are $W_i$ if $z_i = +1$ and $-W_i$ if $z_i = -1$, and their bounding Hamming distances $r_i$ which are $W_i\cdot W_i - b_i$ if $z_i=+1$ and $b_i$ if $z_i=-1$. This family defines a set of balls $B(\tilde{x}_i,r_i)$ in Hamming space, which may or may not intersect.

For the sake of argument, we then assume that the largest intersection (ie. the intersection of the most balls) contains the "ideal input" which we can use as ground truth for the layer feeding into the current layer.

**Note** Unfortunately, the arbitrary intersection of Hamming balls is a topic that appears to only have been studied in a combinatoric sense. The most closest research in this field is actually with regards to "binary means" or the Closest String/Substring problems.

Whilst the above pure mathematical representation is what should be ideally researched to provide the optimal solution to this issue - I do have a quick and dirty hack which does minimize collective distance to the ideal strings, and that is simple majority vote.

### Binary Majority Vote

Given a set of $n$ ideal input binary strings:

```
x_tilde_1 = +1 -1 +1 -1 +1 -1 +1 ... 
x_tilde_2 = +1 -1 -1 +1 -1 +1 -1 ...
    .
    .
    .
x_tilde_n = -1 +1 +1 -1 +1 +1 +1 ...
```

We simply count the number of $+1$'s that appear in each entry and threshold at $n/2$:

```
x_tilde_count = 9 12 72 8 48 9 ...
x_tilde_thres = -1 -1 +1 -1 +1 -1 ...
```

`x_tilde_thres` is taken to be ground truth $z$ for the above layer and we repeat the process up the chain to get deep learning. We repeat the perceptron boilerplate down below and then begin to implement out the majority vote scheme.

In [1]:
# In this block, we repeat much of the boilerplate from the previous chapter

import tensorflow as tf
import numpy as np

# Load MNIST data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

THRESHOLD = 128

# Full data conversion
x_train = ((x_train > THRESHOLD)*2 - 1).reshape(x_train.shape[0],28*28)
x_test = ((x_test > THRESHOLD)*2 - 1).reshape(x_test.shape[0],28*28)
y_train = tf.one_hot(y_train,depth=10)*2 - 1
y_test = tf.one_hot(y_test,depth=10)*2 - 1


# Set seed for reproducability
SEED = 1337
np.random.seed(SEED)

# Define the requisite thresholds
W_THRES = 2
b_THRES = 2

# Counters and counter logic
counter = 0
counter_RESET = 10

# Feedforward one sample
def forward(W,b,x):
    out = np.sign(x@W-b)
    out[out == 0] = 1
    return out

# Discover which columns/bias terms are to blame
def blame_columns(z,zhat,b_blame,bias,param):

    # Binarize inputs
    z = z > 0
    zhat = zhat > 0

    # Compute where there are false positives and false negatives
    false_pos = np.logical_and(zhat,np.logical_not(z))
    false_neg = np.logical_and(np.logical_not(zhat),z)

    # Increment bias blame for false positives (too big!)
    for idx, i in enumerate(false_pos):
        if i:
            b_blame[idx] += 1

    # Decrement bias blame for false negatives (too small!)
    for idx, i in enumerate(false_neg):
        if i:
            b_blame[idx] -= 1

    # If bias threshold is crossed, reset blame and increment/decrement bias
    for idx, i in enumerate(np.abs(b_blame)>param):
        if np.sign(b_blame[idx]) > 0 and i:
            bias[idx] += 1
        elif np.sign(b_blame[idx]) < 0 and i:
            bias[idx] -= 1

    return false_pos, false_neg, b_blame, bias

def blame_weights(x,false_pos,false_neg,W_blame,Weight,param):

    # Binarize inputs
    x = x > 0

    # If a weight is found to be blame for a false positive attribute blame
    for idx, i in enumerate(false_pos):
        if i:
            for jdx, j in enumerate(np.logical_not(np.logical_xor(Weight[:,idx]>0,x))):
                if j:
                    W_blame[jdx,idx] += 1

    # If a weight is found to be blame for a false negative attribute blame
    for idx, i in enumerate(false_neg):
        if i:
            for jdx, j in enumerate(np.logical_xor(Weight[:,idx]>0,x)):
                if j:
                    W_blame[jdx,idx] += 1

    # Find where weights exceed the blame threshold
    rows,cols = np.where(W_blame >= param)

    # Reset blame counter and flip corresponding weight
    for i,j in zip(rows,cols):
        W_blame[i,j] = 0
        Weight[i,j] = Weight[i,j] * -1
    
    return W_blame, Weight

# Not much changes here

2022-11-15 14:44:01.040822: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-15 14:44:01.165122: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-15 14:44:03.291305: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 14:44:03.309481: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 14:44:03.3

We now implement the majority weight ideal input construction routine to build out the deep learning system. 

In [2]:
def majority_vote(z,W,fp,fn):

    out = np.zeros(W.shape[0])

    if not fp.any() and not fn.any():
        return z

    for idx, i in enumerate(fp):
        if i:
            out -= W[:,idx]
    
    for idx, i in enumerate(fn):
        if i:
            out += W[:,idx]
            
    return np.sign(out)

We can now implement the binarized multi-layer perceptron with its caches and begin training on MNIST using this procedure. 

In [3]:
from tqdm import tqdm

# Initialize weights and biases
W1 = (np.random.uniform(0,1,(28*28,256)) > 0.5) * 2 - 1
W2 = (np.random.uniform(0,1,(256,10)) > 0.5) * 2 - 1
b1 = np.zeros(256)
b2 = np.zeros(10)

# Initialize blame counters
W1_blame = np.zeros((28*28,256))
W2_blame = np.zeros((256,10))
b1_blame = np.zeros(256)
b2_blame = np.zeros(10)

# Counter Parameters
counter = 0
acc = 0
acc_count = 0
REPORT = 10000
epochs = 3

for e in range(epochs):
    print("EPOCH "+str(e+1))
    indices = np.arange(x_train.shape[0])
    np.random.shuffle(indices)
    for i in tqdm(indices):

        x = x_train[i]
        y = y_train[i]

        # Forward pass
        z1 = forward(W1,b1,x)
        z2 = forward(W2,b2,z1)

        # Backward pass, Layer 2
        fp, fn, b2_blame, b2 = blame_columns(y,z2,b2_blame,b2,4)
        W2_blame, W2 = blame_weights(z1,fp,fn,W2_blame,W2,16)

        # Majority vote inter-layer glue
        y2 = majority_vote(z1,W2,fp,fn)

        # Train Layer 1
        fp, fn, b1_blame, b1 = blame_columns(y2,z1,b1_blame,b1,16)
        W1_blame, W1 = blame_weights(x,fp,fn,W1_blame,W1,256)

        # Forgiveness counter
        counter += 1 - np.sum(np.logical_and(z2 > 0, y > 0))
        if counter >= counter_RESET:
            W1_blame -= 1
            W1_blame[W1_blame < 0] = 0
            b1_blame = np.sign(b1_blame) * (np.abs(b1_blame) - 1)
            W2_blame -= 1
            W2_blame[W2_blame < 0 ] = 0
            b2_blame = np.sign(b2_blame) * (np.abs(b2_blame) - 1)
            counter = 0

        # Accuracy metric counter
        acc += np.sum(np.logical_and(z2 > 0, y > 0))
        acc_count += 1
        if acc_count >= REPORT:
            print("Current Accuracy: " + str(acc / REPORT))
            acc_count = 0
            acc = 0

EPOCH 1


  0%|          | 248/60000 [00:06<24:24, 40.81it/s]


KeyboardInterrupt: 

Here, I'd like to emphasize that this improves performance significantly over a single layer perceptron - a good indicator that this method works. Whilst this particular approach offers less intuitive prettyness to understand the feature extraction performed here, we can still verify that this functions as expected from a generalization standpoint.

In [None]:
for x,y in zip(x_test,x_test):

    # Forward pass
    z1 = forward(W1,b1,x)
    z2 = forward(W2,b2,z1)

    # Accuracy metric counter
    acc += np.sum(np.logical_and(z2 > 0, y > 0))
    acc_count += 1
    if acc_count >= REPORT:
        print("Current Accuracy: " + str(acc / REPORT))
        acc_count = 0
        acc = 0

ValueError: operands could not be broadcast together with shapes (10,) (784,) 

In [None]:
np.sum(np.array([[1,2,3],[4,5,6]]),axis=1)

array([ 6, 15])