# Perceptron

Here we explain the fundamental mechanism of discrete backpropagation on the following introductory machine learning model: the binarized perceptron or nonlinear regression. Before we begin, we must explain some fundamental concepts:

**Definition 1** - A matrix is said to be binarized if it only takes entries in $\{-1,+1\}$.

**Definition 2** - The Hamming Distance denoted $d$ of two binarized matrices $A$ and $B$ of equal size is given by: $$d(A,B)=\dfrac12\sum_{i} A_i + B_i$$

or in plain english "*the number of entries that are different between the two matrices*".

**Definition 3** - We typically denote a sample-wise ground truth $z$ and an input $x$ and batched ground truth $Z$ and an input $X$.

**Definition 4** - Given a binarized ground truth $Z$ and an input $X$, a binarized perceptron aims to find a binarized matrix $W$ and an integer vector $b$ to compute $\hat{Z}$ via:

$$\hat{Z} = \text{sign}(XW + b)$$

Where $\text{sign}$ is given by:

$$\text{sign}(x)=\begin{cases}-1\qquad \text{if }x<0\\ +1\qquad \text{if }x\geq 0 \end{cases}$$

**Note 1** the matrix multiplication is taken to be ordinary matrix multiplication as one would do on integer matrices, if we map $-1\to 0$ and $+1\to 1$ in typical binary fashion - we replace the multiplication in the individual dot products with an bitwise XNOR and addition is then taken to be a population count/*Hamming weight* of the bit-string left after this XNOR'ing.

**Note 2** In practice, transistor-for-transistor *XOR* gates are simpler to implement and indeed in the experimental binary warp-level matrix multiply and accumulate instructions available on post-Turing Nvidia GPUs this is what actually gets used - and similarly in FPGA - however, the underlying mathematics of perceptron changes little and we can still train successfully in this regime.