# Recurrent Neural Networks

Discrete Optimization can similarly be extended to Recurrent Neural Networks - since we have the steps for a multi-layer discrete optimization - we can apply these to the exact formula of a binarized Elman network and begin learning on a task that is more typical of a language model.

We begin with boilerplate methods:

In [15]:
import numpy as np

# Set seed for reproducability
SEED = 1337
np.random.seed(SEED)

# Feedforward Elman Network
def elman_forward(Wh,Uh,Wy,bh,by,x,h):
    out = np.sign(x@Wh+h@Uh-bh)
    out[out == 0] = 1
    out2 = np.sign(out@Wy-by)
    out[out == 0] = 1
    return out, out2

def compute_fpfn(z,zhat):
       # Binarize inputs
    z = z > 0
    zhat = zhat > 0

    # Compute where there are false positives and false negatives
    false_pos = np.logical_and(zhat,np.logical_not(z))
    false_neg = np.logical_and(np.logical_not(zhat),z)

    false_pos = false_pos * 2 - 1
    false_neg = false_neg * 2 - 1

    return false_pos, false_neg

# Discover which columns/bias terms are to blame
def blame_columns(z,zhat,b_blame,bias,param,depth=0):

    # Binarize inputs
    z = z > 0
    zhat = zhat > 0

    # Compute where there are false positives and false negatives
    false_pos = np.logical_and(zhat,np.logical_not(z))
    false_neg = np.logical_and(np.logical_not(zhat),z)

    # Increment bias blame for false positives (too big!)
    for idx, i in enumerate(false_pos):
        if i:
            b_blame[idx] += 2**-depth

    # Decrement bias blame for false negatives (too small!)
    for idx, i in enumerate(false_neg):
        if i:
            b_blame[idx] -= 2**-depth

    # If bias threshold is crossed, reset blame and increment/decrement bias
    for idx, i in enumerate(np.abs(b_blame)>param):
        if np.sign(b_blame[idx]) > 0 and i:
            bias[idx] += 1
        elif np.sign(b_blame[idx]) < 0 and i:
            bias[idx] -= 1

    return false_pos, false_neg, b_blame, bias

def blame_weights(x,false_pos,false_neg,W_blame,Weight,param,depth=0):

    # Binarize inputs
    x = x > 0

    # If a weight is found to be blame for a false positive attribute blame
    for idx, i in enumerate(false_pos):
        if i:
            for jdx, j in enumerate(np.logical_not(np.logical_xor(Weight[:,idx]>0,x))):
                if j:
                    W_blame[jdx,idx] += 2**-depth

    # If a weight is found to be blame for a false negative attribute blame
    for idx, i in enumerate(false_neg):
        if i:
            for jdx, j in enumerate(np.logical_xor(Weight[:,idx]>0,x)):
                if j:
                    W_blame[jdx,idx] += 2**-depth

    # Find where weights exceed the blame threshold
    rows,cols = np.where(W_blame >= param)

    # Reset blame counter and flip corresponding weight
    for i,j in zip(rows,cols):
        W_blame[i,j] = 0
        Weight[i,j] = Weight[i,j] * -1
    
    return W_blame, Weight

# Majority vote inter-layer glue
def majority_vote(z,W,fp,fn):

    out = np.zeros(W.shape[0])

    if not fp.any() and not fn.any():
        return z

    for idx, i in enumerate(fp):
        if i:
            out -= W[:,idx]
    
    for idx, i in enumerate(fn):
        if i:
            out += W[:,idx]
            
    return np.sign(out)

### Data

We'd like to load a more temporally sensitive dataset to prove that we are effectively training a recurrent neural net, of sorts. To do this, we'll be using the `reuters` dataset, which tokenizes 30945 words into 8982 articles into 45 topics - we will load and one hot encode this for training.

### Training

Training of an Elman Network is a little more complex, since we must do "Discrete Optimization Through Time" - the concept itself isn't too difficult to realize since all we have to do is introduce a depth limit like typical RNNs. We'll consider the weight matrices and biases to be static (even though they really aren't) and hope that our tuning signal isn't too noisy for some sufficiently small depth of optimization.

In order to explain Elman Network training, we should probably define an Elman Network. An Elman Network is defined by the following equations:

$$
\begin{align*}
h_t &= \sigma_h(W_hx_t+U_h h_{t-1} +b_h)\\
y_t &= \sigma_y(W_y h_t+b_y)
\end{align*}
$$

Note: $W_h\neq W_y$. So our training will work as usual for the $y_t$ step, utilizing the above boilerplate - blame attribution for the recursive step is little more tricky. In that case the various $x_t$ must be cached, and so to the labels (ie. ground truth) of the $h_t$, to blame the two weight matrices, we just pass these independently through the `blame_weights` function and keep track of appropriate labels.

To prevent exponential blow up (and admittedly in violation of the philosophy of discrete optimization), we'll also require allocated blame at $t$ time-steps back in history to be scaled down by $2^{-t}$ - we do this by adding an optional "depth" keyword argument to `blame_columns` and `blame_weights`

In [16]:
import tensorflow as tf
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()
from tqdm import tqdm

# Load MNIST data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.reuters.load_data()

def binarize(input: np.array):
    return (input < 0.5) * 2 - 1

INPUT_SIZE = 30945 # Number of one-hot words in reuters dataset
HIDDEN_SIZE = 2048 # Number of hidden units that we want
OUTPUT_SIZE = 45 # How many output nodes should there be

# Initialize Elman Network weights
hidden_weight = binarize(np.random.uniform(0,1,(INPUT_SIZE,HIDDEN_SIZE))) # Denoted W_h in above formulae
interi_weight = binarize(np.random.uniform(0,1,(HIDDEN_SIZE,HIDDEN_SIZE))) # Denoted U_h ...
output_weight = binarize(np.random.uniform(0,1,(HIDDEN_SIZE,OUTPUT_SIZE))) # Denoted W_y ...

# Initialize Elman Network biases
hidden_bias = np.random.randint(0,INPUT_SIZE,HIDDEN_SIZE)
output_bias = np.random.randint(0,HIDDEN_SIZE,OUTPUT_SIZE)

# Initialize cache and cache parameters
input_cache = []
hidden_cache = [np.zeros(HIDDEN_SIZE) - 1]
label_cache = []
CACHE_DEPTH = 2

# Blame counters
hidden_weight_blame = np.zeros((INPUT_SIZE,HIDDEN_SIZE))
interi_weight_blame = np.zeros((HIDDEN_SIZE,HIDDEN_SIZE))
output_weight_blame = np.zeros((HIDDEN_SIZE,OUTPUT_SIZE))
hidden_bias_blame = np.zeros(HIDDEN_SIZE)
output_bias_blame = np.zeros(OUTPUT_SIZE)

# Program counters
epochs = 3
counter = 0
COUNT_RESET = 20
acc = 0
acc_count = 0

# Iterate over epochs
for e in range(epochs):

    print("EPOCH "+str(e+1))
    indices = np.arange(x_train.shape[0])
    np.random.shuffle(indices)

    input_cache = []
    hidden_cache = [np.zeros(HIDDEN_SIZE) - 1]
    label_cache = []
    CACHE_DEPTH = 2

    # Iterate over samples
    for i in tqdm(indices):
        
        # Load training samples
        x = (tf.one_hot(x_train[i],INPUT_SIZE)) * 2 - 1
        y = (tf.one_hot(y_train[i],OUTPUT_SIZE)) * 2 - 1

        
        # Iterate over words in samples
        for x1 in x:

            # Cache input activations
            input_cache.insert(0,x1)
            input_cache = input_cache[:CACHE_DEPTH]
            
            # Forward step
            prev_h = hidden_cache[0]
            curr_h, yhat = elman_forward(hidden_weight,interi_weight,output_weight,
                                         hidden_bias,output_bias,x1,prev_h)

            # Cache new hidden activation
            hidden_cache.insert(0,curr_h)
            hidden_cache = hidden_cache[:CACHE_DEPTH]

            # Begin backward pass of second Elman equation
            fp, fn, output_bias_blame, output_bias = blame_columns(y,yhat,output_bias_blame,output_bias,16)
            output_weight_blame, output_weight = blame_weights(curr_h, fp, fn, output_weight_blame, output_weight, 16)
            h = majority_vote(curr_h ,output_weight,fp,fn)

            # Cache hidden ground truth
            label_cache.insert(0,h)
            label_cache = label_cache[:CACHE_DEPTH]

            # Recursive backward pass of second Elman equation
            for i in range(CACHE_DEPTH-1):
                fp, fn, hidden_bias_blame, hidden_bias = blame_columns(label_cache[i],hidden_cache[i],hidden_bias_blame,hidden_bias,16,depth=i)
                hidden_weight_blame, hidden_weight = blame_weights(input_cache[i],fp,fn,hidden_weight_blame,hidden_weight,16,depth=i)
                interi_weight_blame, interi_weight = blame_weights(hidden_cache[i+1],fp,fn,interi_weight_blame,interi_weight,16,depth=i)

            # Forgiveness counter
            counter += 1 - np.sum(np.logical_and(yhat > 0, y > 0))
            if counter >= COUNT_RESET:

                hidden_weight_blame -= 1
                hidden_weight_blame[hidden_weight_blame < 0] = 0
                interi_weight_blame -= 1
                interi_weight_blame[interi_weight_blame < 0] = 0
                output_weight_blame -= 1
                output_weight_blame[output_weight_blame < 0] = 0
                
                hidden_bias_blame = np.sign(hidden_bias_blame) * (np.abs(hidden_bias_blame) - 1)
                output_bias_blame = np.sign(output_bias_blame) * (np.abs(output_bias_blame) - 1)
                counter = 0

            # Accuracy metric counter
            acc += np.sum(np.logical_and(yhat > 0, y > 0))
            acc_count += 1
        
        print("Sample Accuracy: ",acc/acc_count)
        acc, acc_count = 0,0

EPOCH 1


  0%|          | 1/8982 [02:38<395:49:34, 158.67s/it]

Sample Accuracy:  0.8279569892473119


  0%|          | 2/8982 [07:55<627:55:17, 251.73s/it]

Sample Accuracy:  0.8172043010752689


  0%|          | 3/8982 [12:16<637:59:18, 255.79s/it]

Sample Accuracy:  0.8152173913043478


  0%|          | 4/8982 [13:17<446:38:10, 179.09s/it]

Sample Accuracy:  1.0


  0%|          | 5/8982 [14:15<337:54:36, 135.51s/it]

Sample Accuracy:  1.0


In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.reuters.load_data()



3