# addition prediction problem & encoder-decoder lstm

in their 2015 [paper](https://arxiv.org/pdf/1410.4615.pdf "https://arxiv.org/pdf/1410.4615.pdf"), Wojciech Zaremba and Ilya Sutskever showed that LSTM encoder-decoder models were capable of calculating the output of small programs--adding together two numbers of up to nine digits in length.

even more impressive, the model was reading the character representations of the symbols and digits. there is nothing programatically to indicate to the model what operations are being represented--it learns this during training:

>*It is important to emphasize that the LSTM reads the entire input one character at a time and produces the output one character at a time. The characters are initially meaningless from the model’s
perspective;  for instance, the model does not know that “+” means addition or that
6
is followed
by
7
.  In fact, scrambling the input characters (e.g., replacing “a” with “q”, “b” with “w”, etc.,) has
no effect on the model’s ability to solve this problem*
>
> Wojciech Zaremba and Ilya Sutskever, Learning to Execute

as an example of how this works, the model might take in the sequence representing

12 + 4 = 16

represented in the following vectors:

`['1','2','+','5']`

`['1','7']`

the digits and symbols are just characters; they have no fucntional meaning. the model learns the relationships during training.


## sequence-to-sequence (seq2seq) model

there are a couple of key characteristics to note in this problem. first, the order matters--shuffling the order of the characters would make any relationship impossible to deduce.

second, the input and output can vary, making this problem more challenging than a one-to-one or many-to-one sequence prediction problem.

because the data is an ordered sequence of variable input and output length, this problem requires a many-to-many modeling approach, otherwise known as [__sequence to sequence__](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf "https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf").

### padding

because sequence lengths can vary, we need to do some __padding__. padding consists of adding characters at the beginning, end (or both) of a sequence to make sure all the sequences in a training set are the same length. any character can be used; but it should be sufficiently different from the training data that the mahcine can figure out it isn't pertinent to the problem at hand--in other words, the padding shoulnd't add noise.

there are different methods for choosing what characters to use for padding and how. some are intuitive, and some are statistical. 

*for more information about libraries for padding:*

__tensorflow__

*tool:* `tf.pad`

*documentation:* https://www.tensorflow.org/api_docs/python/tf/pad

__keras__

*tool:* `pad_sequences`

*documentation:* https://keras.io/preprocessing/sequence/

__numpy__

*tool:* `numpy.pad`

*documentation:* https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.pad.html

### one hot encoding

in order to be machine readable, these characters need to somehow be encoded into numerical data.

in this case, [__one hot encoding__](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f "https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f") provides the answer.

we can treat each character as a category, and each input to the machine will come with a set of vectors defining it. it's interesting to note here that in a sense the dense matrices associated with one hot encoding mostly serve to tell a machine what categories a particular example *isn't* in; these matrices are mostly zeros.

for automatic one hot encoding, there are a number of libraries and packages available. these mostly work well when every potential category is represented in the data.


### generating data

the code below generates data perfectly prepared for this problem. when executed, it will

* __generate__ random pairs of numbers with their sums
* convert the __integers__ to __strings__
* __pad__ the strings on the left, using the space `' '` character
* __integer encode__ the sequences

and, finally

* __one hot encode__ the sequences
* __assign__ sequences to data structures (lists) so they're ready for a model

because this code does all these things manually, it's easy to see how many of these parameters can be changed to experiement with model performance!

the code to generate this data below is taken (with only a few modifications) from [jason brownlee's excellent course on LSTMs](https://machinelearningmastery.com/lstms-with-python/ "https://machinelearningmastery.com/lstms-with-python/") available at [www.machinelearningmastery.com](https://machinelearningmastery.com/lstms-with-python/)

In [5]:
from random import seed
from random import randint
from math import ceil
from math import log10

# generate lists of random integers and their sum

def random_sum_pairs(n_examples, n_numbers, largest):
    
    X, y = list(), list()
    
    for _ in range(n_examples):
        
        in_pattern = [randint(1,largest) for _ in range(n_numbers)]
        
        out_pattern = sum(in_pattern)
        
        X.append(in_pattern)
        
        y.append(out_pattern)
    
    return X, y

# convert data to strings

def to_string(X, y, n_numbers, largest):
    
    # calculate largest possible value
    
    max_length = int(n_numbers * ceil(log10(largest+1)) + n_numbers - 1)
    
    Xstr = list()
    
    for pattern in X:
        
        strp = ' + ' .join([str(n) for n in pattern])
        
        strp = '' .join([ ' ' for _ in range(max_length-len(strp))]) + strp
        
        Xstr.append(strp)
    
    max_length = int(ceil(log10(n_numbers * (largest+1))))
    
    ystr = list()
    
    for pattern in y:
        
        strp = str(pattern)
        
        strp = '' .join([ ' ' for _ in range(max_length-len(strp))]) + strp
        
        ystr.append(strp)
    
    return Xstr, ystr

# integer encode strings
# i've changed variable names here to prevent scope issues with multiple notebook runs
# and make it easier for me to read

def integer_encode(X, y, alphabet):
    
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    
    X_int = list()
    
    for pattern in X:
        
        integer_encoded = [char_to_int[char] for char in pattern]
        
        X_int.append(integer_encoded)
    
    y_int = list()
    
    for pattern in y:
        
        integer_encoded = [char_to_int[char] for char in pattern]
        
        y_int.append(integer_encoded)
    
    return X_int, y_int

# one hot encode
# some names changed for scope & readability

def one_hot_encode(X, y, max_int):
    
    X_encoded = list()
    
    for seq in X:
        
        pattern = list()
            
        for index in seq:
                
            vector = [0 for _ in range(max_int)]
                
            vector[index] = 1
                
            pattern.append(vector)
        
        X_encoded.append(pattern)
    
    y_encoded = list()
    
    for seq in y:
        
        pattern = list()
        
        for index in seq:
            
            vector = [0 for _ in range(max_int)]
            
            vector[index] = 1
            
            pattern.append(vector)
        
        y_encoded.append(pattern)
    
    return X_encoded, y_encoded

# let's test it out
# to make it easy to see how pieces fit i've numbered the X & y transforms
# X_1, X_2, y_3, etc...to X_final, y_final

seed(1)

n_samples = 1

n_numbers = 2

largest_number = 10

# make pairs

X_1, y_1 = random_sum_pairs(n_samples, n_numbers, largest_number)

print('step 1: pairs with sums \n')
print(X_1, y_1)

# convert to strings

X_2, y_2 = to_string(X_1, y_1, n_numbers, largest_number)

print('\n step 2: transform to strings \n')
print(X_2, y_2)

# integer encode
# include every character we're using
# even the spaces for padding!

alphabet = ['0','1','2','3','4','5','6','7','8','9','+',' ']

X_3, y_3 = integer_encode(X_2, y_2, alphabet)

print('\n step 3: integer encoding \n')
print(X_3, y_3)

# one hot encode

X_final, y_final = one_hot_encode(X_3, y_3, len(alphabet))

print('\n final step: one hot encoding \n')
print(X_final, y_final)


step 1: pairs with sums 

[[3, 10]] [13]

 step 2: transform to strings 

['3 + 10'] ['13']

 step 3: integer encoding 

[[3, 11, 10, 11, 1, 0]] [[1, 3]]

 final step: one hot encoding 

[[[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]] [[[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]]]


sweet, it works! obvs we're not going to want to do all that for every sample, so let's make a function:

In [6]:
def make_samples(n_samples, n_numbers, largest_number, alphabet):
    
    # get pairs
    
    X_pairs, y_pairs = random_sum_pairs(n_samples, n_numbers, largest_number)
    
    # pairs to strings
    
    X_strings, y_strings = to_strings(X_pairs, y_pairs, largest_number)
    
    # integer encode
    
    X_int, y_int = integer_encode(X_strings, y_strings, alphabet)
    
    # one hot encode
    
    X_encoded, y_encoded = one_hot_encode(X_int, y_int, len(alphabet))
    
    # return as numpy arrays
    
    X, y = array(X_encoded), array(y_encoded)
    
    return X, y

### decoding sequences

in order to easily read the results, we'll need to decode them. with one hot encoding, it's easy to use python's `argmax` to get the results: in a matrix of (nearly) all zeros, the highest value will be the `1` denoting a character's category. 

the results can be inverted using `argmax()` to return the index of the category a particular character belongs to, since the `1` denoting it will be the highest number in the otherwise all-zero array.

In [8]:
def decode_results(sequence, alphabet):
    
    int_2_char = dict((i, c) for i, c in enumerate(alphabet))
    
    strings = list()
    
    for seq in sequences:
        
        string_version = int_2_char[argmax(seq)]
        
        strings.append(string_version)
    
    return ''.join(strings)
    