# simple LSTM

so basic it runs on pumpkin spice

## echo sequence prediction problem

### generating data

our echo sequence prediction problem needs data: specifically vectors of random sequences. let's use integers, and define our problem space as integers between 0 and 99.

we'll use the ```randint()``` function from the python 3 ```random``` [module](https://docs.python.org/3/library/random.html "python 3 random module docs") to generate random integers within the range we specify (in this case, 0 to 99). 

we can use the ```randint()``` function within a function of our own to generate sequences of random integers--this will be the data for our problem.

In [3]:
# randint() is inside the python random module

import random

In [4]:
# use randint() to generate a random integer between 0 and 99

rand_int = random.randint(0, 99)

rand_int

24

we need a _lot_ more than one of these. which means it's time to build a function to automate this for us:

In [5]:
def make_seq(seq_length, n_features):
    
    '''
    generate sequences of a given length
    and given number of features
    '''
    return [random.randint(0, n_features - 1) for _ in range(seq_length)]

__demo:__ let's make a sequence with 10 values and 50 features

In [6]:
make_seq(10, 50)

[35, 29, 27, 2, 8, 35, 33, 5, 26, 38]

### one hot encoding

before we can train the model, we have to encode the data into a format that an LSTM can use. the way we encode data matters; choices made here can significantly affect model performance.

to frame this data properly, let's revisit the original problem:

we're trying to predict a number. a _specific_ number.

if we wanted to _approximate_ the number, we could frame this as a __regression__ problem, and train our model to output a close (but not exact) approximation of the number.

but because we want the _exact_ integer (and _not_ an approximation, which is what a regression model outputs) we need to frame this problem as a __classification__ model.

__classification__ means handling categorical data, which machines can do handily using __one hot encoding__.

### automatic vs manual one hot encoding

```scikit-learn``` has a super neat ```OneHotEncoder()``` [transformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html "sklearn OneHotEncoder doc") that can automate one hot encoding, but because it fits the data, it can only encode the values that it sees represented. 

we need all possible values--from 0 to 99--represented. but because we're generating our integer sequences pseudo-randomly using ```np.random.randint()```, we can't guarantee that all values will be represented.

it's possible to feed in the categories to ```OneHotEncoder()``` manually. here, however, we're going to simply make our own transformer.

we'll convert the results to a ```numpy``` ```array``` in order to make them easier to decode later.

### decoding

later on we'll need a way to interpret the model's results. to do so we'll need to decode the one hot scheme.

we can easily do this using the ```numpy``` ```argmax()``` function.

```numpy.argmax()``` returns the indices for the maximum values along a vector. because each vector in the binary one hot encoding will be a lot of zeroes with a single high value--a ```1```--we can easily use ```argmax()``` to grab the index of the ```1``` value and return it. that's our output.

In [7]:
from numpy import array
from numpy import argmax

# encoder function

def one_hot_encoder(seq, n_features):
    
    '''
    creates a vector of binary values for each
    possible feature in the dataset.
    '''
    
    encoding = list()
    
    for val in seq:
        
        vector = [0 for _ in range(n_features)]
        vector[val] = 1
        encoding.append(vector)
        
    return array(encoding)  

# decoder function

def one_hot_decoder(seq_encoded):
    '''
    decodes results by returning the index of
    the point in the vector with the largest value,
    i.e. 1 
    '''
    
    return [argmax(vector) for vector in seq_encoded]

In [9]:
seq = make_seq(50, 100)

seq_encoded = one_hot_encoder(seq, 100)

seq_decoded = one_hot_decoder(seq_encoded)

print(seq, '\n')
print(seq_encoded, '\n')
print(seq_decoded)

[14, 92, 39, 16, 55, 57, 69, 37, 50, 89, 4, 91, 39, 28, 49, 58, 60, 86, 67, 83, 63, 64, 37, 37, 4, 87, 12, 22, 75, 76, 63, 63, 87, 45, 32, 89, 95, 86, 63, 26, 9, 41, 12, 10, 57, 92, 57, 93, 58, 34] 

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]] 

[14, 92, 39, 16, 55, 57, 69, 37, 50, 89, 4, 91, 39, 28, 49, 58, 60, 86, 67, 83, 63, 64, 37, 37, 4, 87, 12, 22, 75, 76, 63, 63, 87, 45, 32, 89, 95, 86, 63, 26, 9, 41, 12, 10, 57, 92, 57, 93, 58, 34]


### reshape to 3d matrix

LSTMs require input in the form of a 3d matrix.

the three dimensions LSTMs need, in order, are: __samples, time steps, & features__.

the sequence we generated above, ```seq```, is 

* one __sample__,
* fifty __time steps__, 
* one hundred __features__.

for ```seq```, the specific sequence we just generated, it's easy to set the shape to three dimensions using the ```reshape()``` function:

In [10]:
X = seq_encoded.reshape(1, 50, 100)

print(X)
print(X.shape)

[[[0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  ..., 
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]]]
(1, 50, 100)


a more generalizable version might look like this:

```X = seq_encoded.reshape(n_samples, length, n_features)```

### generating samples

following each of the steps above, in order, will generate 1 sample for our LSTM model.

it makes sense to automate these tasks:

In [11]:
def make_sample(length, n_features, output_index):
    '''
    creates a single sample that is LSTM-ready.
    '''
    #create sequence of pseudo-random integers
    seq = make_seq(length, n_features)
    
    # one hot encoding
    seq_encoded = one_hot_encoder(seq, n_features)
    
    # reshape to 3d matrix suitable for LSTM
    X = seq_encoded.reshape(1, length, n_features)
    
    # get the output
    y = seq_encoded[output_index].reshape(1, n_features)
    
    return X, y
    

let's test ```make_sample``` to make sure it works:

In [12]:
X, y = make_sample(50, 100, 17)

print(X, '\n')
print(X.shape, '\n')
print(y, '\n')
print(y.shape)

[[[0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  ..., 
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]]] 

(1, 50, 100) 

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]] 

(1, 100)


## more information

##### python 3 random module documentation:

https://docs.python.org/3/library/random.html

##### sklearn preprocessing documentation:

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing