<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Lesson 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)
## _aka_ PREDICTING THE FUTURE!

<img src="https://media.giphy.com/media/l2JJu8U8SoHhQEnoQ/giphy.gif" width=480 height=356>
<br></br>
<br></br>

> "Yesterday's just a memory - tomorrow is never what it's supposed to be." -- Bob Dylan

Wish you could save [Time In A Bottle](https://www.youtube.com/watch?v=AnWWj6xOleY)? With statistics you can do the next best thing - understand how data varies over time (or any sequential order), and use the order/time dimension predictively.

A sequence is just any enumerated collection - order counts, and repetition is allowed. Python lists are a good elemental example - `[1, 2, 2, -1]` is a valid list, and is different from `[1, 2, -1, 2]`. The data structures we tend to use (e.g. NumPy arrays) are often built on this fundamental structure.

A time series is data where you have not just the order but some actual continuous marker for where they lie "in time" - this could be a date, a timestamp, [Unix time](https://en.wikipedia.org/wiki/Unix_time), or something else. All time series are also sequences, and for some techniques you may just consider their order and not "how far apart" the entries are (if you have particularly consistent data collected at regular intervals it may not matter).

## Recurrent Neural Networks

There's plenty more to "traditional" time series, but the latest and greatest technique for sequence data is recurrent neural networks. A recurrence relation in math is an equation that uses recursion to define a sequence - a famous example is the Fibonacci numbers:

$F_n = F_{n-1} + F_{n-2}$

For formal math you also need a base case $F_0=1, F_1=1$, and then the rest builds from there. But for neural networks what we're really talking about are loops:

![Recurrent neural network](https://upload.wikimedia.org/wikipedia/commons/b/b5/Recurrent_neural_network_unfold.svg)

The hidden layers have edges (output) going back to their own input - this loop means that for any time `t` the training is at least partly based on the output from time `t-1`. The entire network is being represented on the left, and you can unfold the network explicitly to see how it behaves at any given `t`.

Different units can have this "loop", but a particularly successful one is the long short-term memory unit (LSTM):

![Long short-term memory unit](https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Long_Short-Term_Memory.svg/1024px-Long_Short-Term_Memory.svg.png)

There's a lot going on here - in a nutshell, the calculus still works out and backpropagation can still be implemented. The advantage (ane namesake) of LSTM is that it can generally put more weight on recent (short-term) events while not completely losing older (long-term) information.

After enough iterations, a typical neural network will start calculating prior gradients that are so small they effectively become zero - this is the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem), and is what RNN with LSTM addresses. Pay special attention to the $c_t$ parameters and how they pass through the unit to get an intuition for how this problem is solved.

So why are these cool? One particularly compelling application is actually not time series but language modeling - language is inherently ordered data (letters/words go one after another, and the order *matters*). [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) is a famous and worth reading blog post on this topic.

For our purposes, let's use TensorFlow and Keras to train RNNs with natural language. Resources:

- https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py
- https://keras.io/layers/recurrent/#lstm
- http://adventuresinmachinelearning.com/keras-lstm-tutorial/

Note that `tensorflow.contrib` [also has an implementation of RNN/LSTM](https://www.tensorflow.org/tutorials/sequences/recurrent).

### RNN/LSTM Sentiment Classification with Keras

In [1]:
'''
#Trains an LSTM model on the IMDB sentiment classification task.
The dataset is actually too small for LSTM to be of any advantage
compared to simpler, much faster methods such as TF-IDF + LogReg.
**Notes**
- RNNs are tricky. Choice of batch size is important,
choice of loss and optimizer is critical, etc.
Some configurations won't converge.
- LSTM loss decrease patterns during training can be quite different
from what you see with CNNs/MLPs/etc.
'''
from __future__ import print_function

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

max_features = 20000
# cut texts after this number of words (among top max_features most common words)
maxlen = 80 # usually use the mean length of sequences in x_train, 
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 train sequences
25000 test sequences


In [2]:
len(x_train)

25000

In [3]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Pad sequences (samples x time)
x_train shape: (25000, 80)
x_test shape: (25000, 80)


In [4]:
x_train[0]

array([   15,   256,     4,     2,     7,  3766,     5,   723,    36,
          71,    43,   530,   476,    26,   400,   317,    46,     7,
           4, 12118,  1029,    13,   104,    88,     4,   381,    15,
         297,    98,    32,  2071,    56,    26,   141,     6,   194,
        7486,    18,     4,   226,    22,    21,   134,   476,    26,
         480,     5,   144,    30,  5535,    18,    51,    36,    28,
         224,    92,    25,   104,     4,   226,    65,    16,    38,
        1334,    88,    12,    16,   283,     5,    16,  4472,   113,
         103,    32,    15,    16,  5345,    19,   178,    32],
      dtype=int32)

In [5]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128)) #Takes max features(25k) and embed as dense layer with 128 values(neurons)
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) # 128 LSTM units(neurons), with dropout
model.add(Dense(1, activation='sigmoid')) # Returns one predicted value

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=2, # change to a lower value to spped up training for learning purposes
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model...
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Train...
Train on 25000 samples, validate on 25000 samples
Instructions for updating:
Use tf.cast instead.
Epoch 1/2
Epoch 2/2
Test score: 0.40488060349464416
Test accuracy: 0.82284


### LSTM Text generation with Keras

What else can we do with LSTMs? Since we're analyzing the *sequence*, we can do more than classify - we can *generate* text. I'ved pulled some news stories using [newspaper](https://github.com/codelucas/newspaper/).

This example is drawn from the Keras [documentation](https://keras.io/examples/lstm_text_generation/).

In [None]:
from tensorflow.keras.callbacks import LambdaCallback 
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop

import numpy as np
import random
import sys
import os

In [None]:
data_files = os.listdir('./articles')

In [None]:
# Read in Data
# Read in as one big blob
""" 
-Text data starts out as a whole bunch of articles stored in a individual text files
-We append all the data into one giant text string containing ~900k chars
"""


# text = ''

# for filename in data_files:
#     if filename[-3:] == 'txt':
#         path = f"./articles/{filename}"
#         with open(path, "r") as data:
#             content = data.read()
#             text = text + " " + content
            
# print("corpus length:", len(text))

data = []

for file in data_files:
    if file[-3:] == 'txt':
        with open(f"./articles/{file}", 'r') as f:
            data.append(f.read())
            

In [None]:
# We append all the data into one giant text string
text[:50]

[78,
 104,
 2,
 58,
 22,
 104,
 2,
 76,
 76,
 78,
 89,
 59,
 22,
 58,
 8,
 0,
 98,
 78,
 44,
 59,
 76,
 76,
 59,
 32,
 78,
 2,
 8,
 118,
 104,
 59,
 89,
 118,
 0,
 58,
 78,
 104,
 76,
 117,
 22,
 61]

In [72]:
# Encode Data as Chars

"""
1. Create one giant string of articles
2. Get a unique list of chars
3. Create lookup dictionary (mapping) "char_int" and "int_char"
"""


"""
Read through all 900k chars and create a dictionary only containing the 
unique chars the exist in the 900k text blob
"""

# Current lecture method
giant_string = " ".join(data)
chars = list(set(giant_string))
char_int = {c:i for i,c in enumerate (chars)}
int_char = {i:c for i,c in enumerate(chars)}

# Previous lecture method
# chars = sorted(list(set(text))) # Sets dont repeat characters so this list will only diplay the characters used and no duplicates
# char_indicies = dict((c,i) for i, c in enumerate(chars))
# indicies_char = dict((i, c) for i,c in enumerate(chars))

In [73]:
indicies_char = int_char
char_indicies = char_int

In [10]:
indicies_char

{0: '\n',
 1: ' ',
 2: '!',
 3: '"',
 4: '#',
 5: '$',
 6: '%',
 7: '&',
 8: "'",
 9: '(',
 10: ')',
 11: '*',
 12: '+',
 13: ',',
 14: '-',
 15: '.',
 16: '/',
 17: '0',
 18: '1',
 19: '2',
 20: '3',
 21: '4',
 22: '5',
 23: '6',
 24: '7',
 25: '8',
 26: '9',
 27: ':',
 28: ';',
 29: '?',
 30: '@',
 31: 'A',
 32: 'B',
 33: 'C',
 34: 'D',
 35: 'E',
 36: 'F',
 37: 'G',
 38: 'H',
 39: 'I',
 40: 'J',
 41: 'K',
 42: 'L',
 43: 'M',
 44: 'N',
 45: 'O',
 46: 'P',
 47: 'Q',
 48: 'R',
 49: 'S',
 50: 'T',
 51: 'U',
 52: 'V',
 53: 'W',
 54: 'X',
 55: 'Y',
 56: 'Z',
 57: '[',
 58: ']',
 59: '_',
 60: 'a',
 61: 'b',
 62: 'c',
 63: 'd',
 64: 'e',
 65: 'f',
 66: 'g',
 67: 'h',
 68: 'i',
 69: 'j',
 70: 'k',
 71: 'l',
 72: 'm',
 73: 'n',
 74: 'o',
 75: 'p',
 76: 'q',
 77: 'r',
 78: 's',
 79: 't',
 80: 'u',
 81: 'v',
 82: 'w',
 83: 'x',
 84: 'y',
 85: 'z',
 86: '{',
 87: '|',
 88: '©',
 89: '\xad',
 90: '·',
 91: '½',
 92: '×',
 93: 'á',
 94: 'ã',
 95: 'è',
 96: 'é',
 97: 'ê',
 98: 'í',
 99: 'ñ',
 100: 

In [None]:
""" The 900k text string only contained 121 unique characters """
len(chars)

121

In [75]:
# Create the Sequence Data
maxlen = 40
step = 5

encoded = [char_int[c] for c in giant_string]

sequences = [] # 40 characters
next_chars = [] # 1 character

for i in range(0, len(encoded) - maxlen, step):
    sequences.append(encoded[i: i + maxlen])
    next_chars.append(encoded[i + maxlen])
    
print('sequeces: ', len(sequences))

sequeces:  178374


In [76]:
# Previous lecture method

# maxlen = 40
# step = 3

# """
# Using the dictionary of 121 unique char dictionary, we want to iterate through the raw text data and produce
# sequences of 40 character that are encoded using one hot encoding. We are dealing with character and not words.

# """

# #Sentences will be X and next_chars will be y because next_char is what we are predicting
# sentences = [] # X
# next_chars = [] # y

# for i in range(0, len(text) - maxlen, step):
#     sentences.append(text[i: i + maxlen])
#     next_chars.append(text[i + maxlen])
    
# print("sequences:", len(sentences))

### Specify X and y

In [77]:
# Whether or not a particular char in a particular sequence exists
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i,sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i, t, char] = 1 
    y[i, next_chars[i]] = 1

MemoryError: 

In [78]:
print(x.shape) # x represents al sequences of 40
x[0]

(178374, 40, 121)


array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [79]:
# Previous lecture

# # Binary encode x and y
# # Text data will be one hot encoded
# #  We will measure our loss as categorical cross entropy
# # Will be using accuracy as our loss metric to update the weights
# x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
# y = np.zeros((len(sentences), len(chars)), dtype=np.bool)


"""
Loop over each sentence in our sentences and preserve an ID number,
for each of the characters in the sentence, we will preserve the location(integer) of that charecter in the sequence
In the input data we will append the sentences ID, character location ID, and the actual charecter number
"""
# for i, sentences in enumerate(sentences): #loop over sentences in sentences
#     for t, char in enumerate(sentences): # preserve location integer of character in sequence
#         x[i, t, char_indicies[char]] = 1 #append sentence ID, character location ID, and actual charecter number
#     y[i, char_indicies[next_chars[i]]] = 1 # for y take the sentences ID and pass next_chars i for readability

'\nLoop over each sentence in our sentences and preserve an ID number,\nfor each of the characters in the sentence, we will preserve the location(integer) of that charecter in the sequence\nIn the input data we will append the sentences ID, character location ID, and the actual charecter number\n'

In [80]:
"""
The first character in our character lookup is not represented in the sentence
"""
x[0]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [1]:
# build the model: a single LSTM

model = Sequential()
model.add(LSTM(128,input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

NameError: name 'Sequential' is not defined

In [82]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [83]:
print(x.shape)
print(y.shape)
text = sequence

(178374, 40, 121)
(297291, 121)


In [84]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [85]:
model.fit(x, y,
          batch_size=128,
          epochs=5,
          callbacks=[print_callback])

ValueError: Input arrays should have the same number of samples as target arrays. Found 178374 input samples and 297291 target samples.