### Natural Language Processing

Natural Language Processing (NLP) is a discipline in computing that deals with the communication between natural (human) languages and computer languages. A common example of NLP is something like spellcheck or autocomplete. NLP is the field that focuses on how computers can understand and/or process natural/human languages.  

#### Recurrent Neural Networks

This new kind of neural network, a recurrent neural network (RNN), is much more capable of processing sequential data such as text or characters. And will be used to do the following:

    Sentiment Analysis
    Character Generation

### Sequence Data

Sequence data such as long chains of text, weather patterns, videos and really anything where the notion of a step or time is relevant needs to be processed and handled in a special way. When working with previous data, the notion of time or step was irrelevant.

Textual data contains words that follow in a very specific meaningful order and we need to be able to keep track of each word and when it occurs in the data. Simply encoding an entire paragraph of text into one data point wouldn't provide a meaningful picture of the data and would be very difficult to do anything with. This is why we treat text as a sequence and process one word at a time. We will keep track of where each of these words appear and use that information to try to understand the meaning of pieces of text.

### Encoding Text

ML models and NNs don't accept raw text data as an input. This means we must somehow encode our textual data to numeric values that our models can understand. There are many different ways of doing this and we will look at a few examples below.

Consider the following two movie reviews:  

    I thought the movie was going to be bad, but it was actually amazing!

    I thought the movie was going to be amazing, but it was actually bad!

Although these two setences are very similar we know that they have very different meanings. This is because of the ordering of words, a very important property of textual data.

Now keep that in mind while we consider some different ways of encoding our textual data.  

#### Bag of Words

The easiest way to encode text data is to use a "bag of words". This technique encodes each word in a sentence with an integer and throws it into a collection that does not maintain the order of the words but does keep track of the frequency

In [1]:
# The python function below that encodes a string of text into bag of words. 
vocab = {} # Maps word to integer representation
word_encoding = 1

def bag_of_words(text):
    global word_encoding
    
    words = text.lower().split(" ") # Creates a list of all the words in the text, we'll assume there is no grammar in our text for this example
    bag = {} # Stores all of the encodings and their frequency
    
    for word in words:
        
        if word in vocab:
            encoding = vocab[word] # Get encoding from vocab
        else:
            vocab[word] = word_encoding
            encoding = word_encoding
            word_encoding += 1
            
        if encoding in bag:
            bag[encoding] += 1
        else:
            bag[encoding] = 1
            
    return bag

text = "this is a test to see if this test will work is is test a a"
bag = bag_of_words(text)
print(bag)
print(vocab)

{1: 2, 2: 3, 3: 3, 4: 3, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}
{'this': 1, 'is': 2, 'a': 3, 'test': 4, 'to': 5, 'see': 6, 'if': 7, 'will': 8, 'work': 9}


In [2]:
# Look at how this encoding works for the two sentences we showed above
positive_review = "I thought the movie was going to be bad but it was actually amazing"
negative_review = "I thought the movie was going to be amazing but it was actually bad"

pos_bag = bag_of_words(positive_review)
neg_bag = bag_of_words(negative_review)

print("Positive:", pos_bag)
print("Negative:", neg_bag, '\n')

print("We can see that even though these sentences have a very different meaning they are encoded exaclty the same way")

Positive: {10: 1, 11: 1, 12: 1, 13: 1, 14: 2, 15: 1, 5: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1}
Negative: {10: 1, 11: 1, 12: 1, 13: 1, 14: 2, 15: 1, 5: 1, 16: 1, 21: 1, 18: 1, 19: 1, 20: 1, 17: 1} 

We can see that even though these sentences have a very different meaning they are encoded exaclty the same way


### Integer Encoding

This next technique, integer encoding, involves representing each word or character in a sentence as a unique integer and maintaining the order of these words. This should fix the earlier problem where we lost the order of words.

In [3]:
vocab = {}  
word_encoding = 1

def one_hot_encoding(text):
    
  global word_encoding

  words = text.lower().split(" ") 
  encoding = []  

  for word in words:
    if word in vocab:
      code = vocab[word]  
      encoding.append(code) 
    else:
      vocab[word] = word_encoding
      encoding.append(word_encoding)
      word_encoding += 1
  
  return encoding

text = "this is a test to see if this test will work is is test a a"
encoding = one_hot_encoding(text)
print(encoding)
print(vocab)

[1, 2, 3, 4, 5, 6, 7, 1, 4, 8, 9, 2, 2, 4, 3, 3]
{'this': 1, 'is': 2, 'a': 3, 'test': 4, 'to': 5, 'see': 6, 'if': 7, 'will': 8, 'work': 9}


In [4]:
# Let's look at one hot encoding on our movie reviews.
positive_review = "I thought the movie was going to be bad but it was actually amazing"
negative_review = "I thought the movie was going to be amazing but it was actually bad"

pos_encode = one_hot_encoding(positive_review)
neg_encode = one_hot_encoding(negative_review)

print("Positive:", pos_encode)
print("Negative:", neg_encode, '\n')

Positive: [10, 11, 12, 13, 14, 15, 5, 16, 17, 18, 19, 14, 20, 21]
Negative: [10, 11, 12, 13, 14, 15, 5, 16, 21, 18, 19, 14, 20, 17] 



Now we are keeping track of the order of words and we can tell where each occurs.  Ideally, we would like similar words to have similar labels and different words to have very different labels. For example, the words happy and joyful should probably have very similar labels so we can determine that they are similar. While words like horrible and amazing should probably have very different labels. The method we looked at above won't be able to do something like this for us. This could mean that the model will have a very difficult time determing if two words are similar or not which could result in some pretty drastic performace impacts.

### Word Embeddings

A third far superior method is word embeddings. This method keeps the order of words intact as well as encodes similar words with very similar labels. It attempts to not only encode the frequency and order of words but the meaning of those words in the sentence. It encodes each word as a dense vector that represents its context in the sentence.

Unlike the previous techniques word embeddings are learned by looking at many different training examples. You can add what's called an embedding layer to the beggining of your model and while your model trains your embedding layer will learn the correct embeddings for words. You can also use pretrained embedding layers.

### Recurrent Neural Networks (RNN's)

We have been using feed-forward neural networks. This means that all our data is fed forwards (all at once) from left to right through the network. This won't work well for processing text. A recurrent neural network is a network that contains a loop. A RNN will process one word at a time while maintaining an internal memory of what it's already seen. This will allow it to treat words differently based on their order in a sentence and to slowly build an understanding of the entire input, one word at a time.

This is why we are treating our text data as a sequence! So that we can pass one word at a time to the RNN.

Let's have a look at what a recurrent layer might look like:

![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)
*Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/*

Variable definitions:

    ht: output at time t

    xt: input at time t

    A Recurrent Layer (loop)

What this diagram is trying to illustrate is that a recurrent layer processes words or input one at a time in combination with the output from the previous iteration. So, as we progress further in the input sequence, we build a more complex understanding of the text as a whole.

It can be effective at processing shorter sequences of text for simple problems but has many downfalls associated with it, one of them being that as text sequences get longer it gets increasingly difficult for the network to understand the text properly.

### LSTM

LSTM (Long Short-Term Memory) works similarly to the simple RNN layer but adds a way to access inputs from any timestep in the past. Whereas in our simple RNN layer input from previous timestamps gradually disappeared as we got further through the input, with an LSTM we have a long-term memory data structure storing all the previously seen inputs as well as when we saw them. This allows access to any previous value we want at any point in time. This adds to the complexity of our network and allows it to discover more useful relationships between inputs and when they appear.

### Sentiment Analysis

The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.

### Movie Review Dataset

Classify movie reviews using the IMDB movie review dataset from keras, which contains 25,000 reviews. Each review is already preprocessed and has a label as either positive or negative. Each review is encoded by integers that represents how common a word is in the entire dataset. For example, a word encoded by the integer 3 means that it is the 3rd most common word in the dataset.

In [5]:
from keras.datasets import imdb
import keras
# from keras_preprocessing import sequence apparently this doesn't work
from keras_preprocessing import sequence
import tensorflow as tf
import os
import numpy as np

VOCAB_SIZE = 88584

MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=VOCAB_SIZE)

In [6]:
len(train_data[1])

189

### Preprocessing

The reviews are different lengths. We cannot pass different length data into our neural network. Therefore, we must make each review the same length. To do this we will follow the procedure below:

    if the review is greater than 250 words then trim off the extra words
    if the review is less than 250 words add the necessary amount of 0's to make it equal to 250

Keras has a function that can do this for us:

In [7]:
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)

### Creating the Model

We'll use a word embedding layer as the first layer in our model and add a LSTM layer afterwards that feeds into a dense node to get our predicted sentiment.

32 stands for the output dimension of the vectors generated by the embedding layer. We can change this value if we'd like!

In [8]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

In [9]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          2834688   
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


### Training

Now to compile and train the model

In [10]:
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["acc"])
history = model.fit(train_data, train_labels, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [11]:
results = model.evaluate(test_data, test_labels)
print(results)

[0.5047945976257324, 0.8448399901390076]


### Making Predictions

Using the network to make predictions on our reviews.

Since our reviews are encoded, we need to convert any review that we write into that form so the network can understand it. To do that well load the encodings from the dataset and use them to encode our own data.

In [12]:
word_index = imdb.get_word_index()

def encode_text(text):
    
    tokens = keras.preprocessing.text.text_to_word_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    return sequence.pad_sequences([tokens], MAXLEN)[0]

text = "that movie was just amazing, so amazing"
encoded = encode_text(text)
print(encoded)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  12  17  13  4

In [13]:
# Decode function
reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode_integers(integers):
    
    PAD = 0
    text = ""
    for num in integers:
        
        if num != PAD:
            text += reverse_word_index[num] + " "
            
    return text[:-1]

print(decode_integers(encoded))

that movie was just amazing so amazing


In [14]:
# Time to make prediction
def predict(text):
    encoded_text = encode_text(text)
    pred = np.zeros((1,250))
    pred[0] = encoded_text
    result = model.predict(pred)
    print(result[0])
    
positive_review = "That movie was so awesome! I really loved it and would watch it again because it was amazingly great"
predict(positive_review)

negative_review = "that movie sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
predict(negative_review)

[0.89384913]
[0.11018962]


## RNN Play Generator

In [1]:
import keras
from keras_preprocessing import sequence
import tensorflow as tf
import os
import numpy as np

In [2]:
# Dataset taking an extract from a Shakespeare play
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

In [3]:
# Read and decode for py2 compatibility
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# Length of text is # of characters in it
print("Length of text: {} characters".format(len(text)))

Length of text: 1115394 characters


In [4]:
# First 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



### Encoding  


In [5]:
vocab = sorted(set(text))
# Creating a mapping from unique characters to indicies
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

def text_to_int(text):
    return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(text)

In [6]:
# Taking a look at how text is encoded
print("Text:", text[:13])
print("Encoded:", text_to_int(text[:13]))

Text: First Citizen
Encoded: [18 47 56 57 58  1 15 47 58 47 64 43 52]


In [7]:
# Function that converts numeric values to text
def int_to_text(ints):
    
    try:
        ints = ints.numpy()
    except:
        pass
    
    return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:13]))

First Citizen


### Creating Training Examples
Our task is to feed the model a sequence and have it return the next character. Therfore we need to split our text data above into many shorter sequences that we can pass to the model as training examples. 

The training examples we prepapre will use a *seq_length* sequence as input and a *seq_length* sequence as the output where that sequence is the original sequence shifted one letter to the right. For example:

```input: Hell | output: ello```

Our first step will be to create a stream of characters from our text data.