## Natural Language Processing

Natural Language Processing (or NLP) is a discipline in computing that deals with the communication between natural(human) languages and computer language. A common example of NLP is something like spellcheck or autocomplete.

## Recurrent Neural Networks

In this tutorial we will introduce a new kind of neural network that is much capable of processing sequential data such as text or characters called a recurrent neural network (RNN)

we will learn how to use reccurent neural network to do the following:

* Sentiment Analysis
* Character Generation

RNN's are fairly complex and come in many different forms so in this tutorial we will focus on how they work and the kind of problems they are best suited for.

## Bag of Words

In [1]:
vocab = {}  # maps word to integer representing it
word_encoding = 1
def bag_of_words(text):
  global word_encoding

  words = text.lower().split(" ")  # create a list of all of the words in the text, well assume there is no grammar in our text for this example
  bag = {}  # stores all of the encodings and their frequency

  for word in words:
    if word in vocab:
      encoding = vocab[word]  # get encoding from vocab
    else:
      vocab[word] = word_encoding
      encoding = word_encoding
      word_encoding += 1
    
    if encoding in bag:
      bag[encoding] += 1
    else:
      bag[encoding] = 1
  
  return bag

text = "this is a test to see if this test will work is is test a a"
bag = bag_of_words(text)
print(bag)
print(vocab)

{1: 2, 2: 3, 3: 3, 4: 3, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}
{'this': 1, 'is': 2, 'a': 3, 'test': 4, 'to': 5, 'see': 6, 'if': 7, 'will': 8, 'work': 9}


This isn't really the way we would do this in practice, but I hope it gives you an idea of how bag of words works. Notice that we've lost the order in which words appear. In fact, let's look at how this encoding works for the two sentences we showed above.

## Sentiment Analysis

## Movie Review Dataset

Well start by loading in the IMDB movie review dataset from keras. This dataset contains 25,000 reviews from IMDB where each one is already preprocessed and has label as either positive or negative. Each review is encoded by integers that represents how common a word is in the entire dataset. For example a word encoded by the integer 3 mean that it is the 3rd most common word in the dataset.

In [2]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import tensorflow as tf
import os
import numpy as np

VOCAB_SIZE = 88584

MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_label) = imdb.load_data(num_words = VOCAB_SIZE)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [3]:
train_data[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

## Preprocessing

In [5]:
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)
train_data[1]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     1,   194,
        1153,   194,  8255,    78,   228,     5,     6,  1463,  4369,
        5012,   134,    26,     4,   715,     8,   118,  1634,    14,
         394,    20,    13,   119,   954,   189,   102,     5,   207,
         110,  3103,    21,    14,    69,   188,     8,    30,    23,
           7,     4,   249,   126,    93,     4,   114,     9,  2300,
        1523,     5,   647,     4,   116,     9,    35,  8163,     4,
         229,     9,   340,  1322,     4,   118,     9,     4,   130,
        4901,    19,

## Create the Model

In [7]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation = 'sigmoid')
])

In [8]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 32)          2834688   
                                                                 
 lstm_1 (LSTM)               (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2843041 (10.85 MB)
Trainable params: 2843041 (10.85 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Training

In [9]:
model.compile(loss = 'binary_crossentropy', optimizer = 'rmsprop', metrics = ['acc'])

histpry = model.fit(train_data, train_labels, epochs = 10, validation_split = 0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


And well evaluate the model on out training data to see how well it performs.

In [10]:
result = model.evaluate(test_data, test_label)
result



[0.4805392920970917, 0.8593999743461609]

## Making predictions

In [12]:
word_index = imdb.get_word_index()

def encode_text(text):
    tokens = tf.keras.preprocessing.text.text_to_word_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    return sequence.pad_sequences([tokens], MAXLEN)[0]

text = "that movie was just amazing, so amazing"
encode = encode_text(text)
print(encode)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  12  17  13  4

In [13]:
# while were at it lets make a decode function

reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode_integers(integers):
    PAD = 0
    text = ""
    for num in integers:
        if num != PAD:
            text += reverse_word_index[num] + " "

    return text[:-1]

print(decode_integers(encode))

that movie was just amazing so amazing


In [14]:
# now time to make a prediction

def predict(text):
    encoded_text = encode_text(text)
    pred = np.zeros((1,250))
    pred[0] = encoded_text
    result = model.predict(pred) 
    print(result[0])

positive_review = "That movie was! really loved it and would great watch it again because it was amazingly great"
predict(positive_review)

negative_review = "that movie really sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
predict(negative_review)


[0.8773501]
[0.31494194]


## RNN Play Generator

In [15]:
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy

In [16]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


In [18]:
text = open(path_to_file, 'rb').read().decode(encoding = 'utf-8')
print('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [19]:
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



## Encoding

Since this text isn't encoded yet well need to do that ourself. We are going to encode each unique character as a different integer.

In [22]:
vocab = sorted(set(text))

char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

def text_to_int(text):
    return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(text)

In [23]:
print("Text:", text[:13])
print("Encoded:", text_to_int(text[:13]))

Text: First Citizen
Encoded: [18 47 56 57 58  1 15 47 58 47 64 43 52]


In [25]:
def int_to_text(ints):
    try:
        ints = ints.numpy()
    except:
        pass
    return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:13]))

First Citizen


In [26]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

In [27]:
sequence = char_dataset.batch(seq_length+1, drop_remainder=True)

In [28]:
def split_intput_target(chunk) :
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequence.map(split_intput_target)

In [29]:
for x, y in dataset.take(2):
    print("\n\nEXAMPLE\n")
    print("INPUT")
    print(int_to_text(x))
    print("\nOUTPUT")
    print(int_to_text(y))



EXAMPLE

INPUT
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You

OUTPUT
irst Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You 


EXAMPLE

INPUT
are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you 

OUTPUT
re all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you k


In [30]:
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab)
EMBEDDING_DIM = 256
RNN_UNITS = 1024

BUFFER_SIZE = 1000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder = True)

In [31]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape = [batch_size, None]),
        tf.keras.layers.LSTM(rnn_units,
                             return_sequences = True,
                             stateful = True,
                             recurrent_initializer = 'glorot_uniform'),
                             tf.keras.layers.Dense(vocab_size)
    ])
    return model
model = build_model(VOCAB_SIZE,EMBEDDING_DIM,RNN_UNITS,BATCH_SIZE)
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (64, None, 256)           16640     
                                                                 
 lstm_2 (LSTM)               (64, None, 1024)          5246976   
                                                                 
 dense_1 (Dense)             (64, None, 65)            66625     
                                                                 
Total params: 5330241 (20.33 MB)
Trainable params: 5330241 (20.33 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [33]:
for input_example_batch, target_example_batch in data.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "--> (batch_size, sequence_length, vocab_size)")

(64, 100, 65) --> (batch_size, sequence_length, vocab_size)


In [34]:
print(len(example_batch_predictions))
print(example_batch_predictions)

64
tf.Tensor(
[[[-9.17716883e-04  2.05973512e-03 -1.29466504e-02 ...  7.24846916e-03
    1.18075637e-04  8.22192989e-03]
  [ 4.43683332e-03 -1.94420828e-03 -1.09162945e-02 ...  9.60852113e-03
   -3.02824820e-03  9.73279122e-03]
  [ 4.28345427e-03  1.85428909e-03 -1.50364274e-02 ...  9.72239207e-03
   -1.18152273e-03  3.63048282e-03]
  ...
  [ 8.22923426e-03  4.73024836e-03 -1.93125277e-03 ...  1.02472249e-02
   -2.51482613e-03 -9.06765461e-03]
  [ 1.14468699e-02  1.34245143e-03 -2.17978796e-03 ...  1.20051727e-02
   -4.81724739e-03 -3.78987845e-03]
  [ 1.49077782e-02  7.57269852e-04 -8.42376985e-03 ...  1.40316170e-02
   -7.97890592e-03 -5.31198271e-03]]

 [[ 9.64803435e-03  7.92764314e-03 -4.81153792e-03 ...  2.21281569e-03
    8.79527011e-04 -1.13362800e-02]
  [ 4.95439442e-03  5.53762168e-03  2.03680852e-03 ...  7.15896999e-03
    3.23626585e-03 -1.42284790e-02]
  [ 6.30202796e-03  4.41909628e-03  9.34002222e-04 ...  9.25146695e-03
    4.11982927e-03 -7.07802968e-03]
  ...
  [ 2.245

In [35]:
pred = example_batch_predictions[0]
print(len(pred))
print(pred)

100
tf.Tensor(
[[-0.00091772  0.00205974 -0.01294665 ...  0.00724847  0.00011808
   0.00822193]
 [ 0.00443683 -0.00194421 -0.01091629 ...  0.00960852 -0.00302825
   0.00973279]
 [ 0.00428345  0.00185429 -0.01503643 ...  0.00972239 -0.00118152
   0.00363048]
 ...
 [ 0.00822923  0.00473025 -0.00193125 ...  0.01024722 -0.00251483
  -0.00906765]
 [ 0.01144687  0.00134245 -0.00217979 ...  0.01200517 -0.00481725
  -0.00378988]
 [ 0.01490778  0.00075727 -0.00842377 ...  0.01403162 -0.00797891
  -0.00531198]], shape=(100, 65), dtype=float32)


In [36]:
time_pred = pred[0]
print(len(time_pred))
print(time_pred)

65
tf.Tensor(
[-0.00091772  0.00205974 -0.01294665 -0.01102437  0.00885246  0.00350611
  0.00248048 -0.00198216 -0.00029814 -0.00306895 -0.0025068  -0.00679063
 -0.00581012 -0.00015235 -0.00383247  0.00264474 -0.00647019 -0.00157837
 -0.00361783 -0.00522902 -0.00991746  0.00111962  0.00993568  0.00844753
  0.00169271 -0.0045408   0.00195838  0.00245238 -0.00359592  0.0071556
  0.00710109 -0.00416773  0.00733918 -0.01170573 -0.00541775 -0.00611222
 -0.00576066  0.00645136  0.00341756  0.0075292  -0.00585935  0.01876229
  0.01149772  0.00828716  0.00092977  0.00139835 -0.01508622  0.00383708
 -0.00044344  0.00767208  0.00052797  0.00396218  0.00406228  0.01087895
 -0.0040208  -0.00421738 -0.00291978 -0.01043626 -0.00176217 -0.00527899
  0.00749619  0.00477153  0.00724847  0.00011808  0.00822193], shape=(65,), dtype=float32)


In [37]:
sampled_indices = tf.random.categorical(pred, num_samples=1)

sampled_indices = np.reshape(sampled_indices, (1, -1))[0]
predicted_chars = int_to_text(sampled_indices)

predicted_chars

"eiqVhT$TIpY,jrvkETC-sPoVoolHfHyVlAWk;w-OVxJutrmuQoUUY3''k.Fi\nGpjSsaCDEdEjdNUM?!.\n.iZYOIVLJYEQlaZwjpg"

In [38]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits = True)

In [39]:
model.compile(optimizer = "adam", loss = loss)

In [40]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [42]:
history = model.fit(data, epochs=2, callbacks=[checkpoint_callback])

Epoch 1/2
Epoch 2/2


In [45]:
model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
# checkpoint_num = 2
# model.load_weights(tf.train.load_checkpoint("./training_checkpoints/ckpt_" + str(checkpoint_num)))
# model.build(tf.TensorShape([1, None]))
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 800

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
    
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [46]:
inp = input("Type a starting string: ")
print(generate_text(model, inp))

romeon with have brieg a
Shase gite fart is
Alled with thay to will'o the cors.
That her canlive te to be to sinm in take apoct, the lead on you more; we thank you forturd stull maze andows.

KIANENLA:
Hervollmbow
In Buttoling termt ag then, will repust shall by thus.

GONTELSO:
Hay is his were was that cuttread calf that dient'd you he? Mary corme thene;
Matioms you.
Thy ford wood, lost be doue bean, and this dedvit to have the ne
&rows to her sens wo will baind to withan
A diuth off goods.

GROZIE:
thou thy boand!
Houths, that un reasunh; same, than the send as ad. Thy that?

TRANEO:
What, nor thar we with a knowl: no heach two you. Vecaisca, thy father-mact:
My not speim fort y os Guchter, gons
he hade anverarvest ere sheaw these selil: come
Of te me; so we sto bekn nece-
Om thin straws dome
