# Natural Language Processing Intro

Natural Language Processing (NLP) is a large field of AI with many related, but somewhat distinct sub-disciplines. We won't have time to look at all of these, but you are most likely somewhat familiar with many of the applications.

Some NLP Tasks:
* Summary generation, information extractions
* Translation (language to language)
* Transcription (speech to text)
* Auto-completion
* Sentiment Analysis
* Intent Detection
* Chat, automated writing, dialog generation, question answering
* Voice assistant
* Document retrieval

## Ambiguity in language

NLP is not a simple task, and until deep learning, was quite limited in its abilities, though as with most fields in AI, there is a long history, [dating back to the 1950s](https://en.wikipedia.org/wiki/Natural_language_processing). Part of the challenge is that human language tends to be ambiguous and recognizing words is really only the start to inferring meaning. Take for example this sentence:
   > The boy saw a man with a telescope
   
   * Who had the telescope?
   
More context is needed to answer this. Yet, context is not always there and even when it is, can be a challenge for NLP methods.

## Tokenization

One of the primary challenges of NLP is representing language as numbers--remember computers, and the ML/AI systems we have, primarily deal with numbers. For computer vision problems, this was relatively easy in that we took pixel intensities of an image and fed those in. But what to do with speech, words, text?

The processes of converting text to numerical representation is called **tokenization**. There are many methods of tokenization, but the idea is to break text into itemizable components--tokens.

Tokens can be words, letters, word fragments, or even sentences.



In [None]:
# Some of the examples here are adapted from https://github.com/NVDLI/LDL
# Which supports the book Learning Deep Learning (LDL) by Magnus Ekman (ISBN: 9780137470358).

"""
The MIT License (MIT)
Copyright (c) 2021 NVIDIA
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""

In [None]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.text \
    import text_to_word_sequence
import tensorflow as tf
import logging
tf.get_logger().setLevel(logging.ERROR)

EPOCHS = 10  # originally 32, reduced for time.
BATCH_SIZE = 256
INPUT_FILE_NAME = 'data/frankenstein.txt'
WINDOW_LENGTH = 40
WINDOW_STEP = 3
PREDICT_LENGTH = 3
MAX_WORDS = 10000
EMBEDDING_WIDTH = 100

## Cleaning, tokenization and creation of training set

In the following block, we read in the [*Frankenstein*](https://www.gutenberg.org/ebooks/84) text file and use the `text_to_word_sequence` to convert the text to a list of individual words. This also removes punctuation and converts everything to lower case. This command accomplishes our cleaning and tokenization steps.

The next step is to create the training set of `fragments` and corresponding `targets`. The hyperparameters were set above, and used here to make a training set by sliding a window over the text, `WINDOW_LENGTH=40` words at a time. 

The word immediately after that becomes the `target` and the window shifts down `WINDOW_STEP=3` words to make the next fragment/target pair.

In [None]:
# Open and read file.
file = open(INPUT_FILE_NAME, 'r', encoding='utf-8-sig')
text = file.read()
file.close()

# Make lower case and split into individual words.
text = text_to_word_sequence(text)

# Create training examples.
fragments = []
targets = []
for i in range(0, len(text) - WINDOW_LENGTH, WINDOW_STEP):
    fragments.append(text[i: i + WINDOW_LENGTH])
    targets.append(text[i + WINDOW_LENGTH])

### Look at one fragment/target pair

The following text from the book happens to be *fragment*/**target** pair 94:
 > *I am already far north of London, and as I walk in the streets of
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight. Do you understand* **this**
feeling?

In [None]:
print(f'Fragment: {fragments[94]}\nTarget: {targets[94]}')

print(f'\nTotal fragments: {len(fragments)}')

So, now we have the whole text split into 26,238 fragments, each with the following word as the target. This is what our model will be trained on.

But we have another step before we can begin training.

## Embedding

Somehow we need to convert the words (both in the fragments and the targets) to numbers. Embedding goes a step further, representing words in n-dimensional vector space. This vector space is a lower dimensionality than the number of words in the vocabulary (as opposed to one-hot encoding) and results in words with with similar meaning being closer in space than other word.

Here's an example figure showing some embeddings (taken from Renu Khandelwal's article [*Word Embeddings for NLP*](https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4):


![Image of word embeddings from https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4 ](images/word_embeddings_Renu_Khandelwal.png)

The embedding is learned in the training process. In this step, we are assigning numbers to words. Above, we set `MAX_WORDS=10000` so, once we have 10,000 words in the vocabulary, all additional words will be assigned to the `UNK` class.

In [None]:
# Convert to indices.
tokenizer = Tokenizer(num_words=MAX_WORDS, oov_token='UNK')
tokenizer.fit_on_texts(text)
fragments_indexed = tokenizer.texts_to_sequences(fragments)
targets_indexed = tokenizer.texts_to_sequences(targets)

# Convert to appropriate input and output formats.
X = np.array(fragments_indexed, dtype=np.int64)
y = np.zeros((len(targets_indexed), MAX_WORDS))
for i, target_index in enumerate(targets_indexed):
    y[i, target_index] = 1

In [None]:
print(f'Fragment as numbers: {X[94]} \nTarget is one-hot encoded: {y[94]}')

## Build and fit the model

Our first layer is the embedding layer that learns to convert the numbered words in the vocabulary to the embedding. Then we have two LSTM layers and a dense layer with a ReLU activation and a final layer with one neuron per word in the vocabulary and a softmax to make the output the probability of each word being the output from the model.

In [None]:
# Build and train model.
training_model = Sequential()
training_model.add(Embedding(
    output_dim=EMBEDDING_WIDTH, input_dim=MAX_WORDS,
    mask_zero=True, input_length=None))
training_model.add(LSTM(128, return_sequences=True,
                        dropout=0.2, recurrent_dropout=0.2))
training_model.add(LSTM(128, dropout=0.2,
                        recurrent_dropout=0.2))
training_model.add(Dense(128, activation='relu'))
training_model.add(Dense(MAX_WORDS, activation='softmax'))
training_model.compile(loss='categorical_crossentropy',
                       optimizer='adam')
training_model.summary()
history = training_model.fit(X, y, validation_split=0.05,
                             batch_size=BATCH_SIZE, 
                             epochs=EPOCHS, verbose=2, 
                             shuffle=True)

In [None]:
# Build stateful model used for prediction.
inference_model = Sequential()
inference_model.add(Embedding(
    output_dim=EMBEDDING_WIDTH, input_dim=MAX_WORDS,
    mask_zero=True, batch_input_shape=(1, 1)))
inference_model.add(LSTM(128, return_sequences=True,
                         dropout=0.2, recurrent_dropout=0.2,
                         stateful=True))
inference_model.add(LSTM(128, dropout=0.2,
                         recurrent_dropout=0.2, stateful=True))
inference_model.add(Dense(128, activation='relu'))
inference_model.add(Dense(MAX_WORDS, activation='softmax'))
weights = training_model.get_weights()
inference_model.set_weights(weights)

In [None]:
# Provide beginning of sentence and
# predict next words in a greedy manner
first_words = ['i', 'saw']
first_words_indexed = tokenizer.texts_to_sequences(
    first_words)
inference_model.reset_states()
predicted_string = ''

# Feed initial words to the model.
for i, word_index in enumerate(first_words_indexed):
    x = np.zeros((1, 1), dtype=np.int64)
    x[0][0] = word_index[0]
    predicted_string += first_words[i]
    predicted_string += ' '
    y_predict = inference_model.predict(x, verbose=0)[0]

# Predict PREDICT_LENGTH words.
for i in range(PREDICT_LENGTH):
    new_word_index = np.argmax(y_predict)
    word = tokenizer.sequences_to_texts(
        [[new_word_index]])
    x[0][0] = new_word_index
    predicted_string += word[0]
    predicted_string += ' '
    y_predict = inference_model.predict(x, verbose=0)[0]
print(predicted_string)

In [None]:
# Let's predict more by changing predict length to 10

PREDICT_LENGTH = 10

# Predict PREDICT_LENGTH words.
for i in range(PREDICT_LENGTH):
    new_word_index = np.argmax(y_predict)
    word = tokenizer.sequences_to_texts(
        [[new_word_index]])
    x[0][0] = new_word_index
    predicted_string += word[0]
    predicted_string += ' '
    y_predict = inference_model.predict(x, verbose=0)[0]
print(predicted_string)

In [None]:
# Explore embedding similarities.
embeddings = training_model.layers[0].get_weights()[0]
lookup_words = ['the', 'saw', 'see', 'of', 'and',
                'monster', 'frankenstein', 'read', 'eat']
for lookup_word in lookup_words:
    lookup_word_indexed = tokenizer.texts_to_sequences(
        [lookup_word])
    print('words close to:', lookup_word)
    lookup_embedding = embeddings[lookup_word_indexed[0]]
    word_indices = {}
    # Calculate distances.
    for i, embedding in enumerate(embeddings):
        distance = np.linalg.norm(
            embedding - lookup_embedding)
        word_indices[distance] = i
    # Print sorted by distance.
    for distance in sorted(word_indices.keys())[:5]:
        word_index = word_indices[distance]
        word = tokenizer.sequences_to_texts([[word_index]])[0]
        print(word + ': ', distance)
    print('')