<a href="https://colab.research.google.com/github/KevinTheRainmaker/ML_DL_Basics/blob/master/Udacity%3A%20Intro%20to%20TensorFlow%20for%20DL/UTFD_L9C7_Preparing_Text_to_Use_with_TensorFlow_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Preparing text to use with TensorFlow models

The high level steps to prepare text to be used in a machine learning model are:

1. Tokenizing the words to get numerical values
2. Create numerical sequences of the sentences
3. Adjust the sequences to all be the same length

## Import the classes

In [1]:
# Import Tokenizer and pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Sentences to Tokenize

In [2]:
sentences = [
             'I saw someone I knew wherever I went during the fair',
             'Guests can help themselves to refreshments whenever they wish',
             'She will do her best to answer them as thoroughly as possible',
             'She and I go to the fair as a guests',
             'She will read and edit the articles for the next fair'
]

In [3]:
print(sentences)

['I saw someone I knew wherever I went during the fair', 'Guests can help themselves to refreshments whenever they wish', 'She will do her best to answer them as thoroughly as possible', 'She and I go to the fair as a guests', 'She will read and edit the articles for the next fair']


## Create the Tokenizer and define an OOV token

When creating the Tokenizer, we can specify the max number of words in the dictionary and a token to represent words that are OOV.

This OOV token will be used when we create sequences for sentences that contain words that are not in the word index.

In [4]:
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')

## Tokenize the words

In [5]:
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'i': 2, 'the': 3, 'fair': 4, 'to': 5, 'she': 6, 'as': 7, 'guests': 8, 'will': 9, 'and': 10, 'saw': 11, 'someone': 12, 'knew': 13, 'wherever': 14, 'went': 15, 'during': 16, 'can': 17, 'help': 18, 'themselves': 19, 'refreshments': 20, 'whenever': 21, 'they': 22, 'wish': 23, 'do': 24, 'her': 25, 'best': 26, 'answer': 27, 'them': 28, 'thoroughly': 29, 'possible': 30, 'go': 31, 'a': 32, 'read': 33, 'edit': 34, 'articles': 35, 'for': 36, 'next': 37}


## Create sequences for the sentences

After tokenizing the words, the word index contains a unique number for each word.

However, the numbers in the word index are not ordered, unlikely the words in a sentences.

So after tokenizing the words, the next step is to generate sequences for the sentences.

In [6]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[2, 11, 12, 2, 13, 14, 2, 15, 16, 3, 4], [8, 17, 18, 19, 5, 20, 21, 22, 23], [6, 9, 24, 25, 26, 5, 27, 28, 7, 29, 7, 30], [6, 10, 2, 31, 5, 3, 4, 7, 32, 8], [6, 9, 33, 10, 34, 3, 35, 36, 3, 37, 4]]


## Make the sequences all the same length

Later, when you feed the sequences into a neural network to train a model, the sequences all nedd to be uniform in size. 

Currently the sequences have varied lengths, so the next step is to make them all be the same size, either by paddin or truncating.

</br>

### Padding & Truncating
Using `pad_sequences()` to padding or truncating the sequences to make them all be the same length. By default, the process goes at the start of the sequences, but you can specify the start direction.

If you don't provide the max length, then the sequence are padded to match the length of longest sentence(No truncating).

</br>

[[All about the options about padding and truncating]](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences)

In [7]:
padded = pad_sequences(sequences)
print("Word Index: ", word_index)
print("Sequences: ", sequences)
print("Padded Sequences: ", padded)

Word Index:  {'<OOV>': 1, 'i': 2, 'the': 3, 'fair': 4, 'to': 5, 'she': 6, 'as': 7, 'guests': 8, 'will': 9, 'and': 10, 'saw': 11, 'someone': 12, 'knew': 13, 'wherever': 14, 'went': 15, 'during': 16, 'can': 17, 'help': 18, 'themselves': 19, 'refreshments': 20, 'whenever': 21, 'they': 22, 'wish': 23, 'do': 24, 'her': 25, 'best': 26, 'answer': 27, 'them': 28, 'thoroughly': 29, 'possible': 30, 'go': 31, 'a': 32, 'read': 33, 'edit': 34, 'articles': 35, 'for': 36, 'next': 37}
Sequences:  [[2, 11, 12, 2, 13, 14, 2, 15, 16, 3, 4], [8, 17, 18, 19, 5, 20, 21, 22, 23], [6, 9, 24, 25, 26, 5, 27, 28, 7, 29, 7, 30], [6, 10, 2, 31, 5, 3, 4, 7, 32, 8], [6, 9, 33, 10, 34, 3, 35, 36, 3, 37, 4]]
Padded Sequences:  [[ 0  2 11 12  2 13 14  2 15 16  3  4]
 [ 0  0  0  8 17 18 19  5 20 21 22 23]
 [ 6  9 24 25 26  5 27 28  7 29  7 30]
 [ 0  0  6 10  2 31  5  3  4  7 32  8]
 [ 0  6  9 33 10 34  3 35 36  3 37  4]]


\* Sentence 3 is the longest and other sentences are padded to match sentence 3.

In [8]:
# Specify a max length for the padded sequences
padded = pad_sequences(sequences, maxlen=15)
print(padded)

[[ 0  0  0  0  2 11 12  2 13 14  2 15 16  3  4]
 [ 0  0  0  0  0  0  8 17 18 19  5 20 21 22 23]
 [ 0  0  0  6  9 24 25 26  5 27 28  7 29  7 30]
 [ 0  0  0  0  0  6 10  2 31  5  3  4  7 32  8]
 [ 0  0  0  0  6  9 33 10 34  3 35 36  3 37  4]]


Also the sentence 3 is padded.

In [9]:
# Put the padding at the end of the sequences
padded = pad_sequences(sequences, maxlen=15, padding='post')
print(padded)

[[ 2 11 12  2 13 14  2 15 16  3  4  0  0  0  0]
 [ 8 17 18 19  5 20 21 22 23  0  0  0  0  0  0]
 [ 6  9 24 25 26  5 27 28  7 29  7 30  0  0  0]
 [ 6 10  2 31  5  3  4  7 32  8  0  0  0  0  0]
 [ 6  9 33 10 34  3 35 36  3 37  4  0  0  0  0]]


Padding(Additional zeros) is added to the end of sequences.

In [10]:
# Limit the length of the sequences, you will see some sequences get truncated
padded = pad_sequences(sequences, maxlen = 6)
print(padded)

[[14  2 15 16  3  4]
 [19  5 20 21 22 23]
 [27 28  7 29  7 30]
 [ 5  3  4  7 32  8]
 [ 3 35 36  3 37  4]]


## OOV Examples


In [11]:
# Try turning sentences that contain words that 
# aren't in the word index into sequences.
# Add your own sentences to the test_data
test_data = [
    "my best friend's favorite ice cream flavor is strawberry",
    "my dog's best friend is a manatee"
]
print (test_data)

# Remind ourselves which number corresponds to the
# out of vocabulary token in the word index
print("<OOV> has the number", word_index['<OOV>'], "in the word index.")

# Convert the test sentences to sequences
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

# Pad the new sequences
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")

# Notice that "1" appears in the sequence wherever there's a word 
# that's not in the word index
print(padded)

["my best friend's favorite ice cream flavor is strawberry", "my dog's best friend is a manatee"]
<OOV> has the number 1 in the word index.

Test Sequence =  [[1, 26, 1, 1, 1, 1, 1, 1, 1], [1, 1, 26, 1, 1, 32, 1]]

Padded Test Sequence: 
[[ 0  1 26  1  1  1  1  1  1  1]
 [ 0  0  0  1  1 26  1  1 32  1]]
