##### Copyright 2020 The TensorFlow Authors.

# Tokenizing text and creating sequences for sentences

This colab shows you how to tokenize text and create sequences for sentences as the first stage of preparing text for use with TensorFlow models.

## Import the Tokenizer

In [1]:
# Import the Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer


## Write some sentences

Feel free to change and add sentences as you like

In [20]:
sentences = [
    'My favorite food is ice cream',
    'do you like ice cream too?',
    'My dog likes ice cream!',
    "your favorite flavor of icecream is chocolate",
    "chocolate isn't good for dogs",
    "your dog, your cat, and your parrot prefer broccoli",
    "Amir",
    "Nasir",
    "Asif",
    "Kashif",
    "Amir is a goodiest man"
]

## Tokenize the words

The first step to preparing text to be used in a machine learning model is to tokenize the text, in other words, to generate numbers for the words.

In [21]:
# Optionally set the max number of words to tokenize.
# The out of vocabulary (OOV) token represents words that are not in the index.
# Call fit_on_text() on the tokenizer to generate unique numbers for each word
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")     # STEP 1 tokenize the most common 100 word to a nihari store like token :)
tokenizer.fit_on_texts(sentences)                             # STEP 2 set the word to number or token code.


## View the word index
After you tokenize the text, the tokenizer has a word index that contains key-value pairs for all the words and their numbers.

The word is the key, and the number is the value.

Notice that the OOV token is the first entry.


In [22]:
# Examine the word index
word_index = tokenizer.word_index               # This is like select id from words_num_list
print(word_index)

{'<OOV>': 1, 'your': 2, 'is': 3, 'ice': 4, 'cream': 5, 'my': 6, 'favorite': 7, 'dog': 8, 'chocolate': 9, 'amir': 10, 'food': 11, 'do': 12, 'you': 13, 'like': 14, 'too': 15, 'likes': 16, 'flavor': 17, 'of': 18, 'icecream': 19, "isn't": 20, 'good': 21, 'for': 22, 'dogs': 23, 'cat': 24, 'and': 25, 'parrot': 26, 'prefer': 27, 'broccoli': 28, 'nasir': 29, 'asif': 30, 'kashif': 31, 'a': 32, 'goodiest': 33, 'man': 34}


In [23]:
# Get the number for a given word
print(word_index['favorite'])                   # This is like select id from words_num_list where word='favorite'

7


In [24]:
print(word_index['kashif'])

31


# Create sequences for the sentences

After you tokenize the words, the word index contains a unique number for each word. However, the numbers in the word index are not ordered. Words in a sentence have an order. So after tokenizing the words, the next step is to generate sequences for the sentences.

In [25]:
sequences = tokenizer.texts_to_sequences(sentences)      # STEP 3 all the sentences are encoded as numbers.
print (sequences)

[[6, 7, 11, 3, 4, 5], [12, 13, 14, 4, 5, 15], [6, 8, 16, 4, 5], [2, 7, 17, 18, 19, 3, 9], [9, 20, 21, 22, 23], [2, 8, 2, 24, 25, 2, 26, 27, 28], [10], [29], [30], [31], [10, 3, 32, 33, 34]]


# Sequence sentences that contain words that are not in the word index

Let's take a look at what happens if the sentence being sequenced contains words that are not in the word index.

The Out of Vocabluary (OOV) token is the first entry in the word index. You will see it shows up in the sequences in place of any word that is not in the word index.

In [26]:
sentences2 = ["I like hot chocolate", "My dogs and my hedgehog like kibble but my squirrel prefers grapes and my chickens like ice cream, preferably vanilla"]

sequences2 = tokenizer.texts_to_sequences(sentences2)
print(sequences2)               # The words not in the vocabulary bag are <OOV> which is number 1 

[[1, 14, 1, 9], [6, 23, 25, 6, 1, 14, 1, 1, 6, 1, 1, 1, 25, 6, 1, 14, 4, 5, 1, 1]]
