<a href="https://colab.research.google.com/github/KevinTheRainmaker/ML_DL_Basics/blob/master/Udacity%3A%20Intro%20to%20TensorFlow%20for%20DL/UTFD_L9C5_Tokenizing_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizing text and Creating sequences for sentences


## Import the Tokenizer

In [1]:
# Import the Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer

## Sentences to Tokenize

In [9]:
sentences = [
             'I saw someone I knew wherever I went during the fair',
             'Guests can help themselves to refreshments whenever they wish',
             'She will do her best to answer them as thoroughly as possible',
             'She and I go to the fair as a guests',
             'She will read and edit the articles for the next fair'
]

## Tokenize the words

The first step to preparing text to e used in a machine learning model is to tokenize the text.

In other words, generate numbers for the words.

In [10]:
# Optionally set the max number of words to tokenize.
# The out of vocabulary (OOV) token represents words that are not in the index.
# Call `fit_on_text()` on the tokenizer to generate unique numbers for each word.

tokenizer = Tokenizer(num_words= 100, oov_token = '<OOV>')
tokenizer.fit_on_texts(sentences)

## View the word index

After tokenizing the text, the tokenizer has a word index that contains key-value pairs for all the words and their numbers.

The word is the key, and the number is the value.

Notice that the OOV token is the first entry.

In [11]:
# Examine the word index
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'i': 2, 'the': 3, 'fair': 4, 'to': 5, 'she': 6, 'as': 7, 'guests': 8, 'will': 9, 'and': 10, 'saw': 11, 'someone': 12, 'knew': 13, 'wherever': 14, 'went': 15, 'during': 16, 'can': 17, 'help': 18, 'themselves': 19, 'refreshments': 20, 'whenever': 21, 'they': 22, 'wish': 23, 'do': 24, 'her': 25, 'best': 26, 'answer': 27, 'them': 28, 'thoroughly': 29, 'possible': 30, 'go': 31, 'a': 32, 'read': 33, 'edit': 34, 'articles': 35, 'for': 36, 'next': 37}


In [13]:
# Get the number for a given word
print(word_index['answer'])

27


## Create sequences for the sentences

After tokenizing the words, the word index contains a unique number for each word.

However, the numbers in the word index are not ordered, unlikely the words in a sentences.

So after tokenizing the words, the next step is to generate sequences for the sentences.

In [14]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[2, 11, 12, 2, 13, 14, 2, 15, 16, 3, 4], [8, 17, 18, 19, 5, 20, 21, 22, 23], [6, 9, 24, 25, 26, 5, 27, 28, 7, 29, 7, 30], [6, 10, 2, 31, 5, 3, 4, 7, 32, 8], [6, 9, 33, 10, 34, 3, 35, 36, 3, 37, 4]]


## Sequence sentences that contain words that are not in the word index

How about the sentences that contains word that not in the index?

Hint: The OOV token is the first(1) entry in the word index.

In [15]:
sentences2 = [
              'He will be honored for his research at the fair',
              'The power cable provided with the computer is compatible with other device'
              ]
sequences2 = tokenizer.texts_to_sequences(sentences2)
print(sequences2)

[[1, 9, 1, 1, 36, 1, 1, 1, 3, 4], [3, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1]]


The OOV token represented as '1' in sequences.