<a href="https://colab.research.google.com/github/RLWH/tensorflow-certification-labs/blob/main/C3_W1_Lab_2_sequences_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ungraded Lab: Generating Sequences and Padding
In this lab, you will look at converting your input sentences into a sequence of tokens. Similar to images in the previous course, you need to prepare text data with uniform size before feeding it to your model. You will see how to do these in the next sections.

# Text to sequence
In the previous lab, we saw how to generate a `word_index` dictionary to generate tokens for each word in our corpus. 

We can then use the result to convert each of the input sentences into a sequence of tokens. This is done using the `texts_to_sequences()` method as shown below. 

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define our input texts
sentences = [
    "I love my dog",
    "I love my cat",
    "You love my dog!",
    "Do you think my dog is amazing?"
]

# Initialise the Tokenizer class
tokenizer = Tokenizer(num_words=100, oov_token="")

# Tokenise the input sentences
tokenizer.fit_on_texts(sentences)

# Get the word index dictionary
word_index = tokenizer.word_index

# Generate list of okten sequences
sequences = tokenizer.texts_to_sequences(sentences)

# Print the result
print("\nWord Index = ", word_index)
print("\nSequences = ", sequences)


Word Index =  {'': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Sequences =  [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


# Padding
We will usually need to pad the sequences into a uniform length because that is what our model expects. 

We can use the `pad_sequences` for that. By default, it will pad according to the length of the longest sequence. We can override this with the `maxlen` argument to define a specific length. 

In [3]:
# Pad the sequences to a uniform length
padded = pad_sequences(sequences, maxlen=5)

# Print the result
print("\nPadded Sequences: \n", padded)


Padded Sequences: 
 [[ 0  5  3  2  4]
 [ 0  5  3  2  7]
 [ 0  6  3  2  4]
 [ 9  2  4 10 11]]


# Out-of-vocabulary tokens
Notice that we defined an `oov_token` when the `Tokenizer` was initialised earlier. This will be used when you have input words that are not found in the `word_index` dictionary. 
For example, we may decide to collect more text after our initial training and decide to not re-generate the `word_index`. We will see this in action in the cell below. 

Notice the token `1` is inserted for words that are not found in the dictionary

In [4]:
# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

# Generate the sequences
test_seq = tokenizer.texts_to_sequences(test_data)

# Print the word index dictionary
print("\nWord Index = ", word_index)

# Print the sequences with OOV
print("\nTest Sequences = ", test_seq)


Word Index =  {'': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Test Sequences =  [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


In [5]:
# Print the padded result
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)


Padded Test Sequence: 
[[0 0 0 0 0 5 1 3 2 4]
 [0 0 0 0 0 2 4 1 2 1]]
