# **TEXT TO SEQUENCES**

+ Now we have created Sequences from Sentences.
+ In steps to Prepare the data.
  + Tokenize the Words.
  + Create Sequences from the Sentences.
  + Make the Sequence all the same length.
+ Use Padding and Truncating to make the Sequence all the Same Length.

**PADDING AND TRUNCATING**

+ By default, pad_sequences pad to the longest sequence.
+ Specify the `maxlen` to set the length of the Sequence.
+ By default, Sequences are padded or truncated at the beginning of the Sequence.
+ Specify `padding = "post"` to pad from the end of the Sequence.
+ Specify `truncating = "post"` to truncate from the end of the Sequence.

IMPORTS 

In [1]:
# Import Tokenizer and pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [12]:
## define some sentences
sentences = [
    'My favorite food is ice cream',
    'do you like ice cream too?',
    'My dog likes ice cream!',
    "your favorite flavor of icecream is chocolate",
    "chocolate isn't good for dogs",
    "your dog, your cat, and your parrot prefer broccoli"
]
print(sentences)

## When creating the Tokenizer, 
## you can specify the max number of words in the dictionary. 
## You can also specify a token to represent words that are out of the vocabulary (OOV), 
## in other words, that are not in the dictionary. 
##This OOV token will be used when you create sequences for sentences that contain words that are not in the word index.
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
## tokenize the words using fit_on_texts
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

## So the next step to representing text in a way that can be meaningfully used by machine learning programs 
## is to create numerical sequences that represent the sentences in the text.
## Each sentence will be converted into a sequence where each word is replaced by its number in the word index.
sequences = tokenizer.texts_to_sequences(sentences)
print("\nSequences :-",sequences)

## do padding here we are doing the padding. not truncating
## we can do padding alone.
## we can do truncating
## and padding and truncating together.
## notes are present in the Bio
padded = pad_sequences(sequences)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)

# Specify a max length for the padded sequences
padded = pad_sequences(sequences, maxlen=15)
print("\nPadded Sequence with length 15 :--",padded)

# Put the padding at the end of the sequences
padded = pad_sequences(sequences, maxlen=15, padding="post")
print("\nPadding at the End of the Sequence with length 15 :-",padded)

# Limit the length of the sequences, you will see some sequences get truncated
## this will happen in the beggining, the truncating operation.
padded = pad_sequences(sequences, maxlen=3)
print("\nTruncated Sequences to the Length 3 :-",padded)

## truncating to the end of the sequences.
padded = pad_sequences(sequences, maxlen=3, padding = "post")
print("\nTruncated Sequences to the end by Length 3 :-",padded)

['My favorite food is ice cream', 'do you like ice cream too?', 'My dog likes ice cream!', 'your favorite flavor of icecream is chocolate', "chocolate isn't good for dogs", 'your dog, your cat, and your parrot prefer broccoli']
{'<OOV>': 1, 'your': 2, 'ice': 3, 'cream': 4, 'my': 5, 'favorite': 6, 'is': 7, 'dog': 8, 'chocolate': 9, 'food': 10, 'do': 11, 'you': 12, 'like': 13, 'too': 14, 'likes': 15, 'flavor': 16, 'of': 17, 'icecream': 18, "isn't": 19, 'good': 20, 'for': 21, 'dogs': 22, 'cat': 23, 'and': 24, 'parrot': 25, 'prefer': 26, 'broccoli': 27}

Sequences :- [[5, 6, 10, 7, 3, 4], [11, 12, 13, 3, 4, 14], [5, 8, 15, 3, 4], [2, 6, 16, 17, 18, 7, 9], [9, 19, 20, 21, 22], [2, 8, 2, 23, 24, 2, 25, 26, 27]]

Word Index =  {'<OOV>': 1, 'your': 2, 'ice': 3, 'cream': 4, 'my': 5, 'favorite': 6, 'is': 7, 'dog': 8, 'chocolate': 9, 'food': 10, 'do': 11, 'you': 12, 'like': 13, 'too': 14, 'likes': 15, 'flavor': 16, 'of': 17, 'icecream': 18, "isn't": 19, 'good': 20, 'for': 21, 'dogs': 22, 'cat': 2

What happens if some of the sentences contain words that are not in the word index?

Try generating sequences for some sentences that have words that are not in the word index.

In [13]:
# Try turning sentences that contain words that 
# aren't in the word index into sequences.
# Add your own sentences to the test_data
test_data = [
    "my best friend's favorite ice cream flavor is strawberry",
    "my dog's best friend is a manatee"
]
print (test_data)

# Remind ourselves which number corresponds to the
# out of vocabulary token in the word index
print("<OOV> has the number", word_index['<OOV>'], "in the word index.")

# Convert the test sentences to sequences
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

# Pad the new sequences
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")

# Notice that "1" appears in the sequence wherever there's a word 
# that's not in the word index
print(padded)

["my best friend's favorite ice cream flavor is strawberry", "my dog's best friend is a manatee"]
<OOV> has the number 1 in the word index.

Test Sequence =  [[5, 1, 1, 6, 3, 4, 16, 7, 1], [5, 1, 1, 1, 7, 1, 1]]

Padded Test Sequence: 
[[ 0  5  1  1  6  3  4 16  7  1]
 [ 0  0  0  5  1  1  1  7  1  1]]


***