The high level steps to prepare text to be used in a machine learning model are:

1.   Tokenize the words to get numerical values for them
2.   Create numerical sequences of the sentences
3.   Adjust the sequences to all be the same length.



## Import the classes you need

In [1]:
# Import Tokenizer and pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Write some sentences


In [2]:
sentences = [
    'I love eating pizza on weekends',
    'Do you enjoy playing soccer?',
    'My cat naps all day',
    'Your garden looks beautiful with roses',
    'Fresh apples are my go-to snack',
    'The movie was thrilling and suspenseful',
    'Hiking in the mountains is refreshing',
    'She reads a book every week',
    'The sunset over the ocean is breathtaking',
    'Baking cookies is a fun activity',
    'I often ride my bike to work',
    'Your painting is quite impressive',
    'Traveling to new countries excites me',
    'He practices the piano daily',
    'Our team won the championship game',
    'Coffee in the morning wakes me up',
    'The new restaurant downtown is fantastic',
    'Learning new languages is beneficial',
    'The thunderstorm scared my cat',
    'Gardening is a peaceful hobby',
    'They enjoy kayaking on the lake',
    'His artwork is displayed in the gallery',
    'I plan to visit the museum tomorrow',
    'Your car is really fast',
    'The concert last night was amazing',
    'I need to buy groceries today',
    'She volunteers at the animal shelter',
    'The library is a quiet place to study',
    'Their wedding was beautiful and touching',
    'Yoga helps me relax and unwind',
    'My favorite food is ice cream',
    'do you like ice cream too?',
    'My dog likes ice cream!',
    "your favorite flavor of icecream is chocolate",
    "chocolate isn't good for dogs",
    "your dog, your cat, and your parrot prefer broccoli"
]
print(sentences)


['I love eating pizza on weekends', 'Do you enjoy playing soccer?', 'My cat naps all day', 'Your garden looks beautiful with roses', 'Fresh apples are my go-to snack', 'The movie was thrilling and suspenseful', 'Hiking in the mountains is refreshing', 'She reads a book every week', 'The sunset over the ocean is breathtaking', 'Baking cookies is a fun activity', 'I often ride my bike to work', 'Your painting is quite impressive', 'Traveling to new countries excites me', 'He practices the piano daily', 'Our team won the championship game', 'Coffee in the morning wakes me up', 'The new restaurant downtown is fantastic', 'Learning new languages is beneficial', 'The thunderstorm scared my cat', 'Gardening is a peaceful hobby', 'They enjoy kayaking on the lake', 'His artwork is displayed in the gallery', 'I plan to visit the museum tomorrow', 'Your car is really fast', 'The concert last night was amazing', 'I need to buy groceries today', 'She volunteers at the animal shelter', 'The library 

## Create the Tokenizer and define an out of vocabulary token
When creating the Tokenizer, you can specify the max number of words in the dictionary. You can also specify a token to represent words that are out of the vocabulary (OOV), in other words, that are not in the dictionary. This OOV token will be used when you create sequences for sentences that contain words that are not in the word index.

In [3]:
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")

## Tokenize the words

In [4]:
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'the': 2, 'is': 3, 'your': 4, 'my': 5, 'to': 6, 'i': 7, 'and': 8, 'a': 9, 'cat': 10, 'was': 11, 'in': 12, 'new': 13, 'me': 14, 'ice': 15, 'cream': 16, 'on': 17, 'do': 18, 'you': 19, 'enjoy': 20, 'beautiful': 21, 'she': 22, 'favorite': 23, 'dog': 24, 'chocolate': 25, 'love': 26, 'eating': 27, 'pizza': 28, 'weekends': 29, 'playing': 30, 'soccer': 31, 'naps': 32, 'all': 33, 'day': 34, 'garden': 35, 'looks': 36, 'with': 37, 'roses': 38, 'fresh': 39, 'apples': 40, 'are': 41, 'go': 42, 'snack': 43, 'movie': 44, 'thrilling': 45, 'suspenseful': 46, 'hiking': 47, 'mountains': 48, 'refreshing': 49, 'reads': 50, 'book': 51, 'every': 52, 'week': 53, 'sunset': 54, 'over': 55, 'ocean': 56, 'breathtaking': 57, 'baking': 58, 'cookies': 59, 'fun': 60, 'activity': 61, 'often': 62, 'ride': 63, 'bike': 64, 'work': 65, 'painting': 66, 'quite': 67, 'impressive': 68, 'traveling': 69, 'countries': 70, 'excites': 71, 'he': 72, 'practices': 73, 'piano': 74, 'daily': 75, 'our': 76, 'team': 77, 'won'

## Turn sentences into sequences

Each word now has a unique number in the word index.  However, words in a sentence are in a specific order. You can't just randomly mix up words and have the outcome be a sentence.

For example, although "chocolate isn't good for dogs" is a perfectly fine sentence, "dogs isn't for chocolate good" does not make sense as a sentence.

So the next step to representing text in a way that can be meaningfully used by machine learning programs is to create numerical sequences that represent the sentences in the text.

Each sentence will be converted into a sequence where each word is replaced by its number in the word index.

In [5]:
sequences = tokenizer.texts_to_sequences(sentences)
print (sequences)

[[7, 26, 27, 28, 17, 29], [18, 19, 20, 30, 31], [5, 10, 32, 33, 34], [4, 35, 36, 21, 37, 38], [39, 40, 41, 5, 42, 6, 43], [2, 44, 11, 45, 8, 46], [47, 12, 2, 48, 3, 49], [22, 50, 9, 51, 52, 53], [2, 54, 55, 2, 56, 3, 57], [58, 59, 3, 9, 60, 61], [7, 62, 63, 5, 64, 6, 65], [4, 66, 3, 67, 68], [69, 6, 13, 70, 71, 14], [72, 73, 2, 74, 75], [76, 77, 78, 2, 79, 80], [81, 12, 2, 82, 83, 14, 84], [2, 13, 85, 86, 3, 87], [88, 13, 89, 3, 90], [2, 91, 92, 5, 10], [93, 3, 9, 94, 95], [96, 20, 97, 17, 2, 98], [99, 1, 3, 1, 12, 2, 1], [7, 1, 6, 1, 2, 1, 1], [4, 1, 3, 1, 1], [2, 1, 1, 1, 11, 1], [7, 1, 6, 1, 1, 1], [22, 1, 1, 2, 1, 1], [2, 1, 3, 9, 1, 1, 6, 1], [1, 1, 11, 21, 8, 1], [1, 1, 14, 1, 8, 1], [5, 23, 1, 3, 15, 16], [18, 19, 1, 15, 16, 1], [5, 24, 1, 15, 16], [4, 23, 1, 1, 1, 3, 25], [25, 1, 1, 1, 1], [4, 24, 4, 10, 8, 4, 1, 1, 1]]


## Make the sequences all the same length

Later, when you feed the sequences into a neural network to train a model, the sequences all need to be uniform in size. Currently the sequences have varied lengths, so the next step is to make them all be the same size, either by padding them with zeros and/or truncating them.

Use f.keras.preprocessing.sequence.pad_sequences to add zeros to the sequences to make them all be the same length. By default, the padding goes at the start of the sequences, but you can specify to pad at the end.

You can optionally specify the maximum length to pad the sequences to. Sequences that are longer than the specified max length will be truncated. By default, sequences are truncated from the beginning of the sequence, but you can specify to truncate from the end.

If you don't provide the max length, then the sequences are padded to match the length of the longest sentence.

For all the options when padding and truncating sequences, see https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences



In [6]:
padded = pad_sequences(sequences)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)



Word Index =  {'<OOV>': 1, 'the': 2, 'is': 3, 'your': 4, 'my': 5, 'to': 6, 'i': 7, 'and': 8, 'a': 9, 'cat': 10, 'was': 11, 'in': 12, 'new': 13, 'me': 14, 'ice': 15, 'cream': 16, 'on': 17, 'do': 18, 'you': 19, 'enjoy': 20, 'beautiful': 21, 'she': 22, 'favorite': 23, 'dog': 24, 'chocolate': 25, 'love': 26, 'eating': 27, 'pizza': 28, 'weekends': 29, 'playing': 30, 'soccer': 31, 'naps': 32, 'all': 33, 'day': 34, 'garden': 35, 'looks': 36, 'with': 37, 'roses': 38, 'fresh': 39, 'apples': 40, 'are': 41, 'go': 42, 'snack': 43, 'movie': 44, 'thrilling': 45, 'suspenseful': 46, 'hiking': 47, 'mountains': 48, 'refreshing': 49, 'reads': 50, 'book': 51, 'every': 52, 'week': 53, 'sunset': 54, 'over': 55, 'ocean': 56, 'breathtaking': 57, 'baking': 58, 'cookies': 59, 'fun': 60, 'activity': 61, 'often': 62, 'ride': 63, 'bike': 64, 'work': 65, 'painting': 66, 'quite': 67, 'impressive': 68, 'traveling': 69, 'countries': 70, 'excites': 71, 'he': 72, 'practices': 73, 'piano': 74, 'daily': 75, 'our': 76, 't

In [7]:
# Specify a max length for the padded sequences
padded = pad_sequences(sequences, maxlen=15)
print(padded)

[[ 0  0  0  0  0  0  0  0  0  7 26 27 28 17 29]
 [ 0  0  0  0  0  0  0  0  0  0 18 19 20 30 31]
 [ 0  0  0  0  0  0  0  0  0  0  5 10 32 33 34]
 [ 0  0  0  0  0  0  0  0  0  4 35 36 21 37 38]
 [ 0  0  0  0  0  0  0  0 39 40 41  5 42  6 43]
 [ 0  0  0  0  0  0  0  0  0  2 44 11 45  8 46]
 [ 0  0  0  0  0  0  0  0  0 47 12  2 48  3 49]
 [ 0  0  0  0  0  0  0  0  0 22 50  9 51 52 53]
 [ 0  0  0  0  0  0  0  0  2 54 55  2 56  3 57]
 [ 0  0  0  0  0  0  0  0  0 58 59  3  9 60 61]
 [ 0  0  0  0  0  0  0  0  7 62 63  5 64  6 65]
 [ 0  0  0  0  0  0  0  0  0  0  4 66  3 67 68]
 [ 0  0  0  0  0  0  0  0  0 69  6 13 70 71 14]
 [ 0  0  0  0  0  0  0  0  0  0 72 73  2 74 75]
 [ 0  0  0  0  0  0  0  0  0 76 77 78  2 79 80]
 [ 0  0  0  0  0  0  0  0 81 12  2 82 83 14 84]
 [ 0  0  0  0  0  0  0  0  0  2 13 85 86  3 87]
 [ 0  0  0  0  0  0  0  0  0  0 88 13 89  3 90]
 [ 0  0  0  0  0  0  0  0  0  0  2 91 92  5 10]
 [ 0  0  0  0  0  0  0  0  0  0 93  3  9 94 95]
 [ 0  0  0  0  0  0  0  0  0 96 20 97 17

In [8]:
# Put the padding at the end of the sequences
padded = pad_sequences(sequences, maxlen=15, padding="post")
print(padded)

[[ 7 26 27 28 17 29  0  0  0  0  0  0  0  0  0]
 [18 19 20 30 31  0  0  0  0  0  0  0  0  0  0]
 [ 5 10 32 33 34  0  0  0  0  0  0  0  0  0  0]
 [ 4 35 36 21 37 38  0  0  0  0  0  0  0  0  0]
 [39 40 41  5 42  6 43  0  0  0  0  0  0  0  0]
 [ 2 44 11 45  8 46  0  0  0  0  0  0  0  0  0]
 [47 12  2 48  3 49  0  0  0  0  0  0  0  0  0]
 [22 50  9 51 52 53  0  0  0  0  0  0  0  0  0]
 [ 2 54 55  2 56  3 57  0  0  0  0  0  0  0  0]
 [58 59  3  9 60 61  0  0  0  0  0  0  0  0  0]
 [ 7 62 63  5 64  6 65  0  0  0  0  0  0  0  0]
 [ 4 66  3 67 68  0  0  0  0  0  0  0  0  0  0]
 [69  6 13 70 71 14  0  0  0  0  0  0  0  0  0]
 [72 73  2 74 75  0  0  0  0  0  0  0  0  0  0]
 [76 77 78  2 79 80  0  0  0  0  0  0  0  0  0]
 [81 12  2 82 83 14 84  0  0  0  0  0  0  0  0]
 [ 2 13 85 86  3 87  0  0  0  0  0  0  0  0  0]
 [88 13 89  3 90  0  0  0  0  0  0  0  0  0  0]
 [ 2 91 92  5 10  0  0  0  0  0  0  0  0  0  0]
 [93  3  9 94 95  0  0  0  0  0  0  0  0  0  0]
 [96 20 97 17  2 98  0  0  0  0  0  0  0

In [9]:
# Limit the length of the sequences, you will see some sequences get truncated
padded = pad_sequences(sequences, maxlen=3)
print(padded)

[[28 17 29]
 [20 30 31]
 [32 33 34]
 [21 37 38]
 [42  6 43]
 [45  8 46]
 [48  3 49]
 [51 52 53]
 [56  3 57]
 [ 9 60 61]
 [64  6 65]
 [ 3 67 68]
 [70 71 14]
 [ 2 74 75]
 [ 2 79 80]
 [83 14 84]
 [86  3 87]
 [89  3 90]
 [92  5 10]
 [ 9 94 95]
 [17  2 98]
 [12  2  1]
 [ 2  1  1]
 [ 3  1  1]
 [ 1 11  1]
 [ 1  1  1]
 [ 2  1  1]
 [ 1  6  1]
 [21  8  1]
 [ 1  8  1]
 [ 3 15 16]
 [15 16  1]
 [ 1 15 16]
 [ 1  3 25]
 [ 1  1  1]
 [ 1  1  1]]


## What happens if some of the sentences contain words that are not in the word index?

Here's where the "out of vocabulary" token is used. Try generating sequences for some sentences that have words that are not in the word index.

In [10]:
# Try turning sentences that contain words that aren't in the word index into sequences.
test_data = [
    "my best friend's favorite ice cream flavor is strawberry",
    "my dog's best friend is a manatee",
    "the sun sets beautifully over the horizon",
    "we love to travel to exotic destinations",
    "she enjoys painting landscapes in her free time",
    "his favorite book is a classic novel",
    "the garden is blooming with colorful flowers",
    "they went on an adventure through the forest",
    "I like to relax by the lake on weekends",
    "our family enjoys hosting BBQ parties in the backyard",
    "my cat's favorite toy is a feather on a string",
    "reading under a tree is very calming",
    "the best way to start the day is with a jog",
    "she bakes the most delicious cakes",
    "he plays guitar in a local band",
    "we often have game nights with friends",
    "the view from the mountain top is spectacular",
    "cooking new recipes is a fun challenge",
    "he enjoys building model airplanes",
    "our neighbors have the cutest puppy",
    "visiting the zoo is always exciting",
    "my sister loves collecting vintage postcards",
    "watching the stars on a clear night is magical",
    "she practices yoga every morning",
    "they have a beautiful fish tank in their living room",
    "he's passionate about restoring old cars",
    "going to the beach is my favorite summer activity",
    "she often writes poetry in her journal",
    "the farmer's market has fresh produce every week",
    "their home is filled with unique artwork",
    "she loves hiking trails in the national park",
    "the city's skyline looks amazing at night",
    "he enjoys photography as a hobby",
    "our family reunions are always filled with laughter"
]
print (test_data)

# Remind ourselves which number corresponds to the out of vocabulary token in the word index
print("<OOV> has the number", word_index['<OOV>'], "in the word index.")

# Convert the test sentences to sequences
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

# Pad the new sequences
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")

# Notice that "1" appears in the sequence wherever there's a word
# that's not in the word index
print(padded)

["my best friend's favorite ice cream flavor is strawberry", "my dog's best friend is a manatee", 'the sun sets beautifully over the horizon', 'we love to travel to exotic destinations', 'she enjoys painting landscapes in her free time', 'his favorite book is a classic novel', 'the garden is blooming with colorful flowers', 'they went on an adventure through the forest', 'I like to relax by the lake on weekends', 'our family enjoys hosting BBQ parties in the backyard', "my cat's favorite toy is a feather on a string", 'reading under a tree is very calming', 'the best way to start the day is with a jog', 'she bakes the most delicious cakes', 'he plays guitar in a local band', 'we often have game nights with friends', 'the view from the mountain top is spectacular', 'cooking new recipes is a fun challenge', 'he enjoys building model airplanes', 'our neighbors have the cutest puppy', 'visiting the zoo is always exciting', 'my sister loves collecting vintage postcards', 'watching the stars