<a href="https://colab.research.google.com/github/LeoVal1/NLP_tokenization/blob/main/NLP_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizing text, creating sequences for sentences and Padding

## Import the Tokenizer

In [1]:
# Import the Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


## Write some sentences

Feel free to change and add sentences as you like

In [2]:
sentences = [
    'CSS stands for Cascading Style Sheets',
    'CSS describes how HTML elements are to be displayed on screen, paper, or in other media',
    'CSS saves a lot of work. It can control the layout of multiple web pages all at once',
    'External stylesheets are stored in CSS files',
    'In a programming language, variables are used to store data values.',
    'JavaScript uses the var keyword to declare variables.',
    'An equal sign is used to assign values to variables',
]

## Tokenize the words

The first step to preparing text to be used in a machine learning model is to tokenize the text, in other words, to generate numbers for the words.

In [3]:
# Optionally set the max number of words to tokenize.
# The out of vocabulary (OOV) token represents words that are not in the index.
# Call fit_on_text() on the tokenizer to generate unique numbers for each word
tokenizer = Tokenizer(num_words = 150, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)


## View the word index
After you tokenize the text, the tokenizer has a word index that contains key-value pairs for all the words and their numbers.

The word is the key, and the number is the value.

Notice that the OOV token is the first entry.


In [4]:
# Examine the word index
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'to': 2, 'css': 3, 'are': 4, 'in': 5, 'variables': 6, 'a': 7, 'of': 8, 'the': 9, 'used': 10, 'values': 11, 'stands': 12, 'for': 13, 'cascading': 14, 'style': 15, 'sheets': 16, 'describes': 17, 'how': 18, 'html': 19, 'elements': 20, 'be': 21, 'displayed': 22, 'on': 23, 'screen': 24, 'paper': 25, 'or': 26, 'other': 27, 'media': 28, 'saves': 29, 'lot': 30, 'work': 31, 'it': 32, 'can': 33, 'control': 34, 'layout': 35, 'multiple': 36, 'web': 37, 'pages': 38, 'all': 39, 'at': 40, 'once': 41, 'external': 42, 'stylesheets': 43, 'stored': 44, 'files': 45, 'programming': 46, 'language': 47, 'store': 48, 'data': 49, 'javascript': 50, 'uses': 51, 'var': 52, 'keyword': 53, 'declare': 54, 'an': 55, 'equal': 56, 'sign': 57, 'is': 58, 'assign': 59}


In [5]:
# Get the number for a given word
print(word_index['stylesheets'])

43


# Create sequences for the sentences

After you tokenize the words, the word index contains a unique number for each word. However, the numbers in the word index are not ordered. Words in a sentence have an order. So after tokenizing the words, the next step is to generate sequences for the sentences.

In [5]:
sequences = tokenizer.texts_to_sequences(sentences)
print (sequences)

[[3, 12, 13, 14, 15, 16], [3, 17, 18, 19, 20, 4, 2, 21, 22, 23, 24, 25, 26, 5, 27, 28], [3, 29, 7, 30, 8, 31, 32, 33, 34, 9, 35, 8, 36, 37, 38, 39, 40, 41], [42, 43, 4, 44, 5, 3, 45], [5, 7, 46, 47, 6, 4, 10, 2, 48, 49, 11], [50, 51, 9, 52, 53, 2, 54, 6], [55, 56, 57, 58, 10, 2, 59, 11, 2, 6]]


# Sequence sentences that contain words that are not in the word index

Let's take a look at what happens if the sentence being sequenced contains words that are not in the word index.

The Out of Vocabluary (OOV) token is the first entry in the word index. You will see it shows up in the sequences in place of any word that is not in the word index.

In [6]:
sentences2 = ["I like hot chocolate", "My dogs and my hedgehog like kibble but my squirrel prefers grapes and my chickens like ice cream, preferably vanilla"]

sequences2 = tokenizer.texts_to_sequences(sentences2)
print(sequences2)

[[1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


## Padding the Sequences

Make the sequences all the same length
Later, when you feed the sequences into a neural network to train a model, the sequences all need to be uniform in size. Currently the sequences have varied lengths, so the next step is to make them all be the same size, either by padding them with zeros and/or truncating them.

Use f.keras.preprocessing.sequence.pad_sequences to add zeros to the sequences to make them all be the same length. By default, the padding goes at the start of the sequences, but you can specify to pad at the end.

You can optionally specify the maximum length to pad the sequences to. Sequences that are longer than the specified max length will be truncated. By default, sequences are truncated from the beginning of the sequence, but you can specify to truncate from the end.

If you don't provide the max length, then the sequences are padded to match the length of the longest sentence.

In [7]:
padded = pad_sequences(sequences)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)



Word Index =  {'<OOV>': 1, 'to': 2, 'css': 3, 'are': 4, 'in': 5, 'variables': 6, 'a': 7, 'of': 8, 'the': 9, 'used': 10, 'values': 11, 'stands': 12, 'for': 13, 'cascading': 14, 'style': 15, 'sheets': 16, 'describes': 17, 'how': 18, 'html': 19, 'elements': 20, 'be': 21, 'displayed': 22, 'on': 23, 'screen': 24, 'paper': 25, 'or': 26, 'other': 27, 'media': 28, 'saves': 29, 'lot': 30, 'work': 31, 'it': 32, 'can': 33, 'control': 34, 'layout': 35, 'multiple': 36, 'web': 37, 'pages': 38, 'all': 39, 'at': 40, 'once': 41, 'external': 42, 'stylesheets': 43, 'stored': 44, 'files': 45, 'programming': 46, 'language': 47, 'store': 48, 'data': 49, 'javascript': 50, 'uses': 51, 'var': 52, 'keyword': 53, 'declare': 54, 'an': 55, 'equal': 56, 'sign': 57, 'is': 58, 'assign': 59}

Sequences =  [[3, 12, 13, 14, 15, 16], [3, 17, 18, 19, 20, 4, 2, 21, 22, 23, 24, 25, 26, 5, 27, 28], [3, 29, 7, 30, 8, 31, 32, 33, 34, 9, 35, 8, 36, 37, 38, 39, 40, 41], [42, 43, 4, 44, 5, 3, 45], [5, 7, 46, 47, 6, 4, 10, 2, 48,

In [8]:
# Specify a max length for the padded sequences
padded = pad_sequences(sequences, maxlen=20)
print(padded)

[[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  3 12 13 14 15 16]
 [ 0  0  0  0  3 17 18 19 20  4  2 21 22 23 24 25 26  5 27 28]
 [ 0  0  3 29  7 30  8 31 32 33 34  9 35  8 36 37 38 39 40 41]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0 42 43  4 44  5  3 45]
 [ 0  0  0  0  0  0  0  0  0  5  7 46 47  6  4 10  2 48 49 11]
 [ 0  0  0  0  0  0  0  0  0  0  0  0 50 51  9 52 53  2 54  6]
 [ 0  0  0  0  0  0  0  0  0  0 55 56 57 58 10  2 59 11  2  6]]


In [9]:
# Put the padding at the end of the sequences
padded = pad_sequences(sequences, maxlen=20, padding="post")
print(padded)

[[ 3 12 13 14 15 16  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3 17 18 19 20  4  2 21 22 23 24 25 26  5 27 28  0  0  0  0]
 [ 3 29  7 30  8 31 32 33 34  9 35  8 36 37 38 39 40 41  0  0]
 [42 43  4 44  5  3 45  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 5  7 46 47  6  4 10  2 48 49 11  0  0  0  0  0  0  0  0  0]
 [50 51  9 52 53  2 54  6  0  0  0  0  0  0  0  0  0  0  0  0]
 [55 56 57 58 10  2 59 11  2  6  0  0  0  0  0  0  0  0  0  0]]


In [10]:
# Limit the length of the sequences, you will see some sequences get truncated
padded = pad_sequences(sequences, maxlen=9)
print(padded)

[[ 0  0  0  3 12 13 14 15 16]
 [21 22 23 24 25 26  5 27 28]
 [ 9 35  8 36 37 38 39 40 41]
 [ 0  0 42 43  4 44  5  3 45]
 [46 47  6  4 10  2 48 49 11]
 [ 0 50 51  9 52 53  2 54  6]
 [56 57 58 10  2 59 11  2  6]]


## What happens if some of the sentences contain words that are not in the word index?

Here's where the "out of vocabulary" token is used. Try generating sequences for some sentences that have words that are not in the word index.

In [11]:
# Try turning sentences that contain words that 
# aren't in the word index into sequences.
# Add your own sentences to the test_data
test_data = [
    "my best friend's favorite ice cream flavor is strawberry",
    "my dog's best friend is a manatee"
]
print (test_data)

# Remind ourselves which number corresponds to the
# out of vocabulary token in the word index
print("<OOV> has the number", word_index['<OOV>'], "in the word index.")

# Convert the test sentences to sequences
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

# Pad the new sequences
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")

# Notice that "1" appears in the sequence wherever there's a word 
# that's not in the word index
print(padded)

["my best friend's favorite ice cream flavor is strawberry", "my dog's best friend is a manatee"]
<OOV> has the number 1 in the word index.

Test Sequence =  [[1, 1, 1, 1, 1, 1, 1, 58, 1], [1, 1, 1, 1, 58, 7, 1]]

Padded Test Sequence: 
[[ 0  1  1  1  1  1  1  1 58  1]
 [ 0  0  0  1  1  1  1 58  7  1]]
