<a href="https://colab.research.google.com/github/LeoVal1/NLP_tokenization/blob/main/NLP_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizing text and creating sequences for sentences

## Import the Tokenizer

In [1]:
# Import the Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer


## Write some sentences

Feel free to change and add sentences as you like

In [2]:
sentences = [
    'CSS stands for Cascading Style Sheets',
    'CSS describes how HTML elements are to be displayed on screen, paper, or in other media',
    'CSS saves a lot of work. It can control the layout of multiple web pages all at once',
    'External stylesheets are stored in CSS files',
    'In a programming language, variables are used to store data values.',
    'JavaScript uses the var keyword to declare variables.',
    'An equal sign is used to assign values to variables',
]

## Tokenize the words

The first step to preparing text to be used in a machine learning model is to tokenize the text, in other words, to generate numbers for the words.

In [3]:
# Optionally set the max number of words to tokenize.
# The out of vocabulary (OOV) token represents words that are not in the index.
# Call fit_on_text() on the tokenizer to generate unique numbers for each word
tokenizer = Tokenizer(num_words = 150, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)


## View the word index
After you tokenize the text, the tokenizer has a word index that contains key-value pairs for all the words and their numbers.

The word is the key, and the number is the value.

Notice that the OOV token is the first entry.


In [4]:
# Examine the word index
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'to': 2, 'css': 3, 'are': 4, 'in': 5, 'variables': 6, 'a': 7, 'of': 8, 'the': 9, 'used': 10, 'values': 11, 'stands': 12, 'for': 13, 'cascading': 14, 'style': 15, 'sheets': 16, 'describes': 17, 'how': 18, 'html': 19, 'elements': 20, 'be': 21, 'displayed': 22, 'on': 23, 'screen': 24, 'paper': 25, 'or': 26, 'other': 27, 'media': 28, 'saves': 29, 'lot': 30, 'work': 31, 'it': 32, 'can': 33, 'control': 34, 'layout': 35, 'multiple': 36, 'web': 37, 'pages': 38, 'all': 39, 'at': 40, 'once': 41, 'external': 42, 'stylesheets': 43, 'stored': 44, 'files': 45, 'programming': 46, 'language': 47, 'store': 48, 'data': 49, 'javascript': 50, 'uses': 51, 'var': 52, 'keyword': 53, 'declare': 54, 'an': 55, 'equal': 56, 'sign': 57, 'is': 58, 'assign': 59}


In [5]:
# Get the number for a given word
print(word_index['stylesheets'])

43


# Create sequences for the sentences

After you tokenize the words, the word index contains a unique number for each word. However, the numbers in the word index are not ordered. Words in a sentence have an order. So after tokenizing the words, the next step is to generate sequences for the sentences.

In [6]:
sequences = tokenizer.texts_to_sequences(sentences)
print (sequences)

[[3, 12, 13, 14, 15, 16], [3, 17, 18, 19, 20, 4, 2, 21, 22, 23, 24, 25, 26, 5, 27, 28], [3, 29, 7, 30, 8, 31, 32, 33, 34, 9, 35, 8, 36, 37, 38, 39, 40, 41], [42, 43, 4, 44, 5, 3, 45], [5, 7, 46, 47, 6, 4, 10, 2, 48, 49, 11], [50, 51, 9, 52, 53, 2, 54, 6], [55, 56, 57, 58, 10, 2, 59, 11, 2, 6]]


# Sequence sentences that contain words that are not in the word index

Let's take a look at what happens if the sentence being sequenced contains words that are not in the word index.

The Out of Vocabluary (OOV) token is the first entry in the word index. You will see it shows up in the sequences in place of any word that is not in the word index.

In [7]:
sentences2 = ["I like hot chocolate", "My dogs and my hedgehog like kibble but my squirrel prefers grapes and my chickens like ice cream, preferably vanilla"]

sequences2 = tokenizer.texts_to_sequences(sentences2)
print(sequences2)

[[1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
