## Description: Text Preprocessing with Keras Tokenizer and Padding
This script demonstrates essential text preprocessing steps for Natural Language Processing using TensorFlow's Keras Tokenizer and pad_sequences.

It begins by defining a set of training sentences and initializing a Tokenizer with a specified num_words (limiting the vocabulary size to the most frequent words) and an oov_token (a special token for words not encountered during training). The tokenizer then learns the vocabulary from the training sentences, converting them into numerical sequences where each number represents a word's unique ID.

A key part of the process is pad_sequences, which ensures all text sequences have a uniform length, a common requirement for input to neural networks. Shorter sequences are padded with zeros, and longer ones are truncated.

Finally, the script showcases how the trained tokenizer handles new, unseen data, particularly how it uses the <OOV> token to represent words that were not part of its original vocabulary, providing a robust way to manage out-of-vocabulary terms in real-world applications.

In [1]:
import tensorflow as tf
from tensorflow import keras

# Import necessary modules from Keras for text preprocessing.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define a list of sentences to be processed.
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

# Initialize the Tokenizer.
# num_words=100 specifies that the tokenizer will only consider the top 100 most frequent words.
# oov_token="<OOV>" defines a special token for "Out-Of-Vocabulary" words.
# Any word encountered that was not in the original training vocabulary will be replaced by this token's index.
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")

# Fit the tokenizer on the provided sentences.
# This step analyzes the text, creates a vocabulary of unique words based on frequency,
# and assigns a unique integer index to each word. Punctuation is typically removed,
# and words are converted to lowercase by default.
tokenizer.fit_on_texts(sentences)

# Retrieve the word_index dictionary, which maps each word to its corresponding integer index.
word_index = tokenizer.word_index

# Convert the sentences into sequences of integers, where each integer represents a word's index.
sequences = tokenizer.texts_to_sequences(sentences)

# Pad the sequences to a uniform length.
# maxlen=5 ensures all sequences have a length of 5.
# If a sequence is shorter, it will be padded with zeros (by default at the beginning).
# If a sequence is longer, it will be truncated (by default from the beginning).
padded = pad_sequences(sequences, maxlen=5)

# Print the generated word index, original sequences, and padded sequences.
print("\nWord Index = ", word_index)
print("\nSequences = ", sequences)
print("\nPadded Sequences:")
print(padded)

# Try with words that the tokenizer wasn't fit to.
# These new sentences will demonstrate how the tokenizer handles "Out-Of-Vocabulary" words.
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

# Convert the test data sentences into sequences using the already fitted tokenizer.
# Words like "really" and "manatee" were not in the original 'sentences' list,
# so they will be replaced by the OOV token's index.
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

# Pad the test sequences.
# maxlen=10 ensures all test sequences have a length of 10.
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)


Word Index =  {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Sequences =  [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

Padded Sequences:
[[ 0  5  3  2  4]
 [ 0  5  3  2  7]
 [ 0  6  3  2  4]
 [ 9  2  4 10 11]]

Test Sequence =  [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Padded Test Sequence: 
[[0 0 0 0 0 5 1 3 2 4]
 [0 0 0 0 0 2 4 1 2 1]]
