In [None]:
!pip install tensorflow

This Python script demonstrates a fundamental step in Natural Language Processing: word tokenization using TensorFlow's Keras Tokenizer. It takes a list of sentences and processes them to create a vocabulary. The Tokenizer automatically handles common preprocessing steps like converting text to lowercase and removing punctuation.

The num_words parameter limits the vocabulary to the most frequent words, which is useful for managing memory and focusing on important terms. After fitting the tokenizer, a word_index dictionary is generated, providing a unique integer ID for each word in the learned vocabulary. This mapping is crucial for converting textual data into numerical representations, a necessary step for most machine learning models.

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Define a list of sentences to be processed.
sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

# Initialize the Tokenizer.
# num_words=100 specifies that the tokenizer will only consider the top 100 most frequent words.
# Words beyond this frequency will be ignored.
tokenizer = Tokenizer(num_words=100)

# Fit the tokenizer on the provided sentences.
# This step analyzes the text, creates a vocabulary of unique words,
# and assigns a unique integer index to each word based on its frequency.
# Punctuation is typically removed, and words are converted to lowercase by default.
tokenizer.fit_on_texts(sentences)

# Retrieve the word_index dictionary.
# This dictionary maps each word to its corresponding integer index.
word_index = tokenizer.word_index

# Print the word_index.
# The output will show each unique word from the sentences and its assigned integer ID.
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}
