<a href="https://colab.research.google.com/github/HelenLit/tokenizer_basics/blob/main/basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Generating the vocabulary

In this notebook, I will look first at how I can provide a look up dictionary for each word. The code below takes a list of sentences, then takes each word in those sentences and assigns it to an integer. This is done using the [fit_on_texts()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts) method and we can get the result by looking at the `word_index` property. More frequent words have a lower index.



In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Defining input sentences
sentences = [
    'i love my dog',
    'I, love my cat'
    ]

# Initializing the Tokenizer class
tokenizer = Tokenizer(num_words = 100)

# Generating indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Geting the indices and print it
word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


The `num_words` parameter used in the initializer specifies the maximum number of words minus one (based on frequency) to keep when generating sequences. It does not affect how the `word_index` dictionary is generated.

All punctuation is ignored and words are converted to lower case. We can override these behaviors by modifying the `filters` and `lower` arguments of the `Tokenizer` class as described [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#arguments).

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Defining input sentences
sentences = [
    'i love my dog',
    'I, love my cat'
    ]

# Initializing the Tokenizer class
tokenizer = Tokenizer(num_words = 1)

# Generating indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Geting the indices and print it
word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


After passing `1` instead of `100`, as shown on the previous cell, the `word_index` is the same.
