<a href="https://colab.research.google.com/github/https-deeplearning-ai/tensorflow-1-public/blob/master/C3/W1/ungraded_labs/C3_W1_Lab_1_tokenize_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ungraded Lab: TextVectorization Layer

## Generating the vocabulary

The code below takes a list of sentences, then takes each word in those sentences and assigns it to an integer. This is done using the [adapt()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) method and you can get the vocabulary by looking at the `get_vocabulary` property. More frequent words have a lower index.

In [2]:
# from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import TextVectorization

# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat'
    ]

# Initialize the Tokenizer class
tokenizer = TextVectorization(max_tokens = 100, output_mode = 'int', output_sequence_length = 5)

# Generate indices for each word in the corpus
tokenizer.adapt(sentences)

# Get the vocabulary
word_index = tokenizer.get_vocabulary()
print(word_index)

['', '[UNK]', 'my', 'love', 'i', 'dog', 'cat']


In [3]:
# get the specific encoding of each sentence
sequences = tokenizer(sentences)
print(sequences.numpy())

[[4 3 2 5 0]
 [4 3 2 6 0]]


- here we can see that `output_sequence_length` is set to 5, so the output vectors will have a length of 5.

In [4]:
# Get the vocabulary
vocabulary = tokenizer.get_vocabulary()

# Iterate over the vocabulary and print the token assigned to each word
for word in vocabulary:
    print(f"{word}: {tokenizer(word)}")

: []
[UNK]: [1 0 0 0 0]
my: [2 0 0 0 0]
love: [3 0 0 0 0]
i: [4 0 0 0 0]
dog: [5 0 0 0 0]
cat: [6 0 0 0 0]


The `max_tokens` parameter used in the initializer specifies the maximum number of words minus one (minus one if output_mode is int) (based on frequency) to keep when generating sequences. You will see this in a later exercise. For now, the important thing to note is it does affect how the `vocabulary` is generated. You can try passing `3` instead of `100` as shown on the next cell and you will see that the dictionary only contains three words.

Also notice that by default, all punctuation is ignored and words are converted to lower case. You can override these behaviors by modifying the `filters` and `lower` arguments of the `TextVectorization` class as described [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#arguments). You can try modifying these in the next cell below and compare the output to the one generated above.

In [5]:
# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

# Initialize the Tokenizer class
tokenizer = TextVectorization(max_tokens = 3, output_mode='int')

# Generate indices for each word in the corpus
tokenizer.adapt(sentences)

# Get the indices and print it
word_index = tokenizer.get_vocabulary()
print(word_index)

['', '[UNK]', 'my']


- here we have set `max_tokens` to 3, so our vocabulary will only contain 3 words.

That concludes this short exercise on tokenizing input texts!