# <font color='#154360'> <b> <center> Text Vectorization Layer </center> </b> </font>

A preprocessing layer which maps text features to integer sequences.

<b> tf.keras.layers.TextVectorization </b>(
    
    max_tokens=None,
    
    standardize='lower_and_strip_punctuation',
    
    split='whitespace',
    
    ngrams=None,
    
    output_mode='int',
    
    output_sequence_length=None,
    
    pad_to_max_tokens=False,
    
    vocabulary=None,
    
    idf_weights=None,
    
    sparse=False,
    
    ragged=False,
    
    encoding='utf-8',
    
    name=None,
    
    **kwargs
)

It transforms a batch of strings (one example = one string) into either:

- A list of token indices (one example = 1D tensor of integer token indices) 
- Or a dense representation (one example = 1D tensor of float values representing data about the example's tokens). 


The vocabulary for the layer must be either:

- Supplied on construction 
- Or learned via adapt(). 

When this layer is adapted:

- It will analyze the dataset.
- Determine the frequency of individual string values
- Create a vocabulary from them. 

This vocabulary can have unlimited size or be capped, depending on the configuration options for this layer; if there are more unique values in the input than the maximum vocabulary size, the most frequent terms will be used to create the vocabulary.

In [14]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

In [8]:
max_tokens = 5000  # Maximum vocab size.
max_len = 5  # Sequence length to pad the outputs to.

# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode='int',
    output_sequence_length=max_len)

In [9]:
# Now that the vocab layer has been created, call `adapt` on the
# list of strings to create the vocabulary.
vectorize_layer.adapt(["foo bar", "bar baz", "baz bada boom"])

In [12]:
# Get the config of our text vectorizer
vectorize_layer.get_config()

{'name': 'text_vectorization_1',
 'trainable': True,
 'dtype': 'string',
 'batch_input_shape': (None,),
 'max_tokens': 5000,
 'standardize': 'lower_and_strip_punctuation',
 'split': 'whitespace',
 'ngrams': None,
 'output_mode': 'int',
 'output_sequence_length': 5,
 'pad_to_max_tokens': False,
 'sparse': False,
 'ragged': False,
 'vocabulary': None,
 'idf_weights': None,
 'encoding': 'utf-8',
 'vocabulary_size': 7}

In [18]:
# let's see the tokens
vocab = np.array(vectorize_layer.get_vocabulary())
vocab

array(['', '[UNK]', 'baz', 'bar', 'foo', 'boom', 'bada'], dtype='<U5')

In [17]:
# Dict with the id-token map
id_to_token = {idx: word for idx, word in enumerate(vocab)}
token_to_id = {word: idx for idx, word in enumerate(vocab)}

print("ID a Token:", id_to_token)
print("Token a ID:", token_to_id)

ID a Token: {0: '', 1: '[UNK]', 2: 'baz', 3: 'bar', 4: 'foo', 5: 'boom', 6: 'bada'}
Token a ID: {'': 0, '[UNK]': 1, 'baz': 2, 'bar': 3, 'foo': 4, 'boom': 5, 'bada': 6}


In [26]:
# print them in a nicest way
for idx, word in enumerate(vocab):
    print(idx, word)

0 
1 [UNK]
2 baz
3 bar
4 foo
5 boom
6 bada


In [27]:
# Now, the layer can map strings to integers -- you can use an
# embedding layer to map these integers to learned embeddings.
input_data = [["foo qux bar"], ["qux baz"]]
vectorize_layer(input_data)

<tf.Tensor: shape=(2, 5), dtype=int64, numpy=
array([[4, 1, 3, 0, 0],
       [1, 2, 0, 0, 0]])>

### References

[TextVectorization Layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization)