In [1]:
import tensorflow as tf

## 1] Tokenizer

**Usage:** Converts text to a sequence of tokens (words or subwords) and converts them into integers (token indices).

The Tokenizer assigns a unique integer to each unique token based on their frequency.

The more frequent a word, the lower its index.

In [3]:
texts = ['Hello, how are you?', 'I am doing great, thank you!']

In [2]:
tokenizer=tf.keras.preprocessing.text.Tokenizer()

In [4]:
tokenizer.fit_on_texts(texts)

In [6]:
seq=tokenizer.texts_to_sequences(texts)

In [7]:
seq

[[2, 3, 4, 1], [5, 6, 7, 8, 9, 1]]

## 2] HashTrick

**Use hashing_trick when:**
    
 You need a simple, **memory-efficient** way to convert text to integers.
    
 You do not require an explicit vocabulary.
    
 You are **dealing with very large text corpora**.

In [11]:
from tensorflow.keras.preprocessing.text import hashing_trick

In [12]:
texts = ['Hello, how are you?', 'I am doing great, thank you!']

In [13]:
num_words=10

In [14]:
hash_trick=[hashing_trick(word,num_words) for word in texts]

In [15]:
hash_trick

[[7, 7, 7, 9], [7, 1, 6, 7, 5, 9]]

## 3] One hot

The one_hot function from TensorFlow's Keras module converts text into a one-hot encoded representation. It does this by first hashing each word to a unique integer within the specified vocabulary size and then representing each word by its index.

**Using one-hot when:**

You want a quick and easy way to convert text to hashed indices.

You are prototyping and need to generate integer sequences quickly.

You are working with models that can accept these integer sequences directly.

In [19]:
from tensorflow.keras.preprocessing.text import one_hot

In [16]:
texts = ['Hello, how are you?', 'I am doing great, thank you!']

In [17]:
# Decide on vocabulary size (for this example, we'll take the top 10 words)
vocob_size=10

In [20]:
encoded=[one_hot(text,vocob_size) for text in texts]

In [21]:
encoded

[[7, 7, 7, 9], [7, 1, 6, 7, 5, 9]]

## 4] Text to word sequence

**Usage:** Converts text into a list of words (or tokens).

In [22]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence

In [23]:
text = 'Hello, how are you?'

In [24]:
words=text_to_word_sequence(text)
words

['hello', 'how', 'are', 'you']