We can use the text_to_word_sequence() function from the previous section to split the document into words and then use a set to represent only the unique words in the document. The size of this set can be used to estimate the size of the vocabulary for one document.

In [3]:
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

8


We can put this together with the one_hot() function and one hot encode the words in the document. The complete example is listed below.

The vocabulary size is increased by one-third to minimize collisions when hashing words.

In [8]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
print(words)
vocab_size = len(words)
# integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

{'over', 'dog', 'brown', 'quick', 'jumped', 'lazy', 'the', 'fox'}
[3, 8, 1, 1, 5, 5, 3, 4, 1]


### Hash Encoding with hashing_trick

Keras provides the hashing_trick() function that tokenizes and then integer encodes the document, just like the one_hot() function. It provides more flexibility, allowing you to specify the hash function as either ‘hash’ (the default) or other hash functions such as the built in md5 function or your own function.

Below is an example of integer encoding a document using the md5 hash function.

In [7]:
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

8
[6, 4, 1, 2, 7, 5, 6, 2, 6]


We can see that the use of a different hash function results in consistent, but different integers for words as the one_hot() function in the previous section.

### Tokenizer API

Keras provides the Tokenizer class for preparing text documents for deep learning. The Tokenizer must be constructed and then fit on either raw text documents or integer encoded text documents.

In [9]:
from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!','Good work','Great effort','nice work','Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)

Once fit, the Tokenizer provides 4 attributes that you can use to query what has been learned about your documents:

 - **word_counts**: A dictionary of words and their counts.
 - **word_docs**: A dictionary of words and how many documents each appeared in.
 - **word_index**: A dictionary of words and their uniquely assigned integers.
 - **document_count**:An integer count of the total number of documents that were used to fit the Tokenizer.

In [21]:
# summarize what was learned
print(t.word_counts)
print(t.word_docs)
#print(t.word_index)
#print(t.document_count)

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
defaultdict(<class 'int'>, {'done': 1, 'well': 1, 'good': 1, 'work': 2, 'great': 1, 'effort': 1, 'nice': 1, 'excellent': 1})


Once the Tokenizer has been fit on training data, it can be used to encode documents in the train or test datasets.

The texts_to_matrix() function on the Tokenizer can be used to create one vector per document provided per input. The length of the vectors is the total size of the vocabulary.

This function provides a suite of standard bag-of-words model text encoding schemes that can be provided via a mode argument to the function.

The modes available include:

 * **‘binary‘**: Whether or not each word is present in the document. This is the default.
 * **‘count‘**: The count of each word in the document.
 * **‘tfidf‘**: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document.
 * **‘freq‘**: The frequency of each word as a ratio of words within each document.
 
We can put all of this together with a worked example.

In [19]:
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count') # here, we use 'count' mode
print(encoded_docs)

[[ 0.  0.  1.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.]]


Running the example fits the Tokenizer with 5 small documents. The details of the fit Tokenizer are printed. Then the 5 documents are encoded using a word count.

Each document is encoded as a 9-element vector with one position for each word and the chosen encoding scheme value for each word position. In this case, a simple word count mode is used.