# Tokenize into words

Keras provides the **text_to_word_sequence()** function that you can use to split text into a list of words.

By default, this function automatically does 3 things:

Splits words by space (split=” “).
Filters out punctuation (filters=’!”#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’).
Converts text to lowercase (lower=True).

In [1]:
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# tokenize the document
result = text_to_word_sequence( text,
                                filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                                lower=True,
                                split=" ")
print(result)

Using TensorFlow backend.


['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


# Build vocabulary - Encoding with one_hot

A document is represented as a sequence of integer values, where each word in the document is represented as a unique integer.
Keras provides **the one_hot()** function.

In [1]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


8
[3, 8, 4, 9, 5, 5, 3, 4, 6]


# Padding

In [38]:
from keras.preprocessing.sequence import pad_sequences
test=[[2, 3, 4, 5, 6, 7, 2, 8, 9, 10, 11], [2, 9, 12, 8]]
pad_sequences(test, maxlen=None, dtype='int32', padding='pre', value=0.0)


array([[ 2,  3,  4,  5,  6,  7,  2,  8,  9, 10, 11],
       [ 0,  0,  0,  0,  0,  0,  0,  2,  9, 12,  8]], dtype=int32)