# Working with text data
Like all other neural networks, deep-learning models don’t take as input raw text:
they only work with numeric tensors. Vectorizing text is the process of transforming text
into numeric tensors. This can be done in multiple ways:
1. Segment text into words, and transform each word into a vector.
2. Segment text into characters, and transform each character into a vector.
3. Extract n-grams of words or characters, and transform each n-gram into a vector.
N-grams are overlapping groups of multiple consecutive words or characters.

### Understanding n-grams and bag-of-words
Word n-grams are groups of N (or fewer) consecutive words that you can extract from
a sentence. The same concept may also be applied to characters instead of words.
Here’s a simple example. Consider the sentence “The cat sat on the mat.” It may be
decomposed into the following set of 2-grams: <br/>
{"The", "The cat", "cat", "cat sat", "sat",
"sat on", "on", "on the", "the", "the mat", "mat"}<br/><br/>
It may also be decomposed into the following set of 3-grams:<br/>
{"The", "The cat", "cat", "cat sat", "The cat sat",
"sat", "sat on", "on", "cat sat on", "on the", "the",
"sat on the", "the mat", "mat", "on the mat"}<br/><br/>
Such a set is called a bag-of-2-grams or bag-of-3-grams, respectively. The term bag
here refers to the fact that you’re dealing with a set of tokens rather than a list or
sequence: the tokens have no specific order. This family of tokenization methods is
called bag-of-words.<br/>
Because bag-of-words isn’t an order-preserving tokenization method (the tokens generated are understood as a set, not a sequence, and the general structure of the sentences is lost), it tends to be used in shallow language-processing models rather than
in deep-learning models. Extracting n-grams is a form of feature engineering, and
deep learning does away with this kind of rigid, brittle approach, replacing it with hierarchical feature learning. One-dimensional convnets and recurrent neural networks,
introduced later in this chapter, are capable of learning representations for groups of
words and characters without being explicitly told about the existence of such groups,
by looking at continuous word or character sequences. For this reason, we won’t
cover n-grams any further in this book. But do keep in mind that they’re a powerful,
unavoidable feature-engineering tool when using lightweight, shallow text-processing
models such as logistic regression and random forests.

# Word-level one-hot encoding

In [2]:
import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

token_index = {}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1
            
print(token_index)

{'ate': 8, 'my': 9, 'mat.': 6, 'the': 5, 'The': 1, 'cat': 2, 'dog': 7, 'homework.': 10, 'sat': 3, 'on': 4}


In [3]:
max_length = 10

results = np.zeros(shape=(len(samples), max_length, max(token_index.values()) + 1))

for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.
        
print(results)

[[[ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

 [[ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]]


# Using keras for word-label one-hot encoding

In [4]:
from keras.preprocessing.text import Tokenizer

In [5]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# config to take 1000 words
tokenizer = Tokenizer(num_words=1000)

# Build the index for words
tokenizer.fit_on_texts(samples)

# Turns the string into lists of integer indices
sequences = tokenizer.texts_to_sequences(samples)

print(sequences)

[[1, 4, 7, 8, 1, 9], [1, 6, 5, 2, 3]]


In [6]:
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
word_index = tokenizer.word_index

print(word_index)
print(one_hot_results)

{'sat': 7, 'homework': 3, 'cat': 4, 'ate': 5, 'the': 1, 'dog': 6, 'my': 2, 'on': 8, 'mat': 9}
[[ 0.  1.  0. ...,  0.  0.  0.]
 [ 0.  1.  1. ...,  0.  0.  0.]]
