# Chapter 6 - Deep learning for text and sequences

## 6.1 Working with text data

Keep in mind throughout this chapter that none of these deep learning models truly understand text in a human sense; rather these models can map the statistical structure of written language. Deep learning for natural language processing is pattern recognition applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition applied to pixels.

Like all other neural networks, deep-learning models don't take as input raw text: they only work with numeric tensors. **Vectorizing** text is the process of transforming text into numeric tensors. This can be done in multiple ways:
- Segment text into words, and transform each word into a vector.
- Segment text into characters, and transform each character into a vector.
- Extract n-grams of words or characters, and transform each n-gram into a vector. **N-grams** are overlapping groups of multiple consecutive words or characters.

Collectively, the different units into which you can break down text are called **tokens**, and breaking text into such tokens is called **tokenization**. There are two major ways to associate a vector with a token: **one-hot encoding** of tokens, and **token embedding** (typically used exclusively for words, and called word embedding).

### 6.1.1 One-hot encoding of words and characters

One-hot encoding consists of associating a unique integer index with every word and then turning this index $i$ into a binary vector of size $N$ (the size of the vocabulary).

Word-level one hot encoding:

```py
import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

token_index = {}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1
            
max_length = 10

results = np.zeros(shape=(len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1
```

Character-level one-hot encoding:

```py
import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable
token_index = dict(zip(range(1, len(characters) + 1), characters))

max_length = 50
results = np.zeros(shape=(len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i, j, index] = 1
```

Note that Keras has built-in utilities for doing one-hot encoding of text at the word level or character level. They take care of a number of important features such as stripping special characters and only taking into account the $N$ most important words, which is a common restriction, to avoid dealing with very large input vector spaces).

Using Keras for word-level one-hot encoding:

```py
from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)

sequences = tokenizer.texts_to_sequences(samples)

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

word_index = tokenizer.word_index
print('Found {:s} unique tokens.'.format(len(word_index)))
```

A variant of one-hot encoding is the so-called **one-hot hashing trick**, which you can use when the number of unique tokens in your vocabulary is too large to handle explicitly. Instead of explicitly assigning an index to each word and keeping a reference of these indices in a dictionary, you can hash words into vectors of fixed size. This is typically done with a very lightweight hashing function.

The main advantage of this method is that it does away with maintaining an explicit word index, which saves memory and allows online encoding of the data (you can generate token vectors right away, before you’ve seen all of the available data). The one drawback of this approach is that it’s susceptible to hash collisions: two different words may end up with the same hash.