# Deep Learning with Python
# 6.1 - One-Hot Encoding

- A tokenization technique for text-based data.
- Tokenization is the process of breaking text-based data into individual units such as words or characters (called tokens) that can then be encoded as vectors.
- Tokenization allows us to convert text-based data into numeric tensors which can then be passed to a Deep Learning model.

## Word-Level One-Hot Encoding
Every **word** in the sentence is considered an individual building block for text.

In [3]:
import numpy as np

Samples represents the data that we will be tokenizing and then vectorizing through one-hot encoding. In this case, each sample is a sentence, but a sample can also be an entire document. 

In [4]:
samples = ['The cat sat on the mat', 'The dog ate my homework']

### Tokenizing Samples
We will create a dictionary of called `token_index` which will map each unique word in all our samples to a unique index or identifier. The key is the word itself and the value is an index number.

We do this by parsing each sample in the `samples` list, and splitting it into individual words. IRL we would also remove punctuation marks and special characters from the samples.

The result of `split()` on a single sample is a list of words. We then check if each word in the list of words is already present in the `token_index` dictionary. If this is not the case, we add the word as a key to the dictionary with its value set to the total number of unique words in the total words added to the dictionary so far. We add 1 to this index because the index `0` is usually not assigned to any word. It is reserved, possibly for characters such as space or for invalid keys.

In [5]:
# token_index is a dictionary mapping words to index numbers
token_index = {}

# For every sample in the corpus/collection of samples/documents
for sample in samples:
    for word in sample.split(): 
        if word not in token_index:
            token_index[word] = len(token_index) + 1

When parsing a sample, we will limit the program to a fixed number of words to tokenize. So the first 10 words in each sample will be tokenized. The remainining words will not. This keeps the word index dictionary's size manageable.

In [6]:
max_length = 10

We create a new three-dimensional vector to store the results of tokenizing the samples.
- Axis 0 - The batch or sample axis: one dimension per sample to be tokenized/vectorized.
- Axis 1 - `max_length` - the maximum number of words that will be considered for tokenization for a sample.
- Axis 2 - `max(token_index.values()) + 1` - For each word to be tokenized, we will create an `n + 1` dimensional vector, where `n` is the number of unique words in our dictionary. An additional dimension is added to account for `0` - the index that is not assigned to any word.

In [7]:
results = np.zeros(shape=(len(samples), 
                         max_length, 
                         max(token_index.values()) + 1))

Each sample in the `samples` array is first split into a list of individual words. Then `enumerate` is used to assign an index to each word in the list of words (up to the maximum number of words to be tokenized). The `get` method is used to look up the word in the dictionary of unique words and get its `index` value. Then, for each word, the value in the `index` column is set to 1. (a `float32` value, not an `int` value).

In [8]:
# For each sample in the collection of documents
for i, sample in enumerate(samples):
    # For every word in each sample up to the defined number of words
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

In [9]:
# This is the dictionary of unique words in all samples
# Since we aren't converting words to lowercase, `The`, and `the` are different
token_index

{'The': 1,
 'cat': 2,
 'sat': 3,
 'on': 4,
 'the': 5,
 'mat': 6,
 'dog': 7,
 'ate': 8,
 'my': 9,
 'homework': 10}

The shape of the `results` array is `(2, 10, 11)`. 2 samples. In the second axis, there is one dimension for each potential word that will be tokenized in each sample (10). In the third axis, there is one dimension per each unique word in the dictionary (not accounting for case sensitivity) and an additional dimension for the 0th index.

In [10]:
results.shape

(2, 10, 11)

Printing the one-hot encoded version of the first sample to confirm manually encoded results.

In [11]:
results[0]

array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [12]:
results[1]

array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

## Character-level One-Hot Encoding

`string` library must be imported because we need to use the `printable` attribute to extract all ASCII characters from the string. This is a way of dropping invalid (non-alphanumeric) characters.

In [13]:
import string

Samples are still the same as the ones used in word-based encoding.

In [14]:
samples = ['The cat sat on the mat', 'The dog ate my homework']


Make a string that contains all valid alphanumeric characters. This will be used to check for characters to extract from the samples. 

In [15]:
characters = string.printable # All printable ASCII characters

Creating a dictionary where the keys will be the index of the characters and the values will be the actual character each index encodes. 

The `zip` function is used to combine two list of the same length into a single data structure where each element is a tuple consisting of the elements at the same index in each list. In this way we're passing a list of tuples to the dictionary in the form `(index, character)`.

In [16]:
token_index = dict(zip(range(1, len(characters) + 1), characters))

We will limit the tokenizer to the first 50 characters in each sample.

In [17]:
max_length = 50

Create a `numpy` array that of all zeros with the following dimensions
- Axis 0 is still the batch axis and has one dimension per each sample to be encoded.
- Axis 1 has one dimension per each character in the sample to be encoded. Limited to 50 characters.
- Axis 2 has one dimension per each unique character present in the dictionary of characters along with an additional dimension for the 0 index. This time we're using the `keys` rather than the `values` of the dictionary to determine the dimensions because the `keys` stores the indexes.

In [18]:
results = np.zeros((len(samples), max_length, 
                    max(token_index.keys()) + 1))

Slightly different logic. Instead of splitting each sample into individual words at the space ` ` character, we're using `enumerate` to convert it to a list of characters, each with its own index. This index is then used to find the location in the one-hot vector for each character that should be set to 1.

In [19]:
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i, j, index] = 1.

In [20]:
results 

array([[[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

## `keras` One-Hot Encoding
- Don't reinvent the wheel.
- Use `keras`'s built in one-hot encoding functionality. 
- These utilities will strip away all special characters from strings and account for only the first `N` most commonly occurring words.
- Using a finite number of commonly occurring words helps us avoid dealing with evry large input vector spaces.

In [21]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [22]:
samples = ['The cat sat on the mat', 'The dog ate my homework']

In [23]:
# Create tokenizer configured to take into account only the 1000 most common words
tokenizer = Tokenizer(num_words=1000)

In [24]:
# Build the word index - turns strings into list of integer indices
tokenizer.fit_on_texts(samples)

In [25]:
# Turn strings into lists of integer indices
sequences = tokenizer.texts_to_sequences(samples)

In [26]:
# Could also directly get 1-hot binary representations
# Other vectorization modes than 1-Hot also supported
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

In [27]:
print(tokenizer.word_index)

{'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}


In [28]:
# Recovering the word index that was computed
word_index = tokenizer.word_index
print('Found %s unique tokens.'%len(word_index))

Found 9 unique tokens.


## One-Hot Hashing

When there are far too many words for us to assign an explicit word index to, we can use a variant of the one-hot encoding scheme called **one-hot hashing**. This involves assigning a random index to a word in a vector of fixed size through a hash function.

The major advantage of this method is that it allows for online encoding of word vectors: all words do not need to be encoding in one go (as is the case in using conventional, index-based one-hot encoding). 

Also, this scheme does away with maintaining an explicit word index, which saves memory. 

The drawback of this approach is hash collisions - when the dimensionality of the vectors used to store the words is close to/less than the number of unique words, the hash function may assign the same index to more than one word, which will confuse the ML algorithm.

In [29]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

In [30]:
# Stores words as vectors of size 1000. If we have close to 1000 words
# (or more) we are more likely tosee hash collisions, which will
# decrease the accuacy of the encoding method. 
dimensionality = 1000
max_length = 10

In [31]:
results = np.zeros((len(samples), 
                    max_length, 
                    dimensionality))

In [32]:
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        # Hash to a random index between 0 and 1000
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1.