In [3]:
import torch

# Preparing Names Data

Here we'll review the steps for converting names to tensors and embeddings.  We'll also introduce the packed and padded sequences that make dealing with batches of sequences of different lengths possible.

### 1. Preparing the Data

Let's assume you have a dataset of names and their corresponding languages. Each name is a sequence of characters, and the names vary in length.

In [17]:
names = ["Chen", "Smith", "Dubois", "O'Neill", "Kawasaki", ]
languages = ["Chinese", "English", "French", "Irish", "Japanese"]

## 2. Tokenizing and Encoding the Names
Convert each name into a tensor of character indices. You'll need a character vocabulary for this.  

* Tokenizing is breaking up text into smaller units like characters or words.
* We can create our vocabulary using all the unique tokens in our corpus or a predefined vocabulary, such as ASCII characters.  

In [18]:
# Set up the vocabulary

import string

all_letters = string.ascii_letters + " .,;'"
alphabet_to_index = { char:i for i,char in enumerate(all_letters) }  # this is the vocabulary
alphabet_to_index

{'a': 0,
 'b': 1,
 'c': 2,
 'd': 3,
 'e': 4,
 'f': 5,
 'g': 6,
 'h': 7,
 'i': 8,
 'j': 9,
 'k': 10,
 'l': 11,
 'm': 12,
 'n': 13,
 'o': 14,
 'p': 15,
 'q': 16,
 'r': 17,
 's': 18,
 't': 19,
 'u': 20,
 'v': 21,
 'w': 22,
 'x': 23,
 'y': 24,
 'z': 25,
 'A': 26,
 'B': 27,
 'C': 28,
 'D': 29,
 'E': 30,
 'F': 31,
 'G': 32,
 'H': 33,
 'I': 34,
 'J': 35,
 'K': 36,
 'L': 37,
 'M': 38,
 'N': 39,
 'O': 40,
 'P': 41,
 'Q': 42,
 'R': 43,
 'S': 44,
 'T': 45,
 'U': 46,
 'V': 47,
 'W': 48,
 'X': 49,
 'Y': 50,
 'Z': 51,
 ' ': 52,
 '.': 53,
 ',': 54,
 ';': 55,
 "'": 56}

It will be helpful to have a function to convert each name to a tensor of its indices in the vocabulary.

In [19]:
# Encode names as tensors of character indices
def string_to_tensor(string, vocabulary):
    indices = [vocabulary[char] for char in string]
    return torch.tensor(indices, dtype=torch.long)

encoded_names = [string_to_tensor(name, char_to_index) for name in names]
encoded_names

[tensor([12, 11, 14, 15]),
 tensor([16,  6, 10,  8, 11]),
 tensor([ 4, 17, 21,  7, 10,  9]),
 tensor([ 2,  1, 20, 14, 10,  5,  5]),
 tensor([ 3, 18, 13, 18,  9, 18, 19, 10])]

The output is a list of tensors of different lengths, but if we want to process a batch of sequences we'll need all the tensors to have the same length.

### 3. Padding the sequences

Use `torch.nn.utils.rnn.pad_sequence` to pad the encoded names, making them all the same length.

In [20]:
from torch.nn.utils.rnn import pad_sequence

# Pad the encoded names
padded_names = pad_sequence(encoded_names, batch_first=True)
padded_names

tensor([[12, 11, 14, 15,  0,  0,  0,  0],
        [16,  6, 10,  8, 11,  0,  0,  0],
        [ 4, 17, 21,  7, 10,  9,  0,  0],
        [ 2,  1, 20, 14, 10,  5,  5,  0],
        [ 3, 18, 13, 18,  9, 18, 19, 10]])

In the book we're using the author also packs the sequences using when they're loaded by the dataloader, but then has to unpack them before creating the embeddings.  After the embeddings are created, they're repacked.  

We'll show how packing works using our set of names, but in practice, we'll apply packing after the embedding step.

**Note:**  If you search for information about packing and padding on the internet you'll find many resources saying the sequences should be sorted in order of descending length before packing and padding.  That's no longer necessary in PyTorch.  I've sorted the names in order of increasing length to make this easier to understand.

#### Packing the sequences

The best way to understand packing is to do it and view the results.  Note that we need the lengths of the sequences before packing them:

In [24]:
# Create a packed sequence
from torch.nn.utils.rnn import pack_padded_sequence

lengths = torch.tensor([len(name) for name in encoded_names])
print(lengths)

packed_names = pack_padded_sequence(padded_names, lengths=lengths, batch_first=True, enforce_sorted=False)
print(packed_names)
print(padded_names) # for comparison

tensor([4, 5, 6, 7, 8])
PackedSequence(data=tensor([ 3,  2,  4, 16, 12, 18,  1, 17,  6, 11, 13, 20, 21, 10, 14, 18, 14,  7,
         8, 15,  9, 10, 10, 11, 18,  5,  9, 19,  5, 10]), batch_sizes=tensor([5, 5, 5, 5, 4, 3, 2, 1]), sorted_indices=tensor([4, 3, 2, 1, 0]), unsorted_indices=tensor([4, 3, 2, 1, 0]))
tensor([[12, 11, 14, 15,  0,  0,  0,  0],
        [16,  6, 10,  8, 11,  0,  0,  0],
        [ 4, 17, 21,  7, 10,  9,  0,  0],
        [ 2,  1, 20, 14, 10,  5,  5,  0],
        [ 3, 18, 13, 18,  9, 18, 19, 10]])


For reference, PyTorch has a function for unpacking as well:

In [32]:
from torch.nn.utils.rnn import pad_packed_sequence

unpacked_packed_names = pad_packed_sequence( packed_names )
print(unpacked_packed_names)

(tensor([[12, 16,  4,  2,  3],
        [11,  6, 17,  1, 18],
        [14, 10, 21, 20, 13],
        [15,  8,  7, 14, 18],
        [ 0, 11, 10, 10,  9],
        [ 0,  0,  9,  5, 18],
        [ 0,  0,  0,  5, 19],
        [ 0,  0,  0,  0, 10]]), tensor([4, 5, 6, 7, 8]))


Whoa, that isn't quite what I was expecting since the output is transposed (this is the format that RNN layers like), but we can fix that by specifying `batch_first = True`:

In [33]:
unpacked_packed_names = pad_packed_sequence( packed_names , batch_first = True)
print(unpacked_packed_names)

(tensor([[12, 11, 14, 15,  0,  0,  0,  0],
        [16,  6, 10,  8, 11,  0,  0,  0],
        [ 4, 17, 21,  7, 10,  9,  0,  0],
        [ 2,  1, 20, 14, 10,  5,  5,  0],
        [ 3, 18, 13, 18,  9, 18, 19, 10]]), tensor([4, 5, 6, 7, 8]))


### Embed the sequences

Above we showed how to pack the padded sequences.  However, in practice we'll need sequences that are padded and not packed to create embeddings.   

In [35]:
vocab_size = len(alphabet_to_index)
embedding_dim = 4 #for illustration, usually 64 to 256 or higher

embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)

embedded_names = embedding_layer(padded_names)

print(embedded_names.shape)
print(embedded_names)

torch.Size([5, 8, 4])
tensor([[[-0.6822, -0.9926,  1.7998, -1.4366],
         [ 0.8385,  2.9592, -0.5322,  1.3943],
         [-0.9660,  1.1546, -0.0527,  0.2459],
         [ 2.2587,  0.3257, -0.9071,  0.0982],
         [ 0.6579, -1.4150, -0.9433, -0.8691],
         [ 0.6579, -1.4150, -0.9433, -0.8691],
         [ 0.6579, -1.4150, -0.9433, -0.8691],
         [ 0.6579, -1.4150, -0.9433, -0.8691]],

        [[ 0.8636, -1.5479,  0.9165, -0.9403],
         [-1.2439,  0.0983,  0.3206, -0.8861],
         [ 0.4721, -0.5125,  0.7859, -0.2357],
         [ 0.7767, -0.4849,  1.3134,  1.0431],
         [ 0.8385,  2.9592, -0.5322,  1.3943],
         [ 0.6579, -1.4150, -0.9433, -0.8691],
         [ 0.6579, -1.4150, -0.9433, -0.8691],
         [ 0.6579, -1.4150, -0.9433, -0.8691]],

        [[-0.0987, -1.1727,  1.6411, -0.8086],
         [ 0.3613,  0.4326, -2.0263, -1.6359],
         [-0.0186,  0.0044,  1.9132, -1.0677],
         [ 0.5504,  0.8048, -0.1656, -0.9748],
         [ 0.4721, -0.5125,  0.785

### 5.  Pack the embedded sequences before the RNN layer

Everybody seems to do this, but I don't understand if it's actually helpful.  Let's try it to see if we can learn anything.

In [37]:
from torch.nn.utils.rnn import pack_padded_sequence

packed_embedded_names = pack_padded_sequence(embedded_names, lengths, batch_first=True, enforce_sorted=False)

packed_embedded_names

PackedSequence(data=tensor([[-0.8737,  1.1373,  0.0240,  1.0491],
        [-0.3728, -0.2573,  0.1431, -1.7390],
        [-0.0987, -1.1727,  1.6411, -0.8086],
        [ 0.8636, -1.5479,  0.9165, -0.9403],
        [-0.6822, -0.9926,  1.7998, -1.4366],
        [-0.2080,  0.9390, -0.4297, -0.3330],
        [-1.1336,  0.7870,  0.1734,  1.0852],
        [ 0.3613,  0.4326, -2.0263, -1.6359],
        [-1.2439,  0.0983,  0.3206, -0.8861],
        [ 0.8385,  2.9592, -0.5322,  1.3943],
        [-1.7522, -2.1198,  0.1935,  0.1146],
        [-0.7598,  1.7562, -0.5556, -0.1444],
        [-0.0186,  0.0044,  1.9132, -1.0677],
        [ 0.4721, -0.5125,  0.7859, -0.2357],
        [-0.9660,  1.1546, -0.0527,  0.2459],
        [-0.2080,  0.9390, -0.4297, -0.3330],
        [-0.9660,  1.1546, -0.0527,  0.2459],
        [ 0.5504,  0.8048, -0.1656, -0.9748],
        [ 0.7767, -0.4849,  1.3134,  1.0431],
        [ 2.2587,  0.3257, -0.9071,  0.0982],
        [-1.2631,  1.4951, -0.5700,  1.0441],
        [ 0.47