# Using the DataLoader

In [1]:
# Import code from file in upper directory
import sys, os
sys.path.append(os.getcwd() + os.sep + os.pardir)
from tweet_data import TweetsBaseDataset

Let's load the dev dataset for simplicity.

In [2]:
dataset = TweetsBaseDataset('../data/dev', 'us_trial')

Reading twitter - 1grams ...
Reading twitter - 2grams ...
Reading twitter - 1grams ...
Reading files in directory ../data/dev/us_trial
Read file with 50000 tweets
Building vocabulary
Loading labels


To be able to get batches of variable-length sequences we have to write our custom `collate_fn` for the `DataLoader`. This function is defined in `TweetBaseDataset.collate_fn()`.

In [3]:
from torch.utils.data import DataLoader
data_loader = DataLoader(dataset, collate_fn=TweetsBaseDataset.collate_fn, batch_size=4, shuffle=True)

We can now use the data loader to get batches of the data. Each batch contains the padded sequences, the labels, and the length of each sequence. Padding is inserted with zeros (consistent with `dataset.vocabulary['<PAD>']`, which maps to 0) and sequences are sorted from longest to shortest:

In [4]:
data, labels, lengths = next(iter(data_loader))
print('Padded sequences:\n', data)
print('Labels:\n', labels)
print('Sequence lenghts:\n', lengths)

Padded sequences:
 tensor([[ 331,  109, 1370,    2],
        [  57,  146, 2190, 2484],
        [ 207,  807,  221,  961],
        [ 548,  419,    6,    3],
        [   2,  461, 4590,    2],
        [2132,    8,  418,  388],
        [   3,   39,    1,    3],
        [   2,  807,   24,    2],
        [ 182,   14,  674,   87],
        [  14,  593,   19,    3],
        [   3, 8994, 4590,    4],
        [   2,  100,  476, 4454],
        [2150,    7,    9,   65],
        [3138,   17,   22,    0],
        [2150,   19,    6,    0],
        [ 907,   14,   23,    0],
        [  14, 2258,    5,    0],
        [   3,   20,    0,    0],
        [   2,    1,    0,    0],
        [  13,    2,    0,    0],
        [4085, 8995,    0,    0],
        [ 792,    3,    0,    0],
        [   1,    5,    0,    0],
        [   3,    0,    0,    0],
        [   2,    0,    0,    0],
        [ 299,    0,    0,    0],
        [  18,    0,    0,    0],
        [ 142,    0,    0,    0],
        [  52,    0,    0,   

The sequence lengths can be used to create a `PackedSequence`, which avoids calculating the output of recurrent models for padding tokens. A `PackedSequence` is created using `pack_padded_sequence()`:

In [5]:
import torch
from torch.nn.utils.rnn import pack_padded_sequence

# Model definition
embedding_dim = 100
embeddings = torch.nn.Embedding(len(dataset.vocabulary), embedding_dim)
rnn = torch.nn.RNN(embedding_dim, embedding_dim)
linear = torch.nn.Linear(embedding_dim, 6)

# Forward pass with padded batch of data
def example_forward(data, lengths):
    x = embeddings(data)
    x = pack_padded_sequence(x, lengths)
    _, x = rnn(x)
    x = linear(x)
    
    return x

print(example_forward(data, lengths))

tensor([[[-0.0838,  0.2720,  0.0836, -0.1564, -0.3074,  0.1873],
         [-0.0749,  0.1578,  0.1244, -0.3130, -0.2459,  0.1515],
         [-0.0096,  0.2513,  0.1120, -0.1580, -0.0136, -0.0796],
         [ 0.2482,  0.2602, -0.2492,  0.0496,  0.0717,  0.1613]]],
       grad_fn=<ThAddBackward>)


Note that throughout these examples we have been using the default setting in PyTorch where the first axis corresponds to the sequence, and the second corresponds to batches.