# Using the DataLoader

In [1]:
# Import code from file in upper directory
import sys, os
sys.path.append(os.getcwd() + os.sep + os.pardir)
from tweet_data import TweetsBaseDataset

  from collections import Sequence


Let's load the dev dataset for simplicity.

In [2]:
dataset = TweetsBaseDataset('../data/dev', 'us_trial')

Reading file
Read file with 50000 tweets, 77053 unique tokens
Building vocabulary
Loading labels


To be able to get batches of variable-length sequences we have to write our custom `collate_fn` for the `DataLoader`. This function is defined in `TweetBaseDataset.collate_fn()`.

In [3]:
from torch.utils.data import DataLoader
data_loader = DataLoader(dataset, collate_fn=TweetsBaseDataset.collate_fn, batch_size=4, shuffle=True)

We can now use the data loader to get batches of the data. Each batch contains the padded sequences, the labels, and the length of each sequence.

In [9]:
data, labels, lengths = next(iter(data_loader))
print('Padded sequences:\n', data)
print('Labels:\n', labels)
print('Sequence lenghts:\n', lengths)

Padded sequences:
 tensor([[   1,  572,    2,    7],
        [   1,   77,    1, 1056],
        [   2,   10,    2, 1116],
        [   1,   88,    1,  149],
        [   2,   70,    2,   14],
        [   1,   17, 5504,    7],
        [   2,   11,    2,  276],
        [9456, 1858,  177,    3],
        [   2,   15,    1,  702],
        [   1,  545,    3,  450],
        [   2,   20,  596,  123],
        [   1,   83, 1306,    0],
        [   2, 8948,    0,    0],
        [   1,    0,    0,    0]])
Labels:
 tensor([3, 0, 4, 8])
Sequence lenghts:
 [14 13 12 11]


The sequence lengths can be used to create a `PackedSequence`, which avoids calculating the output of recurrent models for padding tokens. A `PackedSequence` is created using `pack_padded_sequence()`:

In [22]:
import torch
from torch.nn.utils.rnn import pack_padded_sequence

# Model definition
embedding_dim = 100
embeddings = torch.nn.Embedding(len(dataset.vocabulary), embedding_dim)
rnn = torch.nn.RNN(embedding_dim, embedding_dim)
linear = torch.nn.Linear(embedding_dim, 6)

# Forward pass with padded batch of data
def example_forward(data, lengths):
    x = embeddings(data)
    x = pack_padded_sequence(x, lengths)
    _, x = rnn(x)
    x = linear(x)
    
    return x

print(example_forward(data, lengths))

tensor([[[ 0.3833,  0.4755,  0.2564,  0.4347, -0.4255,  0.2481],
         [ 0.5882,  0.0470, -0.0261, -0.1068, -0.7731,  0.0276],
         [-0.1024,  0.0152, -0.0420,  0.0380,  0.0566, -0.1652],
         [ 0.0483,  0.2355,  0.0023, -0.1281, -0.2174, -0.3228]]],
       grad_fn=<ThAddBackward>)


Note that throughout these examples we have been using the default setting in PyTorch where the first axis correspond to the sequence, and the second 