<a href="https://colab.research.google.com/github/DavoodSZ1993/Dive-into-Deep-Learning-Notes-/blob/main/15_3_dataset_for_pretraining_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install d2l==1.0.0-alpha1.post0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.0/93.0 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.9/121.9 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.9/84.9 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[?25h

## 15.3 The Dataset for Pretraining Word Embeddings

#### Custom Collate Function
In default collation, batch data must be in the same dimension. But, imagine we have an NLP task and the data is tokenized text.

In [1]:
import torch
from torch.utils.data import DataLoader
import numpy as np

In [4]:
nlp_data = [
    {'tokenized_input': [1, 4, 5, 9, 3, 2],
     'label': 0},
    {'tokenized_input': [1, 7, 3, 14, 48, 7, 23, 154, 2],
     'label': 0},
    {'tokenized_input': [1, 30, 67, 117, 21, 15, 2],
     'label': 0},
    {'tokenized_input': [1, 17, 2],
     'label':0}]

loader = DataLoader(nlp_data, batch_size=2, shuffle=False)
next(iter(loader))

RuntimeError: ignored

The error message says that it is impossible to create a non-rectangular tensor. There are mainly two solutions for this problem.

* Pad the whole dataset to the longest example. Although seems straightforward, this method is very expensive on GPU, and does not influence the result.
* Pad dynamically during batch creation. So, when samples for the patch are selected, we pad only them to the longest one.

We can implement the second method by creating a custom `collate_fn` function.

In [10]:
from torch.nn.utils.rnn import pad_sequence

def custom_collate(data):
  inputs = [torch.tensor(d['tokenized_input']) for d in data]
  labels = [d['label'] for d in data]

  inputs = pad_sequence(inputs, batch_first=True)
  labels = torch.tensor(labels)

  return {'tokenized_input': inputs,
          'label': labels}

loader = DataLoader(nlp_data, batch_size=2, shuffle=False,
                    collate_fn=custom_collate)
iter_loader = iter(loader)
batch1 = next(iter_loader)
print(batch1)
batch2 = next(iter_loader)
print(batch2)

{'tokenized_input': tensor([[  1,   4,   5,   9,   3,   2,   0,   0,   0],
        [  1,   7,   3,  14,  48,   7,  23, 154,   2]]), 'label': tensor([0, 0])}
{'tokenized_input': tensor([[  1,  30,  67, 117,  21,  15,   2],
        [  1,  17,   2,   0,   0,   0,   0]]), 'label': tensor([0, 0])}
