# **Data Loader**

A data loader in **PyTorch** is responsible for efficiently loading and batching data from a data set. It abstracts away the process of iterating over a data set, shuffling, and dividing it into batches for training. In NLP applications, the data loader is used to process and transform your text data, rather than just the data set.

Data loaders have several key parameters, including the data set to load from, batch size (determining how many samples per batch), shuffle (whether to shuffle the data for each epoch), and more. Data loaders also provide an iterator interface, making it easy to iterate over batches of data during training.

An iterator is an object that can be looped over. It contains elements that can be iterated through and typically includes two methods, `__iter__()` and `__next__()`. When there are no more elements to iterate over, it raises a **`StopIteration`** exception.

Iterators are commonly used to traverse large data sets without loading all elements into memory simultaneously, making the process more memory-efficient. In PyTorch, not all data sets are iterators, but all data loaders are.

In PyTorch, the data loader processes data in batches, loading and processing one batch at a time into memory efficiently. The batch size, which you specify when creating the data loader, determines how many samples are processed together in each batch. The data loader's purpose is to convert input data and labels into batches of tensors with the same shape for deep learning models to interpret.

Finally, a data loader can be used for tasks such as tokenizing, sequencing, converting your samples to the same size, and transforming your data into tensors that your model can understand.


In [3]:
# @title Install Necessary Libraries

print("--- Installing Libraries ---")

# Uninstall conflicting packages
!pip uninstall -y torch torchtext torchdata numpy

# Reinstall compatible versions
!pip install torch==2.2.2 torchtext==0.17.2 torchdata==0.7.1 numpy==1.24.4

print("\n--- Installation and Downloads Complete ---")

--- Installing Libraries ---
Found existing installation: torch 2.2.2
Uninstalling torch-2.2.2:
  Successfully uninstalled torch-2.2.2
Found existing installation: torchtext 0.17.2
Uninstalling torchtext-0.17.2:
  Successfully uninstalled torchtext-0.17.2
Found existing installation: torchdata 0.11.0
Uninstalling torchdata-0.11.0:
  Successfully uninstalled torchdata-0.11.0
Found existing installation: numpy 1.24.4
Uninstalling numpy-1.24.4:
  Successfully uninstalled numpy-1.24.4
Collecting torch==2.2.2
  Using cached torch-2.2.2-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Collecting torchtext==0.17.2
  Using cached torchtext-0.17.2-cp311-cp311-manylinux1_x86_64.whl.metadata (7.9 kB)
Collecting torchdata==0.7.1
  Downloading torchdata-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting numpy==1.24.4
  Using cached numpy-1.24.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached torch-2.2.2-cp311-cp31


--- Installation and Downloads Complete ---


In [1]:
# @title Import Necessary Libraries

print("--- Importing Libraries ---")

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

import torchtext
print(torch.__version__)
print(torchtext.__version__)

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from torchdata.datapipes.iter import IterableWrapper, Mapper

import numpy as np
import pandas as pd
import random

print("\nAll necessary libraries imported.")

--- Importing Libraries ---
2.2.2+cu121
0.17.2+cpu

All necessary libraries imported.


In [2]:
sentences = [
    "There's a lady who's sure all that glitters is gold",
    "And she's buying a stairway to Heaven",
    "When she gets there she knows, if the stores are all closed",
    "With a word she can get what she came for",
    "Ooh, ooh, and she's buying a stairway to Heaven"
]

In [6]:
# @title Define a custom data set
class CustomDataset(Dataset):
    def __init__(self, sentences, tokenizer, vocab):
        self.sentences = sentences
        self.tokenizer = tokenizer
        self.vocab = vocab

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        tokens = self.tokenizer(self.sentences[idx])
        # Convert tokens to tensor indices using vocab
        tensor_indices = torch.tensor([self.vocab[token] for token in tokens], dtype=torch.long)
        return torch.tensor(tensor_indices)

# Tokenizer
tokenizer = get_tokenizer("basic_english")

# Build vocabulary
vocab = build_vocab_from_iterator(map(tokenizer, sentences), specials=["<PAD>", "<UNK>"])
vocab.set_default_index(vocab["<UNK>"])  # Handle unknown tokens

# Create an instance of your custom data set
custom_dataset = CustomDataset(sentences, tokenizer, vocab)

print("Custom Dataset Length:", len(custom_dataset))
print("Sample Items:")
for i in range(len(custom_dataset)):
    sample_item = custom_dataset[i]
    print(f"Item {i + 1}: {sample_item}")

Custom Dataset Length: 5
Sample Items:
Item 1: tensor([13,  3,  5,  4, 27, 34,  3,  5, 29,  7, 30, 22, 25, 23])
Item 2: tensor([ 8,  2,  3,  5,  9,  4, 12, 14, 10])
Item 3: tensor([33,  2, 21, 13,  2, 26,  6, 24, 31, 28, 15,  7, 18])
Item 4: tensor([35,  4, 36,  2, 17, 20, 32,  2, 16, 19])
Item 5: tensor([11,  6, 11,  6,  8,  2,  3,  5,  9,  4, 12, 14, 10])


  return torch.tensor(tensor_indices)


In [7]:
# @title Define Collate Function with Sorting and Padding

def collate_batch(batch):
    # Sort by length (descending)
    batch.sort(key=lambda x: len(x), reverse=True)

    # Pad sequences
    padded_batch = pad_sequence(batch, batch_first=True, padding_value=vocab["<PAD>"])

    # Create attention masks (optional, good for Transformer models)
    lengths = torch.tensor([len(seq) for seq in batch])

    return padded_batch, lengths

In [10]:
# @title Create DataLoader with Batch Size 3

# Instantiate dataset
custom_dataset = CustomDataset(sentences, tokenizer, vocab)

# Create DataLoader
data_loader = DataLoader(
    custom_dataset,
    batch_size=3,
    shuffle=False,
    collate_fn=collate_batch
)

In [11]:
# @title Iterate and Print Batches

print("Batches:")
for i, (batch, lengths) in enumerate(data_loader):
    print(f"\nBatch {i + 1}:")
    print("Padded Sequences:\n", batch)
    print("Sequence Lengths:\n", lengths)

Batches:

Batch 1:
Padded Sequences:
 tensor([[13,  3,  5,  4, 27, 34,  3,  5, 29,  7, 30, 22, 25, 23],
        [33,  2, 21, 13,  2, 26,  6, 24, 31, 28, 15,  7, 18,  0],
        [ 8,  2,  3,  5,  9,  4, 12, 14, 10,  0,  0,  0,  0,  0]])
Sequence Lengths:
 tensor([14, 13,  9])

Batch 2:
Padded Sequences:
 tensor([[11,  6, 11,  6,  8,  2,  3,  5,  9,  4, 12, 14, 10],
        [35,  4, 36,  2, 17, 20, 32,  2, 16, 19,  0,  0,  0]])
Sequence Lengths:
 tensor([13, 10])


  return torch.tensor(tensor_indices)
