In [18]:
'''
1) Dataset in PyTorch

A Dataset is an abstraction that represents your data (features + labels).

It tells PyTorch how to access and return individual samples.

There are two main types of datasets:

Built-in datasets – available in torchvision.datasets (like MNIST, CIFAR-10).

Custom datasets – you create your own by subclassing torch.utils.data.Dataset.

2)
DataLoader in PyTorch

The DataLoader wraps around a Dataset and provides:

Batching (splitting data into mini-batches).

Shuffling (randomizing order of samples).

Parallel loading (using multiple workers to speed up loading).



Dataset is the shape of your data: how to index it and how to turn one sample into tensors.
DataLoader is the engine that turns those samples into efficient batches (shuffle, parallel I/O, prefetch, collate)

Concrete technical benefits & features

Batching (batch_size) — groups samples into tensors.

Shuffling (shuffle=True) — random order per epoch (or use a Sampler for custom behavior).

Parallelism (num_workers) — spawn worker processes to load/transform samples in parallel (avoids Python GIL bottleneck for I/O/CPU-bound transforms).

Prefetch & pinning (pin_memory=True) — speeds host→GPU transfer; use tensor.to(device, non_blocking=True).

Collate function (collate_fn) — custom compose of samples into a batch (useful for padding sequences).

Samplers — exact control over indexing/sharding (e.g., uniform sampling, weighted sampling, distributed).

IterableDataset — for streaming data sources (logs, sockets, huge corpora).

persistent_workers — keeps worker processes alive across epochs to reduce startup overhead.

generator + worker_init_fn — deterministic shuffling & reproducible worker RNGs.


3. Other Important Parameters of DataLoader

Here’s a cheat sheet with their use cases 🚀:


| Parameter                | What it does                                 | Why/When to use                                                                     |
| ------------------------ | -------------------------------------------- | ----------------------------------------------------------------------------------- |
| **`batch_size`**         | Number of samples per batch                  | Larger → faster training but more memory use. Smaller → more stable updates.        |
| **`shuffle`**            | Randomizes order every epoch                 | Important for SGD convergence. Use `False` for evaluation/test.                     |
| **`num_workers`**        | Number of parallel worker processes          | More workers → faster data loading (tune based on CPU cores).                       |
| **`pin_memory`**         | Copies data into pinned (page-locked) memory | Makes GPU transfer (`.to(device, non_blocking=True)`) faster. Useful for CUDA.      |
| **`drop_last`**          | Drops last incomplete batch                  | Useful when batchnorm needs consistent batch size.                                  |
| **`sampler`**            | Custom sampling strategy                     | Use for imbalance (WeightedRandomSampler), distributed training, or custom subsets. |
| **`collate_fn`**         | Merges list of samples into batch            | Needed for variable-length data (padding, masking, dicts).                          |
| **`timeout`**            | Seconds to wait for a worker to fetch data   | Avoids deadlocks in slow/unstable I/O.                                              |
| **`worker_init_fn`**     | Function to initialize workers               | Useful for seeding RNG in each worker for reproducibility.                          |
| **`generator`**          | Controls randomness in DataLoader            | Useful for reproducibility (deterministic shuffle).                                 |
| **`persistent_workers`** | Keeps workers alive between epochs           | Saves worker startup overhead in long training jobs.                                |


'''

'\n1) Dataset in PyTorch\n\nA Dataset is an abstraction that represents your data (features + labels).\n\nIt tells PyTorch how to access and return individual samples.\n\nThere are two main types of datasets:\n\nBuilt-in datasets – available in torchvision.datasets (like MNIST, CIFAR-10).\n\nCustom datasets – you create your own by subclassing torch.utils.data.Dataset.\n\n2)\nDataLoader in PyTorch\n\nThe DataLoader wraps around a Dataset and provides:\n\nBatching (splitting data into mini-batches).\n\nShuffling (randomizing order of samples).\n\nParallel loading (using multiple workers to speed up loading).\n\n\n\nDataset is the shape of your data: how to index it and how to turn one sample into tensors.\nDataLoader is the engine that turns those samples into efficient batches (shuffle, parallel I/O, prefetch, collate)\n\nConcrete technical benefits & features\n\nBatching (batch_size) — groups samples into tensors.\n\nShuffling (shuffle=True) — random order per epoch (or use a Sampler 

In [19]:
'''
bacth_size =2
means it will fetch two rows
'''

'\nbacth_size =2\nmeans it will fetch two rows\n'

In [20]:
from sklearn.datasets import make_classification
import torch

In [21]:
# Step 1: Create a synthetic classification dataset using sklearn
X, y = make_classification(
    n_samples=10,       # Number of samples
    n_features=2,       # Number of features
    n_informative=2,    # Number of informative features
    n_redundant=0,      # Number of redundant features
    n_classes=2,        # Number of classes
    random_state=42     # For reproducibility
)

In [22]:
X

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [23]:
X.shape

(10, 2)

In [24]:
y

array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [25]:
y.shape

(10,)

In [26]:
# Convert the data to PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)

In [27]:
X

tensor([[ 1.0683, -0.9701],
        [-1.1402, -0.8388],
        [-2.8954,  1.9769],
        [-0.7206, -0.9606],
        [-1.9629, -0.9923],
        [-0.9382, -0.5430],
        [ 1.7273, -1.1858],
        [ 1.7774,  1.5116],
        [ 1.8997,  0.8344],
        [-0.5872, -1.9717]])

In [28]:
y

tensor([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [29]:
from torch.utils.data import Dataset, DataLoader

In [30]:
class CustomDataset(Dataset):

  def __init__(self, features, labels):

    self.features = features
    self.labels = labels

  def __len__(self):

    return self.features.shape[0]

  def __getitem__(self, index):

    return self.features[index], self.labels[index]

In [31]:
dataset = CustomDataset(X, y)

In [32]:
len(dataset)

10

In [33]:
dataset[2]

(tensor([-2.8954,  1.9769]), tensor(0))

In [34]:
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In [35]:
for batch_features, batch_labels in dataloader:

  print(batch_features)
  print(batch_labels)
  print("-"*50)

tensor([[ 1.0683, -0.9701],
        [-0.7206, -0.9606]])
tensor([1, 0])
--------------------------------------------------
tensor([[-2.8954,  1.9769],
        [-1.9629, -0.9923]])
tensor([0, 0])
--------------------------------------------------
tensor([[ 1.8997,  0.8344],
        [-0.9382, -0.5430]])
tensor([1, 1])
--------------------------------------------------
tensor([[ 1.7273, -1.1858],
        [ 1.7774,  1.5116]])
tensor([1, 1])
--------------------------------------------------
tensor([[-0.5872, -1.9717],
        [-1.1402, -0.8388]])
tensor([0, 0])
--------------------------------------------------
