<a href="https://colab.research.google.com/github/Abdul-Lahad/PyTorch-Tutorial/blob/main/Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset and DataLoader in PyTorch

In this notebook, we will explore the `Dataset` and `DataLoader` classes in PyTorch. These are essential tools for efficiently handling and processing data during model training and evaluation.


In [1]:
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math


## 1. Dataset

The `Dataset` class is an abstract class that represents your data. You can create a custom dataset by subclassing `torch.utils.data.Dataset` and implementing the following methods:

- `__len__`: Returns the size of the dataset.
- `__getitem__`: Retrieves a data sample and its corresponding label (for supervised learning tasks) based on an index.


In [12]:
class WineDataset(Dataset):
  def __init__(self):
      # Data Loading
      xy = np.loadtxt('https://raw.githubusercontent.com/patrickloeber/pytorchTutorial/master/data/wine/wine.csv',
                      delimiter=',',
                      dtype=np.float32,
                      skiprows=1)
      self.x = torch.from_numpy(xy[:,1:])
      self.y = torch.from_numpy(xy[:,[0]]) #n_samples, 1
      self.n_samples = xy.shape[0]

  def __getitem__(self, index):
      return self.x[index],self.y[index]

  def __len__(self):
      return self.n_samples


In [16]:
dataset = WineDataset()
first_data = dataset[100]
features, labels = first_data
print(f'Features: {features}, Labels: {labels}')

Features: tensor([1.2080e+01, 2.0800e+00, 1.7000e+00, 1.7500e+01, 9.7000e+01, 2.2300e+00,
        2.1700e+00, 2.6000e-01, 1.4000e+00, 3.3000e+00, 1.2700e+00, 2.9600e+00,
        7.1000e+02]), Labels: tensor([2.])


## 2. DataLoader

The `DataLoader` class provides an efficient way to load data in batches and optionally shuffle and parallelize the data loading process. It wraps around a `Dataset` object.

### Key Parameters:
- **`dataset`**: The dataset to load data from.
- **`batch_size`**: Number of samples per batch (default: 1).
- **`shuffle`**: Whether to shuffle the data after each epoch (default: `False`).
- **`num_workers`**: Number of subprocesses to use for data loading (default: `0`).
- **`drop_last`**: Whether to drop the last incomplete batch (default: `False`).


In [21]:
dataloader = DataLoader(dataset=dataset, batch_size=4, shuffle=True, num_workers=2)
dataiter = iter(dataloader)
data = next(dataiter)
features, labels = data
print(f'Features: {features}, Labels: {labels}')


Features: tensor([[1.2070e+01, 2.1600e+00, 2.1700e+00, 2.1000e+01, 8.5000e+01, 2.6000e+00,
         2.6500e+00, 3.7000e-01, 1.3500e+00, 2.7600e+00, 8.6000e-01, 3.2800e+00,
         3.7800e+02],
        [1.4370e+01, 1.9500e+00, 2.5000e+00, 1.6800e+01, 1.1300e+02, 3.8500e+00,
         3.4900e+00, 2.4000e-01, 2.1800e+00, 7.8000e+00, 8.6000e-01, 3.4500e+00,
         1.4800e+03],
        [1.1620e+01, 1.9900e+00, 2.2800e+00, 1.8000e+01, 9.8000e+01, 3.0200e+00,
         2.2600e+00, 1.7000e-01, 1.3500e+00, 3.2500e+00, 1.1600e+00, 2.9600e+00,
         3.4500e+02],
        [1.3160e+01, 3.5700e+00, 2.1500e+00, 2.1000e+01, 1.0200e+02, 1.5000e+00,
         5.5000e-01, 4.3000e-01, 1.3000e+00, 4.0000e+00, 6.0000e-01, 1.6800e+00,
         8.3000e+02]]), Labels: tensor([[2.],
        [1.],
        [2.],
        [3.]])


In [24]:
#training loop
num_epochs =2
total_samples = len(dataset)
n_iterations = math.ceil(total_samples/4)
print(total_samples, n_iterations)

for epoch in range(num_epochs):

  for i, (inputs, labels) in enumerate(dataloader):
    #forward backward, update

178 45
