## Dataset, dataloader and data transform classes 

So far we have loaded datasets using numpy (read_csv / read_excel) or from sklearn.datasets, we will explore pytorch alternatives for that. 

- Also, we running the forward pass on entire dataset each time is time consuming. So we will divide it into batches, using the `Dataset `and `DataLoader` classes. 

#### Definitions:

__Epoch__: 1 forward and backward pass of ALL training samples

__batch size__: defines how many samples are processed at a time before updating the weights.

__no of iteration__: a single update step — i.e., one forward + backward pass using one batch.

eg: 100 samples, batch_size = 20 --> 100/20 = 5 iterations per epoch 

```for epoch in range(5):                   # 5 passes over data
    for batch in data_loader:            # 10 batches per epoch
        # Forward pass
        # Compute loss
        # Backward pass
        # Optimizer step```

In [4]:
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np 
import math

## Importing torch datasets from numpy 

In [6]:
class WineDataset(Dataset):

    def __init__(self):
        xy = np.loadtxt('dataset/wine.csv', delimiter=',', skiprows=1, dtype=np.float32)
        self.X = torch.from_numpy(xy[:, 1:])
        self.y = torch.from_numpy(xy[:, 0])
        self.n_samples = xy.shape[0]
    
    def __getitem__(self, index):
        # to enable indexing and extract rows 
        return self.X[index], self.y[index]
    
    def __len__(self):
        # to check the size of dataset
        return self.n_samples

dataset = WineDataset()       

In [9]:
print(dataset.__len__())

first_datapoint = dataset[0]
first_datapoint

178


(tensor([1.4230e+01, 1.7100e+00, 2.4300e+00, 1.5600e+01, 1.2700e+02, 2.8000e+00,
         3.0600e+00, 2.8000e-01, 2.2900e+00, 5.6400e+00, 1.0400e+00, 3.9200e+00,
         1.0650e+03]),
 tensor(1.))

In [11]:
# capture features and label for first datapoint
features_0, label_0 = dataset[0]
print('features_0 = ', features_0, 'label = ', label_0)

features_0 =  tensor([1.4230e+01, 1.7100e+00, 2.4300e+00, 1.5600e+01, 1.2700e+02, 2.8000e+00,
        3.0600e+00, 2.8000e-01, 2.2900e+00, 5.6400e+00, 1.0400e+00, 3.9200e+00,
        1.0650e+03]) label =  tensor(1.)


In [14]:
# loading the data for further processing into the model

dataloader = DataLoader(dataset=dataset, batch_size=4, shuffle=True, num_workers=0)

In [None]:
# convert to an iterator and look at one random sample -- iter() and next() are 
# rarely needed in practice except to look at how the batch looks like. Like df.head() maybe. 
dataiter = iter(dataloader)
data = next(dataiter)
features, labels = data
print(features, labels)

tensor([[1.1810e+01, 2.1200e+00, 2.7400e+00, 2.1500e+01, 1.3400e+02, 1.6000e+00,
         9.9000e-01, 1.4000e-01, 1.5600e+00, 2.5000e+00, 9.5000e-01, 2.2600e+00,
         6.2500e+02],
        [1.1870e+01, 4.3100e+00, 2.3900e+00, 2.1000e+01, 8.2000e+01, 2.8600e+00,
         3.0300e+00, 2.1000e-01, 2.9100e+00, 2.8000e+00, 7.5000e-01, 3.6400e+00,
         3.8000e+02],
        [1.1030e+01, 1.5100e+00, 2.2000e+00, 2.1500e+01, 8.5000e+01, 2.4600e+00,
         2.1700e+00, 5.2000e-01, 2.0100e+00, 1.9000e+00, 1.7100e+00, 2.8700e+00,
         4.0700e+02],
        [1.2250e+01, 4.7200e+00, 2.5400e+00, 2.1000e+01, 8.9000e+01, 1.3800e+00,
         4.7000e-01, 5.3000e-01, 8.0000e-01, 3.8500e+00, 7.5000e-01, 1.2700e+00,
         7.2000e+02]]) tensor([2., 2., 2., 3.])


In [None]:
# Dummy Training loop
num_epochs = 2
total_samples = len(dataset)
n_iterations = math.ceil(total_samples/4)
print(total_samples, n_iterations)

for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(dataloader):
        
        # here: 178 samples, batch_size = 4, n_iters=178/4=44.5 -> 45 iterations
        # Run your training process
        
        if (i+1) % 5 == 0:
            print(f'Epoch: {epoch+1}/{num_epochs}, Step {i+1}/{n_iterations}| Inputs {inputs.shape} | Labels {labels.shape}')

178 45
Epoch: 1/2, Step 5/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 1/2, Step 10/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 1/2, Step 15/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 1/2, Step 20/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 1/2, Step 25/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 1/2, Step 30/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 1/2, Step 35/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 1/2, Step 40/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 1/2, Step 45/45| Inputs torch.Size([2, 13]) | Labels torch.Size([2])
Epoch: 2/2, Step 5/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 2/2, Step 10/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 2/2, Step 15/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 2/2, Step 20/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4])
Epoch: 

input size = 4,13 => batch size = 4 (set above) and 13 are no of features of each datapoint. 

## Inbuilt datasets in torch.vision

In [None]:
?torchvision.datasets

[31mType:[39m        module
[31mString form:[39m <module 'torchvision.datasets' from 'C:\\Users\\AN80050181\\AppData\\Roaming\\Python\\Python313\\site-packages\\torchvision\\datasets\\__init__.py'>
[31mFile:[39m        c:\users\an80050181\appdata\roaming\python\python313\site-packages\torchvision\datasets\__init__.py
[31mDocstring:[39m   <no docstring>

In [None]:
torchvision.datasets.MNIST(root='') # plug in various arguments as needed. 
# fashion-mnist, cifar, coco etc are available. 

### Transformation in pytorch:

PyTorch doesn’t have built-in transformers like SimpleImputer, StandardScaler, or MinMaxScaler the way scikit-learn does.

BUT — you don’t have to write a new custom class every time either. There are a few strategies:

Option 1: Use sklearn for preprocessing. This is the most common and totally acceptable.Then convert into tensor data objects just before training. 

Option 2: Use PyTorch transforms (but mostly for images) <br>
```from torchvision import transforms```
This gives you tools like:

- transforms.ToTensor()

- transforms.Normalize(mean, std)

- transforms.Resize()

- transforms.RandomCrop()

- transforms.Compose([...])

__BUT: These are image-focused.__

__If you’re working with tabular or text data → sklearn is better for preprocessing.__

Option 3: Write your own preprocessing and use Dataset<br>
You can write a custom PyTorch Dataset class and add your own logic inside `__getitem__()`: