# How to create and load a dataset in PyTorch

In [2]:
import torch

In [3]:
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    
    def __init__(self, X, y):
        self.X = X # features tensor
        self.y = y # target tensor
    
    def __len__(self):
        '''
        returns the total number of samples available in our dataset
        '''
        return len(self.X)
    
    def __getitem__(self, idx):
        '''
        returns a sample of data at the precise idx
        '''
        return(self.X[idx], self.y[idx])

In [4]:
from sklearn.datasets import make_classification

In [5]:
data, target = make_classification(n_samples=1000, n_features=5)

In [6]:
custom_dataset = CustomDataset(X=data, y=target)

We can confirm the length of `custom_dataset`:

In [7]:
len(custom_dataset)

1000

We can access `custom_dataset`, position *0*, to confirm 5 features:

In [8]:
custom_dataset[0]

(array([-1.90390898,  1.72680179, -2.38823174, -0.31144763,  1.24389687]), 0)

To confirm that the data is actually in position 0:

In [9]:
data[0]

array([-1.90390898,  1.72680179, -2.38823174, -0.31144763,  1.24389687])

The data does not contain the target, which is stored into the variable target in position 0.

We can check we have five features by accessing `custom_dataset` in position 0 and getting the first array using `shape`:

In [10]:
custom_dataset[0][0].shape

(5,)

The above is useful for binary classification.

# Multiple target classification.

In [11]:
from sklearn.datasets import make_multilabel_classification

In [12]:
data, target = make_multilabel_classification(n_samples=1000, n_features=5, n_classes=3)

In [13]:
custom_dataset_mlb = CustomDataset(X=data, y=target)

In [14]:
custom_dataset_mlb[0]

(array([15.,  4.,  6., 12.,  9.]), array([0, 1, 1]))

In [15]:
custom_dataset_mlb[1]

(array([15.,  7.,  6.,  8.,  7.]), array([0, 1, 1]))

PyTorch Datasets are objects that have one job: to reutnr a single datapoint on request.<br>
This allows us to buils a PyTorch consistent dataset.

On the other hand, we need to loop over the index rows to get all possible elements inside a dataset object.<br>
This is inefficient, especially with large datasets.

## PyTorch DataLoaders

PyTorch DataLoaders allows us to load datasets efficiently.<br>
We need to prepare the data in order for training of the model.

The Dataset class retrieves datasets' features and lables one sample at a time.<br>
It would be more efficient to load batches of samples.<br>
This is useful to perform gradient descent efficiently.<br>
Because the data is being shuffled, it helps reduce overfitting.<br>
Python's multiprocessessing speeds up the data retrieval.

In [16]:
from torch.utils.data import DataLoader
?DataLoader

[1;31mInit signature:[0m
[0mDataLoader[0m[1;33m([0m[1;33m
[0m    [0mdataset[0m[1;33m:[0m [0mtorch[0m[1;33m.[0m[0mutils[0m[1;33m.[0m[0mdata[0m[1;33m.[0m[0mdataset[0m[1;33m.[0m[0mDataset[0m[1;33m[[0m[1;33m+[0m[0mT_co[0m[1;33m][0m[1;33m,[0m[1;33m
[0m    [0mbatch_size[0m[1;33m:[0m [0mOptional[0m[1;33m[[0m[0mint[0m[1;33m][0m [1;33m=[0m [1;36m1[0m[1;33m,[0m[1;33m
[0m    [0mshuffle[0m[1;33m:[0m [0mbool[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0msampler[0m[1;33m:[0m [0mOptional[0m[1;33m[[0m[0mtorch[0m[1;33m.[0m[0mutils[0m[1;33m.[0m[0mdata[0m[1;33m.[0m[0msampler[0m[1;33m.[0m[0mSampler[0m[1;33m][0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mbatch_sampler[0m[1;33m:[0m [0mOptional[0m[1;33m[[0m[0mtorch[0m[1;33m.[0m[0mutils[0m[1;33m.[0m[0mdata[0m[1;33m.[0m[0msampler[0m[1;33m.[0m[0mSampler[0m[1;33m[[0m[0mSequence[0m[1;33m][0m[1;33m][0m 

In [19]:
data_loader = DataLoader(dataset=custom_dataset, batch_size=8, shuffle=True)

Looking at `custom_dataset` in position 0, which returns the first element of our data made from the tuple described in terms of features and target:

In [20]:
custom_dataset[0]

(array([-1.90390898,  1.72680179, -2.38823174, -0.31144763,  1.24389687]), 0)

The method above is accessing the data by index, which is inefficient with large datasets.

A better method is iterate through the dataset using the dataloader:

In [21]:
data_iter = iter(data_loader)

This is an iterable called SingleProcessLoaderIter:

In [22]:
data_iter

<torch.utils.data.dataloader._SingleProcessDataLoaderIter at 0x1eaae17d5b0>

We will get eight samples at a, instead of one like we did in the dataloader.

In [26]:
data_02 = data_iter.next()

In [30]:
features_02, target_02 = data_02

In [29]:
features_02

tensor([[-1.7441e-01,  2.9884e-01, -3.3672e-01, -3.6284e-01, -4.9011e-01],
        [-2.0852e-01,  4.1347e-02, -1.3764e-01,  3.1712e-01,  1.2664e+00],
        [ 2.6825e-01,  5.8138e-02,  8.3714e-02, -6.7259e-01, -2.1019e-01],
        [-6.1177e-01,  6.2545e-04, -3.0263e-01,  1.2173e+00,  4.0929e-01],
        [-2.2918e-01,  1.0278e-01, -1.9936e-01,  2.1226e-01,  9.1174e-01],
        [-2.3062e+00,  1.7453e+00, -2.6024e+00,  4.4593e-01,  7.8741e-01],
        [ 5.2418e-02,  5.4306e-02, -1.9654e-02, -2.3350e-01, -3.0124e-01],
        [ 6.5835e-01, -2.8117e-01,  5.6088e-01, -6.4325e-01, -1.0710e+00]],
       dtype=torch.float64)

DataLoaders are useful when you have to perform tedious operations with tensors and they should be used in order to speed up the model training phase when using PyTorch.