This notebook is adapted from the official PyTorch tutorial on [Datasets & DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html).

# Datasets & DataLoaders


Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code
to be decoupled from our model training code for better readability and modularity.

PyTorch provides two data primitives: ``torch.utils.data.DataLoader`` and ``torch.utils.data.Dataset``
that allow you to use pre-loaded datasets as well as your own data.

``Dataset`` stores the samples and their corresponding labels, and ``DataLoader`` wraps an iterable around
the ``Dataset`` to enable easy access to the samples.




PyTorch domain libraries provide a number of pre-loaded datasets (such as FashionMNIST) that
subclass ``torch.utils.data.Dataset`` and implement functions specific to the particular data.
They can be used to prototype and benchmark your model. You can find them
here: [Image Datasets](https://pytorch.org/vision/stable/datasets.html),
[Text Datasets](https://pytorch.org/text/stable/datasets.html), and
[Audio Datasets](https://pytorch.org/audio/stable/datasets.html).

In this tutorial though, we will learn how to load your own data that is not part of a PyTorch official dataset. 

We will use the classic [California Housing](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset) dataset as a running example. 

In [1]:
from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)

In [2]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [3]:
y.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64

We are now in the deep-learning realm, so let's cast everything into PyTorch tensors!

In [4]:
import torch

torch.set_printoptions(sci_mode=False, linewidth=300)

In [5]:
X = torch.from_numpy(X.to_numpy()).float()
y = torch.from_numpy(y.to_numpy()).float()

In [6]:
X.size(), y.size()

(torch.Size([20640, 8]), torch.Size([20640]))

A note on `dtype`: In NumPy, the default `dtype` is usually `np.float64` (also known as the `double` type). But in PyTorch the default is `torch.float32` (also known as the `float` type). Computations performed in double precisions are often much slower and more memory-intensive than that in single precisions, even on a GPU. Moreover, the additional precision offered by `double` usually doesn't matter (i.e., it won't affect the evaluation metrics). That's why we cast `X` and `y` to `torch.float32` using `.float()`. 

## Creating a Custom Dataset

When your data is already stored as tensors, [TensorDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) will come handy. 

In [7]:
from torch.utils.data import TensorDataset

cal_housing = TensorDataset(X, y) # simply pass your tensors here

OK, but why do we need a `Dataset`? Because it allows us to index into your tensors and retrieve (features, label) pairs. 

In [8]:
cal_housing[0] # a tuple of (features, label)

(tensor([   8.3252,   41.0000,    6.9841,    1.0238,  322.0000,    2.5556,   37.8800, -122.2300]),
 tensor(4.5260))

In [9]:
cal_housing[:5] # a tuple of (the first 5 rows, the first 5 labels). 

(tensor([[     8.3252,     41.0000,      6.9841,      1.0238,    322.0000,      2.5556,     37.8800,   -122.2300],
         [     8.3014,     21.0000,      6.2381,      0.9719,   2401.0000,      2.1098,     37.8600,   -122.2200],
         [     7.2574,     52.0000,      8.2881,      1.0734,    496.0000,      2.8023,     37.8500,   -122.2400],
         [     5.6431,     52.0000,      5.8174,      1.0731,    558.0000,      2.5479,     37.8500,   -122.2500],
         [     3.8462,     52.0000,      6.2819,      1.0811,    565.0000,      2.1815,     37.8500,   -122.2500]]),
 tensor([4.5260, 3.5850, 3.5210, 3.4130, 3.4220]))

In [10]:
len(cal_housing) # the length of our dataset

20640

We can split our dataset into training and testing sets using PyTorch's `random_split` function. 

You may also split the data beforehand using your favourite package and create separate `Dataset` for each split. But, when in Rome...

In [12]:
import math
from torch.utils.data import random_split

train_frac = 0.7
train_size = math.floor(train_frac * len(cal_housing))
test_size = len(cal_housing) - train_size

cal_housing_train, cal_housing_test = random_split(cal_housing, (train_size, test_size))

In [13]:
len(cal_housing_train), len(cal_housing_test)

(14447, 6193)

## Preparing your data for training with DataLoaders
The ``Dataset`` retrieves our dataset's features and labels one sample at a time. While training a model, we typically want to
pass samples in "minibatches", reshuffle the data at every epoch to reduce model overfitting, and use Python's ``multiprocessing`` to
speed up data retrieval.

[DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) is an iterable that abstracts this complexity for us in an easy API.



In [14]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(cal_housing_train, batch_size=64, shuffle=True)
test_dataloader = DataLoader(cal_housing_test, batch_size=64, shuffle=False)

## Iterate through the DataLoader

We have loaded that dataset into the ``DataLoader`` and can iterate through the dataset as needed.
Each iteration below returns a batch of ``train_features`` and ``train_labels`` (containing ``batch_size=64`` features and labels respectively).
Because we specified ``shuffle=True``, after we iterate over all batches the data is shuffled (for finer-grained control over
the data loading order, take a look at [Samplers](https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler)).



In [15]:
train_features, train_labels = next(iter(train_dataloader))

print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")

Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])


`next` + `iter` is useful when you want to examine a batch of data (e.g., to check if you have implemented your data-loading functions correctly). 

But a more common paradigm is to use your `DataLoader` with a `for` loop, so that you can continuously step into your batches. 

In [16]:
for features, labels in test_dataloader:
    print(f"Feature batch shape: {features.size()}")
    print(f"Labels batch shape: {labels.size()}")

Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: tor