# Pytorch Datasets and Data Loaders
PyTorch makes accessing data for your model a breeze! These tools ensure that the flow of information to our AI is just right, making its learning experience effective and fun.

## Technical Terms:
**PyTorch Dataset class:** This is like a recipe that tells your computer how to get the data it needs to learn from, including where to find it and how to parse it, if necessary.

**PyTorch Data Loader:** Think of this as a delivery truck that brings the data to your AI in small, manageable loads called batches; this makes it easier for the AI to process and learn from the data.

**Batches:** Batches are small, evenly divided parts of data that the AI looks at and learns from each step of the way.

**Shuffle:** It means mixing up the data so that it's not in the same order every time, which helps the AI learn better.

## Create a dataset
[PyTorch Reference](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)

In [1]:
from torch.utils.data import Dataset

In [2]:
# Create a fake dataset
class NumberProductDataset(Dataset):
    """
    Defines a dataset that contains pairs of consecutive numbers and their associated products/multiples
    """
    def __init__(self, data_range:tuple=(1,10)):
        self.numbers = list(range(data_range[0],data_range[1]))
        return


    def __getitem__(self, index):
        """
        Provides index access to pairs 
        """
        number0 = self.numbers[index]
        number1 = self.numbers[index] + 1
        return (number0, number1), number0*number1


    def __len__(self):
        """
        Total number of pairs in the dataset
        """
        return len(self.numbers)

In [3]:
# Instantiate the dataset
dataset = NumberProductDataset(
    data_range=(0,11))

In [4]:
# Access a data sample
data_sample = dataset[3]
print(data_sample)

((3, 4), 12)


In [5]:
dataset.__len__()

11

In [6]:
for i in range(dataset.__len__()):
    print(dataset.__getitem__(i))

((0, 1), 0)
((1, 2), 2)
((2, 3), 6)
((3, 4), 12)
((4, 5), 20)
((5, 6), 30)
((6, 7), 42)
((7, 8), 56)
((8, 9), 72)
((9, 10), 90)
((10, 11), 110)


## Load a dataset
[PyTorch Reference](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)

In [7]:
from torch.utils.data import DataLoader

In [8]:
# Instantiate the dataset
dataset = NumberProductDataset(
    data_range=(0,11))
# Create a DataLoader instance
dataloader = DataLoader(dataset=dataset, batch_size=3, shuffle=False)

In [9]:
# Iterate over batches
print(f"There are {len(dataloader)} items in the dataloader object")
print()
for (i, j) in dataloader:
    print(f"\t{i}, {j}")
print()
print("""Note there is a list of tensors and then a tensor at each new line.
In the first index of the list, there are elements i:j of the 1st column in the dataset.
In the second index of the list, there are elements i:j of the 2nd column of the dataset.
And in the last element, outside of the list of tensors, is the i:j elements of the 3rd column of the index.

So, if we consider the list as X and the elements as x0 and x1, and then the last element as Y:
    X[0]*X[1]=Y""")

There are 4 items in the dataloader object

	[tensor([0, 1, 2]), tensor([1, 2, 3])], tensor([0, 2, 6])
	[tensor([3, 4, 5]), tensor([4, 5, 6])], tensor([12, 20, 30])
	[tensor([6, 7, 8]), tensor([7, 8, 9])], tensor([42, 56, 72])
	[tensor([ 9, 10]), tensor([10, 11])], tensor([ 90, 110])

Note there is a list of tensors and then a tensor at each new line.
In the first index of the list, there are elements i:j of the 1st column in the dataset.
In the second index of the list, there are elements i:j of the 2nd column of the dataset.
And in the last element, outside of the list of tensors, is the i:j elements of the 3rd column of the index.

So, if we consider the list as X and the elements as x0 and x1, and then the last element as Y:
    X[0]*X[1]=Y


In [10]:
print("Each of these above are the batches and are like small datasets. We can also shuffle them so we get randomized data:")
dataloader = DataLoader(dataset=dataset, batch_size=3, shuffle=True)
for (i, j) in dataloader:
    print(f"\t{i}, {j}")
    # next(iter(dataloader))

Each of these above are the batches and are like small datasets. We can also shuffle them so we get randomized data:
	[tensor([2, 3, 7]), tensor([3, 4, 8])], tensor([ 6, 12, 56])
	[tensor([6, 0, 5]), tensor([7, 1, 6])], tensor([42,  0, 30])
	[tensor([10,  4,  8]), tensor([11,  5,  9])], tensor([110,  20,  72])
	[tensor([9, 1]), tensor([10,  2])], tensor([90,  2])
