<img style="max-width:20em; height:auto;" src="../graphics/A-Little-Book-on-Adversarial-AI-Cover.png"/>

Author: Nik Alleyne   
Author Blog: https://www.securitynik.com   
Author GitHub: github.com/securitynik   

Author Other Books: [   

            "https://www.amazon.ca/Learning-Practicing-Leveraging-Practical-Detection/dp/1731254458/",   
            
            "https://www.amazon.ca/Learning-Practicing-Mastering-Network-Forensics/dp/1775383024/"   
        ]   


This notebook ***(dataset_dataloader.ipynb)*** is part of the series of notebooks From ***A Little Book on Adversarial AI***  A free ebook released by Nik Alleyne

### Custom DataSet and DataLoader
When building machine learning models, we often work with very large datasets. In many cases, those datasets cannot be loaded directly into memory. If we try to load these large datasets into memory, we get an **out of memory error**.   
  
To address this concern, datasets are used in conjunction with data loaders. Let's build a simple dataset and use with a dataloader to process this information.   

Think of the dataset as a container for our data that has a length and is indexable.   
References:   https://www.intodeeplearning.com/how-to-use-pytorch-dataloader/    


### Lab Objectives:   
- Learn how to create a dataset 
- Learn the key components of a dataset
- Learn how to use a dataset
- Leverage a dataloader  
- Understand the role the dataloader play when working with 


### Step 1:  

In [1]:
# We start this process off by importing the Dataloader and Dataset class
from torch.utils.data import DataLoader, Dataset
import torch

In [2]:
### Version of key libraries used  
print(f'Torch version used:  {torch.__version__}')

Torch version used:  2.7.1+cu128


In [3]:
# Setup the device to work with
# This should ensure if there are accelerators in place, such as Apple backend or CUDA, 
# we should be able to take advantage of it.

if torch.cuda.is_available():
    print('Setting the device to cuda')
    device = 'cuda'
elif torch.backends.mps.is_available():
    print('Setting the device to Apple mps')
    device = 'mps'
else:
    print('Setting the device to CPU')
    device = torch.device('cpu')

Setting the device to cuda


Let's go ahead and now create a sample dataset by *subclassing* the Dataset class. In subclassing, the Dataset class, we need to ensure our class has two key items. These are:   
- The **__len__** : This returns the number of samples in the dataset
- the **__getitem__** : Load and return a sample from the batch at a given index position   
https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/


In [4]:
# Create MyFirstDataSet by subclassing dataset
class MyFirstDataset(Dataset):
    def __init__(self, X):
        # Define a variable called X
        self.X = X

    # Define here the index of the item to be returned
    def __getitem__(self, index):
        # Take the input from X and index into the rows
        return index, self.X[index]
    
    def __len__(self):
        # Get the total number of samples in the dataset
        return self.X.shape[0]    

In [5]:
# Create some sample data
torch.manual_seed(10)

# Modify this as you wish
# Create more rows or columns by adjusting the size=(10,2)
# To for example size=(20,4), etc.
X = torch.randint(low=0, high=1000, dtype=torch.float, size=(10,2))

# Take a peak at the data 
# At the same time return the size/shape
X, X.size()

(tensor([[937., 405.],
         [932., 567.],
         [432., 705.],
         [187.,  92.],
         [321., 805.],
         [256., 773.],
         [521., 210.],
         [536., 833.],
         [504., 610.],
         [616.,  22.]]),
 torch.Size([10, 2]))

Instantiate the class  

### Step 2:  

In [6]:
# Instantiate the class
# play around with different *dataset_len* to get a better understanding of the output

# Fix the random number generator
torch.manual_seed(10)
my_first_dataset = MyFirstDataset(X=X)

# Access a sample from the dataset
print(f'Length of the dataset is: {len(my_first_dataset)}'), 
print(f'The first sample isL {my_first_dataset[1]}')

Length of the dataset is: 10
The first sample isL (1, tensor([932., 567.]))


Let us now introduce the dataloader.   

When we specify the batch size of 5, our 10 samples sets are in 2 batches. Notice the two tuples of *tensors*.  Go ahead and change the batch size to see how the output changes.  Try setting it at 3 for example

In [7]:
# Testing the 
my_first_dataloader = DataLoader(dataset=my_first_dataset, batch_size=5, num_workers=2, shuffle=False)
my_first_dataloader

<torch.utils.data.dataloader.DataLoader at 0x7f261ee8cb60>

In [8]:
# Print out the batches
for batch in my_first_dataloader:
    print(batch)


[tensor([0, 1, 2, 3, 4]), tensor([[937., 405.],
        [932., 567.],
        [432., 705.],
        [187.,  92.],
        [321., 805.]])]
[tensor([5, 6, 7, 8, 9]), tensor([[256., 773.],
        [521., 210.],
        [536., 833.],
        [504., 610.],
        [616.,  22.]])]


First tensor contains the index positions. This is why we see [0, 1, 2, 3, 4] in the first tensor. Second tensor is the actual value in X split across the two batches. Notice also if you were to re-run the for lop above, you would have the same results every time. This is because we set shuffle=False when we created the dataloader. 

We could simply set the dataloader shuffle=True and work with the default randomizer. Let us instead try this on our own. The default sampler uses a *SequentialSampler* as shown below.  

### Step 3:  

In [9]:
# Setup the default sampler 
type(my_first_dataloader.sampler)

torch.utils.data.sampler.SequentialSampler

In [10]:
# This is similar to the SequentialSampler directly
from torch.utils.data import SequentialSampler

In [11]:
# We see below we get the same output as before with the SequentialSampler
seq_sampler = SequentialSampler(data_source=X)
seq_sampler

<torch.utils.data.sampler.SequentialSampler at 0x7f2744a051f0>

In [12]:
# Setup the dataloader again. This time use the sequential sampler
X_loader= DataLoader(dataset=my_first_dataset, batch_size=5, sampler=seq_sampler)

# Enumerate the batches
# Similar to before, play around with the batch size to see how your output varies
# This output is sequential just as the one above# Let us transition this shuffling the dataset
for batch_idx, item in enumerate(X_loader):
    print(f'Batch number: {batch_idx} has items: \n{batch}')
    print(item)
    print('*'*40)

Batch number: 0 has items: 
[tensor([5, 6, 7, 8, 9]), tensor([[256., 773.],
        [521., 210.],
        [536., 833.],
        [504., 610.],
        [616.,  22.]])]
[tensor([0, 1, 2, 3, 4]), tensor([[937., 405.],
        [932., 567.],
        [432., 705.],
        [187.,  92.],
        [321., 805.]])]
****************************************
Batch number: 1 has items: 
[tensor([5, 6, 7, 8, 9]), tensor([[256., 773.],
        [521., 210.],
        [536., 833.],
        [504., 610.],
        [616.,  22.]])]
[tensor([5, 6, 7, 8, 9]), tensor([[256., 773.],
        [521., 210.],
        [536., 833.],
        [504., 610.],
        [616.,  22.]])]
****************************************


The results look much the same as we saw earlier.   

Let us move to setting shuffle = True

### Step 4:  

In [13]:
# Let us instantiate the class once 
# This time, notice the *shuffle=True*

X_loader= DataLoader(dataset=my_first_dataset, batch_size=5, shuffle=True)

# if we look at the dataset type again
# We see we are using the Random Sampler
type(X_loader.sampler)

torch.utils.data.sampler.RandomSampler

In [14]:
# Enumerate the batches.
# Similar to before, play around with the batch size to see your output
# Also run this a few times
# The results should change every time
# The order of the output should change every time you run this

for batch_idx, batch in enumerate(X_loader):
    print(f'Batch number: {batch_idx} has items: \n{batch}')
    print('*'*40)

Batch number: 0 has items: 
[tensor([5, 0, 8, 6, 2]), tensor([[256., 773.],
        [937., 405.],
        [504., 610.],
        [521., 210.],
        [432., 705.]])]
****************************************
Batch number: 1 has items: 
[tensor([9, 7, 4, 3, 1]), tensor([[616.,  22.],
        [536., 833.],
        [321., 805.],
        [187.,  92.],
        [932., 567.]])]
****************************************


In [15]:
# Using the shuffle=True is the same as using the RanomSampler
from torch.utils.data import RandomSampler

In [16]:
# Setup the random sampler on X
X_random_sampler = RandomSampler(X)
X_dataset = MyFirstDataset(X)
X

tensor([[937., 405.],
        [932., 567.],
        [432., 705.],
        [187.,  92.],
        [321., 805.],
        [256., 773.],
        [521., 210.],
        [536., 833.],
        [504., 610.],
        [616.,  22.]])

In [17]:
# Setup another dataloader, using the raondom sampler 
X_loader = DataLoader(dataset=X_dataset, batch_size=5, sampler=X_random_sampler)
for index, (batch_idx, batch) in enumerate(X_loader):
    print(f'Index number: {index} batch index: {batch_idx} has item: \n{batch}')

Index number: 0 batch index: tensor([3, 5, 0, 4, 7]) has item: 
tensor([[187.,  92.],
        [256., 773.],
        [937., 405.],
        [321., 805.],
        [536., 833.]])
Index number: 1 batch index: tensor([6, 9, 8, 2, 1]) has item: 
tensor([[521., 210.],
        [616.,  22.],
        [504., 610.],
        [432., 705.],
        [932., 567.]])


We should now have a solid understanding of the role datasets and dataloader play in ensuring we do not run out of memory.

### Lab Takeaways:  
- As can be seen from above, we use the dataset to customize our data and the data loader to handle our data in batches.  
- From above, if you play around the the *batch_size* you will notice the number of tuples change. For example, with batch size of two, we see two tuples above. 
- Notice the *tensor* appearing twice. Each one is a Tensor. 
- If you change the *batch_size* to 4, you will see we have everything returned because we have only four items in our length.
- So the dataloader takes our full data returns it in batch sizes of our choice. Basically splitting it up into batch_size groups.

With that understanding, we are ready to to use the dataset and dataloader to accept the main part of our data  .

