# Pipeline building
This use-case is pipeline building. For this example the task of classification on MNIST is used.

In [1]:
import cascade.data as cdd

import torch
import torchvision
import torchvision.transforms.functional as F

Let's load torch dataset

In [2]:
MNIST_ROOT = 'data'

train_ds = torchvision.datasets.MNIST(root=MNIST_ROOT,
                                     train=True, 
                                     transform=F.to_tensor,
                                     download=True)
test_ds = torchvision.datasets.MNIST(root=MNIST_ROOT, 
                                    train=False, 
                                    transform=F.to_tensor)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data\MNIST\raw\train-images-idx3-ubyte.gz


9913344it [01:35, 103385.17it/s]                              


Extracting data\MNIST\raw\train-images-idx3-ubyte.gz to data\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data\MNIST\raw\train-labels-idx1-ubyte.gz


29696it [00:00, 14870350.00it/s]         

Extracting data\MNIST\raw\train-labels-idx1-ubyte.gz to data\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz





Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data\MNIST\raw\t10k-images-idx3-ubyte.gz


1649664it [00:03, 436587.50it/s]                              


Extracting data\MNIST\raw\t10k-images-idx3-ubyte.gz to data\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data\MNIST\raw\t10k-labels-idx1-ubyte.gz


5120it [00:00, 3402762.87it/s]          


Extracting data\MNIST\raw\t10k-labels-idx1-ubyte.gz to data\MNIST\raw



## Adding metadata
But in the end we need not only loaded dataset, but the container for metadata we can store. So the next step is to link these datasets to Cascade's objects. The most simple way is to use `cascade.data.Wrapper`.  
  
Suppose we also need to write some data description to be able to know on which data our model was trained. We can do it using `meta_prefix` keyword in the constructor of any dataset.   
It accepts python dictionaries of any serializable objects. We will pass short description in metadata.

In [3]:
train_ds = cdd.Wrapper(train_ds, 
    meta_prefix={
        'desc': 'This is MNIST dataset of handwritten images'
    })
test_ds = cdd.Wrapper(test_ds)

## Applying noise
Let's say we want to apply noise to an image.  
*We will use hardcoded magnitude to simplify an example.*  
To do this we need to make a Modifier. Modifier wraps another dataset and applies a function to its elements in a lazy way.

In [4]:
class NoiseModifier(cdd.Modifier):
    def __getitem__(self, index):
        img, label = self._dataset[index] # get the data from Wrapper, which is _dataset for this Modifier
        img += torch.rand_like(img) * 0.1 # apply random noise with fixed magnitude
        img = torch.clip(img, 0, 255)
        return img, label

In [5]:
# Let's apply the noise to the images!
train_ds = NoiseModifier(train_ds)

## Viewing metadata
To view final metadata of the pipeline we can see what `get_meta` method gives:  
It is the list of dicts with metadata of each block. First is NoiseModifier and the second is Wrapper around MNIST Dataset with our custom description.  
Using keyword `meta_prefix` and method `update_meta` we can add any info we want to the object's metadata.

In [6]:
train_ds.get_meta()

[{'name': '__main__.NoiseModifier', 'type': 'dataset', 'len': 60000},
 {'name': 'cascade.data.dataset.Wrapper',
  'desc': 'This is MNIST dataset of handwritten images',
  'type': 'dataset',
  'len': 60000,
  'obj_type': torchvision.datasets.mnist.MNIST}]

## Ready to train model
Now we can set the batch size and pass our pipeline to the DataLoaders.

In [7]:
BATCH_SIZE = 10

In [8]:
trainldr = torch.utils.data.DataLoader(dataset=train_ds, 
                                       batch_size=BATCH_SIZE,
                                       shuffle=True)
testldr = torch.utils.data.DataLoader(dataset=test_ds,
                                      batch_size=BATCH_SIZE,
                                      shuffle=False)

## See also:
[Train models](model_training.html)  