# Pipeline building
This use-case is pipeline building. For this example the task of classification on MNIST is used.

In [1]:
#!pip3 install torchvision

In [2]:
import cascade.data as cdd
import cascade.meta as cme

import torch
import torchvision
import torchvision.transforms.functional as F

  warn(f"Failed to load image Python extension: {e}")


In [3]:
import cascade
cascade.__version__

'0.13.0'

Let's load torch dataset

In [4]:
MNIST_ROOT = 'data'

train_ds = torchvision.datasets.MNIST(root=MNIST_ROOT,
                                     train=True, 
                                     transform=F.to_tensor,
                                     download=True)
test_ds = torchvision.datasets.MNIST(root=MNIST_ROOT, 
                                    train=False, 
                                    transform=F.to_tensor)

## Adding metadata
But in the end we need not only loaded dataset, but the container for metadata we can store. So the next step is to wrap these datasets into Cascade's objects.
  
Suppose we also need to write some data description to be able to know on which data our model was trained.  

In [5]:
train_ds = cdd.Wrapper(train_ds)
train_ds.describe("This is MNIST dataset of handwritten images")
test_ds = cdd.Wrapper(test_ds)

## Applying noise
Let's say we want to apply noise to an image.  
*We will use hardcoded magnitude to simplify an example.*  
To do this we need to make a Modifier. Modifier wraps another dataset and applies a function to its elements in a lazy way.

In [6]:
class NoiseModifier(cdd.Modifier):
    def __getitem__(self, index):
        img, label = self._dataset[index] # get the data from Wrapper, which is _dataset for this Modifier
        img += torch.rand_like(img) * 0.1 # apply random noise with fixed magnitude
        img = torch.clip(img, 0, 255)
        return img, label

In [7]:
train_ds = NoiseModifier(train_ds)

## Validation
Data validation is one of the most important parts of any ML-pipeline. Testing some assumtions is useful when our pipeline is very complex, but here let's add validation stage just for demonstration.

In [8]:
cme.PredicateValidator(train_ds, lambda x: torch.all(x[0] < 256))

                                      

OK!




cascade.meta.validator.PredicateValidator

## Versioning
For tracking changes in the pipeline during the experiments, the version could be used.

In [9]:
cdd.version(train_ds, 'train_ds_version_log.yml')

'0.0'

## Viewing metadata
To view metadata of the pipeline we can see what `get_meta()` gives:  
First is NoiseModifier and the second is Wrapper around MNIST Dataset with our custom description.  
Using `update_meta()` we can add any info we want to the object's metadata.

In [10]:
train_ds.get_meta()

[{'name': '__main__.NoiseModifier',
  'description': None,
  'tags': [],
  'comments': [],
  'links': [],
  'type': 'dataset',
  'len': 60000},
 {'name': 'cascade.data.dataset.Wrapper',
  'description': 'This is MNIST dataset of handwritten images',
  'tags': [],
  'comments': [],
  'links': [],
  'type': 'dataset',
  'len': 60000,
  'obj_type': "<class 'torchvision.datasets.mnist.MNIST'>"}]

## Ready to train model
Now we can set the batch size and pass our pipeline to the DataLoaders.

In [11]:
BATCH_SIZE = 10

In [12]:
trainldr = torch.utils.data.DataLoader(dataset=train_ds, 
                                       batch_size=BATCH_SIZE,
                                       shuffle=True)
testldr = torch.utils.data.DataLoader(dataset=test_ds,
                                      batch_size=BATCH_SIZE,
                                      shuffle=False)

## See also:
[Train models](model_training.html)  