# Dataset Zoo
Cascade by itself is **DIY ML-engineering solution**. This means that it provides certain basics on top of which you can easily build own ML-workflow.  
Cascade has plenty of solutions - basic that are added to the core and more specific that are in the special `utils` module. And if you didn't found suitable component, you can write it yourself.  
Here some of them are presented. These are the `Dataset`s - building blocks of Cascade's pipelines, their description and short examples of how to use them in your workflow.

In [2]:
import cascade
print(cascade.__version__)

0.8.0-alpha


In [3]:
from cascade import data as cdd
from cascade import utils as cdu

## Wrappers
If your solution has some data source that is already accesible in python-code, but you need to plug it in Cascade's workflow it may be all you need. `Wrapper` gives the items from the source one by one, adding some info about the undelying data to its metadata.

In [19]:
ds = cdd.Wrapper([0, 1, 2, 3, 4]) # Here for simplicity the list of numbers is a data source

for item in ds:
    print(item, end=' ')

0 1 2 3 4 

## Iterators
If data source doesn't have length - you cannot use `Wrapper`s, but it is not a problem, you can use `Iterator`s instead! It is basically the same dataset, but using different interface.

In [5]:
def gen():
    for number in range(5):
        yield number

ds = cdd.Iterator(gen())

for item in ds:
    print(item, end=' ')

0 1 2 3 4 

## ApplyModifier
The pipelines are frequently applying some python-functions to the items in datasets. In Cascade this is done by using `ApplyModifier`.

In [20]:
# The function that will be applied
def square(x):
    return x ** 2

ds = cdd.Wrapper([0, 1, 2, 3, 4])
ds = cdd.ApplyModifier(ds, square) # ds now a pipeline of two stages

for item in ds:
    print(item, end=' ')

0 1 4 9 16 

## Concatenator

In [7]:
ds_1 = cdd.Wrapper([0, 1, 2, 3, 4])
ds_2 = cdd.Wrapper([5, 6, 7, 8, 9])

ds = cdd.Concatenator((ds_1, ds_2))

for item in ds:
    print(item, end=' ')

0 1 2 3 4 5 6 7 8 9 

## CyclicSampler

In [8]:
ds = cdd.Wrapper([0, 1, 2, 3, 4])
ds = cdd.CyclicSampler(ds, 11)

for item in ds:
    print(item, end=' ')

0 1 2 3 4 0 1 2 3 4 0 

## RandomSampler

In [9]:
import numpy as np
np.random.seed(0)

ds = cdd.Wrapper([0, 1, 2, 3, 4])
ds = cdd.RandomSampler(ds, 11)

for item in ds:
    print(item, end=' ')

4 0 3 3 3 1 3 2 4 0 0 

## RangeSampler

In [10]:
ds = cdd.Wrapper([0, 1, 2, 3, 4 , 5, 6, 7, 8, 9, 10])
ds = cdd.RangeSampler(ds, 1, 10, 2)

for item in ds:
    print(item, end=' ')

1 3 5 7 9 

## BruteforceCacher

In [11]:
import time


class LongLoadingDataSource(cdd.Dataset):
    def __init__(self, length, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._length = length

    def __getitem__(self, index):
        time.sleep(1)
        return index
    
    def __len__(self):
        return self._length

ds = LongLoadingDataSource(10)
ds = cdd.BruteforceCacher(ds)

100%|██████████| 10/10 [00:10<00:00,  1.01s/it]


## Pickler

In [12]:
ds = LongLoadingDataSource(10)
ds = cdd.BruteforceCacher(ds)
ds = cdd.Pickler('ds.pkl', ds)

100%|██████████| 10/10 [00:10<00:00,  1.01s/it]


In [13]:
from tqdm import tqdm

ds = cdd.Pickler('ds.pkl')

for item in tqdm(ds):
    print(item, end=' ')

100%|██████████| 10/10 [00:00<00:00, 5006.93it/s]

0 1 2 3 4 5 6 7 8 9 




## SequentialCacher

In [14]:
from cascade import data as cdd


class AlertOnLoad(cdd.Dataset):
    def __init__(self, length, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._length = length

    def __getitem__(self, index):
        print(f'loaded {index}')
        return index
    
    def __len__(self):
        return self._length


ds = AlertOnLoad(100)
ds = cdd.SequentialCacher(ds, 10)

for i in range(11):
    print(f'Step {i}')
    ds[i] # Load the element

Step 0
loaded 0
loaded 1
loaded 2
loaded 3
loaded 4
loaded 5
loaded 6
loaded 7
loaded 8
loaded 9
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Step 7
Step 8
Step 9
Step 10
loaded 10
loaded 11
loaded 12
loaded 13
loaded 14
loaded 15
loaded 16
loaded 17
loaded 18
loaded 19


## VersionAssigner

In [15]:
from cascade import data as cdd

ds = cdd.Wrapper([0, 1, 2, 3, 4])
ds = cdd.VersionAssigner(ds, 'ds_version_log.yml')
print(ds.version)

0.0


In [16]:
ds = cdd.Wrapper([0, 1, 2, 3, 4])
ds = cdd.Modifier(ds)
ds = cdd.VersionAssigner(ds, 'ds_version_log.yml')
print(ds.version)

1.0


In [17]:
ds = cdd.Wrapper([0, 1, 2, 3, 4, 5, 6, 7])
ds = cdd.Modifier(ds)
ds = cdd.VersionAssigner(ds, 'ds_version_log.yml')
print(ds.version)

1.1


## split

In [18]:
from cascade import data as cdd


ds = cdd.Wrapper([0, 1, 2, 3, 4, 5, 6, 7])
train_ds, test_ds = cdd.split(ds, 0.8)

print(len(train_ds), len(test_ds), sep=', ')

6, 2


## OverSampler