# Dataset Zoo
Cascade by itself is **DIY ML-engineering solution**. This means that it provides certain basics on top of which you can easily build own ML-workflow.  
Cascade has plenty of solutions - basic that are added to the core and more specific that are in the special `utils` module. And if you didn't found suitable component, you can write it yourself.  
Here some of already-made components are presented. These are the `Dataset`s - building blocks of Cascade's pipelines, their description and short examples of how to use them in your workflow.

In [2]:
import cascade
print(cascade.__version__)

0.8.0-alpha


## Wrappers
If your solution has some data source that is already accesible in python-code, but you need to plug it in Cascade's workflow it may be all you need. `Wrapper` gives the items from the source one by one, adding some info about the undelying data to its metadata.

In [2]:
from cascade import data as cdd


ds = cdd.Wrapper([0, 1, 2, 3, 4]) # Here for simplicity the list of numbers is a data source

for item in ds:
    print(item, end=' ')

0 1 2 3 4 

In [3]:
ds.get_meta()

[{'name': '<cascade.data.dataset.Wrapper',
  'type': 'dataset',
  'len': 5,
  'obj_type': "<class 'list'>"}]

## Iterators
If data source doesn't have length - you cannot use `Wrapper`s, but it is not a problem, you can use `Iterator`s instead! It is basically the same dataset, but using different interface.

In [4]:
from cascade import data as cdd


def gen():
    for number in range(5):
        yield number

ds = cdd.Iterator(gen())

for item in ds:
    print(item, end=' ')

0 1 2 3 4 

## ApplyModifier
The pipelines are frequently applying some python-functions to the items in datasets. In Cascade this is done by using `ApplyModifier`.

In [5]:
from cascade import data as cdd


# The function that will be applied
def square(x):
    return x ** 2

ds = cdd.Wrapper([0, 1, 2, 3, 4])
ds = cdd.ApplyModifier(ds, square) # ds now a pipeline of two stages

for item in ds:
    print(item, end=' ')

0 1 4 9 16 

## Concatenator
Concatenation is also frequent operation that is done to unify several datasets into one. In Cascade it is done easily using Concatenator.

In [6]:
from cascade import data as cdd


ds_1 = cdd.Wrapper([0, 1, 2, 3, 4])
ds_2 = cdd.Wrapper([5, 6, 7, 8, 9])

ds = cdd.Concatenator((ds_1, ds_2))

for item in ds:
    print(item, end=' ')

0 1 2 3 4 5 6 7 8 9 

In addition, it also stores metadata of all its datasets.

In [7]:
ds.get_meta()

[{'name': '<cascade.data.concatenator.Concatenator of\n<cascade.data.dataset.Wrapper\n<cascade.data.dataset.Wrapper',
  'type': 'dataset',
  'data': [[{'name': '<cascade.data.dataset.Wrapper',
     'type': 'dataset',
     'len': 5,
     'obj_type': "<class 'list'>"}],
   [{'name': '<cascade.data.dataset.Wrapper',
     'type': 'dataset',
     'len': 5,
     'obj_type': "<class 'list'>"}]]}]

## CyclicSampler

In [8]:
from cascade import data as cdd


ds = cdd.Wrapper([0, 1, 2, 3, 4])
ds = cdd.CyclicSampler(ds, 11)

for item in ds:
    print(item, end=' ')

0 1 2 3 4 0 1 2 3 4 0 

## RandomSampler

In [9]:
from cascade import data as cdd
import numpy as np
np.random.seed(0)


ds = cdd.Wrapper([0, 1, 2, 3, 4])
ds = cdd.RandomSampler(ds, 11)

for item in ds:
    print(item, end=' ')

4 0 3 3 3 1 3 2 4 0 0 

## RangeSampler

In [10]:
from cascade import data as cdd


ds = cdd.Wrapper([0, 1, 2, 3, 4 , 5, 6, 7, 8, 9, 10])
ds = cdd.RangeSampler(ds, 1, 10, 2)

for item in ds:
    print(item, end=' ')

1 3 5 7 9 

## BruteforceCacher

In [11]:
import time


class LongLoadingDataSource(cdd.Dataset):
    def __init__(self, length, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._length = length

    def __getitem__(self, index):
        time.sleep(1)
        return index
    
    def __len__(self):
        return self._length

ds = LongLoadingDataSource(10)
ds = cdd.BruteforceCacher(ds)

100%|██████████| 10/10 [00:10<00:00,  1.01s/it]


## Pickler

In [12]:
ds = LongLoadingDataSource(10)
ds = cdd.BruteforceCacher(ds)
ds = cdd.Pickler('ds.pkl', ds)

100%|██████████| 10/10 [00:10<00:00,  1.01s/it]


In [13]:
from tqdm import tqdm

ds = cdd.Pickler('ds.pkl')

for item in tqdm(ds):
    print(item, end=' ')

100%|██████████| 10/10 [00:00<00:00, 5006.93it/s]

0 1 2 3 4 5 6 7 8 9 




## SequentialCacher

In [14]:
from cascade import data as cdd


class AlertOnLoad(cdd.Dataset):
    def __init__(self, length, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._length = length

    def __getitem__(self, index):
        print(f'loaded {index}')
        return index
    
    def __len__(self):
        return self._length


ds = AlertOnLoad(100)
ds = cdd.SequentialCacher(ds, 10)

for i in range(11):
    print(f'Step {i}')
    ds[i] # Load the element

Step 0
loaded 0
loaded 1
loaded 2
loaded 3
loaded 4
loaded 5
loaded 6
loaded 7
loaded 8
loaded 9
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Step 7
Step 8
Step 9
Step 10
loaded 10
loaded 11
loaded 12
loaded 13
loaded 14
loaded 15
loaded 16
loaded 17
loaded 18
loaded 19


## VersionAssigner

In [15]:
from cascade import data as cdd

ds = cdd.Wrapper([0, 1, 2, 3, 4])
ds = cdd.VersionAssigner(ds, 'ds_version_log.yml')
print(ds.version)

0.0


In [16]:
ds = cdd.Wrapper([0, 1, 2, 3, 4])
ds = cdd.Modifier(ds)
ds = cdd.VersionAssigner(ds, 'ds_version_log.yml')
print(ds.version)

1.0


In [17]:
ds = cdd.Wrapper([0, 1, 2, 3, 4, 5, 6, 7])
ds = cdd.Modifier(ds)
ds = cdd.VersionAssigner(ds, 'ds_version_log.yml')
print(ds.version)

1.1


## split

In [18]:
from cascade import data as cdd


ds = cdd.Wrapper([0, 1, 2, 3, 4, 5, 6, 7])
train_ds, test_ds = cdd.split(ds, 0.8)

print(len(train_ds), len(test_ds), sep=', ')

6, 2


## OverSampler and UnderSampler

In [1]:
from cascade import utils as cdu
from cascade import data as cdd


ds = cdd.Wrapper([
    ('a', 0),
    ('b', 1),
    ('c', 1),
    ('d', 1),
])

ds = cdu.OverSampler(ds)
[item for item in ds]

100%|██████████| 4/4 [00:00<00:00, 4017.53it/s]

Original length was 4 and new is 6





[('a', 0), ('b', 1), ('c', 1), ('d', 1), ('a', 0), ('a', 0)]

In [2]:
ds = cdd.Wrapper([
    ('a', 0),
    ('b', 1),
    ('c', 1),
    ('d', 1),
])

ds = cdu.UnderSampler(ds)
[item for item in ds]

100%|██████████| 4/4 [00:00<00:00, 4018.49it/s]

Original length was 4 and new is 2





[('a', 0), ('b', 1)]

## Specific datasets

### TimeSeriesDataset

In [10]:
import datetime
from cascade import utils as cdu


ds = cdu.TimeSeriesDataset(time=[
    datetime.datetime(2022, 11, 5),
    datetime.datetime(2022, 11, 6),
    datetime.datetime(2022, 11, 7),
], data=[0, 1, 2])

ds.to_pandas()

Unnamed: 0,0
2022-11-05,0
2022-11-06,1
2022-11-07,2


In [12]:
ds[1:].to_pandas()

Unnamed: 0,0
2022-11-06,1
2022-11-07,2


In [14]:
ds[datetime.datetime(2022, 11, 6):].to_pandas()

Unnamed: 0,0
2022-11-06,1
2022-11-07,2


In [26]:
import numpy as np

ds = cdu.TimeSeriesDataset(time=[
    datetime.datetime(2022, 11, 5),
    datetime.datetime(2022, 11, 6),
    datetime.datetime(2022, 11, 7),
], data=[0, np.nan, 2])

ds.to_pandas()

Unnamed: 0,0
2022-11-05,0.0
2022-11-06,
2022-11-07,2.0


In [28]:
import pendulum


ds = cdu.TimeSeriesDataset(time=[
    pendulum.datetime(2022, 11, 5),
    pendulum.datetime(2022, 11, 6),
    pendulum.datetime(2022, 11, 7),
    pendulum.datetime(2022, 11, 8),
], data=[0, 1, 2, 3])

In [30]:
cdu.Average(ds, 'days', 2).to_pandas()

Unnamed: 0,0
2022-11-05 00:00:00+00:00,0.5
2022-11-07 00:00:00+00:00,2.5


In [35]:
cdu.Align(ds, [pendulum.datetime(2022, 11, 8)]).to_pandas()

Unnamed: 0,0
2022-11-08 00:00:00+00:00,3.0


In [37]:
ds.get_data()

(array([DateTime(2022, 11, 5, 0, 0, 0, tzinfo=Timezone('UTC')),
        DateTime(2022, 11, 6, 0, 0, 0, tzinfo=Timezone('UTC')),
        DateTime(2022, 11, 7, 0, 0, 0, tzinfo=Timezone('UTC')),
        DateTime(2022, 11, 8, 0, 0, 0, tzinfo=Timezone('UTC'))],
       dtype=object),
 array([0, 1, 2, 3]))

In [38]:
ds.to_numpy()

array([0, 1, 2, 3])

## TableDataset

In [42]:
import pandas as pd

df = cdu.TableDataset(t=pd.DataFrame(data=[[0, 0, 0], [1, 0, 0]]))
df

<cascade.utils.table_dataset.TableDataset
    0  1  2
0  0  0  0
1  0  0  0

In [43]:
df.get_meta()

[{'name': '<cascade.utils.table_dataset.TableDataset\n    0  1  2\n0  0  0  0\n1  0  0  0',
  'type': 'dataset',
  'columns': [0, 1, 2],
  'len': 2,
  'info': {0: {'count': 2.0,
    'mean': 0.0,
    'std': 0.0,
    'min': 0.0,
    '25%': 0.0,
    '50%': 0.0,
    '75%': 0.0,
    'max': 0.0},
   1: {'count': 2.0,
    'mean': 0.0,
    'std': 0.0,
    'min': 0.0,
    '25%': 0.0,
    '50%': 0.0,
    '75%': 0.0,
    'max': 0.0},
   2: {'count': 2.0,
    'mean': 0.0,
    'std': 0.0,
    'min': 0.0,
    '25%': 0.0,
    '50%': 0.0,
    '75%': 0.0,
    'max': 0.0}}}]

In [48]:
cdu.TableFilter(ds, ds._table[0] == 1)

Length before filtering: 4, length after: 1


<cascade.utils.table_dataset.TableFilter
                            0
0                           
2022-11-06 00:00:00+00:00  1