# Kosh Transformers

This notebook introduces Kosh's *transformers*. *Transformers* allow data to be further post-processed after extraction from it's original URI.

Transformer allow for easy transformations such as sub-sampling to more complex operation such as data augmentation or detecting where data is valid or not.

Transformers can be chained. Each step can be cached. Kosh transformers also allow for caching. The default cache directory in stored in `kosh.core.kosh_cache_dir` and points to: `os.path.join(os.environ["HOME"], ".cache", "kosh")`.

## Basic Example Converting from list to numpy

This first example shows how to use transformers to convert between formats. We create a simple loader that returns a list of numbers as floats. This could be a loader for a very compex format.

Here two things could happen:

1. The data is not a great format for us.
2. The loader is slow (but uses proprietary libraries we cannot re-implement)

A transformer can help for both of this.

1. The transformer will convert data to a desired format (numpy arrays here)
2. The result will be cached some that we can quickly reload the data many times in the script.



In [1]:
# import necessary modules
import kosh
import numpy
import time
import os

In [2]:
# Create a file to load in.
with open("kosh_transformers_chaining_example.ascii", "w") as f:
    f.write("1 2. 3 4 5 6 7 8 9")

Now we need to create our proprietary loader

In [3]:
# A very basic loader

# this loader can read the *ascii* mime_type and return *numlist* as one of its output types
class StringsLoader(kosh.loaders.KoshLoader):
    types ={"ascii": ["numlist", "a_format", "another_format"]}  # mime_types and corresponding outpt formats
    def extract(self):
        """The extract function
        return a list of floats"""
        time.sleep(2) # fake slow operation
        with open(self.obj.uri) as f:
            return [float(x) for x in f.read().split()]
    def list_features(self):
        # The only feature is "numbers"
        return ["numbers",]

Now let's create a transformer to convert this list of floats to a numpy array on the fly. (we understand it's a one liner in python)

All we need to do is inherit the basic kosh transformer and implement the `transform` call.

`transform` takes the `inputs` and a `format` as input. 

It needs a *numlist* as an input

In [4]:
import sys
import time
print(sys.prefix)
print(kosh.__version__)
class Ints2Np(kosh.transformers.KoshTransformer):
    types =  {"numlist": ["numpy"]}  # Known inputs type and matching possible output formats
    def transform(self, inputs, format):
        time.sleep(2)  # Artificial slowdown
        return numpy.array(inputs, dtype=numpy.float32)

/g/g19/cdoutrix/miniconda3/envs/kosh
1.2.111.g3aa6ca


Now let's create store,a dataset and associate the data to it.

In [5]:
store = kosh.connect("transformers_example.sql", delete_all_contents=True)
dataset = store.create(name="test_transformer")
dataset.associate("kosh_transformers_chaining_example.ascii", mime_type="ascii")
# let's add our loader to the store
store.add_loader(StringsLoader)
# and print the features associated with this dataset
dataset.list_features()

['numbers']

A simple feature retrieval (or a call to `get`) will return our list

In [6]:
feature = dataset["numbers"]
feature[:]

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

But we want a *numpy* array and our loader cannot do that!

In [7]:
try:
    feature(format="numpy")
except:
    print("Failed as expected")

Failed as expected


We need to use our transformer.

Let's inform Kosh about it

In [8]:
feature = dataset.get_execution_graph("numbers", transformers=[Ints2Np(),])
data = feature(format='numpy')
print(data)

[1. 2. 3. 4. 5. 6. 7. 8. 9.]


It works but it is still slow if we call it again

In [9]:
%time feature(format="numpy")

CPU times: user 2.87 ms, sys: 1.87 ms, total: 4.73 ms
Wall time: 4.01 s


array([1., 2., 3., 4., 5., 6., 7., 8., 9.], dtype=float32)

We need to cache the result.

In [10]:
transform_to_npy = Ints2Np(cache=True, cache_dir=os.getcwd())
feature = dataset.get_execution_graph("numbers", transformers=[transform_to_npy,])
print("First time (caching)")
%time dataset.get("numbers", format="numpy", transformers=[transform_to_npy,])
print("Second time (cached)")
%time dataset.get("numbers", format="numpy", transformers=[transform_to_npy,])

First time (caching)
CPU times: user 5.32 ms, sys: 2.3 ms, total: 7.62 ms
Wall time: 4.03 s
Second time (cached)
CPU times: user 6.49 ms, sys: 1.42 ms, total: 7.91 ms
Wall time: 2.02 s


array([1., 2., 3., 4., 5., 6., 7., 8., 9.], dtype=float32)

## Chaining Transformers

While this was neat, now that our data is in a format that we like we might want to further process it with other transformers. Fortunately these can be chained. The level of caching can be controlled as well.

Let's create an `Even` transformer that gets only even numbers and a fake *slow* operation, in our case that transformer does nothing except pausing for a specific amount of time.



In [11]:
class Even(kosh.transformers.KoshTransformer):
    types = {"numpy": ["numpy"]}
    def transform(self, input, format):
        return numpy.take(input, numpy.argwhere(numpy.mod(input, 2)==0))[:,0]
    
class SlowDowner(kosh.transformers.KoshTransformer):
    types = {"numpy": ["numpy"]}
    def __init__(self, sleep_time=3, cache_dir="kosh_cache", cache=False):
        super(SlowDowner, self).__init__(cache_dir=cache_dir, cache=cache)
        self.sleep_time = sleep_time
    def transform(self, input, format):
        # Fakes a slow operation
        time.sleep(self.sleep_time)
        return input

Let's chain these together

In [12]:
%time dataset.get("numbers", format="numpy", transformers=[transform_to_npy, SlowDowner(3), Even(), SlowDowner(4)])

CPU times: user 4.85 ms, sys: 3.83 ms, total: 8.68 ms
Wall time: 9.03 s


array([2., 4., 6., 8.], dtype=float32)

Let's cache the last step

In [13]:
%time dataset.get("numbers", format="numpy", transformers=[transform_to_npy, SlowDowner(3), Even(), SlowDowner(4, cache_dir="kosh_cache", cache=True)])

CPU times: user 9.06 ms, sys: 2.94 ms, total: 12 ms
Wall time: 9.04 s


array([2., 4., 6., 8.], dtype=float32)

Let's running again we should shove off the last 4 seconds, but let's cache the first 3 as well for next time

In [14]:
%time dataset.get("numbers", format="numpy", transformers=[transform_to_npy, SlowDowner(3, cache_dir="kosh_cache", cache=True), Even(), SlowDowner(4, cache_dir="kosh_cache", cache=True)])

CPU times: user 8.39 ms, sys: 4.39 ms, total: 12.8 ms
Wall time: 5.03 s


array([2., 4., 6., 8.], dtype=float32)

Let's run it again all cached

In [15]:
%time dataset.get("numbers", format="numpy", transformers=[transform_to_npy, SlowDowner(3, cache_dir="kosh_cache", cache=True), Even(), SlowDowner(4, cache_dir="kosh_cache", cache=True)])

CPU times: user 7.6 ms, sys: 4.51 ms, total: 12.1 ms
Wall time: 2.03 s


array([2., 4., 6., 8.], dtype=float32)

## Some examples of transformers included in Kosh

Kosh comes with a few transformers

### Numpy-related transformers

* KoshSimpleNpCache(cache_dir=kosh_cache_dir, cache=True) does nothing but caches the passed arrays using numpy.savez rather than the default (pickled objects)
* Take(cache_dir=kosh_cache_dir, cache=True, indices=[], axis=0, verbose=False) runs numpy.take. Will use mpi to split the indices over the available ranks, gather result on rank 0
* Delta(cache_dir=kosh_cache_dir, cache=True,cache_dir=kosh_cache_dir, cache=True, axis=0, pad=None, pad_value=0, verbose=False) computes difference over an axis between consecutive strides, possibly padding at start or end

### Scikit Learn related transformers

see [Next Notebook](Example_05b_Transformers-SKL.ipynb)


