# Pyveda PR-55: An Overview

This is an overview to the changes introduced in pyveda PR #55, "Flat arrays pave the way." 

## Background/motivation

The original motivation for this work is rooted in the general frustration that grew out the manner in which the pyveda api provides mutliple data access modes - each specifically designed object structures meant to address some particular ML-problem-based workflow or use-case (or a part thereof) - as first-class objects. The source of this frustration was principally singular, but two-fold manifest: both at the user level and developer level. 

At the public level, all but the most trivial use cases require a user to interface with multiple data access objects, each parameterized by seemingly arbitrarily self-similar/intersecting input parameters along with their own specific input arguments (often technical in nature), without a paved path in between. Besides the technical onus this places on any user with even a shell of a real-world problem working in the veda system/api, this design enforces a conceptual construct that has the effect of ascribing some kind of fundamentality to those objects in the framework, which implies an inherent and unique purpose. What that means is that, eg, any user is required to conceptualize their workflow in terms of, in general, an interface to Veda, a network-based io object, and a disk-based io object, all of which provide similar methods (eg, iteration interfaces) and require similar data parameters (potentially), but without much of any straighforward mapping interfaces between them. 

Internal development is similarly affected by this emergent component-based framework conceptualization. In the virtuous pursuit of providing consistent, self-similar extension functions or classes, developers are inevitably tasked with engineering multiple versions of components due to the multiplicity of the base access structures, frequently requiring the manual bushwacking that goes along with passing one structure to the initialization/parameterization of another, all under the table. Not only does this make progress inefficient as well as introduces lots of transitory parameterization code, but it introduces hard requirements between the underlying objects themselves, often static, which can be fault-prone to core development as well as potentially constrain further development of objects that extend or incorporate them in any way. All the while, we perpetuate the conceptual framework of these apparently unique components throughout our codebase as atomic, self-contained freakin things.

Why is it like this? The basic reason is because early development was done in parallel and sought to provide the basic methods required to interface with Veda as well as the technically based io-optimized components, meant to serve data from veda, but designed for agnostic data at the same time. Parameterization intoduced nuances in such a way that made it difficult to both provide general and flexible instantiation protocols. Moreover, meaningful parameter scopes were changing often, with new ones being introduced as necessary as we learned from our engineering as well as the enormous problem scope, which still represents one of the biggest challenges facing our engineering team as the project matures and use cases discovered.


## Concepts/approach
The first step in mitigating the current framework is to build wide-open paths between the accessor objects that already exist, and provide the methods that are the building blocks of our api. One way to do that is to programmatically and conceptually separate the data parameters that characterize our datasets and the fundamental operations and apis that the accessors provide, while still providing protocols to incorporate the nuances those data parameters introduce on the objects themselves. One way to try to do that is to try to make a very general object from which everything else can derive from while providing flexibility for specification down the line. 

`BaseDataSet` provides an interface that any obeject that provides the usual `train`, `test`, `validate` interface and the methods needed to be specified for that. `BaseSampleArray` provides an analogous protocol, and is very much a close proxy to `BaseDataSet`, inhereting presecribd methods on look-up from it's parent dataset. Notice that `BaseDataSet` defines no data parameter properties on its code body; instead, these properties are defined as class properties and are populated from instantiation arguments according to the `_vprops` specification. The `_vprops` specification is a dictionary of relevant property names and values, and are all functions that specify the _descriptor_ protocol. Subclasses can inheret these properties, define their own, add new, or exclude using the `register_vprops` wrapper. One consequence of this is that we can delegate all data parameter initialization in any child class down to `BaseDataSet`, making the initialization of our accessor objects specific to the object itself, making them more flexible to use and more clear what they do. Take a look, for instance, at the initialization signature for `H5DataBase`:

    class H5DataBase(BaseDataSet):
        """
        An interface for consuming and reading local data intended to be used with
        machine learning training
        """
        _sample_class = H5SampleArray
        _variable_class = H5VariableArray

        def __init__(self, fname, title="SBWM", overwrite=False, mode="a", **kwargs):
        
Our object requires a filename, specifically, with some object-specific keyword arguments, and passes `**kwargs` down to `BaseDataSet`, which handles them appropriately. We don't even need to spec them to get a basic H5 object:

In [None]:
from pyveda.vedaset import VedaBase
import os

fname = os.getcwd() + "/temp.h5"
vb = VedaBase(fname, overwrite=True)

Of course, most interface methods that depend on data parameters will fail at the moment, but we could use this object to, eg, simply write out dataset ids. We can also assign data parameters dynamically, so that our `VedaBase` can be naively passed to another object and specified or customized as needed at a later time, or maybe we want to build a database from non-veda data. `VedaStream` can be similarly naievly instantiated.


### Data descriptors and local actions (flat arrays: paving ways)

Some of the properties we commonly pass around to describe datasets shouldn't necessarily need to be static in nature. For instance, we may want to specify additional classes on our dataset when concatenating outside data, or remove classes based on some representative metric. But what if that property is used by the data accessor in one or more op critical ways? The accessor object needs to be able to take the appropriate actions as necessary. 

Suppose we have some spatio-temporal vedabase dataset where N is small, but new data becomes available each day. In general, model accuracy scales... well exponentially in the low N regime. So after getting crappy results with a [70, 30, 20] data partition, we decide to load the training set and test our results on new data that will come in in the future. What happens when we write `vb.partition = [90, 0, 10]`?

Currently, nothing. Data is physically partitioned according to the `.partition` distribution set during the write process to three different arrays. That might be some default value like 70, 30, 10 that we didn't even specify. There is no support for arbitrary iteration over virtually grouped arrays of arbitrary length, that would be ridiculous. So literally, nothing changes: your size of your training data is the same as it was before, there's not even a warning (which is messed up). Dynamic group partitioning is a simple but critical kind of feature that is central to the value-prop domain of pyveda. To support this, `H5DataBase` structures have been re-engineered to write data to single, flat arrays, and some new array interfaces have been introduced to handle the delicacies of calculating and maintaining virtual indexes instead. This turned out to have immediate potential engineering gains across the codebase, for instance:
* partition-based h5 batch-writes off of the client stream became way simpler
* as a consequence, this opened the door to consolidate VedaStream and VedaBase io clients into a single boss client
* Stream, Base array wrapper classes effectively identical, consolidating iteration interfaces and opening door for accessor-agnostic extension classes and plugins for any implementer of virtual-indexed based access-pattern
* Less complicated file-struture; H5DataBase core functionality effectively taken care of by BaseDataSet, bridging functionality structure gap between stream and base

Those are exciting prospects, especially in the context of the approach outlined above. However, the `partition` data descriptor needs some way to schedule a call back to the object so that the relevant actions are taken, according to the object it lives on, which is a nontrivial design challenge.


### Descriptor callbacks
`BaseDataSet` has a private attribute, `._prc`. This is a special attribute, and can be used to _register custom callbacks_ on properties, depending on how a subclass down the line might need to respond when a property is assigned with a certain value, or changes in a certain way. This works due to a special descriptor mixin that looks and checks if `._prc` exists on object calling it, and if it does, it attempts to make any callbacks registered to its name. The base descriptor object is pretty straightforward and implements the classic decriptor pattern:

    class BaseDescriptor(object):
        __vname__ = NotImplemented

        def __init__(self, **kwargs):
            self.__dict__.update(kwargs)
            if not getattr(self, "name", None):
                self.name = type(self).__vname__

        def __get__(self, instance, klass):
            if instance is None:
                return self
            try:
                return instance.__dict__[self.name]
            except KeyError as ke:
                raise AttributeError("'{}' object has no attribute '{}'"
                                     .format(type(instance).__name__, self.name)) from None

        def __set__(self, instance, value):
            instance.__dict__[self.name] = value
            
         
As a python property, this looks something like:

    @property
    def namedattr(self):
        return self.name
        
    @namedattr.setter
    def namedattr(self):
        return self.name


Since our descriptors are meant to be object-agnostic for general applicability, we can define a mixin that, instead of holding some callback state locally, looks to the object it's describing and checks for the special `._prc` attribute, which holds a catalog of callbacks indexed by property name:

    class PropCallbackExecutor(BaseDescriptor):
    registry_target = "_prc"

    def __set__(self, instance, value):
        super().__set__(instance, value)
        registry = getattr(instance, self.registry_target, None)
        if registry:
            for cb in registry[self.name]:
                if inspect.ismethod(cb):
                    cb(self.name, value)
                else:
                    cb(self.name, value, instance)
                    
If the object has a property callback register accessible via `._prc`, this descriptor mixin first sets the value on the object and then proceeds to execute any callbacks the object has registered on its name. Depending on the type of callback (function, method), the callback should support the relevant call signature. Delegating the callback state to the object gives the object dynamic callback control, and thus is exposed at the interface level programmatically.


This turns out to be a useful construct; different objects may require various property callbacks. For VedaStream and VedaBase, updating either `count` or `partition` should change group allocation, and in the context of VedaBase, update the virtual index cache `._vidx`. Let's see how this works with example objects.

In [7]:
import pyveda as pv
pv.config.set_dev()
        
from pyveda.vedaset import VedaStream, VedaBase
vc = pv.from_id("94493508-dd19-4d30-b207-2466ecfc0d2f")

source = vc.gen_sample_ids(count=100)
vs = VedaStream(source, write_h5=False, write_index=False,
        mltype=vc.mltype, classes=vc.classes, image_shape=vc.imshape, image_dtype=vc.dtype,
        count=100, partition=[70,20,10])

print(vs.train.allocated, vs.test.allocated, vs.validate.allocated)


70 20 10


Group count allocations as expected. Changing the partition or the count should change these numbers:

In [8]:
vs.partition = [90, 10, 10]

ValueError: Probability distribution must sum to 100

We can customize our data descriptors as well - setting expected types, checksums, size checks, etc. Check out the `pyveda.vedaset.props` module for all available type descriptions and how they can be subclassed and mixed in.

In [9]:
vs.partition = [90, 5, 5]
print(vs.train.allocated, vs.test.allocated, vs.validate.allocated)

90 5 5


Vedabase requires actual index values, which must update on `count` or `partition` assignments:

In [10]:
vb = VedaBase(fname, overwrite=True, **vs._unpack())




In [14]:
vb.partition = [80, 0, 20]
vb.count = 200
print(vb.train.allocated, vb.test.allocated, vb.validate.allocated)
print([vb.train.images._start, vb.train.images._stop], [vb.test.images._start, vb.test.images._stop],[vb.validate.images._start, vb.validate.images._stop])

160 0 40
[0, 159] [160, 159] [160, 199]


Registering descriptor callbacks is accomplished via the CatalogRegister api:

In [16]:
def classes_callback(name, value):
    print("Hello from classes descriptor!")
    print(value)
    
vb._prc.classes.register(classes_callback)
classes = vb.classes
classes.append("Horse")

In [17]:
vb.classes = classes

Hello from classes descriptor!
['boat', 'Horse']


In general, property callbacks should be registered before passing initialization `**kwargs` to `BaseDataSet` so that they are ready for action when the properties are set. 