# Accessing, Working with Veda Datasets via VedaCollection + VedaBase

10/30/2018
RJP

Pyveda is a python library for interacting with machine-learning datasets on the veda cloud. This notebook is meant to serve as a brief tutorial to the various modules and their apis in pyveda, and how they interact. This notebook should be updated as the [development](https://github.com/DigitalGlobe/pyveda/tree/development) branch of pyveda is updated with new functionality.

Currently, there are two main veda-data interfaces in pyveda: `VedaCollection` and `VedaBase`. `VedaCollection` is more of a client-side representation of a Veda Dataset, accompanied with an api that supports all current client-server communication (REST), so that editing, deleting, adding new datapoints as well as Datasets are handled on the class instance. `VedaBase` is a high-level wrapper around HDF5 (via pytables) that supports fast read/write of contiguous label, image pairs, custom ML-optimized iteration patterns, custom ML-based filetree structure with fine grain access to each data node and leaf, as well as a query-based api around datapoint metadata (in development), builtin input hooks for plugging into whatever DL framework (in development), and maybe some other stuff, i dunno, some optimization/compression io stuff would be good prob.  

The best way to accompany youself with these api's and how they work, besides this notebook, is looking at the code. Here's where [VedaCollection lives](https://github.com/DigitalGlobe/pyveda/blob/development/pyveda/training.py), and here's where [VedaBase lives](https://github.com/DigitalGlobe/pyveda/blob/development/pyveda/db/vedabase.py). 

Since the codebase is in development, new dependencies may well be added at any time, so it might be the case that you need to update your env/install whatever is missing if you're working locally.

We're going to point to veda-dev for this, so you will need gbdx access.

### pyveda.training.VedaCollection

In [1]:
import os 
os.environ["SANDMAN_API"] = "https://veda-api-development.geobigdata.io"

By the way, if you're trying to point to the non-official veda uri ("https://veda-api.geobigdata.io"), and your `SANDMAN_API` env var isn't already set to your target (development in this case see), you need to set that variable before import anything from pyveda, since HOST is set globally there on import. 

Below, we'll import VedaCollection and instantiate an instance via the `from_id` classmethod. The argument is a Veda Dataset Id. Each hosted dataset is associated with a particular guid, and the two I've listed below are currently (at time of writing) available on dev, and (with some minor exceptions) they're up-to-date structure-wise, so we're gonna just stick with those. 

In [3]:
from pyveda import VedaCollection

seg = "411656d7-ec2d-4726-a0f2-35d2bd48e165"
objd = "e18c91fc-d4b7-4cca-97dc-db7d2fb27d6c"
vc = VedaCollection.from_id(objd)

There are other patterns to instantiating a VedaCollection, via `__init__`, as well as who knows how many others? VedaCollection can do a lot of things, but it effectively represents a veda dataset, while providing patterns for generating and posting new, user-created datasets into veda. Here's a quick look at what VedaCollection can tell us:

In [18]:
print("vc.meta:\n{}\n".format(vc.meta))
print("vc.count: {}\n".format(vc.count))
print("vc.bounds: {}\n".format(vc.bounds))
print("vc.classes:\n{}\n".format(vc.classes))
print("vc.imshape: {}\n".format(vc.imshape))
print("vc.mlType: {}".format(vc.mlType))

vc.meta:
{'name': 'XView', 'mlType': 'object_detection', 'public': 'true', 'partition': [100, 0, 0], 'image_refs': [], 'classes': ['shipping container lot', 'bulldozer', 'truck tractor no trailer', 'facility', 'personal aircraft', 'shed', 'aircraft hangar', 'shipping container', 'passenger vehicle', 'storage tank', 'construction site', 'truck tractor with flatbed trailer', 'helicopter', 'bus', 'fixed wing aircraft', 'cargo truck', 'passenger aircraft', 'truck tractor with liquid tank', 'container crane', 'damaged building', 'building', 'car', 'tractor trailer', 'utility truck', 'vehicle lot', 'trailer no truck', 'truck', 'dump truck', 'utility vehicle', 'helipad'], 'bounds': [-1.526822244582169, 12.335128140830934, -1.499049247586651, 12.37132678462982], 'user_id': 'auth0|5ac52c560043574440853f22'}

vc.count: 1399

vc.bounds: [-1.526822244582169, 12.335128140830934, -1.499049247586651, 12.37132678462982]

vc.classes:
['shipping container lot', 'bulldozer', 'truck tractor no trailer', '

Pretty useful dataset attributes. They're crucial for knowing how to handle the real, hard data. There's more to be said here about the VedaCollection api specifically, but that's something that somebody will do later probably.
TO DO: SAY MORE HERE bout that

### VedaCollection -> VedaBase

Besides the method itself, it's definitely worth looking at some of the particulars via which a VedaBase (currently) is returned from `VedaCollection().store`. At the moment, `.store` is the only VedaCollection method which builds and returns a VedaBase out of the data in the cloud. [Here](https://github.com/DigitalGlobe/pyveda/blob/development/pyveda/training.py#L490) it is on github, but also look below, I've put it right there, right just barely even down there:


```python
    def store(self, size, fname, partition=[70, 20, 10], **kwargs):
        """ Build an hdf5 database from this collection and return it as a DataSet instance """
        namepath, ext = os.path.splitext(fname)
        if ext != ".h5":
            fname = namepath + ".h5"

        pgen = self.ids(size)
        vb = VedaBase(fname, self.mtype, self.meta['classes'], self.imshape, image_dtype=self.dtype, **kwargs)

        build_vedabase(vb, pgen, partition, size, gbdx.gbdx_connection.access_token,
                       label_threads=1, image_threads=10)
        vb.flush()
        return vb
```

What does this do? It takes some arguments, instantiates a VedaBase with those args and variety of `self` attributes, calls a function `build_vedabase` that gets passed the VedaBase `vb` as well as some other things. Importantly, it returns the vedabase, which should have snacks inside of it now. 

The important things to note about this codeblock are the input arguments. I'll go over the main ones here and then include a brief overview of what the build_database function does. 


* `fname`: a "local" filepath which will become your data-filled hdf5 file. I recommend passing in a full system path for the time being.
* `size`: the TOTAL number of datapoints you'd like to dump into your vedabase from the cloud. None means all of them (vc.count). 
* `partition`: What percentage of `size` should be allocated to trainging, testing, and validation, according to you? should change the default to something like 80, 20, 0.

This is all you have to provide to `store` in order to get a VedaBase object back. But you should know more. It takes `pgen`, which is a generator of (label, image) urls (`self.ids` is a *generator function*) and passes it to `build_vedabase`, which feeds the generator `pgen` and the vedabase `vb` to a [VedaBaseFetcher](https://github.com/DigitalGlobe/pyveda/blob/development/pyveda/fetch/aiohttp/client.py#L253), which, briefly, consumes the generator while fetching the results, running threadpool-based callbacks and writing to the vedabase. This class, and the others in `pyveda.fetch.aiohttp.client` are highly parameterizable in just about every way possible, and hence can be useful tools for concurrent data piping and process/thread pool based processing callbacks across the veda framework. The push/pull patterns utilized there might also serve as dataflow templates that may be useful, or at least informative, during the future development of VedaStream. You can see how it's instantiated and run in [pyveda.fetch.compat.pyfetch3](https://github.com/DigitalGlobe/pyveda/blob/development/pyveda/fetch/compat/fetchpy3.py) which is just an awful module name and it should get changed. 

An aside:
*You may notice that it is run in a context manager, `ThreadedAsyncioRunner`, which runs the loop in a real python thread. Why is this necessary? Jupyter notebooks run a jupyter session process that manages communication with the python kernel, and it happens to be asyncio based. In asyncio, each scheduler (loop) owns its own thread, so any other asyncio-based process that's slighty heavy will highjack the session loop, and the kernel will die.*  

TODO: notebook documenting how to utilize the various classes in `pyveda.fetch.aiohttp.client.py`, along with brief overview of mechanics. gettttt STOKED!!

So that's a basic overview of how and where `VedaCollection` and `VedaBase` interface, along with a brief high-level view of the plumbing that feeds `vb`. This may change in the future; `VedaCollection` and `VedaBase` don't necessarily need to be all that different at the end of the day, but anyway, this access pattern may likely change. Let's get a VedaBase and go through the things about it. There's a bunch of data (vc.count) in this dataset, but I'm late for a cool party probably so I'm going to just grab a 500 count.

In [24]:
%%time
fname = '/Users/jamiepolackwich1/projects/veda-nbs/coolparty.h5'
vb = vc.store(500, fname, title="vc.name", mode="a")



CPU times: user 13 s, sys: 3.28 s, total: 16.3 s
Wall time: 1min 20s


Oh yeah, I need to figure out how to turn off that dumb pytables "NaturalNaming" warning. crap.

### pyveda.db.VedaBase


Now we have a VedaBase instance. First thing to do is press tab of course:

In [28]:
print("vb.image_shape:{}\n".format(vb.image_shape))
print("vb.klasses: {}\n".format(vb.klasses))
print("vb.mltype: {}\n".format(vb.mltype))
print("vb.train: {}\n".format(vb.train))
print("vb.test: {}\n".format(vb.test))
print("vb.validate: {}".format(vb.validate))

vb.image_shape:(3, 256, 256)

vb.klasses: ['shipping container lot', 'bulldozer', 'truck tractor no trailer', 'facility', 'personal aircraft', 'shed', 'aircraft hangar', 'shipping container', 'passenger vehicle', 'storage tank', 'construction site', 'truck tractor with flatbed trailer', 'helicopter', 'bus', 'fixed wing aircraft', 'cargo truck', 'passenger aircraft', 'truck tractor with liquid tank', 'container crane', 'damaged building', 'building', 'car', 'tractor trailer', 'utility truck', 'vehicle lot', 'trailer no truck', 'truck', 'dump truck', 'utility vehicle', 'helipad']

vb.mltype: object_detection

vb.train: <pyveda.db.vedabase.WrappedDataNode object at 0x1c24fcc780>

vb.test: <pyveda.db.vedabase.WrappedDataNode object at 0x1c264f5d30>

vb.validate: <pyveda.db.vedabase.WrappedDataNode object at 0x1c24fcc780>


Nothing particularly interesting here, you'll notice that there are discrepancies in semantics between `VedaCollection` and `VedaBase`, eg `imhape` vs. `image_shape`, etc. Those will go away in the future, and those kind of metadata attributes will be consistent across pyveda structures.

More interestingly are the `train`, `test` and `validate` properties. These return instances of, what's that you say? `WrappedDataNode` objects. This is a custom class that wraps the concept of a "node" or "group" in h5 lingo. Nodes are kind of like "data folders" which themselves can have attributes and store varieties of data structures, like arrays and tables. `WrappedDataNode` defines useful custom iteration patterns, but first let's press tab again on, say, `vb.train` for example: 

In [29]:
print("vb.train.images: {}\n".format(vb.train.images))
print("vb.train.labels: {}".format(vb.train.labels))

vb.train.images: <pyveda.db.arrays.ImageArray object at 0x1c264f5710>

vb.train.labels: <pyveda.db.arrays.ObjDetectionArray object at 0x1c264f5198>


Like the custom `WrappedDataNode` class, the objects in [pyveda.db.arrays](https://github.com/DigitalGlobe/pyveda/blob/development/pyveda/db/arrays.py) act as hdf5 `array` objects, that is, contiguous, row-accessed chunked data. These data structures are extremely useful in dealing with extraordinarly large amounts of contiguous data, because they provide i/o access *without reading the entire array into memory*. That means we can index and iterate over these objects without worrying about loading everything into memory. Our custom data array classes are smart about things like data serialization and deserialization, batch writes, geo_transforms and more, but also act as low-level wrappers around the core hdf5 array structures.

First, let's check that our `build_vedabase` function did the job:

In [32]:
print("Number of image, label pairs in vb.train: {}".format(len(vb.train)))
print("Number of image, label pairs in vb.test: {}".format(len(vb.test)))
print("Number of image, label pairs in vb.validate: {}".format(len(vb.validate)))

Number of image, label pairs in vb.train: 350
Number of image, label pairs in vb.test: 95
Number of image, label pairs in vb.validate: 45


Well that doesn't add up to 500, but it's close, and it's cause of a bug in my partition write. I'll update that here in a second by the time you're here. But you can see that the distribution (w exception of validate) of data follows the default partition of [70,20,10] that was passed in by default when we called `vc.store()`.

We can check the total size as well:

In [38]:
print("Total number of datapoints in vb: {}".format(len(vb)))

490


Of course we can call them on the individual arrays as well:

In [40]:
print("Calling __len__ on individual array vb.train.labels: {}".format(len(vb.train.labels)))

Calling __len__ on individual array vb.train.labels: 350


Let's look at the array level data access patterns, and then the node level. `__getitem__` and `__iter__` are defined at each level, and the iteration methods are based on generator access provided by the pytables arrays.

In [46]:
# Array level indexing
train_label_0 = vb.train.labels[0]
train_label_100 = vb.train.labels[100] 
train_label_350 = vb.train.labels[-1] 


train_image_0 = vb.train.images[0]
train_image_100 = vb.train.images[100] #vb.train.labels[350]
train_image_350 = vb.train.images[-1]

In [60]:
print("training image, label pair 100\n\n: {}\n\n{}".format(train_image_100, train_label_100))

training image, label pair 100

: [[[ 1787.  1454.  1513. ...,  1236.  1345.  1490.]
  [ 1904.  1470.  1259. ...,  1350.  1378.  1443.]
  [ 2206.  1626.  1321. ...,  1317.  1318.  1559.]
  ..., 
  [ 1842.  1757.  1801. ...,  1284.  1344.  1254.]
  [ 1903.  1871.  1753. ...,  1383.  1369.  1378.]
  [ 1847.  1974.  1879. ...,  1364.  1284.  1434.]]

 [[ 1189.   990.  1050. ...,   890.   945.  1025.]
  [ 1283.  1008.   881. ...,   971.   970.   994.]
  [ 1501.  1123.   928. ...,   948.   927.  1072.]
  ..., 
  [ 1142.  1093.  1125. ...,   913.   962.   907.]
  [ 1181.  1169.  1101. ...,   971.   968.   982.]
  [ 1150.  1233.  1183. ...,   951.   902.  1012.]]

 [[  946.   791.   847. ...,   630.   672.   730.]
  [ 1034.   815.   715. ...,   692.   692.   711.]
  [ 1218.   913.   758. ...,   674.   663.   770.]
  ..., 
  [  858.   822.   848. ...,   614.   646.   601.]
  [  882.   874.   818. ...,   656.   653.   655.]
  [  855.   917.   878. ...,   642.   606.   676.]]]

[[], [], [], [], 

We've got a list of lists indicating bounding boxes (should they be decimal numbers?) and an image to pair. Of course, a `preview` method is on the roster, that would make this demo cooler. But it means that we can readily access individual labels and images on their relevant groups, and they come out they way they should (or, they will if they don't already).

Iteration is supported on each array as expected:

In [54]:
import numpy as np
for image in vb.validate.images:
    print("shape is {}, max is {}".format(image.shape, image.max()))

shape is (3, 256, 256), max is 3952.0
shape is (3, 256, 256), max is 9115.0
shape is (3, 256, 256), max is 3251.0
shape is (3, 256, 256), max is 9138.0
shape is (3, 256, 256), max is 9513.0
shape is (3, 256, 256), max is 5691.0
shape is (3, 256, 256), max is 9377.0
shape is (3, 256, 256), max is 9134.0
shape is (3, 256, 256), max is 9337.0
shape is (3, 256, 256), max is 9550.0
shape is (3, 256, 256), max is 3090.0
shape is (3, 256, 256), max is 3109.0
shape is (3, 256, 256), max is 9207.0
shape is (3, 256, 256), max is 8108.0
shape is (3, 256, 256), max is 6530.0
shape is (3, 256, 256), max is 3644.0
shape is (3, 256, 256), max is 2515.0
shape is (3, 256, 256), max is 6434.0
shape is (3, 256, 256), max is 5990.0
shape is (3, 256, 256), max is 6465.0
shape is (3, 256, 256), max is 9000.0
shape is (3, 256, 256), max is 3411.0
shape is (3, 256, 256), max is 9306.0
shape is (3, 256, 256), max is 5029.0
shape is (3, 256, 256), max is 9002.0
shape is (3, 256, 256), max is 5323.0
shape is (3,

We were able to call functions on the in-memory images being indexed over without loading them all into memory because of the chunked io access built into hdf5.

We also have builtin access to data *pairs* at the group level, which is especially useful for ML frameworks that consume (label, pair) inputs

In [58]:
train_label_50, train_image_50 = vb.train[50]

In [61]:
print("training image, label pair 50\n\n: {}\n\n{}".format(train_image_50, train_label_50))

training image, label pair 50

: [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]

[[[ 2823.  2847.  2871. ...,  2320.  2183.  1780.]
  [ 2867.  2868.  2852. ...,  2318.  2023.  1909.]
  [ 2875.  2888.  2895. ...,  2227.  2210.  2208.]
  ..., 
  [ 3417.  3413.  3426. ...,  1393.  1282.  1250.]
  [ 3455.  3446.  3366. ...,  1322.  1262.  1215.]
  [ 3469.  3507.  3310. ...,  1328.  1313.  1187.]]

 [[ 1404.  1417.  1430. ...,  1327.  1258.  1033.]
  [ 1430.  1432.  1424. ...,  1315.  1160.  1104.]
  [ 1438.  1445.  1451. ...,  1256.  1256.  1269.]
  ..., 
  [ 2070.  2063.  2061. ...,   875.   814.   804.]
  [ 2089.  2079.  2018. ...,   839.   809.   789.]
  [ 2096.  2109.  1981. ...,   846.   848.   777.]]

 [[  961.   971.   980. ...,   928.   881.   723.]
  [  976.   977.   974. ...,   920.   811.   771.]
  [  978.   985.   987. ...,   878.   879.   885.]
  ..., 
  [ 1428.  1420.  1417. ...,   630.   589.   586.]
 

We can get slices as well (same for arrays):

In [64]:
first10_train_image_label_pairs = vb.train[:10]

In [65]:
print(len(first10_train_image_label_pairs))

10


In [69]:
img0, lbl0 = first10_train_image_label_pairs[0]
print(np.array_equiv(img0, train_image_0))
print(lbl0 == train_label_0)

True
True


And we can also iterate over image, label pairs on nodes:

In [71]:
for im_lbl_pair in vb.train:
    pass
    # im_lbl_pair is [image, label] tuple
    

This is pretty much the gist of data access patterns in `VedaBase`. On that (data-access) front, coming soon are:
* ordering options for node access (eg, (label, image) instead of (image, label))
* batch iteration (eg, for dps in vb.test[0:100:10])

One word of **caution** when indexing on arrays: as soon as you call `__getitem__` with a `slice` object, eg `100images = db.train.images[100:200]`, you have loaded those data into **memory**. Therefore, for very large arrays, `images = db.train.images[:]` could very well overload your process ram and cause problems, just like any large im-memory data.

When you're done using your vedabase, it's important that you close the file handle. If you have added or changed data at all, the proper cleanup is:

In [74]:
vb.flush()
vb.close()
del vb

If you want to reload your vedabase at another time, instantiate a VedaBase object with the path to your file:

In [77]:
from pyveda.db import VedaBase
vb = VedaBase(fname, mode="a")

In [78]:
print("Number of image, label pairs in vb.train: {}".format(len(vb.train)))
print("Number of image, label pairs in vb.test: {}".format(len(vb.test)))
print("Number of image, label pairs in vb.validate: {}".format(len(vb.validate)))

Number of image, label pairs in vb.train: 350
Number of image, label pairs in vb.test: 95
Number of image, label pairs in vb.validate: 45


We get the expected results.

`VedaBase` also supports inputting custom loader functions, which can be any transform you please, but can be especially useful for implementing the specific structures/formats that various ML frameworks require on consumption or training. Future work will supply a module of plugins for well-known frameworks like TensorFlow, PyTorch etc, so that they can be loaded behind the scenes with a keyword argument (or something like that). That's it for now, check back in next big push to development, and hit me up (@rjpolackwich on github) with questions, and don't hesistate to submit an issue (and make a PR for it?).

In [79]:
vb.close()
del vb