# Data Quality API
The purpose of this notebook is to demonstrate how to use the Data Quality API to access a BTrDB Stream's distillates and identify data quality issues

In [1]:
import btrdb
from btrdbextras.dq import DQStream, DQStreamSet

In [2]:
# this is a temporary test cluster
db = btrdb.connect(profile="d2")
db.info()

{'majorVersion': 5, 'build': '5.11.133', 'proxy': {'proxyEndpoints': []}}

First we will see how to work with a single BTrDB Stream. `DQStream` subsets `Stream`, and can be instantiated by simply passing in a BTrDB `Stream` object.

In [3]:
stream = db.stream_from_uuid("9464f51f-e05a-5db1-a965-3c339f748081")
dq = DQStream(stream)
print(dq)

DQStream collection=distiltest/a574-5a32-518b-b4e9-c9c86144107a, name=rand


`DQStream` has all of the same attributes and methods as `Stream`, but it has an additional attribute `distillates`, which refers to a list of all of the distillate streams that have been derived from this source stream. We can use these distillates to identify specific data quality concerns.

## Detecting Events
We can use the `DQStream` to identify if the original `Stream` contains data quality issues. There are two methods for this: `contains_any_event()`, which reports if there are _any_ data quality concerns within the stream, as well as `contains_event()`, which reports if there is a specific data quality issue based on a user provided flag. Users can use the `list_distillates()` method to see which distillates are available

In [4]:
# see which distillates are available
dq.list_distillates(notebook=True)

uuid,collection,name,repeats,duplicate-times,zeros
9464f51f...,distiltest/a574-5a32-518b-b4e9-c9c86144107a,rand,x,✓,✓


According to the table, this stream has `zeros` and `duplicate-times` distillates available, so let's see if we can find any events:

In [5]:
dq.contains_any_event()

True

In [6]:
dq.contains_event("zeros")

True

It looks like this stream contains some zero values.

We can also narrow our search to a specific time limit. You can see in the following example that this stream does not contain any zero values on July 16th:

In [7]:
start = "2021-07-16 00:00:00.00"
end = "2021-07-17 00:00:00.00"
dq.contains_event("zeros", start=start, end=end)

False

## Working with StreamSets
The API can also work with `StreamSets`. Similar to `StreamSet`, `DQStreamSet` is just a lightweight wrapper around a list of `DQStreams`. It can be instantiated by providing a list of `Streams`:

In [4]:
stream1 = db.stream_from_uuid("9464f51f-e05a-5db1-a965-3c339f748081")
stream2 = db.stream_from_uuid("077d6745-e3ae-5795-b22d-1eb067abb360")
dqss = DQStreamSet([stream1, stream2])
dqss



<DQStreamSet (2 streams)>

`DQStreamSet` has a `describe()` method that allows us to see metadata of the underlying `Streams`:

In [5]:
dqss.describe(notebook=True)

Collection,Name,Unit,UUID,Version,Available Data Quality Info
distiltest/a574-5a32-518b-b4e9-c9c86144107a,rand,rand,9464f51f...,1072,"zeros, duplicate-times"
distiltest/f0bd-d75b-582b-b8e9-ac9cc9401755,rand,rand,077d6745...,1072,


We can also peak at the underlying `Streams` and see which distillates they have available. Note that the second `Stream` does not have any distillates.

In [6]:
dqss.list_distillates(notebook=True)

uuid,collection,name,repeats,duplicate-times,zeros
9464f51f...,distiltest/a574-5a32-518b-b4e9-c9c86144107a,rand,x,✓,✓
077d6745...,distiltest/f0bd-d75b-582b-b8e9-ac9cc9401755,rand,x,x,x


We can also see if the `Streams` within a `DQStreamSet` contain any events. The output of these methods is a `dict` with each `Stream` UUID as the key and a bool value indicating whether an event was found. Note that `None` will be returned as the value if the `DQStream` does not contain a certain distillate.

In [7]:
dqss.contains_any_event()

{'9464f51f-e05a-5db1-a965-3c339f748081': True,
 '077d6745-e3ae-5795-b22d-1eb067abb360': None}

In [9]:
dqss.contains_event("duplicate-times")

{'9464f51f-e05a-5db1-a965-3c339f748081': True,
 '077d6745-e3ae-5795-b22d-1eb067abb360': None}

In [10]:
dqss.contains_event("zeros")

{'9464f51f-e05a-5db1-a965-3c339f748081': True,
 '077d6745-e3ae-5795-b22d-1eb067abb360': None}

## Notes
* distillates will need to follow a certain naming convention for this to work properly.
* distillers will need to put `source_uuid` annotation in output streams so they can be correctly matched with their source streams.