# Data Quality API
The purpose of this notebook is to demonstrate a POC Data Quality API that can be used to access a BTrDB stream's distillate streams and use them to identify data quality events.

In [1]:
import btrdb
from btrdbextras.dq import DQStream, DQStreamSet

In [2]:
db = btrdb.connect(profile="ni4ai")
db.info()

{'majorVersion': 5, 'build': '5.11.137', 'proxy': {'proxyEndpoints': []}}

First we will see how to work with a single BTrDB Stream. `DQStream` inherits from `Stream`, and can be instantiated by simply passing in a BTrDB `Stream` object.

In [4]:
stream = db.stream_from_uuid("4f8a9730-80e6-4378-948d-cb64a277f8f3")
dq = DQStream(stream)
print(dq)

DQStream collection=monitoring/generator1, name=PhA_Cang


`DQStream` has all of the same attributes and methods as `Stream`, as well as an additional attribute, `distillates`. Distillates is a list of all of the distillate streams that have been derived from the source stream. Each distillate is a sparse stream that contains only values of 1 at timestamps where a specific data quality issue has occurred. We can use these distillates to identify specific data quality concerns for a given `Stream`.

## Detecting Events
We can use the `DQStream` to identify if the original `Stream` contains data quality issues. There are two methods for this: `contains_any_issue()`, which reports if there are _any_ data quality concerns within the stream, as well as `contains_issue()`, which reports if there is a specific data quality issue based on a user provided flag. Users can use the `list_distillates()` method to see which distillates are available:

In [5]:
# see which distillates are available
dq.list_distillates()

{'uuid': '4f8a9730-80e6-4378-948d-cb64a277f8f3',
 'collection': 'monitoring/generator1',
 'name': 'PhA_Cang',
 'repeats': True,
 'duplicate-times': True,
 'zeros': True}

According to the output, this stream has `zeros`, `repeats`, and `duplicate-times` distillates available, so let's see if we can find any events:

In [6]:
dq.contains_any_issue()

True

In [6]:
dq.contains_issue("zeros")

True

In [7]:
dq.contains_issue("repeats")

False

In [8]:
dq.contains_issue("duplicate-times")

False

It looks like this stream contains zero values, but does not contain repeats or duplicate timestamps.

We can also narrow our search to a specific time limit. You can see in the following example that this stream does not contain any zero values on July 16th:

In [9]:
start = "2021-07-16 00:00:00.00"
end = "2021-07-17 00:00:00.00"
dq.contains_issue("zeros", start=start, end=end)

False

## Working with StreamSets
The API can also work with `StreamSets`. Similar to `StreamSet`, `DQStreamSet` is just a lightweight wrapper around a list of `DQStreams`. It can be instantiated by providing a list of `Streams`:

In [10]:
stream1 = db.stream_from_uuid("4f8a9730-80e6-4378-948d-cb64a277f8f3")
stream2 = db.stream_from_uuid("53daa83c-f5fb-4520-b2fe-82c144e3dbef")
dqss = DQStreamSet([stream1, stream2])
dqss

<DQStreamSet (2 streams)>

`DQStreamSet` has a `describe()` method that allows us to see metadata of the underlying `Streams`:

In [12]:
dqss.describe(notebook=True)

Collection,Name,Unit,UUID,Version,Available Data Quality Info
monitoring/generator1,PhA_Cang,Degrees,4f8a9730...,1357,"repeats, zeros, duplicate-times"
monitoring/generator2,PhA_Cang,Degrees,53daa83c...,1366,"zeros, repeats, duplicate-times"


We can also peak at the underlying `Streams` and see which distillates they have available. Note that the second `Stream` does not have any distillates.

In [13]:
dqss.list_distillates()

[{'uuid': '4f8a9730-80e6-4378-948d-cb64a277f8f3',
  'collection': 'monitoring/generator1',
  'name': 'PhA_Cang',
  'repeats': True,
  'duplicate-times': True,
  'zeros': True},
 {'uuid': '53daa83c-f5fb-4520-b2fe-82c144e3dbef',
  'collection': 'monitoring/generator2',
  'name': 'PhA_Cang',
  'repeats': True,
  'duplicate-times': True,
  'zeros': True}]

We can also see if the `Streams` within a `DQStreamSet` contain any events. The output of these methods is a `dict` with each `Stream` UUID as the key and a bool value indicating whether an event was found. Note that `None` will be returned as the value if the `DQStream` does not contain a certain distillate.

In [14]:
dqss.contains_any_issue()

{'4f8a9730-80e6-4378-948d-cb64a277f8f3': True,
 '53daa83c-f5fb-4520-b2fe-82c144e3dbef': True}

In [15]:
dqss.contains_issue("duplicate-times")

{'4f8a9730-80e6-4378-948d-cb64a277f8f3': False,
 '53daa83c-f5fb-4520-b2fe-82c144e3dbef': False}

In [16]:
dqss.contains_issue("zeros")

{'4f8a9730-80e6-4378-948d-cb64a277f8f3': True,
 '53daa83c-f5fb-4520-b2fe-82c144e3dbef': True}

## Notes
* distillates will need to follow a certain naming convention for this to work properly.
* distillers will need to put `source_uuid` annotation in output streams so they can be correctly matched with their source streams.