# Windows, aligned windows, and values

> The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods.

This tutorial offers a guide on using non-traditional methods in PredictiveGrid to work with big time series data sets.

For high resolution streams a simple query such as "Give me all of the data in this stream" can return a volume of data large enough to overload any computing environment. This tutorial describes options for interacting with data in various ways to enable interactions with very large volumes of data.

We'll describe three methods for querying data in PredictiveGrid. In practice none of these is better or worse; there is a time and a place for each. This post will explore when each is appropriate to use.

### Functions used
- `stream.values()`
- `stream.windows()`
- `stream.aligned_windows()`

In [None]:
import btrdb
import pandas as pd
import numpy as np
from btrdb.utils.timez import *
from datetime import datetime, timedelta

from matplotlib import pyplot as plt

In [None]:
db = btrdb.connect()

### Querying data

To illustrate what's meant by `BIG DATA`, let's investigate the very simple task of querying data from a single stream.

If you ask for all of the data in a stream, what will that yield?

In [None]:
streams = db.streams_in_collection('sunshine/PMU1', tags={'name': 'L1MAG'})
stream = streams[0]
print('collection:\t', stream.collection)
print('stream name:\t', stream.name)

# How many points is that?
print('size:\t\t', round(stream.count()/1e9,2), 'billion points')
print('volume:\t\t', round(stream.count()*64*2/8/1e9,2), 'gigabytes')

### That's a lot of data!
Querying that much data will likely overload your computing environment and will likely take quite a long time to get the data back to you. 

***Is there a better way?***

# Windows Queries

Windows queries provide *statistical aggregates* or "summary statistics" of raw data points in a given time interval. A windows query will return a time series of `StatPoint` objects, which can be used to explore summary statistics of raw values over time.

New to `StatPoints`? Start with the tutorial below. 

https://github.com/PingThingsIO/ni4ai-notebooks/blob/main/tutorials/5%20-%20Working%20with%20StatPoints.ipynb

In [None]:
t0 = currently_as_ns()

start, _ = stream.earliest()
start = ns_to_datetime(start.time)

end, _ = stream.latest()
end = ns_to_datetime(end.time)

window = ns_delta(days=5)

start_time = datetime(start.year, start.month, start.day+1)
statpoints = stream.windows(start_time, end, window)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))

In [None]:
def statpoints_to_dataframe(statpoints, datetime_index=True):
    attributes = ['min','mean','max','stddev','count','time',]
    
    df = pd.DataFrame([[getattr(p, attr) for attr in attributes] for p, _ in statpoints],
                     columns=attributes)

    if datetime_index:
        df['datetime'] = [ns_to_datetime(t) for t in df['time']]
        return df.set_index('datetime')
    else:
        return df

df = statpoints_to_dataframe(statpoints)
df.head()

### What just happened?

The query `stream.windows()` scanned through 18 months [!!!] of data and returned a tuple of StatPoint objects. 

Those 18 months are partitioned into 5-day time increments (as specified by the `window` parameter). Each StatPoint reports summary statistics of values observed during that time frame.

Note that pulling all 9+ billion raw point values for the same interval would have taken MUCH longer, and would have returned more data than would have been feasible to hold in RAM. Leveraging StatPoint objects makes it feasible to mine through long time intervals of data to look for trends or event signatures that warrant more detailed / granular analysis.

# What happens if we zoom in?

In [None]:
t0 = currently_as_ns()

window = ns_delta(days=1)
statpoints = stream.windows(start, end, window)
print('Data Duration:',(end-start))
print('Aggregation window:', timedelta(seconds=int(window/1e9)))
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))

df = statpoints_to_dataframe(statpoints)
df.head()

# Aligned windows

Aligned windows return results that look very much like windows queries. The only differece is that time stamps are adjusted to align with time windows stored inherently in the database. Where `windows` queries may need to re-compute statistical aggregates over the time window requested, `aligned_windows` queries can leverage pre-computed values.


Let's look at the difference in performance.

In [None]:
window = ns_delta(days=1)
pw = np.log2(window)

t0 = currently_as_ns()
statpoints = stream.aligned_windows(start, end, pointwidth=pw)
print('Data Duration:',(end-start))
print('Aggregation window:', timedelta(seconds=int(window/1e9)))
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))

df = statpoints_to_dataframe(statpoints)
df.head()

That's much faster! The only thing to note is that the time increment in an `aligned_windows` query is rounded to the nearest time increment that matches the inherent structure of the database. This means your start time, end time, and window may be modified slightly to optimize performance.

In [None]:
print('WINDOW DURATION')
print('\tAs specified:', timedelta(seconds=int(window/1e9)))
print('\tAs returned:', btrdb.utils.general.pointwidth(pw))


print('\n\nSTART TIME')
print('\tAs specified:', start)
print('\tAs returned:', df.index.min())


print('\n\nEND TIME')
print('\tAs specified:', end)
print('\tAs returned:', df.index.max()+timedelta(seconds=int(window/1e9)))

# Getting more granular with aligned_windows

Performance on `aligned_windows` queries is much faster, and will enable you to query data more quickly and at finer resolutions that you'll be able to do using `windows`.

In [None]:
window = ns_delta(hours=6)
pw = np.log2(window)

t0 = currently_as_ns()
statpoints = stream.aligned_windows(start, end, pointwidth=pw)

print('Data Duration:',(end-start))
print('Aggregation window:', btrdb.utils.general.pointwidth(pw))
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))

In [None]:
window = ns_delta(minutes=30)
pw = np.log2(window)

t0 = currently_as_ns()
statpoints = stream.aligned_windows(start, end, pointwidth=pw)

print('Data Duration:',(end-start))
print('Aggregation window:', btrdb.utils.general.pointwidth(pw))
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))

In [None]:
window = ns_delta(minutes=1)
pw = np.log2(window)

t0 = currently_as_ns()
statpoints = stream.aligned_windows(start, end, pointwidth=pw)

print('Data Duration:',(end-start))
print('Aggregation window:', btrdb.utils.general.pointwidth(pw))
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))

That last query took a while! Let's make note that querying 1.5 years of data at 1-minute resolution is starting to push the limits of what our environment (or patience!) can handle.

It is possible to speed that up, however, by using a larger computing environment.

# When to use `values`

Many analytics can be done using StatPoints to summarize steady state characteristics of the data at the time-scale that is of interest, or to identify intervals in the data where there is an "event" in the data. 

Here, we'll simply explore at what point values queries become intractable to perform.

In [None]:
window = ns_delta(minutes=1)
start_time = datetime_to_ns(start)
end_time = start_time + window

t0 = currently_as_ns()
statpoints = stream.values(start_time, end_time)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))

In [None]:
window = ns_delta(minutes=10)
start_time = datetime_to_ns(start)
end_time = start_time + window

t0 = currently_as_ns()
statpoints = stream.values(start_time, end_time)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))

In [None]:
window = ns_delta(hours=1)
start_time = datetime_to_ns(start)
end_time = start_time + window

t0 = currently_as_ns()
statpoints = stream.values(start_time, end_time)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))

In [None]:
window = ns_delta(hours=6)
start_time = datetime_to_ns(start)
end_time = start_time + window

t0 = currently_as_ns()
statpoints = stream.values(start_time, end_time)
print('Runtime: %.2f seconds'%((currently_as_ns()-t0)/1e9))

### Final note...

When running values queries, be sure to check how much working memory you have available in your jupyterhub instance. Bringing large amounts of data into memory can easily cause your environment to crash! You may need to shut down and move to a larger instance.

### `aligned_windows` queries in action 
Here are some examples where we use statpoints to hone in on time intervals that are known (or likely) to be of interest for a given analytic:
- Voltage sags: https://github.com/PingThingsIO/ni4ai-notebooks/blob/main/demo/Voltage%20Sag%20Exploration.ipynb
- Tap changes: https://github.com/PingThingsIO/ni4ai-notebooks/blob/main/demo/Voltage%20Change%20Detection.ipynb

### `values` queries in action

Here are examples where we use values queries to examine events that warrant full-resolution queries:
- Spectral analysis: https://github.com/PingThingsIO/ni4ai-notebooks/blob/main/demo/PV_spectrogram.ipynb
- Phase angle differencing: https://github.com/PingThingsIO/ni4ai-notebooks/blob/main/demo/Phase%20Angle%20Monitoring.ipynb