# Getting Started with PingThings Data Analytics

Welcome to NI4AI! This notebook is designed to help you learn to use the PredictiveGrid platform more effectively by giving you some common code challenges that will allow you to encounter typical day-to-day problems that data scientists working in power systems may face. We encourage you to work through this notebook. Feel free to reach out to info@ni4ai.org with questions if you have them. 

This notebook is broken up into three sections:

1. Working with the BTrDB Database
2. Performing time series analyses
3. Performing power engineering computations

Where possible, I've provided some reading material that may help your understanding. We can update this notebook as needed to further refine it based on your questions! 

## Setting up your Environment

Before you get started, you will need the following things:

1. Access to an _allocation_ (e.g. argon, ni4ai, husky, dominion, etc)
2. An API key (and a username and password for the plotter and JupyterHub) 
3. Access to the [chained-library](https://github.com/pingThingsIO/chained-library) GitHub repository (optional)

**Copy this notebook into a user directory before working on it, please don't modify the base notebook!**

If you have access to JupyterHub, I recommend cloning this repository there first and working in the cluster rather than locally. 

If you don't have access to JupyterHub, make sure you install all dependencies using:

```
$ pip install -r requirements.txt
```

Also, be sure to set up your environment variables correctly, either using `pgops env` or by adding the following environment variables to your bash profile:

1. `$BTRDB_ENDPOINTS`
2. `$BTRDB_API_KEY`

## Working with BTrDB

### Task 0: Connect to the Database

In [None]:
import btrdb 

db = btrdb.connect()
db.info()

### Task 1: Finding Streams

Perform the following tasks in your allocation:

1. Print a table that lists collections and number of streams per collection
2. Print a table that lists the streams in the collection, along with name, unit, and other metadata 
3. Use `db.query` to refactor your code from steps 1 and 2 (does it increase performance?)
4. Print a table that reports for a collection: for each stream, the earliest and latest time stamps, and the number of points per stream 
5. Print a table of collections with number of streams, earliest and latest point and number of points per collection

Checkout the [tabulate](https://pypi.org/project/tabulate/) python package for printing nice tables; you can also use ipython html display. 

### Task 2: CRUD Operations 

If you have the permissions to do so, in a `test/yourname` collection perform the following tasks:

1. Create a stream
2. Add some annotations to the stream 
3. Generate some [random walk data](https://machinelearningmastery.com/gentle-introduction-random-walk-times-series-forecasting-python/) and insert it into your stream 
4. Update your stream's annotations with the parameters of your random walk 
5. Delete data from the middle of the stream
6. Read the original data without the insertion using a version number 
7. Insert data into the middle of the stream, overwriting points 
8. Create a diff of the the version created in 7 vs the version before 5 
9. Obliterate your stream

_Please only perform these operations in a development allocation_

### Task 3: Frequency and Gap Detection 

This task requires you to create a stream with the following characteristics (following from the random walk data generator from task 2):

1. This stream must be in the `test/yourname` collection
2. It should contain data spanning at least 3 hours
3. It should start at 30Hz, go to 60Hz, down to 15Hz at three discrete points
4. It should contain missing data in each frequency phase, including several 200ms-800ms gaps, 1-5s gaps, 45s-5m gaps, and 10-30m gaps 

You can create different streams with different characteristics as above if that's easier. With these streams, write algorithms that perform the following analyses (without information from your data generator):

1. For a specified pointwidth, determine when the sample rates change (e.g. when there was a configuration change)
2. Detect the start and end of the gaps in the data
3. For a specified width, detect windows that have less than the expected number of points


Make sure you obliterate your stream when you are done!

_Please only perform these operations in a development allocation_

## Time Series Analyses

### Task 1: Working with Pandas

Select stream(s) from the allocation you're using, voltage or current magnitudes are probably your best bet.

1. Write a function that queries raw values and returns a pandas Series where the index is a pd.Timestamp 
2. Update the above function to accept possibly multiple streams as input and return a DataFrame 
3. Use interpolation to fill in `np.nan` values in the above DataFrame
4. Use timestamp truncation to fill in the `np.nan` values in the above DataFrame 
5. Write a new function that returns a DataFrame for stat points of a stream using an aligned_windows or windows query
6. Update the above to just get the means or another aggregate function for multiple streams to create a DataFrame from a windows query

### Task 2: Visualization

Select stream(s) from the allocation you're using, voltage or current magnitudes are probably your best bet.

1. Using matplotlib and a windows or aligned windows query, plot the mean series with a lighter fill area between the minimum and the maximum (e.g. similar to the plotter) 

## Power Engineering