# PRIZM Data Wrangling Tutorial


## PRIZM Metadatabase

This section demonstrates the basic functionalities of the PRIZM metadatabase, i.e., the retrieval of metadata, and data loading. We begin by importing the metadatabase module under the `mdb` alias.

In [4]:
import metadatabase as mdb

### Retrieving Metadata

SQLite queries can be executed against the metadatabase using the `execute` function. For instance, the following construction can be used to retrieve the model number and description of every hardware component listed in the PRIZM metadatabase.

In [None]:
mdb.execute("SELECT component_model, component_description FROM HardwareComponents")

As an example of a more complex query, the directory addresses and file names associated with the east-west polarization data gathered by the 100MHz PRIZM antenna during the first half of 2018 can be retrieved in chronological order as follows.

In [None]:
mdb.execute(("SELECT DataDirectories.directory_address, DataTypes.file_name "
             "FROM   DataDirectories "
             "JOIN   DataCategories "
             "ON     DataDirectories.data_category = DataCategories.data_category "
             "AND    DataCategories.category_name = 'Antenna' "
             "JOIN   DataFiles "
             "ON     DataDirectories.data_directory = DataFiles.data_directory "
             "AND    DataDirectories.time_start <= strftime('%s','2018-07-01 00:00:00') "
             "AND    DataDirectories.time_stop >= strftime('%s','2018-01-01 00:00:00') "
             "JOIN   DataTypes "
             "ON     DataFiles.data_file = DataTypes.data_file "
             "JOIN   ChannelGroups "
             "ON     DataFiles.channel_group = ChannelGroups.channel_group "
             "JOIN   ChannelOrientations "
             "ON     ChannelOrientations.channel_orientation = ChannelGroups.channel_orientation "
             "AND    ChannelOrientations.orientation_name = 'EW' "
             "JOIN   ArrayElements "
             "ON     ArrayElements.array_element = ChannelGroups.array_element "
             "AND    ArrayElements.element_name = '100MHz' "
             "ORDER  BY DataDirectories.time_start "))

### Loading Data

PRIZM data can be loaded through the metadatabase using the `load` function. This function receives lists as arguments, and returns a dictionary containing the data matching all combinations of these input lists' elements. This is illustrated below, where absolutely all data collected around April 22–23, 2018 is loaded.

In [None]:
mdb.load(categories=['Antenna', 'Switch'],
         instruments=['100MHz'],
         channels=['EW'],
         intervals=[(1524400000.0,1524500000.0),],
         quality=[1],
         integrity=[1],
         completeness=[1])

Alternatively, curated data selections suitable for specific analyses can be loaded through the metadatabase by referencing certain pickle files, such as those available under this repository's `../selections` subdirectory. As demonstrated below, the pickle file `../selections/2018_100MHz_EW.p` can be referenced to load all the good-quality east-west polarization data gathered by the 100MHz antenna in 2018.

In [35]:
data = mdb.load(selection=mdb._path + '/selections/2018_100MHz_EW.p')

The data is returned in the form of NumPy arrays and organized in a nested dictionary structure. The key hierarchy in the resulting dictionary is shown below for the curated data selection loaded above – a similar hierarchy results when data associated with different categories, instruments, or channels are loaded.

```python
{
    '100MHz':
    {
        'EW':
        {
            'time_sys_start': numpy.ndarray,
            'time_sys_stop': numpy.ndarray,
            'pol': numpy.ndarray,
        },
    },

    'Switch'
    {
        'antenna': numpy.ndarray,
        'res100': numpy.ndarray,
        'res50': numpy.ndarray,
        'short': numpy.ndarray,
        'noise': numpy.ndarray,
    },
}
```

In general, the data associated with both polarization channels are listed directly under their respective category or instrument key, as illustrated by the `antenna`, `res100`, `res50`, `short`, and `noise` entries listed under the `Switch` key. In contrast, the data associated with a particular polarization channel are listed under the appropriate channel key, as examplified by the `time_sys_start`, `time_sys_stop`, and `pol` entries under both the `100MHz` and `EW` keys.

## PRIZM Data Manipulation

...

In [31]:
import data as d
import matplotlib.pyplot as plt