## Interacting with CerebralCortex Data

Cerebral Cortex is MD2K's big data cloud tool designed to support population-scale data analysis, visualization, model development, and intervention design for mobile-sensor data. It provides the ability to do machine learning model development on population scale datasets and provides interoperable interfaces for aggregation of diverse data sources.

This page provides an overview of the core Cerebral Cortex operations to familiarilze you with how to discover and interact with different sources of data that could be contained within the system.

_Note:_ While some of these examples are showing generated data, they are designed to function on real-world mCerebrum data and the signal generators were built to facilitate the testing and evaluation of the Cerebral Cortex platform by those individuals that are unable to see those original datasets or do not wish to collect data before evaluating the system.

### Download Sample Dataset
We use [WESAD](https://archive.ics.uci.edu/ml/datasets/WESAD+%28Wearable+Stress+and+Affect+Detection%29) dataset to demonstarte Cerebral Cortex Kernel capabilities. WESAD is a publicly available dataset for wearable stress and affect detection. This multimodal dataset features physiological and motion data, recorded from both a wrist- and a chest-worn device, of 15 subjects during a lab study. The following sensor modalities are included: blood volume pulse, electrocardiogram, electrodermal activity, electromyogram, respiration, body temperature, and three-axis acceleration. Moreover, the dataset bridges the gap between previous lab studies on stress and emotions, by containing three different affective states (neutral, stress, amusement). In addition, self-reports of the subjects, which were obtained using several established questionnaires, are contained in the dataset. 

## Setting Up Environment

Colab does not contain the necessary runtime enviornments necessary to run Cerebral Cortex.  The following commands will download and install these tools, framework, and datasets.

In [None]:
import importlib, sys, os
from os.path import expanduser
sys.path.insert(0, os.path.abspath('..'))

DOWNLOAD_USER_DATA=True
ALL_USERS=False #this will only  work if DOWNLOAD_USER_DATA=True
IN_COLAB = 'google.colab' in sys.modules
MD2K_JUPYTER_NOTEBOOK = "MD2K_JUPYTER_NOTEBOOK" in os.environ
if (get_ipython().__class__.__name__=="ZMQInteractiveShell"): IN_JUPYTER_NOTEBOOK = True
JAVA_HOME_DEFINED = "JAVA_HOME" in os.environ
SPARK_HOME_DEFINED = "SPARK_HOME" in os.environ
PYSPARK_PYTHON_DEFINED = "PYSPARK_PYTHON" in os.environ
PYSPARK_DRIVER_PYTHON_DEFINED = "PYSPARK_DRIVER_PYTHON" in os.environ
HAVE_CEREBRALCORTEX_KERNEL = importlib.util.find_spec("cerebralcortex") is not None
SPARK_VERSION = "3.1.2"
SPARK_URL = "https://archive.apache.org/dist/spark/spark-"+SPARK_VERSION+"/spark-"+SPARK_VERSION+"-bin-hadoop2.7.tgz"
SPARK_FILE_NAME = "spark-"+SPARK_VERSION+"-bin-hadoop2.7.tgz"
CEREBRALCORTEX_KERNEL_VERSION = "3.3.14"

DATA_PATH = expanduser("~")
if DATA_PATH[:-1]!="/":
    DATA_PATH+="/"
USER_DATA_PATH = DATA_PATH+"cc_data/"

if MD2K_JUPYTER_NOTEBOOK:
    print("Java, Spark, and CerebralCortex-Kernel are installed and paths are already setup.")
else:

    SPARK_PATH = DATA_PATH+"spark-"+SPARK_VERSION+"-bin-hadoop2.7/"
    

    if(not HAVE_CEREBRALCORTEX_KERNEL):
        print("Installing CerebralCortex-Kernel")
        !pip -q install cerebralcortex-kernel==$CEREBRALCORTEX_KERNEL_VERSION
    else:
        print("CerebralCortex-Kernel is already installed.")

    if not JAVA_HOME_DEFINED:
        if not os.path.exists("/usr/lib/jvm/java-8-openjdk-amd64/") and not os.path.exists("/usr/lib/jvm/java-11-openjdk-amd64/"):
            print("\nInstalling/Configuring Java")
            !sudo apt update
            !sudo apt-get install -y openjdk-8-jdk-headless
            os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/"
        elif os.path.exists("/usr/lib/jvm/java-8-openjdk-amd64/"):
            print("\nSetting up Java path")
            os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/"
        elif  os.path.exists("/usr/lib/jvm/java-11-openjdk-amd64/"):
            print("\nSetting up Java path")
            os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"
    else:
        print("JAVA is already installed.")

    if (IN_COLAB or IN_JUPYTER_NOTEBOOK) and not MD2K_JUPYTER_NOTEBOOK:
        if SPARK_HOME_DEFINED:
            print("SPARK is already installed.")
        elif not os.path.exists(SPARK_PATH):
            print("\nSetting up Apache Spark ", SPARK_VERSION)
            !pip -q install findspark
            import pyspark
            spark_installation_path = os.path.dirname(pyspark.__file__)
            import findspark
            findspark.init(spark_installation_path)
            if not os.getenv("PYSPARK_PYTHON"):
                os.environ["PYSPARK_PYTHON"] = os.popen('which python3').read().replace("\n","")
            if not os.getenv("PYSPARK_DRIVER_PYTHON"):
                os.environ["PYSPARK_DRIVER_PYTHON"] = os.popen('which python3').read().replace("\n","")
        else:
            print("SPARK is already installed.")
    else:
        raise SystemExit("Please check your environment configuration at: https://github.com/MD2Korg/CerebralCortex-Kernel/")

if DOWNLOAD_USER_DATA:
    if not os.path.exists(USER_DATA_PATH):
        if ALL_USERS:
            print("\nDownloading all users' data.")
            !rm -rf $USER_DATA_PATH
            !wget -q http://mhealth.md2k.org/images/datasets/cc_data.tar.bz2 && tar -xf cc_data.tar.bz2 -C $DATA_PATH && rm cc_data.tar.bz2
        else:
            print("\nDownloading a user's data.")
            !rm -rf $USER_DATA_PATH
            !wget -q http://mhealth.md2k.org/images/datasets/s2_data.tar.bz2 && tar -xf s2_data.tar.bz2 -C $DATA_PATH && rm s2_data.tar.bz2
    else:
        print("Data already exist. Please remove folder", USER_DATA_PATH, "if you want to download the data again")

CerebralCortex-Kernel is already installed.
JAVA is already installed.
SPARK is already installed.
Data already exist. Please remove folder /root/cc_data/ if you want to download the data again


## Import packages

In [None]:
from cerebralcortex.util.helper_methods import get_study_names
from cerebralcortex.kernel import Kernel

## List all the available studies

Studies are stored on disk as a SQLite database `cc_kernel_database.db` for the metadata which is coupled with a directory structure `study=wesad` beginning with the study name.  Typically, a user calls the `get_study_names` method to list out all the possible studies that Cerebral Cortex currently has access to.

In [None]:
get_study_names()

['wesad']

## Create CerebralCortex object
The __Kernel__ object is the main entry point to the Cerebral Cortex system. It is necessary to pass a configuration directory that tells it all the different parameters it needs to communicate with its other components.  You can examine the details of these configurations for this server by looking at the files contained in the `cc_conf` folder.

In [None]:
CC = Kernel(cc_configs="default", study_name="wesad")

## Getting help
These are the typical ways to learn more about the code and objects within Cerebral Cortex.
1. Intelligent context help by typing the object or class into a cell followed by the period, `.`, then when you press `<tab>` a popup will appear showing additional information about the object or method. Uncomment the first line to try it out.
2. Formatting the commands with a question mark retrieves the documentation strings and examples when appropriate.  `? CC.list_streams`
3. Reading the documentation on our site: https://cerebralcortex-kernel.readthedocs.io/en/latest/

In [None]:
CC.list_streams?

## List available streams in CC
One of the first things a researcher typically wants to know is what data is available to explore.  The kernel offers a couple of methods to facilitate this. The first, `list_streams`, is shown below and exposes all the available streams within the system. 

In [None]:
CC.list_streams()

['wesad.chest.ecg',
 'wesad.chest.resp',
 'wesad.chest.temp',
 'wesad.chest.eda',
 'wesad.wrist.acc',
 'wesad.wrist.temp',
 'wesad.quest',
 'wesad.wrist.bvp',
 'wesad.label',
 'wesad.wrist.eda',
 'wesad.chest.emg',
 'wesad.chest.acc']

### Search streams by name
For larger deployments, the list of all streams may be too long to easily sort through, or you may be interested in a specific type of information. In this case, the second method `search_stream` would be more applicable. This search returns streams that have a substring match of the search parameter.

In [None]:
results = CC.search_stream("acc")
for result in results:
    print(result)

wesad.wrist.acc
wesad.chest.acc


## Get stream data
Once a stream is identified by name, it needs to be loaded into a `DataStream` object by calling `get_stream`.  This pulls into a single object all the metadata associated with the stream as well as a reference to the data so that it can be accessed as needed.

In [None]:
wrist_accel = CC.get_stream("wesad.wrist.acc")

## Print stream statistics
The summary method displays some basic statistics about the datastream such as the number of points as well as max, mean, stdev, and min values.  These statistics are shown for each column of data in the stream.

In [None]:
wrist_accel.summary()

+-------+-----------------+-------------------+------------------+------+-------+
|summary|x                |y                  |z                 |user  |version|
+-------+-----------------+-------------------+------------------+------+-------+
|count  |194528           |194528             |194528            |194528|194528 |
|mean   |39.70145171903273|-0.4073809425892417|18.990525785491034|null  |1.0    |
|stddev |19.52089787101009|33.20667050112892  |25.003502048895655|null  |0.0    |
|min    |-128.0           |-128.0             |-128.0            |s2    |1      |
|25%    |21.0             |-12.0              |9.0               |null  |1      |
|50%    |44.0             |5.0                |17.0              |null  |1      |
|75%    |57.0             |23.0               |37.0              |null  |1      |
|max    |127.0            |104.0              |127.0             |s2    |1      |
+-------+-----------------+-------------------+------------------+------+-------+



## Print stream data
Any datastream can be printed or visualized to the screen; however, it is important to limit, in this case to 3, the number of rows to show.  Streams can contain millions to billions of samples depending on the size of the system and even for the case of a single individual wearing a motion-capture band, this number can exceed 30,000,000 samples for a short two week study.  Cerebral Cortex defaults to settings that try to not load all the data unless needed.

This example prints the first 3 rows of the loaded battery stream and it contains 5 columns. 
- __timestamp__: This is the time in UTC that the sample was recorded at
- __localtime__: This is the time in the local timezone that the sample was recorded at
- __battery_level__: This is the battery percentage of the smartphone device
- __version__: This is the Cerebral Cortex version code assigned to this stream.
- __user__: This is the specific UUID that identifies the user that owns this data point

In [None]:
wrist_accel.show(3, truncate=False) 

+-------------------------+-------------------------+----+-----+-----+----+-------+
|timestamp                |localtime                |x   |y    |z    |user|version|
+-------------------------+-------------------------+----+-----+-----+----+-------+
|2017-05-22 02:15:25      |2017-05-22 02:15:25      |62.0|-21.0|107.0|s2  |1      |
|2017-05-22 02:15:25.03125|2017-05-22 02:15:25.03125|66.0|13.0 |53.0 |s2  |1      |
|2017-05-22 02:15:25.0625 |2017-05-22 02:15:25.0625 |41.0|9.0  |15.0 |s2  |1      |
+-------------------------+-------------------------+----+-----+-----+----+-------+
only showing top 3 rows



## Print stream metadata
Each stream contains 

- __name__: The complete string name of this stream
- __description__: A text description of this stream
- __data_descriptor__: A list of objects that describe the data components of the stream (e.g. battery_level)
    - ...
    - __name__: data descriptor name
    - __type__: the object type (e.g. integer, float, string, ...)
    - __optional_fields__: any number of arbitrary fields can be added when creating a stream and will appear here 
    - ...
- __annotations__: Currently unused but designed to link streams together such as a **data quality** and the corresponding **raw** stream
- __input_streams__: Currently unused but designed to specify which streams were utilized to generate this stream
- __modules__: Metadata about the algorithm/code module the generated this data
  - __name__: The name of the code module
  - __version__: The version of the code module
  - __attributes__: Arbitrary attributes specified by _key-value_ pairs
  - __authors__: A set of author names and emails


In [None]:
metadata = wrist_accel.get_metadata()
print(metadata)

{
    "annotations": [],
    "data_descriptor": [
        {
            "attributes": {
                "description": "utc timestamp"
            },
            "name": "timestamp",
            "type": "datetime"
        },
        {
            "attributes": {
                "description": "local timestamp"
            },
            "name": "localtime",
            "type": "datetime"
        },
        {
            "attributes": {
                "description": "stream version"
            },
            "name": "version",
            "type": "int"
        },
        {
            "attributes": {
                "description": "user id"
            },
            "name": "user",
            "type": "string"
        },
        {
            "attributes": {
                "description": ""
            },
            "name": "x",
            "type": "float"
        },
        {
            "attributes": {
                "description": ""
            },
            "name": "y",
    

## Filter Data

Cerebral Cortex returns all data associated with a stream name, which is great for performing operations and intial exploration; however, it allows for the filtering of these streams of data to isolate certain criterias such as value ranges or specific columns or users.

### Filter data by data column
The first major filtering capability allows for named columns to have logical operations applied to them. The `filter` method is applicable to the data stream object and accepts three parameters.  
- column name: (e.g. battery_level)
- operation: (e.g. >, <, ==, >=, ...)
- criteria: (e.g. 97)



In [None]:
filtered_data = wrist_accel.filter("x>62")
filtered_data.show(3,truncate=False)

+-------------------------+-------------------------+----+----+-----+----+-------+
|timestamp                |localtime                |x   |y   |z    |user|version|
+-------------------------+-------------------------+----+----+-----+----+-------+
|2017-05-22 02:15:25.03125|2017-05-22 02:15:25.03125|66.0|13.0|53.0 |s2  |1      |
|2017-05-22 02:15:25.34375|2017-05-22 02:15:25.34375|63.0|26.0|1.0  |s2  |1      |
|2017-05-22 02:15:25.40625|2017-05-22 02:15:25.40625|78.0|25.0|-36.0|s2  |1      |
+-------------------------+-------------------------+----+----+-----+----+-------+
only showing top 3 rows



### Filter data by user
User filtering is a special case due to the way Cerebral Cortex stores data and a dedicated method, `filter_user`, is provided which accepts a single `USER_ID` as input.  This example illustrates filtering by the prior user id.

In [None]:
filtered_user_data = wrist_accel.filter_user("s2")
filtered_user_data.show(3,truncate=False)

+-------------------------+-------------------------+----+-----+-----+----+-------+
|timestamp                |localtime                |x   |y    |z    |user|version|
+-------------------------+-------------------------+----+-----+-----+----+-------+
|2017-05-22 02:15:25      |2017-05-22 02:15:25      |62.0|-21.0|107.0|s2  |1      |
|2017-05-22 02:15:25.03125|2017-05-22 02:15:25.03125|66.0|13.0 |53.0 |s2  |1      |
|2017-05-22 02:15:25.0625 |2017-05-22 02:15:25.0625 |41.0|9.0  |15.0 |s2  |1      |
+-------------------------+-------------------------+----+-----+-----+----+-------+
only showing top 3 rows



### Filter data by version
Version filtering is a special case due to the way Cerebral Cortex stores data. A dedicated method, `filter_version`, is provided which accepts a single version as input. 

In [None]:
filtered_version_data = wrist_accel.filter_version(1)
filtered_version_data.show(3,truncate=False)

+-------------------------+-------------------------+----+-----+-----+----+-------+
|timestamp                |localtime                |x   |y    |z    |user|version|
+-------------------------+-------------------------+----+-----+-----+----+-------+
|2017-05-22 02:15:25      |2017-05-22 02:15:25      |62.0|-21.0|107.0|s2  |1      |
|2017-05-22 02:15:25.03125|2017-05-22 02:15:25.03125|66.0|13.0 |53.0 |s2  |1      |
|2017-05-22 02:15:25.0625 |2017-05-22 02:15:25.0625 |41.0|9.0  |15.0 |s2  |1      |
+-------------------------+-------------------------+----+-----+-----+----+-------+
only showing top 3 rows



## Convert datastream object into Pandas dataframe
The data representations and visualizations that have been shown so far provide a way for basic data inspections; however, these are not directly suitable for more complex interactions or analysis.  Cerebral Cortex provide a `to_pandas` method to transform the datastream data into a [Pandas](https://pandas.pydata.org/) dataframe object. From this point, anything that Pandas can do is supported.


In [None]:
pdf = wrist_accel.toPandas()
pdf

Unnamed: 0,timestamp,localtime,x,y,z,user,version
0,2017-05-22 02:15:25.000000,2017-05-22 02:15:25.000000,62.0,-21.0,107.0,s2,1
1,2017-05-22 02:15:25.031250,2017-05-22 02:15:25.031250,66.0,13.0,53.0,s2,1
2,2017-05-22 02:15:25.062500,2017-05-22 02:15:25.062500,41.0,9.0,15.0,s2,1
3,2017-05-22 02:15:25.093750,2017-05-22 02:15:25.093750,52.0,16.0,24.0,s2,1
4,2017-05-22 02:15:25.125000,2017-05-22 02:15:25.125000,54.0,15.0,34.0,s2,1
...,...,...,...,...,...,...,...
194523,2017-05-22 03:56:43.843750,2017-05-22 03:56:43.843750,87.0,27.0,23.0,s2,1
194524,2017-05-22 03:56:43.875000,2017-05-22 03:56:43.875000,67.0,32.0,29.0,s2,1
194525,2017-05-22 03:56:43.906250,2017-05-22 03:56:43.906250,41.0,25.0,11.0,s2,1
194526,2017-05-22 03:56:43.937500,2017-05-22 03:56:43.937500,39.0,27.0,22.0,s2,1


## Perform windowing operation on data
Many times it is preferable to group the data into windows before applying an algorithm or computation to the data.  The basic windowing function groups data into non-overlapping chunks and returns a data stream with each cell containing all the data associated with that particular window.

In [None]:
windowed_data = wrist_accel.window(windowDuration=60)

### Sliding windows
Another common windowing technique can be accomplished by adding an `offset` parameter to the parameter list which causes the windows to move by a partial window size instead of the whole window.

In [None]:
windowed_data = wrist_accel.window(windowDuration=60, slideDuration=5)

## Compute some basic stats of windowed data
Cerebral Cortex provides computationally efficient helper functions for generating basic statistics over the datastream. These functions include: _average, sqrt, sum, variance, stdev, min, max_

In [None]:
from cerebralcortex.algorithms.stats.features import statistical_features

stats_features = statistical_features(windowed_data)
stats_features.show(4, False)



+-------------------+-------------------+----+-------+-------------------+-------------------------+---------+--------+---------+----------+-----+-----+-----------+----------+---------+---------+--------+---------+----------+-----+-----+-----------+----------+---------+----------+--------+---------+----------+-----+-----+----------+-----------+---------+
|timestamp          |localtime          |user|version|start_time         |end_time                 |x_mean   |x_median|x_stddev |x_variance|x_max|x_min|x_skew     |x_kurt    |x_sqr    |y_mean   |y_median|y_stddev |y_variance|y_max|y_min|y_skew     |y_kurt    |y_sqr    |z_mean    |z_median|z_stddev |z_variance|z_max|z_min|z_skew    |z_kurt     |z_sqr    |
+-------------------+-------------------+----+-------+-------------------+-------------------------+---------+--------+---------+----------+-----+-----+-----------+----------+---------+---------+--------+---------+----------+-----+-----+-----------+----------+---------+----------+-----

## Basic Plot examples
Visualization is a key part to gaining an understanding of the data and performing data analysis. Cerebral Cortex contains a set of basic plotting operations that can be used for timeseries based DataStream objects. You may pass `CC DataStream` object or `Pandas DataFrame` object to plot the data.

These plots are interactive; try using your mouse to explore the data.

In [None]:
from cerebralcortex.plotting.basic.plots import plot_timeseries, plot_histogram, plot_box

### Timeseries Line Plot

In [None]:
plot_timeseries(pdf)

Output hidden; open in https://colab.research.google.com to view.

### Histogram Plot

In [None]:
plot_histogram(pdf)

Output hidden; open in https://colab.research.google.com to view.

### Box Plot

In [None]:
plot_box(pdf)

Output hidden; open in https://colab.research.google.com to view.