## Interacting with CerebralCortex Data



Cerebral Cortex is MD2K's big data cloud tool designed to support population-scale data analysis, visualization, model development, and intervention design for mobile-sensor data. It provides the ability to do machine learning model development on population scale datasets and provides interoperable interfaces for aggregation of diverse data sources.

This page provides an overview of the core Cerebral Cortex operations to familiarilze you with how to discover and interact with different sources of data that could be contained within the system.

_Note:_ While some of these examples are showing open dataset, they are designed to function on real-world mCerebrum data and the signal generators were built to facilitate the testing and evaluation of the Cerebral Cortex platform by those individuals that are unable to see those original datasets or do not wish to collect data before evaluating the system.

## Setting Up Environment



Notebook does not contain the necessary runtime enviornments necessary to run Cerebral Cortex.  The following commands will download and install these tools, framework, and datasets.

In [None]:
import importlib, sys, os
from os.path import expanduser
sys.path.insert(0, os.path.abspath('..'))

DOWNLOAD_USER_DATA=False
ALL_USERS=False #this will only  work if DOWNLOAD_USER_DATA=True
IN_COLAB = 'google.colab' in sys.modules
MD2K_JUPYTER_NOTEBOOK = "MD2K_JUPYTER_NOTEBOOK" in os.environ
if (get_ipython().__class__.__name__=="ZMQInteractiveShell"): IN_JUPYTER_NOTEBOOK = True
JAVA_HOME_DEFINED = "JAVA_HOME" in os.environ
SPARK_HOME_DEFINED = "SPARK_HOME" in os.environ
PYSPARK_PYTHON_DEFINED = "PYSPARK_PYTHON" in os.environ
PYSPARK_DRIVER_PYTHON_DEFINED = "PYSPARK_DRIVER_PYTHON" in os.environ
HAVE_CEREBRALCORTEX_KERNEL = importlib.util.find_spec("cerebralcortex") is not None
SPARK_VERSION = "3.1.2"
SPARK_URL = "https://archive.apache.org/dist/spark/spark-"+SPARK_VERSION+"/spark-"+SPARK_VERSION+"-bin-hadoop2.7.tgz"
SPARK_FILE_NAME = "spark-"+SPARK_VERSION+"-bin-hadoop2.7.tgz"
CEREBRALCORTEX_KERNEL_VERSION = "3.3.14"

DATA_PATH = expanduser("~")
if DATA_PATH[:-1]!="/":
    DATA_PATH+="/"
USER_DATA_PATH = DATA_PATH+"cc_data/"

if MD2K_JUPYTER_NOTEBOOK:
    print("Java, Spark, and CerebralCortex-Kernel are installed and paths are already setup.")
else:

    SPARK_PATH = DATA_PATH+"spark-"+SPARK_VERSION+"-bin-hadoop2.7/"
    

    if(not HAVE_CEREBRALCORTEX_KERNEL):
        print("Installing CerebralCortex-Kernel")
        !pip -q install cerebralcortex-kernel==$CEREBRALCORTEX_KERNEL_VERSION
    else:
        print("CerebralCortex-Kernel is already installed.")

    if not JAVA_HOME_DEFINED:
        if not os.path.exists("/usr/lib/jvm/java-8-openjdk-amd64/") and not os.path.exists("/usr/lib/jvm/java-11-openjdk-amd64/"):
            print("\nInstalling/Configuring Java")
            !sudo apt update
            !sudo apt-get install -y openjdk-8-jdk-headless
            os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/"
        elif os.path.exists("/usr/lib/jvm/java-8-openjdk-amd64/"):
            print("\nSetting up Java path")
            os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/"
        elif  os.path.exists("/usr/lib/jvm/java-11-openjdk-amd64/"):
            print("\nSetting up Java path")
            os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"
    else:
        print("JAVA is already installed.")

    if (IN_COLAB or IN_JUPYTER_NOTEBOOK) and not MD2K_JUPYTER_NOTEBOOK:
        if SPARK_HOME_DEFINED:
            print("SPARK is already installed.")
        elif not os.path.exists(SPARK_PATH):
            print("\nSetting up Apache Spark ", SPARK_VERSION)
            !pip -q install findspark
            import pyspark
            spark_installation_path = os.path.dirname(pyspark.__file__)
            import findspark
            findspark.init(spark_installation_path)
            if not os.getenv("PYSPARK_PYTHON"):
                os.environ["PYSPARK_PYTHON"] = os.popen('which python3').read().replace("\n","")
            if not os.getenv("PYSPARK_DRIVER_PYTHON"):
                os.environ["PYSPARK_DRIVER_PYTHON"] = os.popen('which python3').read().replace("\n","")
        else:
            print("SPARK is already installed.")
    else:
        raise SystemExit("Please check your environment configuration at: https://github.com/MD2Korg/CerebralCortex-Kernel/")

if DOWNLOAD_USER_DATA:
    if not os.path.exists(USER_DATA_PATH):
        if ALL_USERS:
            print("\nDownloading all users' data.")
            !rm -rf $USER_DATA_PATH
            !wget -q http://mhealth.md2k.org/images/datasets/cc_data.tar.bz2 && tar -xf cc_data.tar.bz2 -C $DATA_PATH && rm cc_data.tar.bz2
        else:
            print("\nDownloading a user's data.")
            !rm -rf $USER_DATA_PATH
            !wget -q http://mhealth.md2k.org/images/datasets/s2_data.tar.bz2 && tar -xf s2_data.tar.bz2 -C $DATA_PATH && rm s2_data.tar.bz2
    else:
        print("Data already exist. Please remove folder", USER_DATA_PATH, "if you want to download the data again")

Installing CerebralCortex-Kernel
[K     |████████████████████████████████| 194 kB 35.1 MB/s 
[K     |████████████████████████████████| 105 kB 49.8 MB/s 
[K     |████████████████████████████████| 21.8 MB 56.5 MB/s 
[K     |████████████████████████████████| 77 kB 5.9 MB/s 
[K     |████████████████████████████████| 20.6 MB 12.8 MB/s 
[K     |████████████████████████████████| 1.3 MB 44.9 MB/s 
[K     |████████████████████████████████| 44 kB 2.6 MB/s 
[K     |████████████████████████████████| 721 kB 37.9 MB/s 
[K     |████████████████████████████████| 636 kB 47.5 MB/s 
[K     |████████████████████████████████| 212.4 MB 59 kB/s 
[K     |████████████████████████████████| 100 kB 8.7 MB/s 
[K     |████████████████████████████████| 94 kB 3.5 MB/s 
[K     |████████████████████████████████| 198 kB 57.6 MB/s 
[K     |████████████████████████████████| 554 kB 35.9 MB/s 
[?25h  Building wheel for datascience (setup.py) ... [?25l[?25hdone
  Building wheel for hdfs3 (setup.py) ... [?25

# Cerebral Cortex Data Analysis Algorithms
Cerebral Cortex contains a library of algorithms that are useful for processing data and converting it into features or biomarkers.  This page demonstrates a simple GPS clustering algorithm.  For more details about the algorithms that are available, please see our [documentation](https://cerebralcortex-kernel.readthedocs.io/en/latest/).  These algorithms are constantly being developed and improved through our own work and the work of other researchers.

## Initalize the system

In [None]:
from cerebralcortex.kernel import Kernel
CC = Kernel(cc_configs="default", study_name="default", new_study=True)

  """)


## Generate some sample location data

This example utilizes a data generator to protect the privacy of real participants and allows for anyone utilizing this system to explore the data without required institutional review board approvals. This is disabled for this demonstration to not create too much data at once.

In [None]:
!wget -q https://raw.githubusercontent.com/MD2Korg/CerebralCortex/master/jupyter_demo/util/data_helper.py

In [None]:
from data_helper import gen_location_datastream

gps_stream = gen_location_datastream(user_id="00000000-afb8-476e-9872-6472b4e66b68", stream_name="gps--org.md2k.phonesensor--phone")

### Print generated demo data

In [None]:
gps_stream.show(3)

+-------------------+-------------------+--------------------+-------+------------------+------------------+--------+--------+----------+---------+
|          timestamp|          localtime|                user|version|          latitude|         longitude|altitude|   speed|   bearing| accuracy|
+-------------------+-------------------+--------------------+-------+------------------+------------------+--------+--------+----------+---------+
|2019-09-01 11:35:59|2019-09-01 16:35:59|00000000-afb8-476...|      1|35.151988006142915|-89.97794751301328|      83|3.622973|107.737823|18.615086|
|2019-09-01 11:36:59|2019-09-01 16:36:59|00000000-afb8-476...|      1| 35.15040997132227|-89.97766721728678|      93|1.897132|  85.19345|18.608745|
|2019-09-01 11:37:59|2019-09-01 16:37:59|00000000-afb8-476...|      1| 35.15093130084504|-89.97707565797207|      97|0.121889| 40.547028|18.260327|
+-------------------+-------------------+--------------------+-------+------------------+------------------+----

### Print schema of demo data

In [None]:
gps_stream.printSchema()

root
 |-- timestamp: timestamp (nullable = true)
 |-- localtime: timestamp (nullable = true)
 |-- user: string (nullable = true)
 |-- version: long (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- altitude: long (nullable = true)
 |-- speed: double (nullable = true)
 |-- bearing: double (nullable = true)
 |-- accuracy: double (nullable = true)



## Cluster the location data
Cerebral Cortex makes it easy to apply built-in algorithms to data streams.  In this case, `gps_clusters` is imported from the algorithm library, then `compute` is utilized to run this algorithm on the `gps_stream` to generate a set of centroids. This is the general format for applying algorithm to datastream and makes it easy for researchers to apply validated and tested algorithms to his/her own data without the need to become an expert in the particular set of transformations needed.  

_Note:_ the `compute` method engages the parallel computation capabilities of Cerebral Cortex, which causes all the data to be read from the data storage layer and processed on every computational core available to the system.  This allows the computation to run as quickly as possible and to take advantage of powerful clusters from a relatively simple interface.  This capability is critical to working with mobile sensor big data where data sizes can exceed 100s of gigabytes per datastream for larger studies.

In [None]:
from cerebralcortex.algorithms.gps.clustering import cluster_gps


### Window GPS Data

In [None]:
windowed_gps = gps_stream.window()

### Cluster windowed GPS data

In [None]:
clusters = cluster_gps(windowed_gps)
clusters.show(3, truncate=False)



+-------------------+-------------------+------------------------------------+-------+------------------+------------------+--------+--------+----------+---------+------------------+------------------+-----------+------------------+
|timestamp          |localtime          |user                                |version|latitude          |longitude         |altitude|speed   |bearing   |accuracy |centroid_longitude|centroid_latitude |centroid_id|centroid_area     |
+-------------------+-------------------+------------------------------------+-------+------------------+------------------+--------+--------+----------+---------+------------------+------------------+-----------+------------------+
|2019-09-01 11:35:59|2019-09-01 16:35:59|00000000-afb8-476e-9872-6472b4e66b68|1      |35.151988006142915|-89.97794751301328|83      |3.622973|107.737823|18.615086|-89.97804462431381|35.15212444831243 |0          |151.38709838496067|
|2019-09-01 11:36:59|2019-09-01 16:36:59|00000000-afb8-476e-9872-647

## Visualize GPS Data

### GPS Stream Plot
GPS visualization requires dedicated plotting capabilities. Cerebral Cortex includes a library to allow for interactive exploration.  In this plot, use your mouse to drag the map around along with zooming in to explore the specific data points.

In [None]:
from cerebralcortex.plotting.gps.plots import plot_gps_clusters

In [None]:
plot_gps_clusters(clusters)