# Import Your Own Data
mCerebrum is not the only want to collect and load data into *Cerebral Cortex*.  It is possible to import your own structured datasets into the platform. This example will demonstration the loading of existing data and subsequently how to read it back from Cerebral Cortex through the same mechanisms you have been utilizing.  Additionally, this example demonstrates how to write a custom data transformation fuction to manipulate this data and produce a smoothed result which it then visualized.

## Initialize the system

In [None]:
%reload_ext autoreload
from util.dependencies import *
from settings import USER_ID

# Import Data
Cerebral Cortex provides a set of predefined data import routines that fit typical use cases.  The most common is CSV data parser, `from cerebralcortex.data_importer.data_parsers import csv_data_parser`.  These parsers are easy to write and can be extended to support most types of data.  Additionally, the data importer, `import_data`, needs to be brought into this notebook so that we can start the data import process.

The `import_data` method requires several parameters that are discussed below.
- `cc_config`: The path to the configuration files for Cerebral Cortex.  This is the same folder that you would utilize for the `Kernel` initialization
- `input_data_dir`: The path to where the data to be imported is located.  In this example, `sample_data` is available in the file/folder browser on the left and you should explore the files located inside of it.
- `user_id`: This is the UUID that owns the data to be imported into the system.
- `data_file_extension`: The type of files to be considered for import
- `data_parser`: The import method or another that defines how to interpret the data samples on a per-line basis
- `gen_report`: A True/False value that controls if a report is printed to the screen when complete.

In [None]:
from cerebralcortex.data_importer.data_parsers import csv_data_parser
from cerebralcortex.data_importer import import_dir

import_dir(
    cc_config="/home/md2k/cc_conf/",
    input_data_dir="sample_data/",
    user_id=USER_ID,
    data_file_extension=[".csv"],
    data_parser=csv_data_parser,
    gen_report=True
)

## Create CerebralCortex object

In [None]:
CC = Kernel("/home/md2k/cc_conf/")

## View Imported Data

In [None]:
iot_stream = CC.get_stream("iot-data-stream")

# Data
iot_stream.show(truncate=False)

# Metadata
iot_stream.metadata

## How to write an algorithm
This section provides an example of how to write a simple smoothing algorithm and apply it to the data that was just imported.

### Import the necessary modules

In [None]:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructField, StructType, StringType, FloatType, TimestampType, IntegerType
from pyspark.sql.functions import minute, second, mean, window
from pyspark.sql import functions as F
import numpy as np

### Define Schema
This schema defines what the computation module will return to the execution context.

In [None]:
schema = StructType([
    StructField("smooth_vals",  FloatType())
])

### Write a user defined function
The user defined function (UDF) is one of two mechanisms available for distributed data processing within the Apache Spark framework.  

The `F.udf` Python decorator assigns the recently defined `schema` as a return type of the `udf` method.  The method, `smooth_algo` accepts a list of values, `vals`, and any python-based operations can be run over this data window to produce the data defined in the schema.  In this case, we are computing a simple windowed average.

In [None]:
@F.udf(schema)
def smooth_algo(vals):
    return [sum(vals)/len(vals)]

## Run smoothing algorithm on imported data
The smoothing algorithm is applied to the datastream by calling the `run_algorithm` method and passing the method as a parameter along with which columns, `some_vals`, that should be sent to the algorithms.  Finally, the `windowDuration` parameter specified the size of the time windows on which to segment the data before applying the algorithm.  Notice that when the next cell is run, the operation completes nearly instantaneously.  This is due to the lazy evaluation aspects of the Spark framework.  When you run the next cell to show the data, the algorithm will be applied to the wholee dataset before showing the results on the screen. 

In [None]:
smooth_stream = iot_stream.run_algorithm(smooth_algo, columnNames=["some_vals"], windowDuration=10)

In [None]:
smooth_stream.show(truncate=False)

## Visualize data
These are two plots that show the original data and the smoothed data to visually check how the algorithm transforms the data.

In [None]:
iot_stream.plot()

In [None]:
smooth_stream.plot()