# Introduction to greenflow

**greenflow** is a set of open-source examples for Quantitative Analysis tasks:
- Data preparation & feat. engineering
- Alpha seeking modeling
- Technical indicators
- Backtesting

It is GPU-accelerated by leveraging [**RAPIDS.ai**](https://rapids.ai) technology, and has Multi-GPU and Multi-Node support.

greenflow computing components are oriented around its plugins and task graph.

## Download example datasets

Before getting started, let's download the example datasets if not present.

In [1]:
! ((test ! -f './data/stock_price_hist.csv.gz' ||  test ! -f './data/security_master.csv.gz') && \
  cd .. && bash download_data.sh) || echo "Dataset is already present. No need to re-download it."

Dataset is already present. No need to re-download it.


## About this notebook

In this tutorial, we are going to use greenflow to do a simple quant job. The job tasks are listed below:
    1. load csv stock data.
    2. filter out the stocks that has average volume smaller than 50.
    3. sort the stock symbols and datetime.
    4. add rate of return as a feature into the table.
    5. in two branches, computethe mean volume and mean return.
    6. read the file containing the stock symbol names, and join the computed dataframes.
    7. output the result in csv files.
    
## TaskGraph playground

Run the following greenflow code to start a empty TaskGraph where computation graph can be created. You can follow the steps as listed below.

In [2]:
import sys; sys.path.insert(0, '..')
from greenflow.dataframe_flow import TaskGraph
task_graph = TaskGraph()
task_graph.draw()

GreenflowWidget(sub=HBox())

## Step by Step to build your first task graph

### Create Task node to load the included stock csv file 
<img src="images/loader_csv.gif" align="center">

### Explore the data and visualize it
<img src='images/explore_data.gif' align='center'>

### Clean up the Task nodes for next steps
<img src='images/clean.gif' align='center'>

### Filter the data and compute the rate of return feature
<img src='images/get_return_feature.gif' align='center'>

### Save current TaskGraph for a composite Task node
<img src='images/add_composite_node.gif' align='center'>

### Clean up the redudant feature computation Task nodes
<img src='images/clean_up_feature.gif' align='center'>

### Compute the averge volume and returns 
<img src='images/average.gif' align='center'>

### Dump the dataframe to csv files
<img src='images/csv_out.gif' align='center'>

Just in case you cannnot follow along, here you can load the tutorial taskgraph from the file. First one is the graph to calculate the return feature. 

In [3]:
task_graph = TaskGraph.load_taskgraph('../taskgraphs/get_return_feature.gq.yaml')
task_graph.draw()

GreenflowWidget(sub=HBox(), value=[OrderedDict([('id', 'stock_data'), ('type', 'CsvStockLoader'), ('conf', {'f…

Load the full graph and click on the `run` button to see the result

In [4]:
task_graph = TaskGraph.load_taskgraph('../taskgraphs/tutorial_intro.gq.yaml')
task_graph.draw()

GreenflowWidget(sub=HBox(), value=[OrderedDict([('id', 'stock_data'), ('type', 'CsvStockLoader'), ('conf', {'f…

## About Task graphs, nodes and plugins

Quant processing operators are defined as nodes that operates on **cuDF**/**dask_cuDF** dataframes.

A **task graph** is a list of tasks composed of greenflow nodes.

The cell below contains the task graph described before.

In [5]:
import warnings; warnings.simplefilter("ignore")
csv_average_return = 'average_return.csv'
csv_average_volume = 'average_volume.csv'
csv_file_path = './data/stock_price_hist.csv.gz'
csv_name_file_path = './data/security_master.csv.gz'
from greenflow.dataframe_flow import TaskSpecSchema 

# load csv stock data
task_csvdata = {
    TaskSpecSchema.task_id: 'stock_data',
    TaskSpecSchema.node_type: 'CsvStockLoader',
    TaskSpecSchema.conf: {'file': csv_file_path},
    TaskSpecSchema.inputs: {},
    TaskSpecSchema.module: "greenflow_gquant_plugin.dataloader"
}

# filter out the stocks that has average volume smaller than 50
task_minVolume = {
    TaskSpecSchema.task_id: 'volume_filter',
    TaskSpecSchema.node_type: 'ValueFilterNode',
    TaskSpecSchema.conf: [{'min': 50.0, 'column': 'volume'}],
    TaskSpecSchema.inputs: {'in': 'stock_data.cudf_out'},
    TaskSpecSchema.module: "greenflow_gquant_plugin.transform"
}

# sort the stock symbols and datetime
task_sort = {
    TaskSpecSchema.task_id: 'sort_node',
    TaskSpecSchema.node_type: 'SortNode',
    TaskSpecSchema.conf: {'keys': ['asset', 'datetime']},
    TaskSpecSchema.inputs: {'in': 'volume_filter.out'},
    TaskSpecSchema.module: "greenflow_gquant_plugin.transform"
}

# add rate of return as a feature into the table
task_addReturn = {
    TaskSpecSchema.task_id: 'add_return_feature',
    TaskSpecSchema.node_type: 'ReturnFeatureNode',
    TaskSpecSchema.conf: {},
    TaskSpecSchema.inputs: {'stock_in': 'sort_node.out'},
    TaskSpecSchema.module: "greenflow_gquant_plugin.transform"
}

# read the stock symbol name file and join the computed dataframes
task_stockSymbol = {
    TaskSpecSchema.task_id: 'stock_name',
    TaskSpecSchema.node_type: 'StockNameLoader',
    TaskSpecSchema.conf: {'file': csv_name_file_path },
    TaskSpecSchema.inputs: {},
    TaskSpecSchema.module: "greenflow_gquant_plugin.dataloader"
}

# In two branches, compute the mean volume and mean return seperately
task_volumeMean = {
    TaskSpecSchema.task_id: 'average_volume',
    TaskSpecSchema.node_type: 'AverageNode',
    TaskSpecSchema.conf: {'column': 'volume'},
    TaskSpecSchema.inputs: {'stock_in': 'add_return_feature.stock_out'},
    TaskSpecSchema.module: "greenflow_gquant_plugin.transform"
}

task_returnMean = {
    TaskSpecSchema.task_id: 'average_return',
    TaskSpecSchema.node_type: 'AverageNode',
    TaskSpecSchema.conf: {'column': 'returns'},
    TaskSpecSchema.inputs: {'stock_in': 'add_return_feature.stock_out'},
    TaskSpecSchema.module: "greenflow_gquant_plugin.transform"
}

task_leftMerge1 = {
    TaskSpecSchema.task_id: 'left_merge1',
    TaskSpecSchema.node_type: 'LeftMergeNode',
    TaskSpecSchema.conf: {'column': 'asset'},
    TaskSpecSchema.inputs: {'left': 'average_return.stock_out', 
                            'right': 'stock_name.stock_name'},
    TaskSpecSchema.module: "greenflow_gquant_plugin.transform"
}

task_leftMerge2 = {
    TaskSpecSchema.task_id: 'left_merge2',
    TaskSpecSchema.node_type: 'LeftMergeNode',
    TaskSpecSchema.conf: {'column': 'asset'},
    TaskSpecSchema.inputs: {'left': 'average_volume.stock_out', 
                            'right': 'stock_name.stock_name'},
    TaskSpecSchema.module: "greenflow_gquant_plugin.transform"
}

# output the result in csv files

task_outputCsv1 = {
    TaskSpecSchema.task_id: 'output_csv1',
    TaskSpecSchema.node_type: 'OutCsvNode',
    TaskSpecSchema.conf: {'path': csv_average_return},
    TaskSpecSchema.inputs: {'df_in': 'left_merge1.merged'},
    TaskSpecSchema.module: "greenflow_gquant_plugin.analysis"
}

task_outputCsv2 = {
    TaskSpecSchema.task_id: 'output_csv2',
    TaskSpecSchema.node_type: 'OutCsvNode',
    TaskSpecSchema.conf: {'path': csv_average_volume },
    TaskSpecSchema.inputs: {'df_in': 'left_merge2.merged'},
    TaskSpecSchema.module: "greenflow_gquant_plugin.analysis"
}

In Python, a greenflow task-spec is defined as a dictionary with the following fields:
- `id`
- `type`
- `conf`
- `inputs`
- `filepath`
- `module`

As a best practice, we recommend using the `TaskSpecSchema` class for these fields, instead of strings.

The `id` for a given task must be unique within a task graph. To use the result(s) of other task(s) as input(s) of a different task, we use the id(s) of the former task(s) in the `inputs` field of the next task.

The `type` field contains the node type to use for the compute task. greenflow includes a collection of node classes. These can be found in `greenflow.plugin_nodes`. Click [here](#node_class_example) to see a greenflow node class example.

The `conf` field is used to parameterise a task. It lets you access user-set parameters within a plugin (such as `self.conf['min']` in the example above). Each node defines the `conf` json schema. The greenflow UI can use this schema to generate the proper form UI for the inputs. It is recommended to use the UI to configure the `conf`. 

The `filepath` field is used to specify a python module where a custom plugin is defined. It is optional if the plugin is in `plugin_nodes` directory, and mandatory when the plugin is somewhere else. In a different tutorial, we will learn how to create custom plugins.

The `module` is optional to tell greenflow the name of module that the node type is from. If it is not specified, greenflow will search for it among all the customized modules. 

A custom node schema will look something like this:
```
custom_task = {
    TaskSpecSchema.task_id: 'custom_calc',
    TaskSpecSchema.node_type: 'CustomNode',
    TaskSpecSchema.conf: {},
    TaskSpecSchema.inputs: ['some_other_node'],
    TaskSpecSchema.filepath: 'custom_nodes.py'
}
```

Below, we compose our task graph and visualize it as a graph.

In [6]:
from greenflow.dataframe_flow import TaskGraph

# list of nodes composing the task graph
task_list = [
    task_csvdata, task_minVolume, task_sort, task_addReturn,
    task_stockSymbol, task_volumeMean, task_returnMean,
    task_leftMerge1, task_leftMerge2,
    task_outputCsv1, task_outputCsv2]

task_graph = TaskGraph(task_list)
task_graph.draw()

GreenflowWidget(sub=HBox(), value=[OrderedDict([('id', 'stock_data'), ('type', 'CsvStockLoader'), ('conf', {'f…

We will use `save_taskgraph` method to save the task graph to a **yaml file**.

That will allow us to re-use it in the future.

In [7]:
task_graph_file_name = '01_tutorial_task_graph.gq.yaml'

task_graph.save_taskgraph(task_graph_file_name)

Here is a snippet of the content in the resulting yaml file:

In [8]:
%%bash -s "$task_graph_file_name"
head -n 19 $1

- id: stock_data
  type: CsvStockLoader
  conf:
    file: ./data/stock_price_hist.csv.gz
  inputs: {}
  module: greenflow_gquant_plugin.dataloader
- id: volume_filter
  type: ValueFilterNode
  conf:
  - column: volume
    min: 50.0
  inputs:
    in: stock_data.cudf_out
  module: greenflow_gquant_plugin.transform
- id: sort_node
  type: SortNode
  conf:
    keys:
    - asset


The yaml file describes the computation tasks.  We can load it and visualize it as a graph.

In [9]:
task_graph = TaskGraph.load_taskgraph(task_graph_file_name)
task_graph.draw()

GreenflowWidget(sub=HBox(), value=[OrderedDict([('id', 'stock_data'), ('type', 'CsvStockLoader'), ('conf', {'f…

## Building a task graph

Running the task graph is the next logical step. Nevertheless, it can optionally be built before running it.

By calling `build` method, the graph is traversed without running the dataframe computations. This could be useful to inspect the column names and types, validate that the plugins can be instantiated, and check for errors.

The output of `build` are instances of each task in a dictionary.

In the example below, we inspect the column names and types for the inputs and outputs of the `left_merge1` task:

In [10]:
from pprint import pprint

task_graph.build()

print('Output of build task graph are instances of each task in a dictionary:\n')
print(str(task_graph))

Output of build task graph are instances of each task in a dictionary:

stock_data: <NodeInTaskGraph greenflow_gquant_plugin.dataloader.csvStockLoader.CsvStockLoader object at 0x7faf857b7d00>
volume_filter: <NodeInTaskGraph greenflow_gquant_plugin.transform.valueFilterNode.ValueFilterNode object at 0x7faf857b7d30>
sort_node: <NodeInTaskGraph greenflow_gquant_plugin.transform.sortNode.SortNode object at 0x7faf8575d6a0>
add_return_feature: <NodeInTaskGraph greenflow_gquant_plugin.transform.returnFeatureNode.ReturnFeatureNode object at 0x7faf8575d730>
stock_name: <NodeInTaskGraph greenflow_gquant_plugin.dataloader.stockNameLoader.StockNameLoader object at 0x7faf8575d9d0>
average_volume: <NodeInTaskGraph greenflow_gquant_plugin.transform.averageNode.AverageNode object at 0x7faf857b7ee0>
average_return: <NodeInTaskGraph greenflow_gquant_plugin.transform.averageNode.AverageNode object at 0x7fb0d953c730>
left_merge1: <NodeInTaskGraph greenflow_gquant_plugin.transform.leftMergeNode.LeftMergeNo

In [11]:
# output meta in 'left_merge_1' node

print('output meta in outgoing dataframe:\n')
pprint(task_graph['left_merge1'].meta_setup())

output meta in outgoing dataframe:

MetaData(inports={'left': {}, 'right': {}}, outports={'merged': {'asset': 'int64', 'returns': 'float64', 'asset_name': 'object'}})


## Running a task graph

To execute the graph computations, we will use the `run` method. If the `Output_Collector` task node is not added to the graph, a output list can be feeded to the run method. The result can be displayed in a rich mode if the `formated` argument is turned on.

`run` can also takes an optional `replace` argument which is used and explained later on

In [12]:
outputs = ['stock_data.cudf_out', 'output_csv1.df_out', 'output_csv2.df_out']
task_graph.run(outputs=outputs, formated=True)

Tab(children=(Output(), Output(), Output(), Output(layout=Layout(border='1px solid black'), outputs=({'output_…

The result can be used as a tuple or dictionary.

In [13]:
result = task_graph.run(outputs=outputs)
csv_data_df, csv_1_df, csv_2_df = result
result['output_csv2.df_out']

Unnamed: 0,asset,volume,asset_name
0,24568,203.002419,ORN
1,2557,429.953169,EMMS
2,4142,487.567188,RIGL
3,869369,172.961884,IBP
4,705684,107.933333,USMD
...,...,...,...
4995,4725,1473.216105,UTSI
4996,4136,81.733333,RGCO
4997,795091,73.178571,FRBA
4998,3811,230.635886,ORIT


We can profile each of the computation node running time by turning on the profiler.

In [14]:
outputs =  ['stock_data.cudf_out', 'output_csv1.df_out', 'output_csv2.df_out']
csv_data_df, csv_1_df, csv_2_df = task_graph.run(outputs=outputs, profile=True)

id:stock_data process time:3.168s
id:volume_filter process time:0.021s
id:sort_node process time:0.102s
id:add_return_feature process time:0.058s
id:average_volume process time:0.013s
id:average_return process time:0.014s
id:stock_name process time:0.008s
id:left_merge1 process time:0.002s
id:output_csv1 process time:0.015s
id:left_merge2 process time:0.002s
id:output_csv2 process time:0.014s


Where most of the time is spent on the csv file processing. This is because we have to convert the time string to the proper format via CPU. Let's inspect the content of `csv_1_df` and `csv_2_df`.

In [15]:
print('csv_1_df content:')
print(csv_1_df)

print('\ncsv_2_df content:')
print(csv_2_df)   

csv_1_df content:
       asset   returns asset_name
0     869301 -0.005854      VNRBP
1       3159  0.000315       ISBC
2       8044  0.000516        SGU
3       2123  0.000801       CGNX
4      22873 -0.001068       RENN
...      ...       ...        ...
4995    3518  0.001136       MPWR
4996  707774 -0.000417       MODN
4997    4856  0.000979       WIRE
4998   22461 -0.000243         MY
4999    1973 -0.002916       BOCH

[5000 rows x 3 columns]

csv_2_df content:
       asset       volume asset_name
0      24568   203.002419        ORN
1       2557   429.953169       EMMS
2       4142   487.567188       RIGL
3     869369   172.961884        IBP
4     705684   107.933333       USMD
...      ...          ...        ...
4995  869374   279.946042       WATT
4996  701990   302.973772       FRGI
4997   24636   136.807107       SVVC
4998    6190  2069.864690        FNF
4999   24153   887.397596        DAR

[5000 rows x 3 columns]


Also, please notice that two resulting csv files has been created:
- average_return.csv
- average_volume.csv

In [16]:
print('\ncsv files created:')
!find . -iname "*symbol*" 


csv files created:


## Subgraphs

A nice feature of task graphs is that we can evaluate any **subgraph**. For instance, if you are only interested in the `average volume` result, you can run only the tasks which are relevant for that computation.

If we would not want to re-run tasks, we could also use the `replace` argument of the `run` function with a `load` option.

The `replace` argument needs to be a dictionary where each key is the task/node id. The values are a replacement task-spec dictionary (i.e. each key is a spec overload, and its value is what to overload with).

In the example below, instead of re-running the `stock_data` node to load a csv file into a `cudf` dataframe, we will use its dataframe output to load from it.

In [17]:
replace = {
    'stock_data': {
        'load': {
            'cudf_out': csv_data_df
        },
        'save': True
    }
}

(volume_mean_df, ) = task_graph.run(outputs=['average_volume.stock_out'],
                                    replace=replace)

print(volume_mean_df)

       asset       volume
0      22705    67.929114
1     869315   151.844770
2       2526    88.337888
3       3939    91.674194
4     705893  8616.574853
...      ...          ...
4995  869571   639.127042
4996    7842   709.077851
4997  701570   110.977778
4998  701705   970.310847
4999    4859   143.615344

[5000 rows x 2 columns]


As a convenience, we can save on disk the checkpoints for any of the nodes, and re-load them if needed. It is only needed to set the save option to `True`. This step will take a while depends on the disk IO speed.

In the example above, the `replace` spec directs `run` to save on disk for the `stock_data`. If `load` was boolean then the data would be loaded from disk presuming the data was saved to disk in a prior run.

The default directory for saving is `<current_workdir>/.cache/<node_id>.hdf5`.

`replace` is also used to override parameters in the tasks. For instance, if we wanted to use the value `40.0` instead `50.0` in the task `volume_filter`, we would do something similar to:
```
replace_spec = {
    'volume_filter': {
        'conf': {
            'min': 40.0
        }
    },
    'some_task': etc...
}
```

In [18]:
replace = {'stock_data': {'load': True},
           'average_return': {'save': True}}


(return_mean_df, ) = task_graph.run(outputs=['average_return.stock_out'], replace=replace)

print('Return mean Dataframe:\n')
print(return_mean_df)

Return mean Dataframe:

       asset   returns
0      22705  0.001691
1     869315  0.000701
2       2526  0.002374
3       3939  0.052447
4     705893  0.000790
...      ...       ...
4995  869571 -0.002908
4996    7842  0.000698
4997  701570 -0.004115
4998  701705  0.002157
4999    4859  0.008666

[5000 rows x 2 columns]


Now, we might want to load the `return_mean_df` from the saved file and evaluate only tasks that we are interested in.

In the cells below, we compare different load approaches:
- in-memory,
- from disk, 
- and not loading at all.

When working interactively, or in situations requiring iterative and explorative task graphs, a significant amount of time is saved by just re-loading the data that do not require to be recalculated.

In [19]:
%%time
print('Using in-memory dataframes for load:')

replace = {'stock_data': {'load':  {
            'cudf_out': csv_data_df
            }},
           'average return': {'load': 
                              {'stock_out': return_mean_df}}
          }

_ = task_graph.run(outputs=['output_csv2.df_out'], replace=replace)

Using in-memory dataframes for load:
CPU times: user 156 ms, sys: 76.8 ms, total: 233 ms
Wall time: 263 ms


In [20]:
%%time
print('Using cached dataframes on disk for load:')

replace = {'stock_data': {'load': True},
           'average return': {'load': True}}

_ = task_graph.run(outputs=['output_csv2.df_out'], replace=replace)

Using cached dataframes on disk for load:
CPU times: user 2.52 s, sys: 488 ms, total: 3.01 s
Wall time: 3.02 s


In [21]:
%%time
print('Re-running dataframes calculations instead of using load:')

replace = {'stock_data': {'load': True}}

_ = task_graph.run(outputs=['output_csv2.df_out'], replace=replace)

Re-running dataframes calculations instead of using load:
CPU times: user 2.49 s, sys: 556 ms, total: 3.04 s
Wall time: 3.05 s


An idiomatic way to save data, if not on disk, or load data, if present on disk, is demonstrated below.

In [22]:
%%time
import os

loadsave_csv_data    = 'load' if os.path.isfile('./.cache/stock_data.hdf5') else 'save'
loadsave_return_mean = 'load' if os.path.isfile('./.cache/average_return.hdf5') else 'save'

replace = {'stock_data': {loadsave_csv_data: True},
           'average_return': {loadsave_return_mean: True}}

_ = task_graph.run(outputs=['output_csv2.df_out'], replace=replace)

CPU times: user 2.52 s, sys: 459 ms, total: 2.98 s
Wall time: 3.01 s


## Delete temporary files

A few cells above, we generated a .yaml file containing the example task graph, and also a couple of CSV files.

Let's keep our directory clean, and delete them.

In [23]:
%%bash -s "$task_graph_file_name"  "$csv_average_return" "$csv_average_volume" 
rm -f $1 $2 $3

<a id='node_class_example'></a>

---

## Node class example

Implementing custom nodes in greenflow is very straighforward.

Data scientists only need to override five methods in the parent class `Node`:
- `init`
- `meta_setup`
- `ports_setup`
- `conf_schema`
- `process`

`init` method is usually used to define the required column names

`ports_setup` defines the input and output ports for the node

`meta_setup` method is used to calculate the output meta name and types.

`conf_schema` method is used to define the JSON schema for the node conf so the client can generate the proper UI for it.

`process` method takes input dataframes and computes the output dataframe. 

In this way, dataframes are strongly typed, and errors can be detected early before the time-consuming computation happens.

Below, it can be observed `ValueFilterNode` implementation details:

In [24]:
import inspect
from greenflow_gquant_plugin.transform import ValueFilterNode

print(inspect.getsource(ValueFilterNode))

class ValueFilterNode(_PortTypesMixin, Node):

    def init(self):
        _PortTypesMixin.init(self)
        self.INPUT_PORT_NAME = 'in'
        self.OUTPUT_PORT_NAME = 'out'
        port_type = PortsSpecSchema.port_type
        self.port_inports = {
            self.INPUT_PORT_NAME: {
                port_type: [
                    "pandas.DataFrame", "cudf.DataFrame",
                    "dask_cudf.DataFrame", "dask.dataframe.DataFrame"
                ]
            },
        }
        self.port_outports = {
            self.OUTPUT_PORT_NAME: {
                port_type: "${port:in}"
            }
        }
        cols_required = {"asset": "int64"}
        addition = {}
        self.meta_inports = {
            self.INPUT_PORT_NAME: cols_required
        }
        self.meta_outports = {
            self.OUTPUT_PORT_NAME: {
                self.META_OP: self.META_OP_ADDITION,
                self.META_REF_INPUT: self.INPUT_PORT_NAME,
                self.META_DATA: addition
      

In [25]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}