[REVIEW] release HRP greenflow plugin #143

yidong72 · 2021-09-03T13:33:23Z

Release the code for HRP project.

avolkov1

One of the unit tests failed.

test_max_drawdown (unit.test_max_drawdown.TestMaxDrawdown) ... FAIL

Minor flake warnings.

$ flake8 greenflow_hrp_plugin/
greenflow_hrp_plugin/__init__.py:41:79: W291 trailing whitespace
greenflow_hrp_plugin/kernels.py:592:80: E501 line too long (81 > 79 characters)
greenflow_hrp_plugin/loadCsvNode.py:109:13: F841 local variable 'client' is assigned to but never used

I'm not done reviewing. Some comments are just remarks. So far the unit test above should be fixed.

gQuant/plugins/hrp_plugin/README.md

gQuant/plugins/hrp_plugin/docker/Dockerfile

gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py

gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/aggregateTimeFeature.py

gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/transactionCostNode.py

yidong72 · 2021-09-09T19:36:17Z

addressed all the comments. RAPIDS 21.08 will break the dask_cudf dataframe type. RAPIDS 21.06 is fine.
Need to fix the from_delayed method to generate proper type in the future.

avolkov1

There's a slight nuisance in the notebooks. Some of the workflows produce a warning that port is missing in incoming meta data:

Task "max_drawdown_feature" Node Type "[<class 'greenflow_hrp_plugin.featureNode.FeatureNode'>]" missing required port "feature_df" in incoming meta data. Should the port be connected?
Task "all_max_drawdown_feature" Node Type "[<class 'greenflow_hrp_plugin.featureNode.FeatureNode'>]" missing required port "feature_df" in incoming meta data. Should the port be connected?

The reason is that the node greenflow_hrp_plugin.featureNode.FeatureNode has a port feature_df that is optional, i.e. it is not required to be connected (but if it is connected its data is used).
For such a scenario there is a port spec called "optional" (PortsSpecSchema.optional: True). I tweaked the code to get this working to inhibit the warning. These are the changes I made:

diff --git a/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py b/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py
index 8a8da3f..ba6b6f1 100644
--- a/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py
+++ b/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py
@@ -40,7 +40,8 @@ class FeatureNode(TemplateNodeMixin, Node):
                 port_type: [
                     "pandas.DataFrame", "cudf.DataFrame",
                     "dask_cudf.DataFrame", "dask.dataframe.DataFrame"
-                ]
+                ],
+                PortsSpecSchema.optional: True
             }
         }
         port_outports = {
diff --git a/greenflow/greenflow/dataframe_flow/_node_flow.py b/greenflow/greenflow/dataframe_flow/_node_flow.py
index 6fda95f..749c64f 100644
--- a/greenflow/greenflow/dataframe_flow/_node_flow.py
+++ b/greenflow/greenflow/dataframe_flow/_node_flow.py
@@ -858,6 +858,10 @@ class NodeTaskGraphMixin(NodeTaskGraphExtensionMixin):
                 dy = PortsSpecSchema.dynamic
                 if inports[iport].get(dy, False):
                     continue
+
+                if inports[iport].get(PortsSpecSchema.optional, False):
+                    continue
+
                 # Is it possible that iport not connected? If so iport should
                 # not be in required. Should raise an exception here.
                 warn_msg = \
diff --git a/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py b/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py
index 0363df0..9be1fc4 100644
--- a/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py
+++ b/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py
@@ -1,3 +1,4 @@
+from copy import deepcopy
 from .portsSpecSchema import (PortsSpecSchema, NodePorts)
 from .metaSpec import (MetaDataSchema, MetaData)
 
@@ -56,12 +57,10 @@ class NodeTaskGraphExtensionMixin:
         dynamic = None
         inports = {}
         for input_port in port_inports:
+            inports[input_port] = deepcopy(port_inports[input_port])
             if input_port in input_connections:
                 determined_type = input_connections[input_port]
-                inports[input_port] = {port_type: determined_type}
-            else:
-                types = port_inports[input_port][port_type]
-                inports[input_port] = {port_type: types}
+                inports[input_port].update({port_type: determined_type})
 
             if dy in port_inports[input_port]:
                 inports[input_port][dy] = True

I set that port as optional in featureNode.py. Then I had to tweak the logic a bit in _node_flow.py:_validate_connected_metadata, and _node_taskgraph_extension_mixin.py:ports_setup_ext to account for the PortsSpecSchema.optional case.

I did not test this thoroughly, but the notebooks seem to run fine and the warning is now correctly handled.

What do you think of making these updates?

avolkov1

I finished reviewing. There's a bug when drawing the taskgraph with ports in ipynb mode.

taskGraph.draw(show='ipynb', show_ports=True)

That draw method is not called with show_ports option in the notebooks, nevertheless it can be fixed via:

diff --git a/greenflow/greenflow/dataframe_flow/taskGraph.py b/greenflow/greenflow/dataframe_flow/taskGraph.py
index 70149b1..66fc0ff 100644
--- a/greenflow/greenflow/dataframe_flow/taskGraph.py
+++ b/greenflow/greenflow/dataframe_flow/taskGraph.py
@@ -353,7 +353,7 @@ class TaskGraph(object):
 
                 if (to_type == OUTPUT_TYPE):
                     continue
-                task_node = get_node_obj(itask)
+                task_node = get_node_obj(itask, tgraph_mixin=True)

I had trouble running through the 17assets.ipynb notebook. Even reducing the bootstrap samples I would run out of memory on the GPU. I had to re-start the notebook kernel and run each section individually, and that way I was able to run through the whole notebook. I think there's a bug with cudf.merge as that method seems to cause memory leak on the GPUs.

My suggestion would be to split that notebook "17assets.ipynb" into several notebooks corresponding to 1 or 2 sections at a time. Name them something like "17assets_1of5.ipynb", "17assets_2of5.ipynb", and so on, or whatever makes sense. That might make it easier to run each notebook, clean the python kernel or shutdown the notebook, run the next notebook, and so on.

Let me know if you'd like to make these changes or not. I'll resolve this comment then, and approve/merge the PR.

Thanks

yidong72 · 2021-09-11T15:27:14Z

I finished reviewing. There's a bug when drawing the taskgraph with ports in ipynb mode.
taskGraph.draw(show='ipynb', show_ports=True)
That draw method is not called with show_ports option in the notebooks, nevertheless it can be fixed via:
diff --git a/greenflow/greenflow/dataframe_flow/taskGraph.py b/greenflow/greenflow/dataframe_flow/taskGraph.py
index 70149b1..66fc0ff 100644
--- a/greenflow/greenflow/dataframe_flow/taskGraph.py
+++ b/greenflow/greenflow/dataframe_flow/taskGraph.py
@@ -353,7 +353,7 @@ class TaskGraph(object):
 
                 if (to_type == OUTPUT_TYPE):
                     continue
-                task_node = get_node_obj(itask)
+                task_node = get_node_obj(itask, tgraph_mixin=True)
I had trouble running through the 17assets.ipynb notebook. Even reducing the bootstrap samples I would run out of memory on the GPU. I had to re-start the notebook kernel and run each section individually, and that way I was able to run through the whole notebook. I think there's a bug with cudf.merge as that method seems to cause memory leak on the GPUs.

My suggestion would be to split that notebook "17assets.ipynb" into several notebooks corresponding to 1 or 2 sections at a time. Name them something like "17assets_1of5.ipynb", "17assets_2of5.ipynb", and so on, or whatever makes sense. That might make it easier to run each notebook, clean the python kernel or shutdown the notebook, run the next notebook, and so on.

Let me know if you'd like to make these changes or not. I'll resolve this comment then, and approve/merge the PR.

Thanks
I have done all the changes. I added following lines to restart the kernel which seems to fix the memory issue.

import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

Could you give it a try?

yidong72 · 2021-09-11T15:28:03Z

ready for review

avolkov1

Thank you. This looks good. I was able to run the notebook. I had to be a little patient after running the kernel shutdown cell. Otherwise this worked.

Did you re-run the 17assets.ipynb notebook completely? Because in the gtihub posted version (https://github.com/yidong72/gQuant/blob/branch-hrp/gQuant/plugins/hrp_plugin/notebooks/17assets.ipynb) the sections after "Hyperparameter Optimization" have the warning that was fixed. Running it myself, none of those warnings appeared anymore.

yidong72 · 2021-09-15T17:50:18Z

I added your 17assets notebook and added the synthetic 10 stocks dataset.

avolkov1

Looks great. Thanks!

init

5e152b5

yidong72 requested a review from avolkov1 September 3, 2021 13:33

yidong72 changed the base branch from main to develop September 3, 2021 13:33

avolkov1 suggested changes Sep 3, 2021

View reviewed changes

yidong72 added 2 commits September 8, 2021 15:16

fixed the bug in unit test

9abfb86

use rapis version 21.06

a81d096

avolkov1 reviewed Sep 9, 2021

View reviewed changes

avolkov1 suggested changes Sep 10, 2021

View reviewed changes

update the text and fix the warning

5d4d4e8

restart the kernel between steps to prevent memory issue

b659c1b

avolkov1 reviewed Sep 13, 2021

View reviewed changes

added the fake dataset

fed6f67

avolkov1 approved these changes Sep 15, 2021

View reviewed changes

avolkov1 merged commit 717affe into NVIDIA:develop Sep 15, 2021

yidong72 deleted the branch-hrp branch September 15, 2021 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] release HRP greenflow plugin #143

[REVIEW] release HRP greenflow plugin #143

yidong72 commented Sep 3, 2021 •

edited

Loading

avolkov1 left a comment

yidong72 commented Sep 9, 2021

avolkov1 left a comment

avolkov1 left a comment

yidong72 commented Sep 11, 2021

yidong72 commented Sep 11, 2021

avolkov1 left a comment •

edited

Loading

yidong72 commented Sep 15, 2021

avolkov1 left a comment

[REVIEW] release HRP greenflow plugin #143

[REVIEW] release HRP greenflow plugin #143

Conversation

yidong72 commented Sep 3, 2021 • edited Loading

avolkov1 left a comment

Choose a reason for hiding this comment

yidong72 commented Sep 9, 2021

avolkov1 left a comment

Choose a reason for hiding this comment

avolkov1 left a comment

Choose a reason for hiding this comment

yidong72 commented Sep 11, 2021

yidong72 commented Sep 11, 2021

avolkov1 left a comment • edited Loading

Choose a reason for hiding this comment

yidong72 commented Sep 15, 2021

avolkov1 left a comment

Choose a reason for hiding this comment

yidong72 commented Sep 3, 2021 •

edited

Loading

avolkov1 left a comment •

edited

Loading