Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] release HRP greenflow plugin #143

Merged
merged 6 commits into from
Sep 15, 2021
Merged

Conversation

yidong72
Copy link
Collaborator

@yidong72 yidong72 commented Sep 3, 2021

Release the code for HRP project.

@yidong72 yidong72 changed the base branch from main to develop September 3, 2021 13:33
Copy link
Contributor

@avolkov1 avolkov1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the unit tests failed.

test_max_drawdown (unit.test_max_drawdown.TestMaxDrawdown) ... FAIL

Minor flake warnings.

$ flake8 greenflow_hrp_plugin/
greenflow_hrp_plugin/__init__.py:41:79: W291 trailing whitespace
greenflow_hrp_plugin/kernels.py:592:80: E501 line too long (81 > 79 characters)
greenflow_hrp_plugin/loadCsvNode.py:109:13: F841 local variable 'client' is assigned to but never used

I'm not done reviewing. Some comments are just remarks. So far the unit test above should be fixed.

gQuant/plugins/hrp_plugin/README.md Outdated Show resolved Hide resolved
gQuant/plugins/hrp_plugin/docker/Dockerfile Outdated Show resolved Hide resolved
@yidong72
Copy link
Collaborator Author

yidong72 commented Sep 9, 2021

addressed all the comments. RAPIDS 21.08 will break the dask_cudf dataframe type. RAPIDS 21.06 is fine.
Need to fix the from_delayed method to generate proper type in the future.

Copy link
Contributor

@avolkov1 avolkov1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a slight nuisance in the notebooks. Some of the workflows produce a warning that port is missing in incoming meta data:

Task "max_drawdown_feature" Node Type "[<class 'greenflow_hrp_plugin.featureNode.FeatureNode'>]" missing required port "feature_df" in incoming meta data. Should the port be connected?
Task "all_max_drawdown_feature" Node Type "[<class 'greenflow_hrp_plugin.featureNode.FeatureNode'>]" missing required port "feature_df" in incoming meta data. Should the port be connected?

The reason is that the node greenflow_hrp_plugin.featureNode.FeatureNode has a port feature_df that is optional, i.e. it is not required to be connected (but if it is connected its data is used).
For such a scenario there is a port spec called "optional" (PortsSpecSchema.optional: True). I tweaked the code to get this working to inhibit the warning. These are the changes I made:

diff --git a/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py b/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py
index 8a8da3f..ba6b6f1 100644
--- a/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py
+++ b/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py
@@ -40,7 +40,8 @@ class FeatureNode(TemplateNodeMixin, Node):
                 port_type: [
                     "pandas.DataFrame", "cudf.DataFrame",
                     "dask_cudf.DataFrame", "dask.dataframe.DataFrame"
-                ]
+                ],
+                PortsSpecSchema.optional: True
             }
         }
         port_outports = {
diff --git a/greenflow/greenflow/dataframe_flow/_node_flow.py b/greenflow/greenflow/dataframe_flow/_node_flow.py
index 6fda95f..749c64f 100644
--- a/greenflow/greenflow/dataframe_flow/_node_flow.py
+++ b/greenflow/greenflow/dataframe_flow/_node_flow.py
@@ -858,6 +858,10 @@ class NodeTaskGraphMixin(NodeTaskGraphExtensionMixin):
                 dy = PortsSpecSchema.dynamic
                 if inports[iport].get(dy, False):
                     continue
+
+                if inports[iport].get(PortsSpecSchema.optional, False):
+                    continue
+
                 # Is it possible that iport not connected? If so iport should
                 # not be in required. Should raise an exception here.
                 warn_msg = \
diff --git a/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py b/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py
index 0363df0..9be1fc4 100644
--- a/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py
+++ b/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py
@@ -1,3 +1,4 @@
+from copy import deepcopy
 from .portsSpecSchema import (PortsSpecSchema, NodePorts)
 from .metaSpec import (MetaDataSchema, MetaData)
 
@@ -56,12 +57,10 @@ class NodeTaskGraphExtensionMixin:
         dynamic = None
         inports = {}
         for input_port in port_inports:
+            inports[input_port] = deepcopy(port_inports[input_port])
             if input_port in input_connections:
                 determined_type = input_connections[input_port]
-                inports[input_port] = {port_type: determined_type}
-            else:
-                types = port_inports[input_port][port_type]
-                inports[input_port] = {port_type: types}
+                inports[input_port].update({port_type: determined_type})
 
             if dy in port_inports[input_port]:
                 inports[input_port][dy] = True

I set that port as optional in featureNode.py. Then I had to tweak the logic a bit in _node_flow.py:_validate_connected_metadata, and _node_taskgraph_extension_mixin.py:ports_setup_ext to account for the PortsSpecSchema.optional case.

I did not test this thoroughly, but the notebooks seem to run fine and the warning is now correctly handled.

What do you think of making these updates?

Copy link
Contributor

@avolkov1 avolkov1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished reviewing. There's a bug when drawing the taskgraph with ports in ipynb mode.

taskGraph.draw(show='ipynb', show_ports=True)

That draw method is not called with show_ports option in the notebooks, nevertheless it can be fixed via:

diff --git a/greenflow/greenflow/dataframe_flow/taskGraph.py b/greenflow/greenflow/dataframe_flow/taskGraph.py
index 70149b1..66fc0ff 100644
--- a/greenflow/greenflow/dataframe_flow/taskGraph.py
+++ b/greenflow/greenflow/dataframe_flow/taskGraph.py
@@ -353,7 +353,7 @@ class TaskGraph(object):
 
                 if (to_type == OUTPUT_TYPE):
                     continue
-                task_node = get_node_obj(itask)
+                task_node = get_node_obj(itask, tgraph_mixin=True)

I had trouble running through the 17assets.ipynb notebook. Even reducing the bootstrap samples I would run out of memory on the GPU. I had to re-start the notebook kernel and run each section individually, and that way I was able to run through the whole notebook. I think there's a bug with cudf.merge as that method seems to cause memory leak on the GPUs.

My suggestion would be to split that notebook "17assets.ipynb" into several notebooks corresponding to 1 or 2 sections at a time. Name them something like "17assets_1of5.ipynb", "17assets_2of5.ipynb", and so on, or whatever makes sense. That might make it easier to run each notebook, clean the python kernel or shutdown the notebook, run the next notebook, and so on.

Let me know if you'd like to make these changes or not. I'll resolve this comment then, and approve/merge the PR.

Thanks

@yidong72
Copy link
Collaborator Author

I finished reviewing. There's a bug when drawing the taskgraph with ports in ipynb mode.

taskGraph.draw(show='ipynb', show_ports=True)

That draw method is not called with show_ports option in the notebooks, nevertheless it can be fixed via:

diff --git a/greenflow/greenflow/dataframe_flow/taskGraph.py b/greenflow/greenflow/dataframe_flow/taskGraph.py
index 70149b1..66fc0ff 100644
--- a/greenflow/greenflow/dataframe_flow/taskGraph.py
+++ b/greenflow/greenflow/dataframe_flow/taskGraph.py
@@ -353,7 +353,7 @@ class TaskGraph(object):
 
                 if (to_type == OUTPUT_TYPE):
                     continue
-                task_node = get_node_obj(itask)
+                task_node = get_node_obj(itask, tgraph_mixin=True)

I had trouble running through the 17assets.ipynb notebook. Even reducing the bootstrap samples I would run out of memory on the GPU. I had to re-start the notebook kernel and run each section individually, and that way I was able to run through the whole notebook. I think there's a bug with cudf.merge as that method seems to cause memory leak on the GPUs.

My suggestion would be to split that notebook "17assets.ipynb" into several notebooks corresponding to 1 or 2 sections at a time. Name them something like "17assets_1of5.ipynb", "17assets_2of5.ipynb", and so on, or whatever makes sense. That might make it easier to run each notebook, clean the python kernel or shutdown the notebook, run the next notebook, and so on.

Let me know if you'd like to make these changes or not. I'll resolve this comment then, and approve/merge the PR.

Thanks
I have done all the changes. I added following lines to restart the kernel which seems to fix the memory issue.

import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

Could you give it a try?

@yidong72
Copy link
Collaborator Author

ready for review

Copy link
Contributor

@avolkov1 avolkov1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This looks good. I was able to run the notebook. I had to be a little patient after running the kernel shutdown cell. Otherwise this worked.

Did you re-run the 17assets.ipynb notebook completely? Because in the gtihub posted version (https://github.com/yidong72/gQuant/blob/branch-hrp/gQuant/plugins/hrp_plugin/notebooks/17assets.ipynb) the sections after "Hyperparameter Optimization" have the warning that was fixed. Running it myself, none of those warnings appeared anymore.

@yidong72
Copy link
Collaborator Author

I added your 17assets notebook and added the synthetic 10 stocks dataset.

Copy link
Contributor

@avolkov1 avolkov1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks!

@avolkov1 avolkov1 merged commit 717affe into NVIDIA:develop Sep 15, 2021
@yidong72 yidong72 deleted the branch-hrp branch September 15, 2021 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants