-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to PGT cause deploy graph failure for node_list
with only one node.
#255
Comments
node_list
with only one node.
There is an error somewhere else and maybe that got lost during one of the merges before. The changes I have committed should actually never result in a list with just a single node and in addition all the nodes in the list should have a port as well. The reason behind that is that we need to be able to specify the port of the DIM and the NM, and with the previous list that was impossible, since it used the first n_island nodes in the list twice, once as a DIM and once as a NM (if the co-locate NM flag was set to True) and attached the default ports in order to access the managers. The updated list thus needs to have at least two entries in the minimal case, one for the DIM and one for the NM, like ['node1:8001', 'node1:8000'] and then this index error would never happen. |
Thanks for clarifying that @wicenec; if that's the case we probably want to introduce error handling to ensure that the node_list is at least two entries, as that requirement is not explicit in the code. I'll have a poke around to see if I can find why our |
@awicenec to add to our conversation in-person earlier, I am currently running the following commands locally to set up the DIM, NM, and translator: $ dlg nm -H 0.0.0.0 -vvv --dlm-enable-replication
$ dlg dim -H 0.0.0.0 -N localhost:8000 -vvv
$ dlg lgweb -d /tmp/ -t /tmp/ -vv This appears to be creating 1 data island manager and 1 host:
This only leads to 1 DropManager in the |
After playing with some settings today, we identified that this issue is localised to running a Server deployment, as opposed to a Browser Direct deployment. The Browser Direct option gives us the expected set of nodes when running a basic, locally deployed version of DALiuGE: I have identified why we get the correct list for the Browser Direct method and have a preliminary fix in #256. |
Closing out as this has been fixed in #256. |
Environment
EAGLE: eagle.icrar.org
DALiuGE: Translator, Node Manager, Data Island Manager (all local).
Issue
Changes introduced in 6a7bf50 lead to the following error when attempting translate and deploy a graph locally:
"Failed to deploy physical graph: Invalid new_num_parts 0"
Specifically, it looks like the issue starts in the
PGT
class prior to partitioning, with how we initialise thenm_list
.I believe this is an off by one error, with Python's list splicing partly to blame; an
IndexError
is not thrown if we try and splice an index that is out of range, it just returns an empty list.If
num_islands=1
, and we havenode_list = ['node1', 'node2']
, we will end up with:is_list == ['node1']
nm_list == ['node2']
However, if we have only
node_list= [node1]
:is_list ['node1'] # index[0]
nm_list -> [] # index[1:], which is out of bounds
Solution
I have implemented a workaround with the following changes:
However, I may be missing some nuance in the construction of the island manager/node manager list, so am interested to hear more on the preferred solution.
The text was updated successfully, but these errors were encountered: