Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to PGT cause deploy graph failure for node_list with only one node. #255

Closed
myxie opened this issue May 21, 2024 · 5 comments
Closed

Comments

@myxie
Copy link
Collaborator

myxie commented May 21, 2024

Environment

EAGLE: eagle.icrar.org
DALiuGE: Translator, Node Manager, Data Island Manager (all local).

Issue

Changes introduced in 6a7bf50 lead to the following error when attempting translate and deploy a graph locally:

"Failed to deploy physical graph: Invalid new_num_parts 0"

Specifically, it looks like the issue starts in the PGT class prior to partitioning, with how we initialise the nm_list.

    is_list = node_list[0:num_islands]
    nm_list = node_list[num_islands:]

I believe this is an off by one error, with Python's list splicing partly to blame; an IndexError is not thrown if we try and splice an index that is out of range, it just returns an empty list.

If num_islands=1, and we have node_list = ['node1', 'node2'], we will end up with:

  • is_list == ['node1']
  • nm_list == ['node2']

However, if we have only node_list= [node1]:

  • is_list ['node1'] # index[0]
  • nm_list -> [] # index[1:], which is out of bounds

Solution

I have implemented a workaround with the following changes:

    is_list = node_list[0:num_islands]
    nm_list = node_list[num_islands-1:] 

However, I may be missing some nuance in the construction of the island manager/node manager list, so am interested to hear more on the preferred solution.

@myxie myxie changed the title Updates to PGT cause deploy graph failure for 1 Node Manager Updates to PGT cause deploy graph failure for one Node Manager May 21, 2024
@myxie myxie changed the title Updates to PGT cause deploy graph failure for one Node Manager Updates to PGT cause deploy graph failure for node_list with only one node. May 21, 2024
@awicenec
Copy link
Contributor

There is an error somewhere else and maybe that got lost during one of the merges before. The changes I have committed should actually never result in a list with just a single node and in addition all the nodes in the list should have a port as well. The reason behind that is that we need to be able to specify the port of the DIM and the NM, and with the previous list that was impossible, since it used the first n_island nodes in the list twice, once as a DIM and once as a NM (if the co-locate NM flag was set to True) and attached the default ports in order to access the managers. The updated list thus needs to have at least two entries in the minimal case, one for the DIM and one for the NM, like ['node1:8001', 'node1:8000'] and then this index error would never happen.

@myxie
Copy link
Collaborator Author

myxie commented May 22, 2024

Thanks for clarifying that @wicenec; if that's the case we probably want to introduce error handling to ensure that the node_list is at least two entries, as that requirement is not explicit in the code.

I'll have a poke around to see if I can find why our node_list is only one element long.

@myxie
Copy link
Collaborator Author

myxie commented May 22, 2024

@awicenec to add to our conversation in-person earlier, I am currently running the following commands locally to set up the DIM, NM, and translator:

$ dlg nm -H 0.0.0.0 -vvv --dlm-enable-replication
$ dlg dim -H 0.0.0.0 -N localhost:8000 -vvv
$ dlg lgweb -d /tmp/ -t /tmp/ -vv

This appears to be creating 1 data island manager and 1 host:

2024-05-22 07:47:55,773 [ INFO] [     MainThread] dlg.manager.cmdline#launchServer:74 DALiuGE version 4.0.1 running at /home/00087932/dlg/workspace
2024-05-22 07:47:55,774 [ INFO] [     MainThread] dlg.manager.cmdline#launchServer:77 Creating DataIslandManager
2024-05-22 07:48:12,138 [DEBUG] [DMChecker Threa] dlg.manager.composite_manager#check_dm:253 Checking DM presence at localhost:8000
2024-05-22 07:48:19,997 [ INFO] [     MainThread] dlg.manager.composite_manager#__init__:592 Created DataIslandManager for hosts: ['localhost:8000']
2024-05-22 07:48:27,064 [ INFO] [     MainThread] dlg.restserver#start:45 Starting REST server on 0.0.0.0:8001
2024-05-22 07:49:09,906 [DEBUG] [      Thread-49] dlg.manager.rest#fwrapper:81 CORS request comming from: http://localhost:8084
2024-05-22 07:49:09,908 [DEBUG] [      Thread-49] dlg.manager.rest#fwrapper:97 CORS headers set to allow from: http://localhost:8084
2024-05-22 07:49:09,908 [DEBUG] [      Thread-49] dlg.manager.rest#fwrapper:99 Bottle sending back result: {"methods": ["BROWSER", "SERVER"]}
2024-05-22 07:49:09,927 [DEBUG] [      Thread-51] dlg.manager.rest#fwrapper:81 CORS request comming from: None
2024-05-22 07:49:09,928 [DEBUG] [      Thread-51] dlg.manager.rest#fwrapper:97 CORS headers set to allow from: None
2024-05-22 07:49:09,928 [DEBUG] [      Thread-51] dlg.manager.rest#fwrapper:99 Bottle sending back result: {"methods": ["BROWSER", "SERVER"]}
2024-05-22 07:49:12,139 [DEBUG] [DMChecker Threa] dlg.manager.composite_manager#check_dm:253 Checking DM presence at localhost:8000
2024-05-22 07:49:50,850 [DEBUG] [      Thread-87] dlg.manager.rest#fwrapper:81 CORS request comming from: None
2024-05-22 07:49:50,850 [DEBUG] [      Thread-87] dlg.manager.rest#fwrapper:97 CORS headers set to allow from: None
2024-05-22 07:49:50,851 [DEBUG] [      Thread-87] dlg.manager.rest#fwrapper:99 Bottle sending back result: ["localhost:8000"]
2024-05-22 07:50:12,140 [DEBUG] [DMChecker Threa] dlg.manager.composite_manager#check_dm:253 Checking DM presence at localhost:8000
2024-05-22 07:51:12,141 [DEBUG] [DMChecker Threa] dlg.manager.composite_manager#check_dm:253 Checking DM presence at localhost:8000
2024-05-22 07:52:12,142 [DEBUG] [DMChecker Threa] dlg.manager.composite_manager#check_dm:253 Checking DM presence at localhost:8000

This only leads to 1 DropManager in the node_list variable when it comes to deploying the PGT. Are you able to provide the command flags you run locally, as I think the DIM I run does not correctly pick up on the Node Manager.

@myxie
Copy link
Collaborator Author

myxie commented May 23, 2024

After playing with some settings today, we identified that this issue is localised to running a Server deployment, as opposed to a Browser Direct deployment. The Browser Direct option gives us the expected set of nodes when running a basic, locally deployed version of DALiuGE: ['localhost:8001', localhost:8000'], but we only see [localhost:8000] if we use the Server option.

I have identified why we get the correct list for the Browser Direct method and have a preliminary fix in #256.

@myxie
Copy link
Collaborator Author

myxie commented Jun 4, 2024

Closing out as this has been fixed in #256.

@myxie myxie closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants