Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colmena version issues in example #2

Open
vsoch opened this issue Mar 16, 2023 · 20 comments
Open

colmena version issues in example #2

vsoch opened this issue Mar 16, 2023 · 20 comments

Comments

@vsoch
Copy link

vsoch commented Mar 16, 2023

Hiya! I'm trying to reproduce the 1_ notebook (with colmena) and none of the versions from the one provided up until the current work.

  • 0.4.2 - 4.0.4: ImportError: cannot import name 'make_queue_pairs' from 'colmena.queue.redis'
  • 0.4.1: ModuleNotFoundError: No module named 'colmena.task_server'
  • 0.3.2-0.4.0: AttributeError: module 'proxystore' has no attribute 'proxy'

Possibly I'm missing something or the script needs to be updated? Also, the import was columna.redis.queue and it should be columna.queue.redis so I'm questioning if this was run to completion.

@WardLT
Copy link
Collaborator

WardLT commented Mar 16, 2023

I made a few version updates. Could you check if it works for you now? You'll need to rebuild the environment, as I changed some things besides Colmena

@vsoch
Copy link
Author

vsoch commented Mar 16, 2023

Rebuilding! So just to clarify - redis is no longer being used? Note in the issue linked above the developer said we could do:

queues = RedisQueues(hostname=args.hostname, topics=['simulate', 'train', 'infer'], serialization_method='pickle')

Should the two be equivalent (aside from using pipes vs redis?)

@vsoch
Copy link
Author

vsoch commented Mar 16, 2023

Lol! Sorry just realized that "the developer" is you! So you probably know this is the better approach. 😆 (sorry for my faux pas). Will redis not work then?

@vsoch
Copy link
Author

vsoch commented Mar 16, 2023

Ack - so the changes to the environment broke the example 0. :( I'm going to revert to see if I can get it working again.

@WardLT
Copy link
Collaborator

WardLT commented Mar 16, 2023

Yes, Redis and Pipes should be equivalent (Redis is better for larger data). I switched to pipes so that users don't have to remember to start redis for the demo.

@vsoch
Copy link
Author

vsoch commented Mar 16, 2023

Let me know when there is an updated colmena to try and I'll try the second example again (it's still hanging with pipes).

@vsoch
Copy link
Author

vsoch commented Mar 17, 2023

In case you need it, here is the current state of my script for the second simulation (still hanging!) https://github.com/rse-ops/flux-hpc/blob/add/molecular-design-parsl/molecular-design-parsl/scripts/1_interleaving-simulation-and-steering.py

Thanks for your help and have a good evening!

@WardLT
Copy link
Collaborator

WardLT commented Mar 17, 2023

Colmena is updated. YOu should find a v0.4.5

Could you try updating "chemfunctions.py" to the latest version from this git repo? It switches from qcengine to ASE for the chemistry computations, and that fixed some issues I had. I'm not sure if yours is the same, but at least it'll put our repos on the same basis.

Also, do you see the second task going out in colmena.log?

@vsoch
Copy link
Author

vsoch commented Mar 17, 2023

Hiya! I've updated the chemistry script, and also fixed (what I consider a bug) with the FluxExecutor - it was running flux start instead of flux submit for every job. The first example (0_.py) continues to work, but the second one (1_.py) still hangs:

queues.send_inputs("C", method="compute_vertical")
result = queues.get_result()  # hangs on this line
queues.send_inputs("C", method="compute_vertical", topic="simulate")

For the log, I'm not sure what I'm looking at, but it look like this!

image

@vsoch
Copy link
Author

vsoch commented Mar 17, 2023

Not sure if this is expected, but I don't see anything in "submit_scripts"
image
. Let me know what else you'd like to see or what else you'd like me to try!

@WardLT
Copy link
Collaborator

WardLT commented Mar 17, 2023

That is good to know about no submit scripts showing up. My guess so far is that the ParslTaskServer is failing to start or crashing when it receives a task.

Do you see anything at the end of that log file? Does it stop writing messages after a certain point?

Another thing you could do is add task_server.join() just before the line that hangs. If the task server is running, the join will hang. If it died, this should propagate the error message.

@WardLT
Copy link
Collaborator

WardLT commented Mar 17, 2023

Overall, it sounds like an issue between Colmena and the FluxExecutor.

Could you open an issue about it on Colmena's GitHub? It be good to test Colmena+FluxExecutor on the simple test cases we use with Colmena if we get stumped here.

@vsoch
Copy link
Author

vsoch commented Mar 17, 2023

Another thing you could do is add task_server.join() just before the line that hangs. If the task server is running, the join will hang. If it died, this should propagate the error message.

The join hangs so we know it's running!

@vsoch
Copy link
Author

vsoch commented Mar 17, 2023

okay I'm doing some debugging. Even when I have manually set the launch_cmd, it seems to have been over-ridden somewhere because it's start again:

image

Also note that when I run the submit script outside of the thread, that's what hangs. Let's dig into that next. When I update to flux submit instead of start, at least I can see the job being submit (this is an error you only see when it's running as root)

image
And I can see the job submit (even when I control c)

image

There isn't much meaningful content in the log (when I attach to the job)

image

No obvious errors in the logs

image

And specifically here is the hang - the submission queue is always empty

        while not stop_event.is_set() or not submission_queue.empty():
            try:
                jobinfo = submission_queue.get(timeout=0.05)
            except queue.Empty:
                # We continually hit this statement
                pass
            else:
                _submit_single_job(flux_executor, working_dir, jobinfo)

Is that related to parsl or columna or something else?

@vsoch
Copy link
Author

vsoch commented Mar 21, 2023

heyo! Do you have any other ideas for what I could try? Are there examples that don't require colmena? That seems to be what might be introducing something buggy.

@WardLT
Copy link
Collaborator

WardLT commented Mar 22, 2023

Sorry for the slow reply. But, I have been thinking about this.

Are there examples that don't require colmena?

Just the first notebook, which does a very similar workflow without Colmena's interference.

That seems to be what might be introducing something buggy.

There are two routes for bugs I'm curious about:

  1. Are you running on a Mac? I've yet to work out Mac support with Colmena (I think it has to do with Pipes)
  2. Does the workflow work with HTEx rather than Flux? Colmena spawns Parsl as a subprocess, and there's a chance doing so leads to some interference.

@vsoch
Copy link
Author

vsoch commented Mar 22, 2023

Heyo!

Are you running on a Mac? I've yet to work out Mac support with Colmena (I think it has to do with Pipes)

I'm not running on a Mac - I use Linux (I have a Mac but I'm allergic. It's great for email though!)

Does the workflow work with HTEx rather than Flux? Colmena spawns Parsl as a subprocess, and there's a chance doing so leads to some interference.

I'm not sure I know what HTEx is! Is that another workflow tool? Since this is for the Flux Operator, we are primarily interested in running with Flux. Is there something I should check to see if this HTEx is involved?

@WardLT
Copy link
Collaborator

WardLT commented Mar 22, 2023

Sorry for the jargon, HTEx was "High throughput Executor." It's the part you replaced with Flux.

Put better, did the colmena demo work without flux?

@vsoch
Copy link
Author

vsoch commented Mar 22, 2023

Put better, did the colmena demo work without flux?

Oh, I didn't test that use case - I'm only interested in running with Flux. If you report the demo works for you, I'd assume it's an issue with how it's integrated into flux.

@WardLT
Copy link
Collaborator

WardLT commented Mar 23, 2023

Yea, flux not working is definitely something to dig in to. I'm hoping to at least rule out Colmena not working on your system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants