Add override mechanism for algorithm containers, transfer explicit .sif images with HTCondor#464
Add override mechanism for algorithm containers, transfer explicit .sif images with HTCondor#464jhiemstrawisc wants to merge 12 commits into
.sif images with HTCondor#464Conversation
agitter
left a comment
There was a problem hiding this comment.
These changes make sense to me overall.
tristan-f-r
left a comment
There was a problem hiding this comment.
This seems fine from a big picture: I am of course wary of:
- The dynamic algorithm import for the new
ALGORITHM_REGISTRY, but the dependency diamond there seems unavoidable. - The strange conflation for having both
image_overrideandimagesin the same structure, but I have no clean ideas for addressing that at the moment [and I don't want to block this PR on waiting for a better refactor to come along].
I'll add some smaller comments 👍
98aa98d to
390e197
Compare
agitter
left a comment
There was a problem hiding this comment.
The updates addressed my major comments. I only have one minor new one.
However, my local tests fail. Some are unrelated to these changes, some are related:
FAILED test/RWR/test_RWR.py::TestRWR::test_rwr - AssertionError: Output file does not match expected output file
FAILED test/ResponseNet/test_rn.py::TestResponseNet::test_responsenet_required - AssertionError: assert False
FAILED test/ResponseNet/test_rn.py::TestResponseNet::test_responsenet_all_optional - AssertionError: assert False
FAILED test/ST_RWR/test_STRWR.py::TestSTRWR::test_strwr - AssertionError: Output file does not match expected output file
FAILED test/analysis/test_cytoscape.py::TestCytoscape::test_cytoscape - spras.containers.ContainerError: (Command formatted as list: `['python', '/py4cytoscape/cytoscape_util.py', '--output', '/spras/T32FPZG/cytoscape.cys', '--pathway', '/spras/5M63HVH/pathway.txt|test\\analysis\\...
FAILED test/analysis/test_summary.py::TestSummary::test_example_networks[example] - AssertionError: assert False
FAILED test/analysis/test_summary.py::TestSummary::test_example_networks[egfr] - AssertionError: assert False
FAILED test/evaluate/test_evaluate.py::TestEvaluate::test_node_precision_recall_pca_chosen_pathway - assert False
FAILED test/generate-inputs/test_generate_inputs.py::TestGenerateInputs::test_prepare_inputs_networks - AssertionError: assert False
FAILED test/parse-outputs/test_parse_outputs.py::TestParseOutputs::test_parse_outputs - AssertionError: assert False
FAILED test/parse-outputs/test_parse_outputs.py::TestParseOutputs::test_duplicate_edges - AssertionError: assert False
FAILED test/test_container_image_resolution.py::TestPrepareSingularityImage::test_local_sif_with_unpack_builds_sandbox - AttributeError: module 'spython' has no attribute 'main'
FAILED test/test_container_image_resolution.py::TestPrepareSingularityImage::test_registry_with_unpack_pulls_and_builds - AttributeError: module 'spython' has no attribute 'main'
FAILED test/test_container_image_resolution.py::TestPrepareSingularityImage::test_local_sif_no_unpack_returns_sif_path - ModuleNotFoundError: No module named 'pwd'
FAILED test/test_container_image_resolution.py::TestPrepareSingularityImage::test_no_override_no_unpack_returns_docker_uri - ModuleNotFoundError: No module named 'pwd'
FAILED test/test_container_image_resolution.py::TestPrepareSingularityImage::test_unpack_skips_build_if_sandbox_exists - AttributeError: module 'spython' has no attribute 'main'
@tristan-f-r do you want to leave more detailed comments still?
|
I suspect the root cause of my spython and pwd errors is that pwd is Unix only: https://docs.python.org/3/library/pwd.html. If that's true, we could skip those tests or mock pwd. |
|
Sounds good. I didn't find anything in my code passthrough, unfortunately. |
…th HTCondor This accomplishes two main things: 1. Users can explicitly state what container they want a given PRM to run in via the configuration file, using the PRM name (as defined in the config file) as the key. 2. When users specify an override `.sif` image, that image is added to an HTCondor transfer list such that Condor moves the file to the EP for execution (to avoid pulling during the job). Explicitly moving required input files is a "best practice" in HTCondor, because failure to resolve inputs at runtime squanders capacity. When no override is provided or the HTCondor Snakemake executor isn't available, the new Snakefile resource rule should be a no-op. In addition to adding the features, I tried to split up some other functions in and around container resolution to make them more testable.
This adds a more robust way to check whether the container framekwork is apptainer/singularity, which for the purposes of our codebase should be treated as synonyms. I decided to do this after noticing an issue in test logs where the container framework was set to apptainer, and `unpack_singularity` was true -- the unpacking behavior happened correctly despite a logged warning claiming it wouldn't happen because the warning only checked for singularity. I believe this diff makes that type of mistake a little harder.
As I started writing the PR message, I realized things weren't quite the way I wanted them to be w.r.t. this hierarchy. Thisshould fix it.
Introduce ALGORITHM_REGISTRY in spras/config/util.py as the single source of truth for all supported algorithms, mapping algorithm names to (module_path, class_name) tuples. Add an AlgorithmName enum auto-generated from the registry keys (case-insensitive via CaseInsensitiveEnum) and a get_valid_algorithm_names() helper. Replace the 11 manual algorithm imports and hardcoded dict in spras/runner.py with a _load_algorithms() function that uses importlib to load classes from the registry at import time. get_algorithm() now validates via AlgorithmName instead of a raw .lower() lookup. To register a new algorithm, add one entry to ALGORITHM_REGISTRY -- no changes to runner.py are needed.
After building ProcessedContainerSettings, check that every key in containers.images is a recognized algorithm name by passing it through AlgorithmName. Raise ValueError with a clear message and the list of valid names if a typo or unknown name is found. Add test_config_container_images_invalid_algorithm to verify the validation rejects unrecognized keys.
Update the Step 4 walkthrough and checklist item 5 to reflect the new algorithm registration mechanism. Adding a new algorithm now requires a single entry in ALGORITHM_REGISTRY in spras/config/util.py instead of manually importing the class and editing the dict in runner.py.
Replace the two-layer design (_resolve_container_override + _resolve_singularity_image both reading image_override) with a single entry point resolve_container_image() that returns a frozen ResolvedImage dataclass carrying the image string and an is_local_sif flag. Key changes in containers.py: - Add ResolvedImage(image, is_local_sif) frozen dataclass - resolve_container_image() is the single place image_override is read; .sif + singularity returns ResolvedImage(sif_path, True) directly - Rename _resolve_singularity_image to _prepare_singularity_image; it receives ResolvedImage and never re-reads image_override. Its job is to prepare (e.g. unpack) the apptainer image - run_container_singularity() takes ResolvedImage instead of str - Add warnings for suspicious overrides (excessive path depth, bare hostname) with "Attempting to use as-is" messaging
Expand the containers.images comment to mention that shared-fs-usage is configured via spras_profile, making the context clearer for users unfamiliar with the HTCondor integration.
Adds a space to an error message, and stops some apptainer tests from running on non-Linux systems, where `spython` imports cause problems.
beabea9 to
f09c88f
Compare
|
[After tinkering with #457,] there does seem to be a better solution to this that avoids the dynamic algorithm loading entirely: We can localize and we can consider this as a kind of "run setting." I'll see if I can lift this run setting configuration out of #457 for general usage. |
Tony and I discussed this approach and ultimately decided against it because the chosen approach keeps container logic in the yaml's container configuration. |
| # AlgorithmName("PathLinker") resolves to AlgorithmName.pathlinker. | ||
| AlgorithmName = CaseInsensitiveEnum("AlgorithmName", {k: k for k in ALGORITHM_REGISTRY}) | ||
|
|
||
| def get_valid_algorithm_names() -> set[str]: |
There was a problem hiding this comment.
Should we have a way to override the Cytoscape docker image as well?
tristan-f-r
left a comment
There was a problem hiding this comment.
Small cytoscape note: I don't have an immediate way to address that problem without making cytoscape a psuedo-algorithm, so I'm wondering if we even need to worry about it.
In line with this, noting that (for this line in -container_suffix = "pathlinker:v2"
+container_suffix = container_settings.resolve_image("pathlinker")where then, in This means that we do have to maintain two separate lists of docker containers and algorithms, but then that also means we can differentiate between our list of algorithms versus other containers (e.g. Cytoscape), while getting free test-time guarantees that all algorithms are specified, since we explicitly would depend on the default values specified in such a container registry when calling [If you agree with this change but also don't want to implement it, I can try my hand at making a PR over your fork of SPRAS to make these changes.] |
|
@tristan-f-r in your proposed redesign, what would the new Where would the two lists of docker containers and algorithms need to be maintained? We would still have the It was a good catch that we can't override Cytoscape currently. Our existing Cytoscape support is weak, so if we exclude it in this pull request and document that, I am okay deferring a more general solution for a later time when someone needs that functionality. |
|
Effectively, I would like to decouple container logic from the rest of SPRAS more than the currently implemented |
This PR aims to accomplish a few things:
foo, I want you to use container imagebar". In the general case, I'm not sure how much this is needed, but @agitter and I have floated the concept in the past, and it helps me solve 2. This implements a 4-tier hierarchy over how much of the image name is overridden (with special logic for.sifextensions), e.g.:would result in container URIs of
omicsintegrator1 --> docker.io/reedcompbio/oi1:latestomicsintegrator2 --> docker.io/jhiemstra/oi2:latest1mincostflow --> hub.opensciencegrid.org/jhiemstra/mincostflow:latestallpairs --> images/allpairs.sif (local file)The caveat here is that I don't believe all container registries follow this kind of
<base_url>/<owner>/<container>hierarchy, but it's at least true for docker, ghcr and hub.opensciencegrid.org. If a user finds themselves somewhere outside those, they can always declare the entire URI explicitly..sifextension, the file is added to the reconstruct rule'shtcondor_transfer_input_filesresource key. This triggers HTCondor to transfer the sif image as part of the job's input sandbox..sifoverride is present and apptainer/singularity is the configured container framework, SPRAS uses the local file instead of pulling/building from a remote registry. This is combined with 2) are the key fixes for Provide guidance on working around docker rate limiting for large CHTC runs #462singularityandapptainercontainer frameworks differently in some cases. I added a helper function that makes it easier to treat these as synonyms.Note that I haven't yet documented this guidance, as #462 requests -- I'd rather get this over the finish line first, then add the documentation to #459 (so I don't create conflicts for myself)