Skip to content

Add override mechanism for algorithm containers, transfer explicit .sif images with HTCondor#464

Open
jhiemstrawisc wants to merge 12 commits into
Reed-CompBio:mainfrom
jhiemstrawisc:explicit-sif-transfer
Open

Add override mechanism for algorithm containers, transfer explicit .sif images with HTCondor#464
jhiemstrawisc wants to merge 12 commits into
Reed-CompBio:mainfrom
jhiemstrawisc:explicit-sif-transfer

Conversation

@jhiemstrawisc
Copy link
Copy Markdown
Collaborator

This PR aims to accomplish a few things:

  1. It adds a new valid keys to config.yaml that let a user say "For reconstruction algorithm foo, I want you to use container image bar". In the general case, I'm not sure how much this is needed, but @agitter and I have floated the concept in the past, and it helps me solve 2. This implements a 4-tier hierarchy over how much of the image name is overridden (with special logic for .sif extensions), e.g.:
containers:
  registry:
    base_url: docker.io
    owner: reedcompbio
    
    images:
      omicsintegrator1: "oi1:latest"
      omicsintegrator2: "jhiemstra/oi2:latest
      mincostflow: "hub.opensciencegrid.org/jhiemstra/mincostflow:latest
      allpairs: "images/allpairs.sif"

would result in container URIs of

  • omicsintegrator1 --> docker.io/reedcompbio/oi1:latest
  • omicsintegrator2 --> docker.io/jhiemstra/oi2:latest1
  • mincostflow --> hub.opensciencegrid.org/jhiemstra/mincostflow:latest
  • allpairs --> images/allpairs.sif (local file)

The caveat here is that I don't believe all container registries follow this kind of <base_url>/<owner>/<container> hierarchy, but it's at least true for docker, ghcr and hub.opensciencegrid.org. If a user finds themselves somewhere outside those, they can always declare the entire URI explicitly.

  1. When a container override contains the .sif extension, the file is added to the reconstruct rule's htcondor_transfer_input_files resource key. This triggers HTCondor to transfer the sif image as part of the job's input sandbox.
  2. When a .sif override is present and apptainer/singularity is the configured container framework, SPRAS uses the local file instead of pulling/building from a remote registry. This is combined with 2) are the key fixes for Provide guidance on working around docker rate limiting for large CHTC runs #462
  3. Some of the container resolution logic started to get sufficiently nested with if/else logic that I wanted to split it out into a helper function I could more easily test. This is where most of the new test code comes from.
  4. Finally, I fixed a small-but-inconsequential bug I noticed in the way we treated singularity and apptainer container frameworks differently in some cases. I added a helper function that makes it easier to treat these as synonyms.

Note that I haven't yet documented this guidance, as #462 requests -- I'd rather get this over the finish line first, then add the documentation to #459 (so I don't create conflicts for myself)

Copy link
Copy Markdown
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes make sense to me overall.

Comment thread config/config.yaml Outdated
Comment thread config/config.yaml
Comment thread spras/config/container_schema.py
Comment thread spras/containers.py
Comment thread test/test_container_image_resolution.py Outdated
@tristan-f-r tristan-f-r added the enhancement New feature or request label Mar 18, 2026
@github-actions github-actions Bot added the merge-conflict This PR has merge conflicts. label Apr 7, 2026
@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community Bot commented Apr 7, 2026

Documentation build overview

📚 spras | 🛠️ Build #32474638 | 📁 Comparing 2599ec8 against latest (87b314c)

  🔍 Preview build  

5 files changed · ± 5 modified

± Modified

Copy link
Copy Markdown
Collaborator

@tristan-f-r tristan-f-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine from a big picture: I am of course wary of:

  • The dynamic algorithm import for the new ALGORITHM_REGISTRY, but the dependency diamond there seems unavoidable.
  • The strange conflation for having both image_override and images in the same structure, but I have no clean ideas for addressing that at the moment [and I don't want to block this PR on waiting for a better refactor to come along].

I'll add some smaller comments 👍

@jhiemstrawisc jhiemstrawisc force-pushed the explicit-sif-transfer branch from 98aa98d to 390e197 Compare April 8, 2026 14:04
@github-actions github-actions Bot removed the merge-conflict This PR has merge conflicts. label Apr 8, 2026
@github-actions github-actions Bot added the merge-conflict This PR has merge conflicts. label Apr 17, 2026
Copy link
Copy Markdown
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updates addressed my major comments. I only have one minor new one.

However, my local tests fail. Some are unrelated to these changes, some are related:

FAILED test/RWR/test_RWR.py::TestRWR::test_rwr - AssertionError: Output file does not match expected output file
FAILED test/ResponseNet/test_rn.py::TestResponseNet::test_responsenet_required - AssertionError: assert False
FAILED test/ResponseNet/test_rn.py::TestResponseNet::test_responsenet_all_optional - AssertionError: assert False
FAILED test/ST_RWR/test_STRWR.py::TestSTRWR::test_strwr - AssertionError: Output file does not match expected output file
FAILED test/analysis/test_cytoscape.py::TestCytoscape::test_cytoscape - spras.containers.ContainerError: (Command formatted as list: `['python', '/py4cytoscape/cytoscape_util.py', '--output', '/spras/T32FPZG/cytoscape.cys', '--pathway', '/spras/5M63HVH/pathway.txt|test\\analysis\\...
FAILED test/analysis/test_summary.py::TestSummary::test_example_networks[example] - AssertionError: assert False
FAILED test/analysis/test_summary.py::TestSummary::test_example_networks[egfr] - AssertionError: assert False
FAILED test/evaluate/test_evaluate.py::TestEvaluate::test_node_precision_recall_pca_chosen_pathway - assert False
FAILED test/generate-inputs/test_generate_inputs.py::TestGenerateInputs::test_prepare_inputs_networks - AssertionError: assert False
FAILED test/parse-outputs/test_parse_outputs.py::TestParseOutputs::test_parse_outputs - AssertionError: assert False
FAILED test/parse-outputs/test_parse_outputs.py::TestParseOutputs::test_duplicate_edges - AssertionError: assert False
FAILED test/test_container_image_resolution.py::TestPrepareSingularityImage::test_local_sif_with_unpack_builds_sandbox - AttributeError: module 'spython' has no attribute 'main'
FAILED test/test_container_image_resolution.py::TestPrepareSingularityImage::test_registry_with_unpack_pulls_and_builds - AttributeError: module 'spython' has no attribute 'main'
FAILED test/test_container_image_resolution.py::TestPrepareSingularityImage::test_local_sif_no_unpack_returns_sif_path - ModuleNotFoundError: No module named 'pwd'
FAILED test/test_container_image_resolution.py::TestPrepareSingularityImage::test_no_override_no_unpack_returns_docker_uri - ModuleNotFoundError: No module named 'pwd'
FAILED test/test_container_image_resolution.py::TestPrepareSingularityImage::test_unpack_skips_build_if_sandbox_exists - AttributeError: module 'spython' has no attribute 'main'

@tristan-f-r do you want to leave more detailed comments still?

Comment thread spras/config/config.py Outdated
@agitter
Copy link
Copy Markdown
Collaborator

agitter commented Apr 17, 2026

I suspect the root cause of my spython and pwd errors is that pwd is Unix only: https://docs.python.org/3/library/pwd.html. If that's true, we could skip those tests or mock pwd.

@tristan-f-r
Copy link
Copy Markdown
Collaborator

tristan-f-r commented Apr 17, 2026

Sounds good. I didn't find anything in my code passthrough, unfortunately.

…th HTCondor

This accomplishes two main things:
1. Users can explicitly state what container they want a given PRM
to run in via the configuration file, using the PRM name (as defined
in the config file) as the key.
2. When users specify an override `.sif` image, that image is added
to an HTCondor transfer list such that Condor moves the file to the
EP for execution (to avoid pulling during the job). Explicitly moving
required input files is a "best practice" in HTCondor, because failure
to resolve inputs at runtime squanders capacity.

When no override is provided or the HTCondor Snakemake executor isn't
available, the new Snakefile resource rule should be a no-op.

In addition to adding the features, I tried to split up some other
functions in and around container resolution to make them more testable.
This adds a more robust way to check whether the container framekwork
is apptainer/singularity, which for the purposes of our codebase should
be treated as synonyms.

I decided to do this after noticing an issue in test logs where the container
framework was set to apptainer, and `unpack_singularity` was true -- the
unpacking behavior happened correctly despite a logged warning claiming it
wouldn't happen because the warning only checked for singularity.

I believe this diff makes that type of mistake a little harder.
As I started writing the PR message, I realized things weren't quite the way I wanted them to be w.r.t. this hierarchy. Thisshould fix it.
Introduce ALGORITHM_REGISTRY in spras/config/util.py as the single
source of truth for all supported algorithms, mapping algorithm names
to (module_path, class_name) tuples. Add an AlgorithmName enum
auto-generated from the registry keys (case-insensitive via
CaseInsensitiveEnum) and a get_valid_algorithm_names() helper.

Replace the 11 manual algorithm imports and hardcoded dict in
spras/runner.py with a _load_algorithms() function that uses importlib
to load classes from the registry at import time.  get_algorithm() now
validates via AlgorithmName instead of a raw .lower() lookup.

To register a new algorithm, add one entry to ALGORITHM_REGISTRY --
no changes to runner.py are needed.
After building ProcessedContainerSettings, check that every key in
containers.images is a recognized algorithm name by passing it through
AlgorithmName.  Raise ValueError with a clear message and the list of
valid names if a typo or unknown name is found.

Add test_config_container_images_invalid_algorithm to verify the
validation rejects unrecognized keys.
Update the Step 4 walkthrough and checklist item 5 to reflect the new
algorithm registration mechanism.  Adding a new algorithm now requires
a single entry in ALGORITHM_REGISTRY in spras/config/util.py instead
of manually importing the class and editing the dict in runner.py.
Replace the two-layer design (_resolve_container_override +
_resolve_singularity_image both reading image_override) with a single
entry point resolve_container_image() that returns a frozen
ResolvedImage dataclass carrying the image string and an is_local_sif
flag.

Key changes in containers.py:
- Add ResolvedImage(image, is_local_sif) frozen dataclass
- resolve_container_image() is the single place image_override is read;
  .sif + singularity returns ResolvedImage(sif_path, True) directly
- Rename _resolve_singularity_image to _prepare_singularity_image;
  it receives ResolvedImage and never re-reads image_override. Its job
  is to prepare (e.g. unpack) the apptainer image
- run_container_singularity() takes ResolvedImage instead of str
- Add warnings for suspicious overrides (excessive path depth, bare
  hostname) with "Attempting to use as-is" messaging
Expand the containers.images comment to mention that shared-fs-usage
is configured via spras_profile, making the context clearer for users
unfamiliar with the HTCondor integration.
Adds a space to an error message, and stops some apptainer tests
from running on non-Linux systems, where `spython` imports cause
problems.
@jhiemstrawisc jhiemstrawisc force-pushed the explicit-sif-transfer branch from beabea9 to f09c88f Compare April 29, 2026 20:20
@github-actions github-actions Bot removed the merge-conflict This PR has merge conflicts. label Apr 29, 2026
@jhiemstrawisc jhiemstrawisc requested a review from agitter April 29, 2026 20:59
@tristan-f-r
Copy link
Copy Markdown
Collaborator

[After tinkering with #457,] there does seem to be a better solution to this that avoids the dynamic algorithm loading entirely:

We can localize images to specific algorithms, e.g.:

allpairs:
  image_override: "..."
  runs: ...

and we can consider this as a kind of "run setting." I'll see if I can lift this run setting configuration out of #457 for general usage.

@jhiemstrawisc
Copy link
Copy Markdown
Collaborator Author

[After tinkering with #457,] there does seem to be a better solution to this that avoids the dynamic algorithm loading entirely:

We can localize images to specific algorithms, e.g.:

allpairs:
  image_override: "..."
  runs: ...

Tony and I discussed this approach and ultimately decided against it because the chosen approach keeps container logic in the yaml's container configuration.

@ntalluri ntalluri mentioned this pull request May 6, 2026
Comment thread spras/config/util.py
# AlgorithmName("PathLinker") resolves to AlgorithmName.pathlinker.
AlgorithmName = CaseInsensitiveEnum("AlgorithmName", {k: k for k in ALGORITHM_REGISTRY})

def get_valid_algorithm_names() -> set[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a way to override the Cytoscape docker image as well?

Copy link
Copy Markdown
Collaborator

@tristan-f-r tristan-f-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small cytoscape note: I don't have an immediate way to address that problem without making cytoscape a psuedo-algorithm, so I'm wondering if we even need to worry about it.

@tristan-f-r
Copy link
Copy Markdown
Collaborator

tristan-f-r commented May 7, 2026

Should we have a way to override the Cytoscape docker image as well?

In line with this, noting that resolved_container_image effectively (with one warning-based exception) ignores container_suffix when the respective image_override is specified, we can instead have it so that instead of each algorithm specifying their own hard-coded container_suffix that is overriden in containers.py, we can instead have each algorithm poll ProcessedContainerSettings directly. As a more concrete sketch of this with e.g. PathLinker, we can have the following:

(for this line in pathlinker.py:)

-container_suffix = "pathlinker:v2"
+container_suffix = container_settings.resolve_image("pathlinker")

where then, in config.yaml, containers.registry.images would in-code have all of the defaults set to the appropriate container versions, which we can do quite easily with pydantic. For the example above, containers.registry.images.pathlinker would have the default value pathlinker:v2, and we would use your existing "container string -> resolved image" pipeline for ProcessedContainerSettings#resolve_image.

This means that we do have to maintain two separate lists of docker containers and algorithms, but then that also means we can differentiate between our list of algorithms versus other containers (e.g. Cytoscape), while getting free test-time guarantees that all algorithms are specified, since we explicitly would depend on the default values specified in such a container registry when calling resolve_image. (This also generalizes to, say, cases where algorithms have run several docker containers with different tooling: we already see behavior approaching this with DOMINOs usage of two resource-heavy tools, though they are luckily present in the same container with little difference in dependencies.)

[If you agree with this change but also don't want to implement it, I can try my hand at making a PR over your fork of SPRAS to make these changes.]

@agitter
Copy link
Copy Markdown
Collaborator

agitter commented May 8, 2026

@tristan-f-r in your proposed redesign, what would the new config.yaml syntax look like? Would that change?

Where would the two lists of docker containers and algorithms need to be maintained? We would still have the ALGORITHM_REGISTRY in config.util and now we would also have a list of default images in containers.registry.images? I'm usually reluctant to have two similar lists to manually maintain because it creates opportunities for errors as new contributors add methods or things diverge over time.

It was a good catch that we can't override Cytoscape currently. Our existing Cytoscape support is weak, so if we exclude it in this pull request and document that, I am okay deferring a more general solution for a later time when someone needs that functionality.

@tristan-f-r
Copy link
Copy Markdown
Collaborator

tristan-f-r commented May 8, 2026

config.yaml would look exactly the same: it's the internals I'm worried about. Specifically, we can drop the use of dynamic module loading in ALGORITHM_REGISTRY with this (instead continuing to use the old algorithms dictionary currently in spras), since the containers configuration would no longer require knowing algorithm names to inject image_override into their respective container settings.

Effectively, I would like to decouple container logic from the rest of SPRAS more than the currently implemented ALGORITHM_REGISTRY solution allows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants