Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define GPU matchmaking expression between job and machine and rename job GPU parameters #11942

Open
amaltaro opened this issue Mar 21, 2024 · 0 comments

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Mar 21, 2024

Impact of the new feature
WMAgent

Is your feature request related to a problem? Please describe.
At the moment, there is no matchmaking involving GPU. The only logic implemented in SI so far is to request GPU pilots whenever jobs request GPUs, then jobs will match against those GPU pilots, which will advertise GPUs.

We need to define a matchmaking expression that will take into consideration the 3 mandatory GPU parameters, currently named as (will likely be renamed): GPUMemoryMB, CUDACapability and CUDARuntime

Describe the solution you'd like
This ticket requests mainly 2 changes:

  1. rename the job GPU parameters to something like:
    GPUMemoryMB --> DESIRED_GPUMemoryMB
    CUDACapability --> DESIRED_GPUCapability
    CUDARuntime --> DESIRED_GPURuntime

  2. define a requirement expression that takes into consideration the 3 job classads above, in addition to RequestGPUs, which says how many GPUs the job is looking for.

Describe alternatives you've considered
The job matchmaking expression can be defined in two places:
a) WMAgent
b) SI Frontend

I would love to hear pros/cons of those. From Marco M., he is fine either way.

Additional context
Further context is provided in this JIRA ticket: https://its.cern.ch/jira/browse/CMSSI-79
which talks about condor job classad renaming and heterogeneous StepChain requirements.

In addition, here is the current job requirements expression:

$ condor_q -l 23332.0 -bet
    (stringListMember(TARGET.Arch,REQUIRED_ARCH)) && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) &&
    (TARGET.HasFileTransfer)

and this is the machine expression in place:

$ condor_q -l 23332.0 -bet -reverse -machine slot1_1@glidein_31901_10637214@lnxfarm315.colorado.edu
    START && (WithinResourceLimits)
  START is
    (true &&
            (SiteWMS_WN_Draining is false)) && ((true) && (true) &&
            (ifthenelse(GLIDEIN_REQUIRED_OS is "any",(HAS_SINGULARITY is true &&
                        GLIDEIN_PS_HAS_SINGULARITY isnt false),(isUndefined(REQUIRED_OS) ||
                        REQUIRED_OS is "any" ||
            REQUIRED_OS is GLIDEIN_REQUIRED_OS)) &&
                ifthenelse(MaxWallTimeMins isnt undefined,(MaxWallTimeMins * 60) < (GLIDEIN_ToDie - MyCurrentTime),(16 * 3600) < (GLIDEIN_ToDie - MyCurrentTime)) &&
                ((DynamicSlot isnt true) || (RequestCpus is Cpus)) &&
                ifthenelse(SlotType is "Static",RequestCpus <= Cpus,true)) &&
            ((ifthenelse(DESIRED_Sites isnt undefined,stringListMember(GLIDEIN_CMSSite,DESIRED_Sites),undefined) ||
                    ifthenelse(DESIRED_Gatekeepers isnt undefined,stringListMember(GLIDEIN_Gatekeeper,DESIRED_Gatekeepers),undefined)) &&
                (isUndefined(RequestGPUs) || RequestGPUs is 0))) &&
        (((GLIDEIN_ToRetire is undefined) ||
        (CurrentTime < GLIDEIN_ToRetire))) &&
        (true)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant