Define GPU matchmaking expression between job and machine and rename job GPU parameters #11942

amaltaro · 2024-03-21T08:34:00Z

Impact of the new feature
WMAgent

Is your feature request related to a problem? Please describe.
At the moment, there is no matchmaking involving GPU. The only logic implemented in SI so far is to request GPU pilots whenever jobs request GPUs, then jobs will match against those GPU pilots, which will advertise GPUs.

We need to define a matchmaking expression that will take into consideration the 3 mandatory GPU parameters, currently named as (will likely be renamed): GPUMemoryMB, CUDACapability and CUDARuntime

Describe the solution you'd like
This ticket requests mainly 2 changes:

rename the job GPU parameters to something like:
GPUMemoryMB --> DESIRED_GPUMemoryMB
CUDACapability --> DESIRED_GPUCapability
CUDARuntime --> DESIRED_GPURuntime
define a requirement expression that takes into consideration the 3 job classads above, in addition to RequestGPUs, which says how many GPUs the job is looking for.

Describe alternatives you've considered
The job matchmaking expression can be defined in two places:
a) WMAgent
b) SI Frontend

I would love to hear pros/cons of those. From Marco M., he is fine either way.

Additional context
Further context is provided in this JIRA ticket: https://its.cern.ch/jira/browse/CMSSI-79
which talks about condor job classad renaming and heterogeneous StepChain requirements.

In addition, here is the current job requirements expression:

$ condor_q -l 23332.0 -bet
    (stringListMember(TARGET.Arch,REQUIRED_ARCH)) && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) &&
    (TARGET.HasFileTransfer)

and this is the machine expression in place:

$ condor_q -l 23332.0 -bet -reverse -machine slot1_1@glidein_31901_10637214@lnxfarm315.colorado.edu
    START && (WithinResourceLimits)
  START is
    (true &&
            (SiteWMS_WN_Draining is false)) && ((true) && (true) &&
            (ifthenelse(GLIDEIN_REQUIRED_OS is "any",(HAS_SINGULARITY is true &&
                        GLIDEIN_PS_HAS_SINGULARITY isnt false),(isUndefined(REQUIRED_OS) ||
                        REQUIRED_OS is "any" ||
            REQUIRED_OS is GLIDEIN_REQUIRED_OS)) &&
                ifthenelse(MaxWallTimeMins isnt undefined,(MaxWallTimeMins * 60) < (GLIDEIN_ToDie - MyCurrentTime),(16 * 3600) < (GLIDEIN_ToDie - MyCurrentTime)) &&
                ((DynamicSlot isnt true) || (RequestCpus is Cpus)) &&
                ifthenelse(SlotType is "Static",RequestCpus <= Cpus,true)) &&
            ((ifthenelse(DESIRED_Sites isnt undefined,stringListMember(GLIDEIN_CMSSite,DESIRED_Sites),undefined) ||
                    ifthenelse(DESIRED_Gatekeepers isnt undefined,stringListMember(GLIDEIN_Gatekeeper,DESIRED_Gatekeepers),undefined)) &&
                (isUndefined(RequestGPUs) || RequestGPUs is 0))) &&
        (((GLIDEIN_ToRetire is undefined) ||
        (CurrentTime < GLIDEIN_ToRetire))) &&
        (true)

The text was updated successfully, but these errors were encountered:

amaltaro added New Feature SimpleCondorPlugIn GPU WMAgent labels Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define GPU matchmaking expression between job and machine and rename job GPU parameters #11942

Define GPU matchmaking expression between job and machine and rename job GPU parameters #11942

amaltaro commented Mar 21, 2024 •

edited

Define GPU matchmaking expression between job and machine and rename job GPU parameters #11942

Define GPU matchmaking expression between job and machine and rename job GPU parameters #11942

Comments

amaltaro commented Mar 21, 2024 • edited

amaltaro commented Mar 21, 2024 •

edited