Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce TritonService and workflows #32576

Merged
merged 31 commits into from Feb 4, 2021

Conversation

kpedro88
Copy link
Contributor

@kpedro88 kpedro88 commented Dec 23, 2020

PR description:

This PR introduces a new edm::Service, TritonService, as a central point for the management of Triton inference servers. The service does the following:

  1. The TritonService parameter set includes a list of known servers. For each server, a name, address, and port should be provided.
  2. At initialization, the service lists the models hosted by each server, and creates maps of server:model and model:server.
  3. During construction of SonicTriton modules, the service is informed of the model requested by each module. Any requested model not provided by a known server is kept in a separate list.
  4. Before beginJob, if the "fallback" option is enabled and the list of unserved models is non-empty, a local server is launched using Singularity. This fallback server can use a local GPU if one is available (currently detected via nvidia-smi); otherwise, it uses the local CPU.
  5. TritonClient construction is moved to occur during beginJob; at this time, the service provides a server address that serves the model for the client. (The client can request a specific server by name.) If the CPU fallback server is being used, the client is forced into Sync mode to prevent contention.
  6. At destruction time, the service shuts down the local fallback server, if one was started.

This behavior is now automatically tested in a unit test. The actual unit test only runs if the necessary ingredients (AVX instructions, model file, and Singularity runtime) are available; otherwise it trivially returns true. The "graph" test modules are chosen for the unit test because they a) use more features (multiple named inputs, variable dimensions) and b) use a smaller model file.

The creation of a cms-data area for HeterogeneousCore-SonicTriton is requested, in order to host the model files necessary for the unit test. (Once this is provided, those model files will be removed from the fetch_model.sh script.)

Other miscellaneous changes in this PR:

  • A special upgrade workflow and corresponding ProcessModifiers for SonicTriton are introduced. This workflow will enable more testing of SonicTriton modules.
    • enableSonicTriton adds the TritonService to the process and enables the local fallback server.
    • allSonicTriton is intended to aggregate any other modifiers needed to enable SonicTriton modules.
  • There are now dedicated SonicTriton module base classes, following the types defined in SonicCore: TritonEDProducer, TritonEDFilter, TritonOneEDAnalyzer. These types are all tested in the new unit test.
    • TritonGraphProducer is refactored to create instances of all module types with minimal duplication, with the amount of artificially-generated input data reduced for the case of the unit test.
    • There are some minor implications for the use of GlobalCache with SonicTriton modules, which are discussed in the documentation.
  • The shell script triton to start and stop the local servers is moved to HeterogeneousCore/SonicTriton/scripts, in order to make it available in the path. A number of new options are added and described in the documentation, notably:
    • usage of multiple model repositories
    • retry starting the container (Singularity can timeout the first time)

The TritonClient API is modified:

  • removed fields: batchSize, address, port
  • added fields: modelConfigPath, preferredServer

Caveats and future work:

  • The unit test, which launches a fallback server using Singularity, can be run inside of another Singularity container if certain conditions are met. This requires:
    • The Singularity runtime should be available inside the top-level container (can be accomplished by building it into the container image, or mounting necessary directories when starting the container)
    • setuid should be disabled and unprivileged user namespaces should be enabled (as far as I understand, this is the current recommended best practice for CMS sites)
  • The server:model and model:server maps in TritonService could be more efficiently implemented using boost::bimap. However, the necessary libraries to use some of the support classes (such as boost::bimaps::unordered_multiset_of) are not distributed with CMSSW. If the boost external is updated, this option could be pursued again in a followup PR.
  • The triton shell script could be reimplemented in Python to be more maintainable and robust. However, Python's built-in shell interaction capabilities are limited and cumbersome. As a future development item, the sh Python library could be added as a CMS external and then used in this script.

The noted development items above are intended for future PRs because they don't impact functionality or APIs, and therefore further delaying this PR (on which several other PRs and working branches will need to be rebased) is not warranted.

The initial presentation of this idea in the core software meeting can be found here: https://indico.cern.ch/event/983377/#16-design-of-the-tritonservice.

PR validation:

Ran the unit test successfully (many times), using both CPU and GPU for the fallback server.

For reference, Singularity-within-Singularity instructions (tested on cmslpc at FNAL and lxplus at CERN) are provided.

Setup:

cmsrel CMSSW_11_3_0_pre1
cd CMSSW_11_3_0_pre1/src
cmsenv
git cms-checkout-topic -u kpedro88:TritonService113X
scram b
cd HeterogeneousCore/SonicTriton/test
./fetch_model.sh

Running test in Singularity†‡:

cmssw-cc7 -B $(readlink -f $HOME) -B $(readlink -f $CMSSW_BASE) -B $(mktemp -d):/eos
cmsenv
cmsRun tritonTest_cfg.py maxEvents=1 modules=TritonGraphProducer unittest=1

† The fake eos directory is needed because cmssw-cc7 binds /eos and the Triton server's nvidia_entrypoint.sh does a find /, which can take forever especially on CERN EOS.

‡ In the course of this PR, the cmssw-cc7 Singularity container was updated to include the Singularity runtime natively. For future reference, if it did not contain the Singularity runtime, the following bind arguments should be added to the invocation, in order to include the runtime + dependencies from the host:

-B /usr/bin/singularity -B /lib64/libseccomp.so.2 -B /etc/singularity -B /usr/libexec/singularity -B /var/singularity

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32576/20562

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @kpedro88 (Kevin Pedro) for master.

It involves the following packages:

Configuration/ProcessModifiers
Configuration/PyReleaseValidation
Configuration/StandardSequences
HeterogeneousCore/SonicCore
HeterogeneousCore/SonicTriton

@jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @cmsbuild, @makortel, @franzoni, @silviodonato, @fwyzard, @qliphy, @fabiocos, @davidlange6 can you please review it and eventually sign? Thanks.
@fabiocos, @makortel, @felicepantaleo, @riga, @GiacomoSguazzoni, @rovere, @VinInn, @Martin-Grunewald, @lecriste, @mtosi, @dgulhan, @slomeo this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@kpedro88
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b0561a/11844/summary.html
CMSSW: CMSSW_11_3_X_2020-12-22-2300/slc7_amd64_gcc900

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 36
  • DQMHistoTests: Total histograms compared: 2716967
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2716944
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 35 files compared)
  • Checked 153 log files, 37 edm output root files, 36 DQM output files

def _addTritonService(process):
process.load("HeterogeneousCore.SonicTriton.TritonService_cff")
from Configuration.ProcessModifiers.enableSonicTriton_cff import enableSonicTriton
modifyConfigurationStandardSequencesServicesAddTritonService_ = enableSonicTriton.makeProcessModifier(_addTritonService)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle loading of TrigonService_cff could be done in the customize function, but to me doing it here with a Modifier does not make a big difference. (and now it demonstrates how "adding more Services like these" would look like)

(this is my last comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep it here if that's acceptable.

@kpedro88
Copy link
Contributor Author

kpedro88 commented Feb 1, 2021

@makortel any further comments?

@makortel
Copy link
Contributor

makortel commented Feb 1, 2021

+heterogeneous

@kpedro88
Copy link
Contributor Author

kpedro88 commented Feb 1, 2021

@jordan-martins, @chayanit, @wajidalikhan, @srimanob, @silviodonato, @qliphy please take a look

@kpedro88
Copy link
Contributor Author

kpedro88 commented Feb 2, 2021

@kpedro88
Copy link
Contributor Author

kpedro88 commented Feb 3, 2021

@cms-sw/pdmv-l2 @cms-sw/upgrade-l2 the only relevant changes for your signing categories are the addition of a placeholder special workflow. Please review and sign ASAP (the PR has already been open for more than a month).

@chayanit
Copy link

chayanit commented Feb 3, 2021

+1

@srimanob
Copy link
Contributor

srimanob commented Feb 4, 2021

+Upgrade

@kpedro88
Copy link
Contributor Author

kpedro88 commented Feb 4, 2021

@silviodonato @qliphy @smuzaffar this should be merged along with cms-data/HeterogeneousCore-SonicTriton#1 (and any associated cmsdist update)

@smuzaffar
Copy link
Contributor

cms-data/HeterogeneousCore-SonicTriton#1 has been merged and should be available in next IB.

@qliphy
Copy link
Contributor

qliphy commented Feb 4, 2021

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 4, 2021

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will be automatically merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants