Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store "architecture" in event provenance #30044

Open
makortel opened this issue May 29, 2020 · 23 comments
Open

Store "architecture" in event provenance #30044

makortel opened this issue May 29, 2020 · 23 comments

Comments

@makortel
Copy link
Contributor

Now that scram gained the ability to build (and load) individual libraries and externals for different "vector architectures" (the exact meaning of "architecture" will likely evolve), we should find a way to track that "architecture" in the event provenance.

After we have figured out the storage model, we need to decide how to communicate the "architecture" from scram to cmsRun.

@makortel
Copy link
Contributor Author

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@wddgit
Copy link
Contributor

wddgit commented Dec 13, 2021

@makortel

Is the decision to use a GPU to produce products with a certain branch name one that applies to an entire process? Or can the decision be different for different Events in the same process? Can the solution we are designing assume this is always true?

Is there a simple toy test configuration that will produce a product on a GPU? I actually have not worked with these GPU issues yet. I don't know much about it yet, although it is interesting. Is there any documentation about the current state of things and how the decision is made and recorded? Is this in the SwitchProducer related code?

@makortel
Copy link
Contributor Author

Just to summarize our discussion, at this point the plan is to store only process-level quantities

  1. Generic device type ("GPU", "FPGA"): as a vector of enumeration. Intended for e.g. DQM use case (Automatic switching between CPU and GPU for DQMEDAnalyzer #35879) to know if a(n earlier) process had offloading enabled
  2. Exact CPU and GPU models : as a vector of strings of the CPU/GPU/etc model names. Intended for debugging (the original topic of this issue)

For point 1, the main information source would be configuration (e.g. SwitchProducer does its decision in python and communicates that to C++ code via configuration), so this connects to #31760.

For point 2, the same information goes (or should eventually go for GPUs) into framework job report, and we should think how to deliver the information for both in a unified way. (likely from CUDAService in case of NVIDIA GPUs).

@wddgit
Copy link
Contributor

wddgit commented Jan 13, 2022

@makortel

I made a first pass at this. I know it is not done and there are issues that need to be resolved before we even think about merging this. But my working branch is here if you want to take a look.

https://github.com/wddgit/cmssw/tree/acceleratorProvenance

One thing we had discussed was making the module that produces this similar to TriggerResults in that it is not scheduled in the configuration and just always runs. The code for that is in the branch above, but I noticed a number of problems with this when I tried to run existing unit tests. Possibly this idea is a mistake?

  1. It makes all processes produce something. That affects process history IDs in the case of merge jobs which seems very bad. At the very least, we should probably only run this if we produce trigger results to remove an undesirable side effect on the ProcessHistory.

  2. It is also breaking a lot of unit tests because something is there that wasn't there before. This isn't a show stopper, but will make a bit of work fixing all the unit tests. For example, all module IDs are different because there is an extra module. Lot of reference files and expected values in many tests change.

We could just make this a standard producer that you have to configure if you want this product. I'm starting to think that might be better. Doing it like TriggerResults also introduces special code in the middle of the Framework (tolerable, but not really very nice for maintainability).

Another alternative would be a configuration parameter that allows one to turn it off.

I noticed the PR you submitted. Can I get the configured accelerator from this variable in the Options parameter set?

process.options.accelerators = cms.untracked.vstring('auto')

Should one of the enumeration values be assigned to the "auto" case?

Can I use "availableAccelerators" for the list of hardware accelerator chips? Or does that contain different information?

@makortel
Copy link
Contributor Author

  1. It makes all processes produce something. That affects process history IDs in the case of merge jobs which seems very bad. At the very least, we should probably only run this if we produce trigger results to remove an undesirable side effect on the ProcessHistory.

I agree that including this special producer in e.g. merge jobs would be bad. But I'm not sure if the condition should be the same as for TriggerResults producer. If I read the code correctly, the TriggerResults is produced if the process has any (trigger) Paths, did I got that right?

For this case somewhat natural condition would be "does process have any modules that produce data products" (i.e. any EDProducers or EDFilters). I'd think a case where all EDProducers are in Tasks and an OutputModule in EndPath drives which of the EDProducers get run would then not lead to TriggerResults being produced, but would benefit from "AcceleratorProvenance".

We could just make this a standard producer that you have to configure if you want this product. I'm starting to think that might be better. Doing it like TriggerResults also introduces special code in the middle of the Framework (tolerable, but not really very nice for maintainability).

A standard producer approach would work nicely with #36699 (in the leading order, at least), but this infrastructure should also cover the CPU vector architecture (for the strings storing the details of the architecture, the enum wouldn't necessarily need CPU element, but it might help for consistency of the vectors?). Ideally the information would be available from all processes that produced any data products, which would then make the "standard producer" approach somewhat cumbersome from the user's point of view.

Maybe we should discuss more about this point (also with @Dr15Jones).

Can I get the configured accelerator from this variable in the Options parameter set?

Yes, that was my idea.

Should one of the enumeration values be assigned to the "auto" case?

That is not needed, because the auto value is expanded to the actually available accelerators already in python, and what gets passed to C++ contains the expanded list. So it is sufficient to interpret process.options.accelerators as it is. The @available_accelerators is mostly for throwing a specific exception from the C++ code in case some of the requested accelerators would be missing from the system.

One question that I'm thinking about is that do we want to interpret the "high level class" of the accelerator for the enum (e.g. GPU) from the process.options.accelerators strings (e.g. gpu-nvidia), or if there should be a separate parameter for that. I'm leaning towards the former, fortunately this should be relatively easy to change if this turns out to not work well.

@wddgit
Copy link
Contributor

wddgit commented Jan 14, 2022

I agree. I will try to modify it to only run when any other product is produced. That is a better criteria than associating it to TriggerResults or the existence of a path.

@wddgit
Copy link
Contributor

wddgit commented Jan 14, 2022

Is there anything that exists to get the hardware chip names from in CMSSW or should I just start from scratch and try to figure out how to get this from the OS or environment?

Maybe we can discuss this again next week.

@makortel
Copy link
Contributor Author

Is there anything that exists to get the hardware chip names from in CMSSW or should I just start from scratch and try to figure out how to get this from the OS or environment?

We should discuss about this aspect. There are already

that dig up hardware information. The CPU service itself fills JobReport

void CPU::postEndJob() {
Service<JobReport> reportSvc;

and is being used by CondorStatusService
edm::Service<edm::CPUServiceBase> cpusvc;
std::string models;
double avgSpeed;
if (cpusvc.isAvailable() && cpusvc->cpuInfo(models, avgSpeed)) {
updateChirpQuoted("CPUModels", models);
updateChirp("CPUSpeed", avgSpeed);
}

The information in CUDAService is not currently propagated to any monitoring (something to be addressed). I'm suspecting the current approach (with CPU service) won't really scale to many accelerator architectures, and am thinking if we should somehow harmonize the information delivery (e.g. CPU, CUDA, etc services fill some central information piece that the other Services/whatnot can ask it from).

To leading order I'd expect this aspect to be decoupled from the actual storage and access part, as long as there is a sensible hook/place where to fill the actual information after we figure out a feasible way for that.

@makortel
Copy link
Contributor Author

@fwyzard What do you think, would it be useful to have the architecture/accelerator information stored by the framework (similar to provenance) for the HLT?

@smorovic @emeschi (motivated by the question above, for considering implementation options) What are the constraints for evolving the streamer file format (e.g. for adding support for ProcessBlock data products)? Are these something that could be done e.g. in YETS during Run 3, or something better left for LS3, or something more relaxed or constrained?

@fwyzard
Copy link
Contributor

fwyzard commented Jan 20, 2022

@fwyzard What do you think, would it be useful to have the architecture/accelerator information stored by the framework (similar to provenance) for the HLT?

I guess it could be useful, yes.
We have this information in aggregated form in the HLT DQM (e.g. the FastTimerService and ThroughputService use per-cpu model subfolders), but nothing that is stored and propagated downstream.

I assume that this information should be equivalent to untracked parameters, to avoid affecting the process hash ?

@smorovic @emeschi (motivated by the question above, for considering implementation options) What are the constraints for evolving the streamer file format (e.g. for adding support for ProcessBlock data products)? Are these something that could be done e.g. in YETS during Run 3, or something better left for LS3, or something more relaxed or constrained?

I'm very not familiar with the streamer format, but one point to be kept in mind is how this would affect (hierarchically) merging streamer files with different information.

@smorovic
Copy link
Contributor

@makortel, streamer is used as a temporary format to ship data in P5 and to Tier-0 and there are no constraints in terms of backwards compatibility (i.e. all clients should use the same major_minor version of CMSSW)

A streamer file consists of header (serialized registry) and event payload (concatenated for each event), but the variant of the format we use until the last step is missing the header, so we can merge files and as the final step prepend the header, all done each lumisection.

Do you intend to have only one block globally (constant information per run), one (constant) block per process, or a unique one per event?

For the first case, it is like with the current header, only one copy is taken and others are discarded, so we need to expand the header format to add another block. This kind of change could be done even during the year (hypothetical example: moving from 12_1 to 12_2).

Second option is that is part of the file header, but in DAQ we need to merge them instead of discarding and add all blocks together as part of the merging process (including growing it when processes crash/restart etc.). I would put that as a task for EYETS since it is significant for the HLT and merging system.
Alterntive to this, ship it from each process each lumisection as a block within containing event payloads, thus avoiding to change the merging logic (assuming reasonably small size compared to event). Probably doable during the year.

If it's shipped with each event (assuming also that it's small) it could be accompany the event payload, either a separate block - or an event product like trigger results. For the latter, probably no changes are needed in the format. Also can be done during the year.

@emeschi
Copy link
Contributor

emeschi commented Jan 20, 2022 via email

@fwyzard
Copy link
Contributor

fwyzard commented Jan 20, 2022

Yes, the idea would be to have this effectively per-event, so we can check offline if there are problems related to accelerator-specific implementations.
For example, re-running offline on a well-defined CPU-only architecture and comparing the results with the online ones.

So, storing the information in a per-Event collection would seem the easiest implementation - but it' not clear if it may cause problems downstream.
OTOH hand it would also be the more future-proof, if we assume we can move to per-event decision on the acelerator to use down the road.

Since I'm also not familiar with the ProcessBlock that Matti mentioned, let me add another comment :-)
If I understand correctly, the ProcessBlock can store products on a per-job basis, in a way that can be propagated through subsequent merge jobs.
This could be an efficient way of storing information that may change across different jobs, but not from event to event within a given job.

@makortel
Copy link
Contributor Author

@fwyzard @smorovic @emeschi Thanks for your input!

At this point we're still considering various implementation options. The idea would be to store more information than just "was GPU used or not", right now this would include the CPU and GPU model information. (note that the starting point of this issue was CPU vector architecture) The main use cases would be for monitoring and debugging after the fact.

For now our idea was to specifically store only process-wide information, since that's what we have, and in principle storing it in some per-process way should be the most efficient way to do it. If it turns out that we need to start deciding accelerator usage event-by-event, it should be straightforward to extend the model with an event product that stores only the information that can vary.

I agree just storing the information in event would be easiest to do, and in the end even the space needed for CPU and GPU model names should be rather small. We'll digest the details you gave and think this further.

@fwyzard
Copy link
Contributor

fwyzard commented Jan 21, 2022

By the way, if the streamer format eventually gains the possibility of storing ProcessBlock data products, we could use it also to store more per-job HLT strings and other information.

@wddgit
Copy link
Contributor

wddgit commented May 5, 2022

FYI A step forward in addressing this issue. Just submitted and waiting for review comments from anyone interested. #37831. The information collected by the new service is the information that we propose be saved persistently. That PR does not address persistence, although that is coming in a future PR.

@wddgit
Copy link
Contributor

wddgit commented Jun 1, 2022

I spent yesterday afternoon and this morning looking at this issue and I am not sure how to proceed with this and the related PR. But here are some comments related to the streamer format based on comments received so far.

In HLT, several processes run on different machines (could be different hardware, chips). Each produces a streamer format without a file header, just events. Different events processed on different machines. Later in a separate process, these streamer files get merged by simple concatenation and a single header is added. There is currently no provision for complex logic here to do things other than simple concatenation of events and appending that to a header. Unless we dramatically change how this works, the only way to store information about the machine that processes an individual event is in the event. I see no way to store this in ProcessBlock. The only place to put the information is in the events. Unless there is special code to merge the event information, I cannot even see how create a ProcessBlock out of the information. And this special code would have to run as part of the code that writes the file header... Not sure how to do that. This is far outside of the context of the current ProcessBlock code we have now. I'm not saying it cannot be done and that I cannot do it, but it is outside of the code I usually work on and outside of my expertise and outside code Core usually controls... I suppose the process that creates the file header could read the events, merge the relevant information in some way and create a block in the header. Or the process that converts from streamer format to edm format could do the same thing and then really create a ProcessBlock object (although I think currently this is a simple merge process with no producers)...

Earlier comments:

From Emilio back in January, "Then the only thing that makes sense is to have this information as an event..." #30044 (comment)

From Andrea back in January: "Yes, the idea would be to have this effectively per-event, so we can check offline if there are problems related to accelerator-specific implementations."#30044 (comment)

So I am lost in trying to understand how to make a Streamer ProcessBlock for this and what information we would put in there.

Probably after this week we should put this on hold while Matti is away until July... Unless we can resolve something this week that I can implement. I also need to focus on finishing the unrelated run concurrency PR which is nearing completion, some uninterrupted time on that would help.

@makortel
Copy link
Contributor Author

makortel commented Jun 2, 2022

I was wondering how the ParameterSetRegistry is handled (since we were thinking to exploit that for the storage). I guess this

void StreamerOutputModuleBase::beginRun(RunForOutput const& iRun) {
start();
auto psetMapHandle = iRun.getHandle(psetToken_);
std::unique_ptr<InitMsgBuilder> init_message =
serializeRegistry(*getSerializerBuffer(),
*branchIDLists(),
*thinnedAssociationsHelper(),
OutputModule::processName(),
description().moduleLabel(),
moduleDescription().mainParameterSetID(),
psetMapHandle.isValid() ? psetMapHandle.product() : nullptr);
doOutputHeader(*init_message);
serializerBuffer_->clearHeaderBuffer();
}

means it is stored as part of the INI files. If I understood #37831 (comment) correctly, only one of the INI files is used in the merging, and the rest are discarded as all of them are (assumed to be) identical.

To me the assumption of "all HLT jobs have identical architecture provenance" sounds something I would not bet on. Even if it is the situation today, and probably cases where some node(s) are unable to use the GPU or has different driver version would be very unlikely to happen, the assumption sounds rather restrictive (towards the HLT farm evolution and the architecture information we'd include in here).

So probably we need to re-think the storage approach.

@smorovic
Copy link
Contributor

smorovic commented Jun 2, 2022

Hi @makortel, correct. We only keep one copy (from one process) and discard others, which isn't sufficient for a heterogeneous farm.
In addition, process lifetime is 'dynamic' over a HLT run, since a crashed process can be restarted, and in some cases even a machine can be added over the lifetime of a run.

First option, already discussed, is to store it in the event payload. If ProcesBlock is tiny, this wouldn't add much size overhead and is the simplest option.

A second option is storing this at file level (which is lumisection level in DAQ), as the output module in HLT process opens a new streamer output file for each LS (without INI part). This would require a third serialized object type for the streamer format (see, for example, https://github.com/cms-sw/cmssw/blob/master/IOPool/Streamer/interface/EventMessage.h).
After DAQ merging is completed, each fully merged file, checked out at the end of each lumisection, would contain all process blocks from any processes that wrote output events.
Tier0 repacking or some other analyzer could merge more than one LS dat file by one repacking job, so that process could also see multiple copies of the same ProcessBlock.

@wddgit
Copy link
Contributor

wddgit commented Oct 11, 2022

The last time we discussed this issue, one question that came up was whether our design for storing parentage of products could be useful for storing the information we are discussing in this issue. I agreed to review that and summarize that design.

In memory the Parentage objects are in a ParentageRegistry. That is basically a map using the ParentageID (a big hash) to reference its contents. ProductProvenance objects in memory hold one of those ParentageIDs (big hash) along with a BranchID (32 bits) to identify the product.

For persistent storage, we have a separate Parentage TTree with a single branch whose elements are Parentage objects. All unique Parentage objects that we save are written once into the TTree. The order is important. StoredProductProvenance objects do not contain the ParentageID. Instead they contain an unsigned int that is an index into the Parentage TTree. The index is the entry number of Parentage object in the TTree. The Parentage TTree is written once per file. A StoredProductProvenance object contains the index (32 bit unsigned int, and also a single 32 bit BranchID to identify the product). It is a per event object written into a branch in the Events TTree. The indexes have small values and compress very well. The ParentageID hashes are not saved persistently, which is good because they are large. Also note that the output module only saves Parentage objects for products which are kept, their ancestors, and for the original product in the case of EDAlias. The rest are dropped and don't take space in the output file. A job can be configured to drop some or all of them.

I have not thought through how this applies to "accelerator provenance" yet or whether this design is useful for that case.

@silviodonato
Copy link
Contributor

Hello,
I just want to point here an additional usecase of having the architecture in event provenance.
We see up to 8% of differences at the HLT between AMD and Intel in the double electron HLT triggers due to -Ofast flag.
I asked for a validation from PdmV, https://its.cern.ch/jira/browse/PDMVRELVALS-159, and it would be very useful to have something to separate the data processed from Intel and AMD. For the time being we can use reportCPUProperties in CPU service as suggested by Matti.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants