Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surprising behavior of metadata caching in subsequent "fresh" runs #662

Closed
alexander-held opened this issue Apr 19, 2022 · 2 comments
Closed
Labels
question Further information is requested

Comments

@alexander-held
Copy link
Contributor

Describe what you want to do
During debugging, I was changing the list of files I wanted to process, as well as associated metadata. I then noticed that the list of files updated fine, while metadata did not. Is this pattern intended?

from coffea import processor


class Processor(processor.ProcessorABC):
    def process(self, events):
        print("metadata in processor", events.metadata["abc"])
        return {}

    def postprocess(self, accumulator):
        return accumulator


fileset = {
    "ttbar": {
        "files": [
            "http://xrootd-local.unl.edu:1094//store/user/AGC/datasets/RunIIFall15MiniAODv2/TT_TuneCUETP8M1_13TeV-powheg-pythia8/MINIAODSIM//PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1/00000/00DF0A73-17C2-E511-B086-E41D2D08DE30.root"
        ],
        "metadata": {"abc": 0},
    }
}

executor = processor.IterativeExecutor()
run = processor.Runner(executor=executor, savemetrics=True)
print("metadata in fileset", fileset["ttbar"]["metadata"]["abc"])
output, metrics = run(fileset, "events", processor_instance=Processor())


fileset = {
    "ttbar": {
        "files": [
            "http://xrootd-local.unl.edu:1094//store/user/AGC/datasets/RunIIFall15MiniAODv2/TT_TuneCUETP8M1_13TeV-powheg-pythia8/MINIAODSIM//PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1/00000/00DF0A73-17C2-E511-B086-E41D2D08DE30.root"
        ],
        "metadata": {"abc": 999},
    }
}

executor = processor.IterativeExecutor()
run = processor.Runner(executor=executor, savemetrics=True)
print("metadata in fileset", fileset["ttbar"]["metadata"]["abc"])
output, metrics = run(fileset, "events", processor_instance=Processor())

This script is doing the exact same thing twice, with only the metadata in the fileset changing. Output:

metadata in fileset 0
Preprocessing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:02 < 0:00:00 | ? file/s ]
metadata in processor 0
Processing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:02 < 0:00:00 | ? chunk/s ]
metadata in fileset 999
metadata in processor 0
Processing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:02 < 0:00:00 | ? chunk/s ]

The metadata arriving in the processor does not update. @nsmith- found a solution to this, metadata_cache={} can be used in processor.Runner.

I came across this in a notebook, where I kept changing fileset and re-running the steps below the fileset definition to process data.

Explain what documentation is missing
Assuming that this is intended behavior, it would be useful to describe this somewhere prominently, perhaps in the processor section. It is unintuitive to me that the list of files changes (not demonstrated above, but to do so, can duplicate the file in the second fileset), but the metadata coming from the same dict does not.

@alexander-held alexander-held added the question Further information is requested label Apr 19, 2022
@nsmith-
Copy link
Member

nsmith- commented Apr 20, 2022

The original purpose of the metadata cache was to not have to re-scan the file list to determine the number of events and cluster boundaries of the files, under the assumption that input files are immutable. It should not have been saving also the user-supplied metadata. So I'm inclined to call this a bug. It looks like we could change

self.metadata_cache[item] = item.metadata

to save only the non-user-specified keys as defined in
_PROTECTED_NAMES = {
"dataset",
"filename",
"treename",
"metadata",
"entrystart",
"entrystop",
"fileuuid",
"numentries",
"uuid",
"clusters",
}

@pfackeldey do you think this makes sense?

@lgray
Copy link
Collaborator

lgray commented Dec 7, 2023

This is overhauled in coffea 2023

@lgray lgray closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants