Surprising behavior of metadata caching in subsequent "fresh" runs #662

alexander-held · 2022-04-19T20:46:26Z

Describe what you want to do
During debugging, I was changing the list of files I wanted to process, as well as associated metadata. I then noticed that the list of files updated fine, while metadata did not. Is this pattern intended?

from coffea import processor


class Processor(processor.ProcessorABC):
    def process(self, events):
        print("metadata in processor", events.metadata["abc"])
        return {}

    def postprocess(self, accumulator):
        return accumulator


fileset = {
    "ttbar": {
        "files": [
            "http://xrootd-local.unl.edu:1094//store/user/AGC/datasets/RunIIFall15MiniAODv2/TT_TuneCUETP8M1_13TeV-powheg-pythia8/MINIAODSIM//PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1/00000/00DF0A73-17C2-E511-B086-E41D2D08DE30.root"
        ],
        "metadata": {"abc": 0},
    }
}

executor = processor.IterativeExecutor()
run = processor.Runner(executor=executor, savemetrics=True)
print("metadata in fileset", fileset["ttbar"]["metadata"]["abc"])
output, metrics = run(fileset, "events", processor_instance=Processor())


fileset = {
    "ttbar": {
        "files": [
            "http://xrootd-local.unl.edu:1094//store/user/AGC/datasets/RunIIFall15MiniAODv2/TT_TuneCUETP8M1_13TeV-powheg-pythia8/MINIAODSIM//PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1/00000/00DF0A73-17C2-E511-B086-E41D2D08DE30.root"
        ],
        "metadata": {"abc": 999},
    }
}

executor = processor.IterativeExecutor()
run = processor.Runner(executor=executor, savemetrics=True)
print("metadata in fileset", fileset["ttbar"]["metadata"]["abc"])
output, metrics = run(fileset, "events", processor_instance=Processor())

This script is doing the exact same thing twice, with only the metadata in the fileset changing. Output:

metadata in fileset 0
Preprocessing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:02 < 0:00:00 | ? file/s ]
metadata in processor 0
Processing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:02 < 0:00:00 | ? chunk/s ]
metadata in fileset 999
metadata in processor 0
Processing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:02 < 0:00:00 | ? chunk/s ]

The metadata arriving in the processor does not update. @nsmith- found a solution to this, metadata_cache={} can be used in processor.Runner.

I came across this in a notebook, where I kept changing fileset and re-running the steps below the fileset definition to process data.

Explain what documentation is missing
Assuming that this is intended behavior, it would be useful to describe this somewhere prominently, perhaps in the processor section. It is unintuitive to me that the list of files changes (not demonstrated above, but to do so, can duplicate the file in the second fileset), but the metadata coming from the same dict does not.

The text was updated successfully, but these errors were encountered:

nsmith- · 2022-04-20T15:23:02Z

The original purpose of the metadata cache was to not have to re-scan the file list to determine the number of events and cluster boundaries of the files, under the assumption that input files are immutable. It should not have been saving also the user-supplied metadata. So I'm inclined to call this a bug. It looks like we could change

coffea/coffea/processor/executor.py

Line 1386 in 3269f54

self.metadata_cache[item] = item.metadata

to save only the non-user-specified keys as defined in

coffea/coffea/processor/executor.py

Lines 59 to 70 in 3269f54

    
           _PROTECTED_NAMES = { 
        
               "dataset", 
        
               "filename", 
        
               "treename", 
        
               "metadata", 
        
               "entrystart", 
        
               "entrystop", 
        
               "fileuuid", 
        
               "numentries", 
        
               "uuid", 
        
               "clusters", 
        
           }

@pfackeldey do you think this makes sense?

lgray · 2023-12-07T00:15:17Z

This is overhauled in coffea 2023

alexander-held added the question Further information is requested label Apr 19, 2022

alexander-held mentioned this issue Apr 27, 2022

User experience and performance improvements for pipeline demonstrator iris-hep/analysis-grand-challenge#64

Open

36 tasks

lgray closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surprising behavior of metadata caching in subsequent "fresh" runs #662

Surprising behavior of metadata caching in subsequent "fresh" runs #662

alexander-held commented Apr 19, 2022

nsmith- commented Apr 20, 2022

lgray commented Dec 7, 2023

Surprising behavior of metadata caching in subsequent "fresh" runs #662

Surprising behavior of metadata caching in subsequent "fresh" runs #662

Comments

alexander-held commented Apr 19, 2022

nsmith- commented Apr 20, 2022

lgray commented Dec 7, 2023