Refactored flashloader can't normalize data #459

zain-sohail · 2024-07-02T08:10:45Z

No description provided.

rettigl · 2024-07-02T12:48:15Z

This is somehow related to bam correction. In the timed_dataframe, delayStage positions are not the same as in the dataframe after bam correction, if parquets are created with the new flash loader. Currently, no idea why.

rettigl · 2024-07-02T12:52:44Z

The computation time with the parquet generated with the new loader also is >2 as long.

rettigl · 2024-07-02T12:56:41Z

A major difference between the old and new loader is the dask index: It is continous in the old loader, and very sparse in the new loader:
Old:

New:

rettigl · 2024-07-02T13:12:57Z

It's related to the bam correction and the mean offsetting:
Old:

New:

Old:

New:

It seems there are a lot of empty entries in the timed_dataframe, where actually not even a photon pulse was present.

zain-sohail · 2024-07-02T13:43:05Z

The computation time with the parquet generated with the new loader also is >2 as long.

I think the slowdown is because of this section being offloaded to dask

sed/sed/loader/flash/loader.py

Lines 610 to 614 in 86978c0

    
           df = self.concatenate_channels(h5_file) 
        
           df = df.dropna(subset=self._config["dataframe"].get("tof_column", "dldTimeSteps")) 
        
           # correct the 3 bit shift which encodes the detector ID in the 8s time 
        
           if self._config["dataframe"].get("split_sector_id_from_dld_time", False): 
        
               df = split_dld_time_from_sector_id(df, config=self._config)

It seems there are a lot of empty entries in the timed_dataframe, where actually not even a photon pulse was present.

This was requested feature by Davide (#307) that the timed dataframe should contain all available values from pulses and not just ones where electrons are detected.

rettigl · 2024-07-02T14:10:52Z

This was requested feature by Davide (#307) that the timed dataframe should contain all available values from pulses and not just ones where electrons are detected

Indeed, but these should also only contain valid pulses, where actual shots exist

zain-sohail · 2024-07-02T15:13:52Z

Indeed, but these should also only contain valid pulses, where actual shots exist

Yes, this comes from the flash data structure of storing 1000 pulses for each train.
The simple way I can think to adapt this is a hard threshold the pulseId for timed_dataframe. Because it's not nans that are stored in bam column but zeros. (In other pulse resolved channels might be zero)
@kutnyakhov any ideas?

A major difference between the old and new loader is the dask index: It is continous in the old loader, and very sparse in the new loader:

This is just because nan drop takes place after loading the buffer files.

rettigl · 2024-07-02T15:17:51Z

I think the slowdown is because of this section being offloaded to dask

That's not the reason. Even if I disable the sector split, it's still slow.

zain-sohail · 2024-07-02T15:31:58Z

sp._timed_dataframe = sp._timed_dataframe[sp_50826._timed_dataframe["pulseId"] < 500].ffill()
ffill should happen with overlap but in general, something like this would fix the issue of noramlizing data

rettigl · 2024-07-02T15:35:01Z

There should be a permanent fix. Probably beet to look for the first nonzero entry from the back or so. Another config entry I would avoid if possible

zain-sohail · 2024-07-02T16:05:43Z

That's not the reason. Even if I disable the sector split, it's still slow.

Might be it's slow because of indexes
https://docs.dask.org/en/stable/generated/dask_expr._collection.DataFrame.set_index.html#dask_expr._collection.DataFrame.set_index
But I will think of a solution that allows the correct electron and timed dataframes and also be fast.

rettigl · 2024-07-02T17:53:50Z

If directly loading the parquet files, first thing to notice is that the new loader produces ~6.5 times more entries compared to the old one, and also the number of entries and remaining NaN is much different:
This is the output of df.count().compute on two directly loaded parquet dataframes (single file):
Old:

trainId                      423643
pulseId                      423643
electronId                   423643
timeStamp                    423643
dldPosX                      423643
dldPosY                      423643
dldTimeSteps                 423643
cryoTemperature              423643
dldTimeBinSize               423643
extractorCurrent             423643
extractorVoltage             423643
sampleBias                   423643
sampleTemperature            423643
tofVoltage                   423643
pulserSignAdc                  9184
monochromatorPhotonEnergy      4571
gmdBda                        44019
bam                          423341
delayStage                    43846
dldSectorID                  423643
dtype: int64

New:

trainId                      2782954
pulseId                      2782954
electronId                   2782954
dldPosX                       423643
dldPosY                       423643
dldTimeSteps                  423643
pulserSignAdc                  25630
bam                          2563000
timeStamp                       2563
monochromatorPhotonEnergy         27
gmdBda                           264
delayStage                       263
sampleBias                      2563
tofVoltage                      2563
extractorVoltage                2563
extractorCurrent                2563
cryoTemperature                 2563
sampleTemperature               2563
dldTimeBinSize                  2563
dtype: int64

Clearly, the performance with the old files is much superior, also their storage size is a fac.tor 2 or so smaller. This should be changed again.

rettigl · 2024-07-02T17:58:20Z

Regarding the timed_dataframe, does it really make a difference if this is computed on the pulse rather than the train level? This would make things super much simple and more performant. I think it's worth checking if that makes a practical difference, if you e.g. take the mean of a train and normalize by this.

zain-sohail · 2024-07-02T21:17:14Z

If directly loading the parquet files, first thing to notice is that the new loader produces ~6.5 times more entries compared to the old one, and also the number of entries and remaining NaN is much different: This is the output of df.count().compute on two directly loaded parquet dataframes (single file): Old:

Yes this is simply because we are not dropping nans from electron columns before saving the files (how it was done before):

sed/sed/loader/flash/loader.py

Lines 610 to 614 in 86978c0

    
           df = self.concatenate_channels(h5_file) 
        
           df = df.dropna(subset=self._config["dataframe"].get("tof_column", "dldTimeSteps")) 
        
           # correct the 3 bit shift which encodes the detector ID in the 8s time 
        
           if self._config["dataframe"].get("split_sector_id_from_dld_time", False): 
        
               df = split_dld_time_from_sector_id(df, config=self._config)

If we do that, the dataframe becomes much smaller.
This doesn't work with if we want to keep the pulse data even without electrons.
One solution could be to save a df and df_timed parquet and then load both per file. Technically it's also possible to save a parquet for each run and just divide the 'per file' into row groups.

Clearly, the performance with the old files is much superior, also their storage size is a fac.tor 2 or so smaller. This should be changed again.

I can of course bring back the old behavior (saving after dropping nans) so we can run tests to see if that is the performance bottleneck.
File size is not the biggest issue because parquet files are quite small.

Regarding the timed_dataframe, does it really make a difference if this is computed on the pulse rather than the train level? This would make things super much simple and more performant. I think it's worth checking if that makes a practical difference, if you e.g. take the mean of a train and normalize by this.

I thought we just had a discussion today about monochromator and using the pulse channel for it instead of train. Perhaps I misunderstood but I can't shed light on whether it's useful or not.

@balerion if you are still using sed, could you explain your usecase?

zain-sohail · 2024-07-02T23:05:28Z

as expected, from my tests, binning performance stays same if i remove nans before saving (index is monotonic then like in main branch version. You can test from branch fix-459

rettigl · 2024-07-03T09:39:39Z

Something is still different. While the speed is back to normal, the result looks quite different:
Old:

New:

Fixed:

It appears to attach the wrong delay stage values (?)

rettigl · 2024-07-03T09:55:32Z

Counts for the same file:
Old:

New:

Fixed:

There are 2567 trains in this file, and 257 different unique delay stage position values

zain-sohail · 2024-07-26T10:55:54Z

closed by #469

zain-sohail mentioned this issue Jul 2, 2024

Version 1.0 #462

Open

11 tasks

This was referenced Jul 3, 2024

Flash electron data has some weird values #464

Closed

Update to the BufferHandler #469

Merged

zain-sohail closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored flashloader can't normalize data #459

Refactored flashloader can't normalize data #459

zain-sohail commented Jul 2, 2024

rettigl commented Jul 2, 2024

rettigl commented Jul 2, 2024

rettigl commented Jul 2, 2024

rettigl commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

rettigl commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

rettigl commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

rettigl commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

rettigl commented Jul 2, 2024

rettigl commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

rettigl commented Jul 3, 2024

rettigl commented Jul 3, 2024

zain-sohail commented Jul 26, 2024

Refactored flashloader can't normalize data #459

Refactored flashloader can't normalize data #459

Comments

zain-sohail commented Jul 2, 2024

rettigl commented Jul 2, 2024

rettigl commented Jul 2, 2024

rettigl commented Jul 2, 2024

rettigl commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

rettigl commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

rettigl commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

rettigl commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

rettigl commented Jul 2, 2024

rettigl commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

zain-sohail commented Jul 2, 2024

rettigl commented Jul 3, 2024

rettigl commented Jul 3, 2024

zain-sohail commented Jul 26, 2024