Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored flashloader can't normalize data #459

Closed
zain-sohail opened this issue Jul 2, 2024 · 18 comments
Closed

Refactored flashloader can't normalize data #459

zain-sohail opened this issue Jul 2, 2024 · 18 comments

Comments

@zain-sohail
Copy link
Member

No description provided.

@rettigl
Copy link
Member

rettigl commented Jul 2, 2024

This is somehow related to bam correction. In the timed_dataframe, delayStage positions are not the same as in the dataframe after bam correction, if parquets are created with the new flash loader. Currently, no idea why.

@rettigl
Copy link
Member

rettigl commented Jul 2, 2024

The computation time with the parquet generated with the new loader also is >2 as long.

@rettigl
Copy link
Member

rettigl commented Jul 2, 2024

A major difference between the old and new loader is the dask index: It is continous in the old loader, and very sparse in the new loader:
Old:
grafik
grafik
New:
grafik
grafik

@rettigl
Copy link
Member

rettigl commented Jul 2, 2024

It's related to the bam correction and the mean offsetting:
Old:
grafik
New:
grafik

Old:
grafik
New:
grafik

It seems there are a lot of empty entries in the timed_dataframe, where actually not even a photon pulse was present.

@zain-sohail
Copy link
Member Author

The computation time with the parquet generated with the new loader also is >2 as long.

I think the slowdown is because of this section being offloaded to dask

df = self.concatenate_channels(h5_file)
df = df.dropna(subset=self._config["dataframe"].get("tof_column", "dldTimeSteps"))
# correct the 3 bit shift which encodes the detector ID in the 8s time
if self._config["dataframe"].get("split_sector_id_from_dld_time", False):
df = split_dld_time_from_sector_id(df, config=self._config)

It seems there are a lot of empty entries in the timed_dataframe, where actually not even a photon pulse was present.

This was requested feature by Davide (#307) that the timed dataframe should contain all available values from pulses and not just ones where electrons are detected.

@rettigl
Copy link
Member

rettigl commented Jul 2, 2024

This was requested feature by Davide (#307) that the timed dataframe should contain all available values from pulses and not just ones where electrons are detected

Indeed, but these should also only contain valid pulses, where actual shots exist

@zain-sohail
Copy link
Member Author

Indeed, but these should also only contain valid pulses, where actual shots exist

Yes, this comes from the flash data structure of storing 1000 pulses for each train.
The simple way I can think to adapt this is a hard threshold the pulseId for timed_dataframe. Because it's not nans that are stored in bam column but zeros. (In other pulse resolved channels might be zero)
@kutnyakhov any ideas?

A major difference between the old and new loader is the dask index: It is continous in the old loader, and very sparse in the new loader:

This is just because nan drop takes place after loading the buffer files.

@rettigl
Copy link
Member

rettigl commented Jul 2, 2024

I think the slowdown is because of this section being offloaded to dask

That's not the reason. Even if I disable the sector split, it's still slow.

@zain-sohail
Copy link
Member Author

sp._timed_dataframe = sp._timed_dataframe[sp_50826._timed_dataframe["pulseId"] < 500].ffill()
ffill should happen with overlap but in general, something like this would fix the issue of noramlizing data

@rettigl
Copy link
Member

rettigl commented Jul 2, 2024

There should be a permanent fix. Probably beet to look for the first nonzero entry from the back or so. Another config entry I would avoid if possible

@zain-sohail
Copy link
Member Author

That's not the reason. Even if I disable the sector split, it's still slow.

Might be it's slow because of indexes
https://docs.dask.org/en/stable/generated/dask_expr._collection.DataFrame.set_index.html#dask_expr._collection.DataFrame.set_index
But I will think of a solution that allows the correct electron and timed dataframes and also be fast.

@zain-sohail zain-sohail mentioned this issue Jul 2, 2024
11 tasks
@rettigl
Copy link
Member

rettigl commented Jul 2, 2024

If directly loading the parquet files, first thing to notice is that the new loader produces ~6.5 times more entries compared to the old one, and also the number of entries and remaining NaN is much different:
This is the output of df.count().compute on two directly loaded parquet dataframes (single file):
Old:

trainId                      423643
pulseId                      423643
electronId                   423643
timeStamp                    423643
dldPosX                      423643
dldPosY                      423643
dldTimeSteps                 423643
cryoTemperature              423643
dldTimeBinSize               423643
extractorCurrent             423643
extractorVoltage             423643
sampleBias                   423643
sampleTemperature            423643
tofVoltage                   423643
pulserSignAdc                  9184
monochromatorPhotonEnergy      4571
gmdBda                        44019
bam                          423341
delayStage                    43846
dldSectorID                  423643
dtype: int64

New:

trainId                      2782954
pulseId                      2782954
electronId                   2782954
dldPosX                       423643
dldPosY                       423643
dldTimeSteps                  423643
pulserSignAdc                  25630
bam                          2563000
timeStamp                       2563
monochromatorPhotonEnergy         27
gmdBda                           264
delayStage                       263
sampleBias                      2563
tofVoltage                      2563
extractorVoltage                2563
extractorCurrent                2563
cryoTemperature                 2563
sampleTemperature               2563
dldTimeBinSize                  2563
dtype: int64

Clearly, the performance with the old files is much superior, also their storage size is a fac.tor 2 or so smaller. This should be changed again.

@rettigl
Copy link
Member

rettigl commented Jul 2, 2024

Regarding the timed_dataframe, does it really make a difference if this is computed on the pulse rather than the train level? This would make things super much simple and more performant. I think it's worth checking if that makes a practical difference, if you e.g. take the mean of a train and normalize by this.

@zain-sohail
Copy link
Member Author

If directly loading the parquet files, first thing to notice is that the new loader produces ~6.5 times more entries compared to the old one, and also the number of entries and remaining NaN is much different: This is the output of df.count().compute on two directly loaded parquet dataframes (single file): Old:

Yes this is simply because we are not dropping nans from electron columns before saving the files (how it was done before):

df = self.concatenate_channels(h5_file)
df = df.dropna(subset=self._config["dataframe"].get("tof_column", "dldTimeSteps"))
# correct the 3 bit shift which encodes the detector ID in the 8s time
if self._config["dataframe"].get("split_sector_id_from_dld_time", False):
df = split_dld_time_from_sector_id(df, config=self._config)

If we do that, the dataframe becomes much smaller.
This doesn't work with if we want to keep the pulse data even without electrons.
One solution could be to save a df and df_timed parquet and then load both per file. Technically it's also possible to save a parquet for each run and just divide the 'per file' into row groups.

Clearly, the performance with the old files is much superior, also their storage size is a fac.tor 2 or so smaller. This should be changed again.

I can of course bring back the old behavior (saving after dropping nans) so we can run tests to see if that is the performance bottleneck.
File size is not the biggest issue because parquet files are quite small.

Regarding the timed_dataframe, does it really make a difference if this is computed on the pulse rather than the train level? This would make things super much simple and more performant. I think it's worth checking if that makes a practical difference, if you e.g. take the mean of a train and normalize by this.

I thought we just had a discussion today about monochromator and using the pulse channel for it instead of train. Perhaps I misunderstood but I can't shed light on whether it's useful or not.

@balerion if you are still using sed, could you explain your usecase?

@zain-sohail
Copy link
Member Author

as expected, from my tests, binning performance stays same if i remove nans before saving (index is monotonic then like in main branch version. You can test from branch fix-459

@rettigl
Copy link
Member

rettigl commented Jul 3, 2024

Something is still different. While the speed is back to normal, the result looks quite different:
Old:
image
image
image
New:
image
image
image
Fixed:
image
image
image

It appears to attach the wrong delay stage values (?)

@rettigl
Copy link
Member

rettigl commented Jul 3, 2024

Counts for the same file:
Old:
image
New:
image
Fixed:
image

There are 2567 trains in this file, and 257 different unique delay stage position values

@zain-sohail
Copy link
Member Author

closed by #469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants