Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histograms from timed dataframe #116

Merged
merged 14 commits into from
Nov 5, 2023
Merged

Conversation

rettigl
Copy link
Member

@rettigl rettigl commented Apr 3, 2023

This adds support for a per-time-unit dataframe, and histogram calculation based on this dataframe.

Histogram calculation in the example notebook takes on our machine ~55s. This can potentially still be improved, e.g. by taking our fast binning method (currently fails because of xarray generation and concatenation issues)

@rettigl rettigl requested a review from steinnymir April 3, 2023 22:05
@rettigl rettigl force-pushed the histograms_from_timed_dataframe branch 3 times, most recently from 7a2ba24 to c9b8a17 Compare May 15, 2023 21:05
@rettigl
Copy link
Member Author

rettigl commented May 15, 2023

This does not produce the right kind of normalization histogram yet, I think...

@rettigl
Copy link
Member Author

rettigl commented May 17, 2023

This implements both variants now. Both routes produce almst identical results, but histograms from timed dataframes are substantially faster. Also a normalization routine for the compute function is added.

One issue remains: If histogram axes are jittered, the actual values of the jitter are different during computation of the data and of the normalization histogram, leading to small errors in the histogram on the order of a percent or so. A workaround would be to include the histogram calculation into bin_partition. This would also avoid reading the data twice, which I think is the major time bottleneck at the moment.

@rettigl
Copy link
Member Author

rettigl commented May 29, 2023

Histogram ontop of data (continously scanned delay stage):
grafik
Performance comparison:
grafik
Two histogram methods compared (bottom: difference):
grafik
Histogram computed twice and compared (jittered):
grafik

Bottomline: All effects of randomness and difference between methods are < 1%.

@rettigl
Copy link
Member Author

rettigl commented Aug 12, 2023

@zainsohail04 This one is pending on your implementation into the Flash loader (and some review)...

@rettigl rettigl force-pushed the histograms_from_timed_dataframe branch from d3076e6 to 0d64a32 Compare September 23, 2023 22:47
@rettigl
Copy link
Member Author

rettigl commented Sep 23, 2023

Rebased, and added dummy implementation for flash loader. Tests are also still pending.

@coveralls
Copy link
Collaborator

coveralls commented Sep 23, 2023

Pull Request Test Coverage Report for Build 6760852807

  • 291 of 314 (92.68%) changed or added relevant lines in 14 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.7%) to 90.759%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sed/loader/flash/loader.py 6 7 85.71%
sed/loader/generic/loader.py 3 4 75.0%
sed/loader/mpes/loader.py 34 39 87.18%
sed/core/processor.py 92 108 85.19%
Totals Coverage Status
Change from base Build 6733527384: 0.7%
Covered Lines: 4901
Relevant Lines: 5400

💛 - Coveralls

@rettigl rettigl force-pushed the histograms_from_timed_dataframe branch from 0d64a32 to 1fc17c6 Compare September 23, 2023 23:22
@rettigl
Copy link
Member Author

rettigl commented Oct 10, 2023

@zain-sohail For me the flash timed_dataframes don't work. They consume endless amounts of memory and never finish computation.

@zain-sohail
Copy link
Member

@zain-sohail For me the flash timed_dataframes don't work. They consume endless amounts of memory and never finish computation.

Screenshot 2023-10-10 at 17 25 35 Just tried and it seems to work fine. Where are you facing a problem?

@rettigl
Copy link
Member Author

rettigl commented Oct 10, 2023

After merging the fast loading code from main, it is at lead performing faster now, but still does not seem to do the right thing:
I am loadings runs=[44824, 44825, 44826, 44827]
grafik
grafik
It does not make sense that the timed dataframe has 8 times more rows...

@rettigl
Copy link
Member Author

rettigl commented Oct 10, 2023

Or does it? Are there so many empty macrobunches?

@rettigl
Copy link
Member Author

rettigl commented Oct 10, 2023

Hmm, okay the generated histogram does indeed not look to bad
grafik

@rettigl
Copy link
Member Author

rettigl commented Oct 10, 2023

The time-stamp version does produce some strange artefacts, though:
grafik

@steinnymir
Copy link
Member

Or does it? Are there so many empty macrobunches?

if you look at the tables you show, the pulse IDs in the electron dataframe are very sparse (7, 38, 69...) so it seems like count rate was really bad, and therefore I expect many empty pulses.

@rettigl rettigl force-pushed the histograms_from_timed_dataframe branch from b8f996b to 2fc1e0a Compare October 10, 2023 18:48
@rettigl
Copy link
Member Author

rettigl commented Oct 10, 2023

rebased to current main

@rettigl
Copy link
Member Author

rettigl commented Oct 10, 2023

Or does it? Are there so many empty macrobunches?

if you look at the tables you show, the pulse IDs in the electron dataframe are very sparse (7, 38, 69...) so it seems like count rate was really bad, and therefore I expect many empty pulses.

I realized that after posting my previous post, but don't undertand the data format completely yet. What exactly are the Train, pulse, and electron IDs? Are these Macrobunches, microbunches, and electrons per microbunch? And what is the time-base for the timed_dataframe? I was expecting macrobunches, but then I would expect one row per trainID, no? Now it seems one row per microbunch. This is certainly typically more than electrons. And why is it going up to 1000, if tpyically there are 400 or 500 microbunches in a macrobunch?

@steinnymir
Copy link
Member

How you described it is all correct.
the reason there are 1000 is I think because the data structure accomodates up to 1000 pulses, as flash can theoretically provide 800, which it never does, max 500 in reality.
I'm not sure if using trainID (macrobunches) would be still ok, don't you think a coarser step size induces more noise?

@rettigl
Copy link
Member Author

rettigl commented Oct 10, 2023

I'm not sure if using trainID (macrobunches) would be still ok, don't you think a coarser step size induces more noise?

All you want here is to capture changes of the scanned variable, which typically is anyways only read out once per macrobunch, no? So no, there should be no difference. It would only make a (marginal) difference for parameters that are varied within the macrobunch. Not sure, though, how this would look like if you consider, e.g. BAM correction etc.

@steinnymir
Copy link
Member

All you want here is to capture changes of the scanned variable, which typically is anyways only read out once per macrobunch, no? So no, there should be no difference. It would only make a (marginal) difference for parameters that are varied within the macrobunch. Not sure, though, how this would look like if you consider, e.g. BAM correction etc.

Right! but still there might be some pulse resolved normalizations one would want, like the FEL intensity for example. I'm afraid we need to keep the per-pulse dataframe for those

@zain-sohail
Copy link
Member

After merging the fast loading code from main, it is at lead performing faster now, but still does not seem to do the right thing: I am loadings runs=[44824, 44825, 44826, 44827]

Wouldn't make sense that there is a timing difference in the compute of dataframe, only makes a difference in creating buffer files quicker, I would imagine. Regarding your question about why the dataframe is bigger, I think Steinn managed to answer it.

Copy link
Member

@steinnymir steinnymir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I tested it together with the hextof changes from #169 and seems to work correctly. Only exception the comments I made. Once those are resolved you can merge for what concerns me.

sed/core/processor.py Show resolved Hide resolved
sed/core/processor.py Show resolved Hide resolved
sed/core/processor.py Show resolved Hide resolved
@steinnymir
Copy link
Member

I accidentally uploaded the branch where i tested this. I'll leave it till this is merged. its hist_testing

@rettigl rettigl force-pushed the histograms_from_timed_dataframe branch 3 times, most recently from 05c2eb2 to 0d621d9 Compare October 28, 2023 21:29
@rettigl rettigl force-pushed the histograms_from_timed_dataframe branch from 0d621d9 to 9c3faf0 Compare October 28, 2023 21:30
@rettigl
Copy link
Member Author

rettigl commented Oct 28, 2023

I fixed this, and modified the tests to test the timed dataframes.

@rettigl
Copy link
Member Author

rettigl commented Oct 28, 2023

I accidentally uploaded the branch where i tested this. I'll leave it till this is merged. its hist_testing

I also updated your test branch

Copy link
Member

@steinnymir steinnymir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and seems to work fine. LGTM!

@steinnymir
Copy link
Member

closes #101

@steinnymir steinnymir linked an issue Nov 4, 2023 that may be closed by this pull request
@rettigl
Copy link
Member Author

rettigl commented Nov 4, 2023

I am working on improving tests for this still, and update the new processor functions, then I will merge

@rettigl rettigl merged commit bb717a7 into main Nov 5, 2023
5 checks passed
@rettigl rettigl deleted the histograms_from_timed_dataframe branch November 5, 2023 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possibility for normalization by the number of FEL pulses:
4 participants