Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snap monitor taking a long time #56

Open
mkolopanis opened this issue Oct 27, 2021 · 0 comments
Open

snap monitor taking a long time #56

mkolopanis opened this issue Oct 27, 2021 · 0 comments

Comments

@mkolopanis
Copy link
Member

We're noticing a ~5-10s read time for get_new_corr in the hera_snap_redis_monitor.
I've attached the relevant slack conversation too.

Slack Convo

Matthew Kolopanis 8 days ago
dave and I narrowed down the snap monitor delay a bit. it's taking at least 10s for the corr.get_new_corr(steam, steam) call to complete for each ADC lane for each snap.
🐢
1
🐌
1
🦥
1
🕸️
1
🐳
1

Lindsay Berkhout 8 days ago
That function has been very inconsistent for me in my ASU setup. Sometimes it takes no time at all and sometimes it takes quite long to get data. I've never figured out what conditions cause what.

Matthew Kolopanis 8 days ago
fun. It seems to wait twice for the register named acc_cnt to increment before reading the dout register

Matthew Kolopanis 8 days ago
I was timing this acc_cnt incrementing to ~5s per time (so twice is 10s obviously)

Matthew Kolopanis 8 days ago
just saw one that took 16s

Bryna Hazelton 8 days ago
It must not always have been this slow

Matthew Kolopanis 8 days ago
Dave and I came to the same conclusion. I remember this script used to move through the entire array pretty quickly.

Bryna Hazelton 8 days ago
do we know what increments the acc_cnt?

Matthew Kolopanis 8 days ago
@Jonathon?

Jonathon Kocz 8 days ago
It should be based on a specified integration time for the onboard xcorr block. I'll take a look and see if I can work out what it's set too and if/why it's slow.

Matthew Kolopanis 8 days ago
I can tell you the default acc_len for the Corr block objects is 3815 if that helps off the top of your head.

Jonathon Kocz 7 days ago
that's an interesting number... do we know why/how it was set originally? - Looking at the code, it interprets acc_len in clock cycles (not spectra) and it takes 1024 clock cycles to get 1 spectra, so 3815 is a weird number to have... (I'm wondering if I'm missing something here...) (edited)

Jonathon Kocz 7 days ago
It's further multiplied by 8 (there are 8 inputs to cycle through) in the code before being used. So a value of 3815 becomes 30520.
(With on spectra every 1024 clocks x 8 inputs, it takes 8192 clock cycles to set through a full set of spectra from each input.) (edited)

Jonathon Kocz 7 days ago
So to return one element of a get_new_corr with an acc_len of 3815, it's asking for an average of ~3.7 spectra, which should take ~ 122us to complete.

Jonathon Kocz 7 days ago
So I'm pretty sure I'm missing something - e.g. why 3815, but that's the initial calculation.

Dave MacMahon 7 days ago
I'm pretty sure the code writes 1024acc_len to the identically named acc_len register, so using the default of acc_len=3815 would write 10243815 == 3906560 to the acc_len register. I think this means that it takes just over 1/8 of a second.

Matthew Kolopanis 7 days ago
but either way 122microsecond, 125miliseconds is a lot faster than 5s

Danny Jacobs 5 days ago
Bumping this because its slowed our telemetry update period from 1m to 30m. Its really important for observers to be able to see when things change. @Jonathon what would be a reasonable acc_len?

Dave MacMahon 5 days ago
One thing that could help is to parallelize this over multiple snaps, but I think that was not done because that makes (made?) communications with the snaps less reliable. One thing that could alleviate that problem would be to manually/administratively pre-populate the ARP table on hera-snap-head. That should minimize if not eliminate the broadcast traffic to the snaps that (I think) is the cause of instability during parallel access to the snaps.

Danny Jacobs 4 days ago
But is there a reason it takes 5s to read out a few ms long integration?

Dave MacMahon 4 days ago
That is it's own issue that is independent of serialized-vs-parallelized reading of the snaps. The actual readout happens very quickly. It's the waiting for acc_cnt to increment that is taking longer than expected.

Jonathon Kocz 2 days ago
It should certainly be safe to parallelize this now (the comms problems we had should be fixed) so that would solve one aspect. I'm still not sure why it's taking so long for acc_cnt to increment. - Where is the monitoring code that does the polling? I was wondering if it's switching up the inputs too quickly and it delays the acc_cnt? (Though if it's just capturing one at a time then incrementing, I'm not sure why that would be the case).

Matthew Kolopanis 2 days ago
hera_corr_f/control_software/scripts/hera_snap_redis_monitor.py
also regarding the comms issues should be fixed. we were still seeing tftp errors when dave and I were looking at this last week. Does that imply the snaps are getting initialized/programed with an old fpg file?

Jonathon Kocz 2 days ago
Mostly - there should now only be tftp errors if a snap is dead / needs a power cycle.

Jonathon Kocz 2 days ago
The only other time there migth be an error is during the programming.

Jonathon Kocz 2 days ago
(as the "golden image" is still the old code).

Matthew Kolopanis 2 days ago
okay, makes some sense :thinking_face:

Jonathon Kocz 2 days ago
Certainly from what I've seen from the ADC comms etc which are all in parallel now, there are only errors when a snap has got into a bad state and needs resetting.

Jonathon Kocz 2 days ago
Confirmed also as Dave said the code writes 1024*acc_len, which is then further *8 on the FPGA, so it's 31,252,480 clock cycles per integration, so ~125ms per spectra.
I'll check the monitor, but if we're reading the full matrix of 6 inputs, that's 21 combinations, which would be ~2.63s, so x2 would give 5.25, so I could see how it would be a ~ 5 second wait (for a full set / snap). If we're just reading the 3 autocorrs, obviously that's not the reason.

Matthew Kolopanis 2 days ago
but it is taking 5 seconds per input

Matthew Kolopanis 2 days ago
maybe i'm missing something here

Jonathon Kocz 2 days ago
probably not :)

Jonathon Kocz 2 days ago
if it's 5 seconds / input, then something is very broken, I'm just not sure what yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant