Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequency concatenation yields wrong data #1102

Closed
wfarah opened this issue Dec 9, 2021 · 9 comments · Fixed by #1105
Closed

Frequency concatenation yields wrong data #1102

wfarah opened this issue Dec 9, 2021 · 9 comments · Fixed by #1105
Assignees
Milestone

Comments

@wfarah
Copy link

wfarah commented Dec 9, 2021

Using pyuvdata==2.2.4

I’m trying to concatenate uvh5 files in frequency, and I’m simply using uv = uv1 + uv2 + uv3 + ... then writing out .ms file and parsing it to casa. This worked perfectly well all the time I used it, except for now where I started seeing some weirdness. The phase of the autocorrelation for some of the subbands appeared to be non-zero, which is not expected.
What I’m doing is the following:

uv1.read("subband1.uvh5")
uv2.read("subband2.uvh5")
...
uv = uv1 + uv2 + ...
uv.write_ms()
uv.write_hdf5()

I then read the ant_1_names and ant_2_names from all the uv files, and I could see that the first integrations in the concatenated uvfiles are not the autocorrelations, unlike the subbands:

subband1.uvh5
[ 3  5  7 ... 33 33 35] #ant_1_names
[ 3  5  7 ... 35 38 38] #ant_2_names

subband2.uvh5
[ 3  5  7 ... 33 33 35]
[ 3  5  7 ... 35 38 38]

...

concatenated.uvh5
[ 3  3  3 ... 33 33 35] #something is wrong here
[ 3  5  7 ... 35 38 38]

These are some cross/auto correlations respectively, viewed in casa. I also extracted the visibilities straight from the uvh5 files, and I see the same thing
Screen Shot 2021-12-09 at 1 49 35 PM
Screen Shot 2021-12-09 at 1 49 47 PM

@wfarah
Copy link
Author

wfarah commented Dec 9, 2021

Data are available on google drive:

https://drive.google.com/drive/folders/1BzNo8fa2yeS8kbO7MoNSPY8pY1FS_j8U

@kartographer
Copy link
Contributor

So after a bit of poking around with the data, I was able to replicate the issue, and I think get to the heart of the issue at play. The long and short of this particular case is that there was a mix of concats across both the time-baseline and frequency axes, and after concatenating across the baseline access, the ordering of the baselines was not necessarily the same as it was prior to adding the two UVData objects together.

This actually leads to a more significant issue: I would have expected that UVData.__add__ would have checked the baseline ordering prior to combining the data together, but it does not appear to do so -- it seems only to check that the baseline and time values overlap, and where those overlaps occur, various metadata agree, but not that ant_1_array, ant_2_array, or time_array are in the same order. Looking at the code further, I think there's a similar issue with freq_array when there is frequency channel overlap, and polarization_array when there is polarization overlap. This little snippet of code maybe demonstrates the problem better than I can write it:

from pyuvdata import UVData
from pyuvdata.data import DATA_PATH
import numpy as np

filepath = str(DATA_PATH + "/day2_TDEM0003_10s_norx_1src_1spw.uvfits")
uv = UVData.from_file(filepath)

uv1 = uv.select(polarizations=["ll"], inplace=False)
uv1.reorder_blts("time","ant1")
uv2 = uv.select(polarizations=["rr"], inplace=False)
uv2.reorder_blts("time","ant2")

assert np.any(uv1.ant_1_array != uv2.ant_1_array)
assert np.any(uv1.ant_2_array != uv2.ant_2_array)

uv3 = uv1 + uv2

# Note that if uv1 and uv2 are differently ordered, one of these
# asserts should fail if everything is working as expected.
assert np.all(uv3.data_array[:,:,:,0] == uv2.data_array[:,:,:,0]).
assert np.all(uv3.data_array[:,:,:,1] == uv1.data_array[:,:,:,0])
assert np.any(uv1.ant_1_array != uv2.ant_1_array)
assert np.any(uv1.ant_2_array != uv2.ant_2_array)

I think adding a check for this would be relatively simple, that could throw an error when the ordering for these items is different. One alternative would be to make __add__ a little bit smarter, and sort out the ordering on concatenation, although that may require a bit more work and I don't know how rapidly this needs to get fixed.

@david-macmahon
Copy link

Does the baseline ordering vary between input datasets or are the time_arrays from each input dataset not identical? AFAIK, these input datasets should have all the same metadata except for the freq_arrays (which should even be contiguous in frequency). But it's not my data so I can't say for sure whether that's all true.

@kartographer
Copy link
Contributor

@david-macmahon -- the ordering is in fact not different between them, but some files contain different subsets of times or baselines (my guess being the former), which causes __add__ to effectively reorder them after concatenation. That resultant UVData object does have a different ordering, which is why the subsequent adds effectively corrupt the data set.

@david-macmahon
Copy link

@kartographer, your guess is exactly right. The first clue was that the file sizes are significantly different. Further examination of the two node1 UVH5 files show that the *_0.uvh5 file contains 59 integrations and the *_1.uvh5 file contains 54 integrations based on the Header/Ntimes value from each file. The final time_array entries differ by 5 integration times (i.e. 59-54), so that nicely corroborates the mismatched time dimension (and therefore the mismatched Nblts dimension).

Not sure if this is originally caused by a "dropped/missing data" problem in the data capture code or mis-matched scan lengths between files. If the former, then the software correlator code should be more robust to dropped/missing data (and the data capture code improved, of course!). In any case it would be good for pyuvdata to either "do the right thing" (whatever that is in this case?) or refuse to combine these mismatched datasets. Either would be preferable to outputting invalid datasets.

@bhazelton
Copy link
Member

thanks for this report @wfarah, we definitely need to fix this. @kartographer I think we should add the check you're suggesting and we should also check to see if this problem is present in the add methods of the other objects, which were based on this code.

@kartographer
Copy link
Contributor

@bhazelton -- sounds good, I'll have a PR for someone to review in just a few minutes...

@kartographer
Copy link
Contributor

@bhazelton -- taking a brief look through UVFlag, UVBeam, and UVCal, I think the answer to your question is "yes", although the code that's in UVFlag.__add__ looks less mature and doesn't seem to do nearly as rigorous a job checking metadata as that for the other objects.

@bhazelton
Copy link
Member

Ok, I'll make issues for those objects so they can be tracked separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants