New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pygmt.x2sys_cross: Refactor to use virtualfiles for output tables #3182
base: main
Are you sure you want to change the base?
Conversation
295afc0
to
5280524
Compare
bc341f6
to
ff290da
Compare
ff290da
to
58c6ea4
Compare
pygmt/src/x2sys_cross.py
Outdated
# Convert 3rd and 4th columns to datetimes. | ||
# These two columns have names "t_1"/"t_2" or "i_1"/"i_2". | ||
# "t_1"/"t_2" means they are datetimes and should be converted. | ||
# "i_1"/"i_2" means they are dummy times (i.e., floating-point values). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding the output correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never used x2sys, but here is my understanding of the C codes and the output:
- The 3rd and 4th columns are datetimes. They can be either absolute datetimes (e.g.,
2023-01-01T01:23:45.678
or dummy datetimes (i.e., double-precision numbers), depending on whether the input tracks contain datetimes. - Internally, absolute datetimes are also represented as double-precision numbers in GMT. So absolute datetimes and dummy datetimes are the same internally.
- When outputting to a file, GMT will convert double-precision numbers into absolute datetimes, since GMT know if the column has dummy datetimes or not.
- A
GMT_DATASET
container can only contain double-precision numbers and text strings. So when outputting to a virtual file, the 3rd and 4th columns always have double-precision numbers. If the column names aret_1
/t_2
, then we know they're absolute datetimes and should be converted; otherwise, they are just dummy datetimes and should not be converted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little unsure if i_1
/i_2
are actually dummy datetimes. This is a sample output from x2sys_cross
:
# Tag: X2SYS4ivlhlo4
# Command: x2sys_cross @tut_ship.xyz -Qi -TX2SYS4ivlhlo4 ->/tmp/lala.txt
# x y i_1 i_2 dist_1 dist_2 head_1 head_2 vel_1 vel_2 z_X z_M
> @tut_ship 0 @tut_ship 0 NaN/NaN/1357.17 NaN/NaN/1357.17
251.004840022 20.000079064 18053.5647431 13446.6562433 333.339586673 229.636557499 269.996783034 270.023614846 NaN NaN 192.232797243 -2957.22757183
251.004840022 20.000079064 18053.5647431 71783.6562433 333.339586673 1148.20975878 269.996783034 270.023614846 NaN NaN 192.232797243 -2957.22757183
250.534946327 20.0000526811 18053.3762934 66989.0210846 332.869692978 1022.68273972 269.996783034 269.360150109 NaN NaN -57.6485957585 -2686.4268008
250.532033147 20.0000525175 18053.3751251 66988.9936489 332.866779797 1022.67977813 269.996783034 22.0133296951 NaN NaN -64.5973890802 -2682.04812157
252.068705 20.000075 13447.5 71784.5 230.700422496 1149.27362378 269.995072235 269.995072235 NaN NaN 0 -3206.5
It seems like the i_1
/i_2
values vary between rows, but I can't quite remember what they represent... maybe an index of some sort? I might need to inspect the C code to see what's going on, can you point me to where these i_1
/i_2
columns are being output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dummy times are just double-precision indexes from 0 to n (xref: https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys.c#L533).
The column name i_1
or t_1
is controlled by the variable t_or_i
in the C code (https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys_cross.c#L998). From https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys_cross.c#L568, it's clear that, if got_time
is True, then the column is absolute time (GMT_IS_ABSTIME
), otherwise it's double-precision numbers (GMT_IS_FLOAT
).
We can keep the dummy times as double-precision numbers or think them as seconds since unix epoch and then convert them to absolute times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep the dummy times as double-precision numbers or think them as seconds since unix epoch and then convert them to absolute times.
Maybe convert the relative time to pandas.Timedelta
or numpy.timedelta64
? Xref #2848.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Done in 9d12ae1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 main changes happening in this PR:
- Adding the
output_type="numpy"
option - Handling the different dtypes of the
i_1
/i_2
ort_1
/t_2
columns
We can keep this as a single PR since it's hard to separate the two things, but might need to discuss the implementation a bit more.
pygmt/src/x2sys_cross.py
Outdated
def x2sys_cross(tracks=None, outfile=None, **kwargs): | ||
def x2sys_cross( | ||
tracks=None, | ||
output_type: Literal["pandas", "numpy", "file"] = "pandas", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm not sure if we should support numpy
output type for x2sys_cross
because all 'columns' will need to be the same dtype in a np.ndarray
. If there are datetime values in the columns, they will get converted to floating point (?), which makes it more difficult to use later. Try adding a unit test for numpy
output_type and see if it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are datetime values in the columns, they will get converted to floating point (?)
You're right. Datetimes are converted to floating points by df.to_numpy()
. Will remove the numpy
output type.
pygmt/src/x2sys_cross.py
Outdated
# Convert 3rd and 4th columns to datetimes. | ||
# These two columns have names "t_1"/"t_2" or "i_1"/"i_2". | ||
# "t_1"/"t_2" means they are datetimes and should be converted. | ||
# "i_1"/"i_2" means they are dummy times (i.e., floating-point values). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little unsure if i_1
/i_2
are actually dummy datetimes. This is a sample output from x2sys_cross
:
# Tag: X2SYS4ivlhlo4
# Command: x2sys_cross @tut_ship.xyz -Qi -TX2SYS4ivlhlo4 ->/tmp/lala.txt
# x y i_1 i_2 dist_1 dist_2 head_1 head_2 vel_1 vel_2 z_X z_M
> @tut_ship 0 @tut_ship 0 NaN/NaN/1357.17 NaN/NaN/1357.17
251.004840022 20.000079064 18053.5647431 13446.6562433 333.339586673 229.636557499 269.996783034 270.023614846 NaN NaN 192.232797243 -2957.22757183
251.004840022 20.000079064 18053.5647431 71783.6562433 333.339586673 1148.20975878 269.996783034 270.023614846 NaN NaN 192.232797243 -2957.22757183
250.534946327 20.0000526811 18053.3762934 66989.0210846 332.869692978 1022.68273972 269.996783034 269.360150109 NaN NaN -57.6485957585 -2686.4268008
250.532033147 20.0000525175 18053.3751251 66988.9936489 332.866779797 1022.67977813 269.996783034 22.0133296951 NaN NaN -64.5973890802 -2682.04812157
252.068705 20.000075 13447.5 71784.5 230.700422496 1149.27362378 269.995072235 269.995072235 NaN NaN 0 -3206.5
It seems like the i_1
/i_2
values vary between rows, but I can't quite remember what they represent... maybe an index of some sort? I might need to inspect the C code to see what's going on, can you point me to where these i_1
/i_2
columns are being output?
pygmt/src/x2sys_cross.py
Outdated
def x2sys_cross(tracks=None, outfile=None, **kwargs): | ||
def x2sys_cross( | ||
tracks=None, | ||
output_type: Literal["pandas", "file"] = "pandas", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the only two options are pandas
or file
, we probably don't need an output_type
parameter, and can revert to the previous code where a pandas.DataFrame
output is returned when outfile
is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 5f04506.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should revert the changes in #3191 since it's not needed by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you're right. Sorry, should have realized that earlier 😅
# These two columns have names "t_1"/"t_2" or "i_1"/"i_2". | ||
# "t_1"/"t_2" means they are absolute datetimes. | ||
# "i_1"/"i_2" means they are dummy times relative to unix epoch. | ||
if output_type == "pandas": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More checks and tests are needed because the conversion rely on GMT configurations like TIME_EPOCH
and TIME_UNIT
.
os.environ["X2SYS_HOME"], kwargs["T"], f"{kwargs['T']}.tag" | ||
) | ||
# Last line is like "-Dxyz -Etsv -I1/1" | ||
lastline = tagfile.read_text().splitlines()[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is encoding="utf8"
not needed anymore?
lastline = tagfile.read_text().splitlines()[-1] | |
lastline = tagfile.read_text(encoding="utf8").splitlines()[-1] |
comment=">", # Skip the 3rd row with a ">" | ||
parse_dates=[2, 3], # Datetimes on 3rd and 4th column | ||
**date_format_kwarg, # Parse dates in ISO8601 format on pandas>=2 | ||
result = lib.virtualfile_to_dataset( | ||
vfname=vouttbl, output_type=output_type, header=2 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that x2sys_cross
can output multi-segment files (each segment is separated by IO_SEGMENT_MARKER
which is >
by default, see https://docs.generic-mapping-tools.org/6.5/reference/file-formats.html#optional-segment-header-records). If I'm not mistaken, the current virtualfile_to_dataset
method does not implement multi-segment file handling yet? To be fair though, the current implementation in x2sys_cross
simply merges all segments into one, since we skip rows starting with >
, but we need to check that virtualfile_to_dataset
will return all segments in a multi-segment file instead of just the first one.
Description of proposed changes
Refactor
pygmt.x2sys_cross
to use virtualfile for output.Partially address #3160.
Need to note that
x2sys_cross
still use temporary files in thetempfile_from_dftrack
function.