Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a data provider, I want to be able to use "Space" as a delimiter in the Table_Delimited class #673

Open
cgobat opened this issue Aug 11, 2023 · 21 comments

Comments

@cgobat
Copy link

cgobat commented Aug 11, 2023

Checked for duplicates

Yes - I've already checked

💪 Motivation

...so that I can label tabular data in which fields are delimited by spaces.

📖 Additional Details

Add "Space" as a permissible value to the Table_Delimited class' <field_delimiter> attribute, with the value meaning, "Fields in the delimited table are delimited by a single ASCII whitespace character (0x20)".

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

@cgobat cgobat changed the title As a data provider, I want "Space" to be an allowable delimiter in the Table_Delimited class As a data provider, I want to be able to use "Space" as a delimiter in the Table_Delimited class Aug 15, 2023
@jordanpadams
Copy link
Member

@cgobat thanks for the inputs!

this, along with #672, should be raised with your node representative and they can bring it to the PDS Data Design Working Group to look into adding these features.

@cgobat
Copy link
Author

cgobat commented Aug 15, 2023

Okay, will do—thanks Jordan. I figured I'd open these issues here to at least put these things on the radar, since they aren't node-specific and I saw that similar things have been handled/posted in this repo in the past.

@jordanpadams
Copy link
Member

@cgobat just to follow up here, have software change requests been opened with the DDWG for this and #672?

@cgobat
Copy link
Author

cgobat commented Sep 11, 2023

Our node rep @mkelley is out of the office until 9/18—I'm not sure whether he had a chance to bring these up with the DDWG before he left.

@mkelley
Copy link

mkelley commented Sep 12, 2023

I am going to discuss this with our node first, but have not been in town for a meeting yet. I will be back next week, though. Is that OK?

@radiosci
Copy link

This has been proposed before; there are a number of problems with allowing as a delimiter. See CCB-190 in JIRA.

@cgobat
Copy link
Author

cgobat commented Sep 12, 2023

Hi @radiosci, the PDS Jira is not publicly accessible. I know this concept has been proposed/requested before, but the reason(s) it is potentially problematic are not immediately apparent—could you outline said problems here? Reviewing DDWG meeting notes from 2017 and 2018, I didn't see any real show-stopping problems described, or even a clear decision on it either way.

@radiosci
Copy link

radiosci commented Sep 12, 2023

CCB-190.pdf
The JIRA entry (less attachments) is captured in the attached PDF. The primary technical problem is that the first appearance of any allowed delimiter automatically makes that character the delimiter for the entire file. If SPACE were added to the delimiter list, a data provider expected 'comma' to be the delimiter, and the first field were text including one or more SPACE characters, then SPACE would become the delimiter. It's possible to add quotes around text strings to avoid this problem; but DDWG agreed that this puts an unacceptable burden on data providers, most of whom can work with the existing set of delimiters.

@matthewtiscareno
Copy link

Per the discussion in the PDF that Dick just shared, this would be a complicated request to implement. Is there a reason for it? What data product are you trying to fit into the model? How hard would it be to either make it a Table_Delimited with a currently available delimiter, or alternatively into a fixed-width table (Table_Character) with spaces between the columns?

@cgobat
Copy link
Author

cgobat commented Sep 12, 2023

That PDF provides some good insight, thanks. The data products in question are space-separated (non-fixed-width) tables of model parameters, and indeed—my current workaround is just to replace spaces with tabs to get them to pass validation. It works, technically, but to echo @radiosci's comment on the CCB, I don't like modifying original data, especially in a case like this where (PDS/DSV technicalities aside) spaces seem like they should be a no-brainer option for table delimiters.

Reading through the discussion there, I have to say that I am pretty much in agreement with everything @jordanpadams said, especially

we do believe this should be added to the IM in a more general sense for all space-delimited files. Of course we could kludge
this product so we can get it into PDS, but do we really think this will be the last space-delimited file a data provider will ever want to archive?

...since I think the fact that I opened this issue, unaware of the history of this CCB, is a good indicator that those MSL products from back in 2017 were not, in fact, the last space-delimited files a data provider ever wanted to archive. ;-)

If the delimiter used by the parser for the table is set merely by the first occurrence of any of the possible delimiters, does that imply that the field_delimiter attribute in the label is not used? If that is the case, why does the Table_Delimited class require it? That certainly strikes me as a less-than-optimal design...

@thareUSGS
Copy link

thareUSGS commented Sep 13, 2023

If the delimiter used by the parser for the table is set merely by the first occurrence of any of the possible delimiters, does that imply that the field_delimiter attribute in the label is not used? If that is the case, why does the Table_Delimited class require it? That certainly strikes me as a less-than-optimal design...

I completely agree. I am still somewhat shocked by the idea that the "first occurance" defines the delimiter, apparently no matter what is defined in the required field_delimiter. What a pain for PDS4 table readers to figure that out in code. Not even sure how Validate would test for all variations of that. I didn't look very hard, but actually I don't see that "first occurrence" capability in the pds4_viewer code base or the GDAL code base.

Really, I would much rather remove that rule (meaning "first occurrence") and solidify the use of field_delimiter. That change would much easier allow for a space (ASCII 0x32) within field_delimiter.

Summary: I could possibly support "space" as a delimiter if that rule didn't exist. Now this issue I see with space are:

  • easy for users to create tables with multiple spaces between columns. Does 1 space mean the same as 2+ coincident spaces?
  • quoting strings which contain spaces will need to be enforced (but easy to validate).
  • empty fields will be very hard to view. For example the use of Commas makes this pretty easy
field1,field2,field3,field4
42,43,44,45
42,,,45

BTW, I assume one reason this Jira ticket stalled originally was the Mars team was able to move to "Horizontal Tab" even in their 3rd-party defined OBJ format. Apparently, the switch from "space" to "tab" in the OBJ files I guess didn't break the software which supports reading the OBJ format directly (e.g., Meshlab).

@cgobat
Copy link
Author

cgobat commented Sep 13, 2023

A quick look through the NASA-PDS/pds4-jparser codebase didn't turn up any obvious implementations of that "first occurrence" functionality either. Upon further digging, I am skeptical that this is really how it works (or maybe I am just misunderstanding what has been described).

I am attaching test_table.tar.gz, which contains a dummy CSV file and a corresponding PDS4 label. The CSV is comma-separated (and field_delimiter in the label is set to "Comma"), but the first field contains a vertical bar and a semicolon, and the second field contains a horizontal tab. None of the fields are enclosed in quotes, and all of them have a data type of ASCII_String. One would think, given previous discussions, that the parser would see other possible delimiters first (i.e., a vertical bar) and throw a fit, but this product passes validate with no complaints. pds4_viewer also reads it no problem, with the fields correctly delimited by the commas.

This implies that the internal parser does use the field_delimiter attribute to inform its choice of delimiter, which seems like it would make "Space" no different than any of the other currently allowed delimiters as far as implementation difficulty goes.

@matthewtiscareno
Copy link

I don't want to fixate too much on how the delimiter is identified. More generally, the current standards are designed to promote ease of use by future generations. Trent just identified one way in which empty columns are difficult to see if you have a space-delimited non-fixed-width table, and that is not the only issue with the current request.

I again invite you to tell me more about the specific files that you're working with, where they came from, how they came to be formatted the way they are, and the arguments for and against changing them to conform with the current standards.

@cgobat
Copy link
Author

cgobat commented Sep 13, 2023

I think that the details of the delimiter identification and usage are highly relevant though, since that mechanism was cited as the primary issue with the proposal, both back in 2017 and just now.

I agree with Trent's observations—there are certainly some decisions that would need to be made if spaces were allowed delimiters. However, none of the things he mentioned are show-stoppers or overly hard questions—they just need to be discussed and decided on.

I'm not sure how much more there is to say about the data. It is ASCII numeric data in tabular format, with fields separated by spaces, and records of inconsistent lengths. The files are produced by an instrument team and used (in space-separated format) by a calibration pipeline. My current kludge (which works fine, technically speaking) is to re-process the files prior to labeling/archiving, replacing spaces with horizontal tabs. I would summarize the arguments for and against this solution as follows:

Arguments for changing the data to conform with current standards Arguments against (i.e. updating the standards to allow spaces)
I have to; there is currently no other option Spaces seem like a sensible/intuitive delimiter to allow, and it would mean being able to deliver data products in a native format consistent with how they are produced/used by mission software

However, I actually don't think that I want to fixate too much on the details of my specific data products which just happened to bring this issue to light. More generally, I think we should center our discussion around whether the addition of spaces as an allowable table delimiter would be good for the PDS as a whole. To quote Jordan again:

This issue came to light due to a request from a specific data provider for this specific data product, however, we do believe this should be added to the IM in a more general sense for all space-delimited files.

The odd PDS DSV 1 delimiter identification mechanism is the only really concrete issue that I've seen presented as a reason that spaces wouldn't work. That is why it is relevant—if it is potentially not a problem after all, then adding space delimiters seems all the more worth revisiting.

@radiosci
Copy link

Table_Delimited uses parsing_standard_id = "PDS DSV 1", which is defined in PDS SR 4C.1 and includes the "first use" provision; PDS DSV 1 was Todd King's adaptation of RFC 4180 for PDS. Table_Delimited is specified in SR 4C.2; it requires use of the attribute field_delimiter, which "must be consistent with the provision in Section 4C.1 item 3".

@jordanpadams
Copy link
Member

jordanpadams commented Sep 13, 2023

Just to add a point regarding @cgobat's comment:

A quick look through the NASA-PDS/pds4-jparser codebase didn't turn up any obvious implementations of that "first occurrence" functionality either. Upon further digging, I am skeptical that this is really how it works (or maybe I am just misunderstanding what has been described).

To be honest, it probably does not work that way... We did have to make a fork of open-csv to support the PDS DSV 1, but I don't know if we added that stipulation in there or not.

Not to open a can of worms here, but could we introduce a second parsing standard based upon RFC 4180 in addition to our own custom delimited table standard? Maybe the original creators of the IM saw this possible future with the parsing_standard_id attribute.

@radiosci
Copy link

I'm not surprised. The relationship between SR 4C.1 and 4C.2 was fraught from the beginning. Todd insisted on having the PDS DSV 1 standard, which PPI maintained for a few months before it was incorporated into the SR. I think requiring field_delimiter to be consistent with 4C.1(3) is the best we can do in practice. Trying to implement the "first use" code would be a waste of time.

@cgobat
Copy link
Author

cgobat commented Sep 13, 2023

Wouldn't the easiest path forward be to keep the current software implementation(s) and just change that one sentence of the DSV standard definition in 4C.1(3), since it seems like the code may already be performing in the more desirable way (i.e., using the label's field_delimiter attribute)?

@matthewtiscareno
Copy link

matthewtiscareno commented Sep 13, 2023

One of the design principles of PDS4 is to archive data that is easy for future generations to use and hard for future generations to misinterpret. Of course, we also want to make things easy for today's data providers, within reason. Those two things must be held in tension.

Another design principle of PDS4 is that we engineer the data to fit the standards, not the other way around. Of course, we do modify the standards when a clear need for it comes to our attention, so those two things also are held in tension.

Personally, I feel that the best way to adhere to the first principle is to preserve the current options of either fixed-width tables or tables with variable length but a non-space delimiter. It seems to me that it is not too onerous to ask data providers to adhere to these options. The best long-term solution would be to update the mission software so that it natively produces files that meet the standards.

I am only one member of the DDWG, and this is probably not the place for an extensive debate. I'm just articulating some reasons why a reasonable person might not approve this request. Of course, you are free to submit the SCR despite my opinion.

@cgobat
Copy link
Author

cgobat commented Sep 13, 2023

Sure, I hear you, and I appreciate your insight. Just so I understand– are you opposed on principle (i.e. that spaces shouldn't be allowed as delimiters for a philosophical reason), or because it will be more work to implement than you think it's worth?

I will say that your first point can also be reframed in the other direction: expecting data users to have to find and follow extra instructions to undo added reprocessing on the front end if they want to use data in its original context is far from the ideal of "easy for future generations to use". It seems to me that the more superfluous processing and reformatting steps you add (that someone later has to undo), the easier it becomes to misinterpret or not be able to use data. Just my 2¢.

@matthewtiscareno
Copy link

Sure, I hear you, and I appreciate your insight. Just so I understand– are you opposed on principle (i.e. that spaces shouldn't be allowed as delimiters for a philosophical reason), or because it will be more work to implement than you think it's worth?

Speaking only for myself, I would say it's more philosophical. Spaces as delimiters add potential difficulty for users that we don't need to add.

I will say that your first point can also be reframed in the other direction: expecting data users to have to find and follow extra instructions to undo added reprocessing on the front end if they want to use data in its original context is far from the ideal of "easy for future generations to use". It seems to me that the more superfluous processing and reformatting steps you add (that someone later has to undo), the easier it becomes to misinterpret or not be able to use data. Just my 2¢.

I don't envision future generations undoing these steps. A fixed-width table or a CSV is a pretty generally applicable thing. A piece of software that requires a text input but that cannot accept either of those formats is not likely to endure (at least not in that state) for very long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ToDo
Development

No branches or pull requests

6 participants