-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a data provider, I want to be able to use "Space" as a delimiter in the Table_Delimited
class
#673
Comments
Table_Delimited
classTable_Delimited
class
Okay, will do—thanks Jordan. I figured I'd open these issues here to at least put these things on the radar, since they aren't node-specific and I saw that similar things have been handled/posted in this repo in the past. |
Our node rep @mkelley is out of the office until 9/18—I'm not sure whether he had a chance to bring these up with the DDWG before he left. |
I am going to discuss this with our node first, but have not been in town for a meeting yet. I will be back next week, though. Is that OK? |
This has been proposed before; there are a number of problems with allowing as a delimiter. See CCB-190 in JIRA. |
Hi @radiosci, the PDS Jira is not publicly accessible. I know this concept has been proposed/requested before, but the reason(s) it is potentially problematic are not immediately apparent—could you outline said problems here? Reviewing DDWG meeting notes from 2017 and 2018, I didn't see any real show-stopping problems described, or even a clear decision on it either way. |
CCB-190.pdf |
Per the discussion in the PDF that Dick just shared, this would be a complicated request to implement. Is there a reason for it? What data product are you trying to fit into the model? How hard would it be to either make it a |
That PDF provides some good insight, thanks. The data products in question are space-separated (non-fixed-width) tables of model parameters, and indeed—my current workaround is just to replace spaces with tabs to get them to pass validation. It works, technically, but to echo @radiosci's comment on the CCB, I don't like modifying original data, especially in a case like this where (PDS/DSV technicalities aside) spaces seem like they should be a no-brainer option for table delimiters. Reading through the discussion there, I have to say that I am pretty much in agreement with everything @jordanpadams said, especially
...since I think the fact that I opened this issue, unaware of the history of this CCB, is a good indicator that those MSL products from back in 2017 were not, in fact, the last space-delimited files a data provider ever wanted to archive. ;-) If the delimiter used by the parser for the table is set merely by the first occurrence of any of the possible delimiters, does that imply that the |
I completely agree. I am still somewhat shocked by the idea that the "first occurance" defines the delimiter, apparently no matter what is defined in the required Really, I would much rather remove that rule (meaning "first occurrence") and solidify the use of Summary: I could possibly support "space" as a delimiter if that rule didn't exist. Now this issue I see with space are:
BTW, I assume one reason this Jira ticket stalled originally was the Mars team was able to move to "Horizontal Tab" even in their 3rd-party defined OBJ format. Apparently, the switch from "space" to "tab" in the OBJ files I guess didn't break the software which supports reading the OBJ format directly (e.g., Meshlab). |
A quick look through the NASA-PDS/pds4-jparser codebase didn't turn up any obvious implementations of that "first occurrence" functionality either. Upon further digging, I am skeptical that this is really how it works (or maybe I am just misunderstanding what has been described). I am attaching test_table.tar.gz, which contains a dummy CSV file and a corresponding PDS4 label. The CSV is comma-separated (and This implies that the internal parser does use the |
I don't want to fixate too much on how the delimiter is identified. More generally, the current standards are designed to promote ease of use by future generations. Trent just identified one way in which empty columns are difficult to see if you have a space-delimited non-fixed-width table, and that is not the only issue with the current request. I again invite you to tell me more about the specific files that you're working with, where they came from, how they came to be formatted the way they are, and the arguments for and against changing them to conform with the current standards. |
I think that the details of the delimiter identification and usage are highly relevant though, since that mechanism was cited as the primary issue with the proposal, both back in 2017 and just now. I agree with Trent's observations—there are certainly some decisions that would need to be made if spaces were allowed delimiters. However, none of the things he mentioned are show-stoppers or overly hard questions—they just need to be discussed and decided on. I'm not sure how much more there is to say about the data. It is ASCII numeric data in tabular format, with fields separated by spaces, and records of inconsistent lengths. The files are produced by an instrument team and used (in space-separated format) by a calibration pipeline. My current kludge (which works fine, technically speaking) is to re-process the files prior to labeling/archiving, replacing spaces with horizontal tabs. I would summarize the arguments for and against this solution as follows:
However, I actually don't think that I want to fixate too much on the details of my specific data products which just happened to bring this issue to light. More generally, I think we should center our discussion around whether the addition of spaces as an allowable table delimiter would be good for the PDS as a whole. To quote Jordan again:
The odd PDS DSV 1 delimiter identification mechanism is the only really concrete issue that I've seen presented as a reason that spaces wouldn't work. That is why it is relevant—if it is potentially not a problem after all, then adding space delimiters seems all the more worth revisiting. |
Table_Delimited uses parsing_standard_id = "PDS DSV 1", which is defined in PDS SR 4C.1 and includes the "first use" provision; PDS DSV 1 was Todd King's adaptation of RFC 4180 for PDS. Table_Delimited is specified in SR 4C.2; it requires use of the attribute field_delimiter, which "must be consistent with the provision in Section 4C.1 item 3". |
Just to add a point regarding @cgobat's comment:
To be honest, it probably does not work that way... We did have to make a fork of open-csv to support the PDS DSV 1, but I don't know if we added that stipulation in there or not. Not to open a can of worms here, but could we introduce a second parsing standard based upon RFC 4180 in addition to our own custom delimited table standard? Maybe the original creators of the IM saw this possible future with the |
I'm not surprised. The relationship between SR 4C.1 and 4C.2 was fraught from the beginning. Todd insisted on having the PDS DSV 1 standard, which PPI maintained for a few months before it was incorporated into the SR. I think requiring field_delimiter to be consistent with 4C.1(3) is the best we can do in practice. Trying to implement the "first use" code would be a waste of time. |
Wouldn't the easiest path forward be to keep the current software implementation(s) and just change that one sentence of the DSV standard definition in 4C.1(3), since it seems like the code may already be performing in the more desirable way (i.e., using the label's |
One of the design principles of PDS4 is to archive data that is easy for future generations to use and hard for future generations to misinterpret. Of course, we also want to make things easy for today's data providers, within reason. Those two things must be held in tension. Another design principle of PDS4 is that we engineer the data to fit the standards, not the other way around. Of course, we do modify the standards when a clear need for it comes to our attention, so those two things also are held in tension. Personally, I feel that the best way to adhere to the first principle is to preserve the current options of either fixed-width tables or tables with variable length but a non-space delimiter. It seems to me that it is not too onerous to ask data providers to adhere to these options. The best long-term solution would be to update the mission software so that it natively produces files that meet the standards. I am only one member of the DDWG, and this is probably not the place for an extensive debate. I'm just articulating some reasons why a reasonable person might not approve this request. Of course, you are free to submit the SCR despite my opinion. |
Sure, I hear you, and I appreciate your insight. Just so I understand– are you opposed on principle (i.e. that spaces shouldn't be allowed as delimiters for a philosophical reason), or because it will be more work to implement than you think it's worth? I will say that your first point can also be reframed in the other direction: expecting data users to have to find and follow extra instructions to undo added reprocessing on the front end if they want to use data in its original context is far from the ideal of "easy for future generations to use". It seems to me that the more superfluous processing and reformatting steps you add (that someone later has to undo), the easier it becomes to misinterpret or not be able to use data. Just my 2¢. |
Speaking only for myself, I would say it's more philosophical. Spaces as delimiters add potential difficulty for users that we don't need to add.
I don't envision future generations undoing these steps. A fixed-width table or a CSV is a pretty generally applicable thing. A piece of software that requires a text input but that cannot accept either of those formats is not likely to endure (at least not in that state) for very long. |
Checked for duplicates
Yes - I've already checked
💪 Motivation
...so that I can label tabular data in which fields are delimited by spaces.
📖 Additional Details
Add "Space" as a permissible value to the
Table_Delimited
class'<field_delimiter>
attribute, with the value meaning, "Fields in the delimited table are delimited by a single ASCII whitespace character (0x20)".Acceptance Criteria
Given
When I perform
Then I expect
⚙️ Engineering Details
No response
The text was updated successfully, but these errors were encountered: