Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOI Service does not accurately parse <author_list> and <editor_list> in XML labels #377

Closed
Tracked by #30
rsjoyner opened this issue Oct 5, 2022 · 5 comments · Fixed by #378
Closed
Tracked by #30
Assignees
Labels

Comments

@rsjoyner
Copy link

rsjoyner commented Oct 5, 2022

🐛 Describe the bug

When parsing the <author_list> and <editor_list> in PDS4 XML labels, a wobbly is thrown if the value doesn't follow the formation rules for using commas and semicolons. For instance, this series of values will fail:

<author_list>smith, john; jones, tom, NASA; Google, Inc.</author_list>

(1) NASA will throw a wobbly because there is no comma
(2) Google will be parsed inaccurately as <last_name>, <first_name>

📜 To Reproduce

See example above

🕵️ Expected behavior

(1) Allow a value (within the set of values) to not require a comma.
(2) A more difficult fix will be to "interpret" the Google example.

📚 Version of Software Used

N/A

🩺 Test Data / Additional context

🏞Screenshots

🖥 System Info

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

🦄 Related requirements

⚙️ Engineering Details

Per @jordanpadams let's try to better handle case (1), but I don't think we will ever be able to handle case (2) until the PDS4 Information Model is improved.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Oct 10, 2022

Per PDS4 Information Model

The author_list attribute contains a semi-colon-separated list of names of people to be cited as authors of the associated product. The general format for individual names is: SURNAME, GIVEN NAME(s). Initials may be used in lieu of given name(s). If the name contains a suffix ("Jr.", "Sr.", "III", etc.) it should be placed before the comma (,). Do not include the word "and" before the final author. All authors should be listed explicitly - do not elide the list using "et al.".

Current information model doesn't appear to allow for the possibility of organisation/mononym authors. @jordanpadams, that seems like an oversight worth fixing.

Provided value smith, john; jones, tom, NASA; Google, Inc. does not fail, and produces expected results.
Value smith, john; jones, tom; NASA; Google, Inc. does fail due to NASA mononym.

Will fix, such that mononym values are written to the author last-name field, with a blank first-name field (if that doesn't cause validation problems - need to check).

Suggest that authors like Google, Inc. should be given like Google Inc., without any commas

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Oct 10, 2022

@jordanpadams is there a good reason the name parsing logic considers . to be a valid separator? Does this support a known use-case?

Ah, I see now - names like R. Deen

I've done the best I can untangling the name parsing logic, but to go any further with it I'll need a comprehensive list of name strings which the parser is expected to support. My tests support:

"Dunn, Alex",
"NASA",
"SomeCorp Inc.",
"Some, MiddleNamed, Gal",
"Suffixed Jr., James",
"R. Deen",

but the first/middle-name ordering is broken for (for example)

"J. R. Bader"

because detection of first/middle-name ordering isn't well-defined, currently.

@alexdunnjpl
Copy link
Contributor

@jordanpadams existing tests suggest a need to support format A.Dunn. Is this correct? Seems like we should be able to expect people to input valid values, which I'd argue that isn't, but maybe there's something preventing us from being that opinionated?

@alexdunnjpl
Copy link
Contributor

@jordanpadams I found these cases

       Examples of cases:
            Case 1 --> Should be parsed by semi-colon
                pds4_fields_authors = "Lemmon, M."
            Case 2 --> Should be parsed by comma
                pds4_fields_authors = "R. Deen, H. Abarca, P. Zamani, J.Maki"
            Case 3 --> Should be parsed by semi-colon
                pds4_fields_authors = "Davies, A.; Veeder, G."
            Case 4  --> Should be parsed by semi-colon
                pds4_fields_authors = "VanBommel, S. J., Guinness, E., Stein, T., and the MER Science Team"
            Case 5 --> Should be parsed by semi-colon
                pds4_fields_authors = "MER Science Team"
                ```

Most of the work is done, but issue is blocked pending confirmation of exactly which cases must be supported.

@alexdunnjpl
Copy link
Contributor

Final list of supported formats:

[
    "A. Dunn",
    "Dunn, Alex",
    "Dunn, A.",
    "Dunn, A. E.",
    "Dunn, A. E. F. G.",
    "Dunn, Alexander E.",
    "Dunn, Alexander E. F. G.",
    "Jet Propulsion Laboratory",
    "JPL",
    "Google Inc.",
    "Suffixed Jr., James",
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants