Name form should include a Style or Order #155

Merged
merged 26 commits into from Oct 15, 2012

Projects

None yet

9 participants

@daveyse
Member
daveyse commented Mar 6, 2012

Because different cultures, regions, etc. order their names differently, supplying a name is typically insufficient to identify the correct name parts, as well as the interpretation of the "natural" order for that name.
For example, "Li Chen" could refer to someone whose given name is "Li" if in the United States, but if in the far east "Li"
is most likely the family name.
When data is transliterated from a culture's native script to Roman, this ambiguity becomes even more prevalent.
Name orderings can be categorized as Surname-First (Sinotypic), Surname-Last (Eurotypic), and No-Surname (Monotypic).

Caveats:
These categories may still not be granular enough unless other metadata is provided. For instance, should a name phrase (such as De La Costa, or Maria De La Cruz) be separated into its atomic parts or left as a phrase? lt may also be required to distinguish between Spanish and Portuguese, where the Mother's Family Name and Father's Family Name create a double-surname, yet the order of these two are not the same.

@stoicflame
Member

Awesome. Thanks for the input.

@EssyGreen

Interesting ... I guess from the Conclusion Model point of view the important thing here is for the user to be able to pick out something that they want to sort by and label that according to their own needs. This technique of "pick a bit" would also be useful for name variations e.g. I have an ancestral line of Frappell often spelled Frapwell, Frapple and even Trapwell etc. As a researcher I want to standardise to make it easy to find my relatives but to retain the original as well. Hence I could say the same was A B Frapwell but sort-by Frappell (even tho' the latter is not actually in the former version).

@jralls
Contributor
jralls commented Mar 9, 2012

Hmm. But is that really necessary for interchange, or is it an application feature? So in @daveyse example of Li Chen, the fields are surname and given name, and depending on whether a record is American or Chines, the two fields will be presented in different order. OK. But the values of the fields doesn't change, and it doesn't really affect the genealogical conclusion.

The effect of interest in the conclusion model is in presenting the name to the user (definitely an application and not an interchange function). In the record model, it's the extraction: Knowing from the record context which value (Li or Chen) goes into which field (surname or given name). Any automation of that process is in the domain of the application not the interchange, though it might be worthwhile to have a way of documenting the reasoning (e.g., it's a Chinese document, so the string Li Chen parses Li to the surname).

Name variations (e.g., Li vs. Lee) is another matter. Regardless of model, I think that that's an application level function, where applications can take different approaches to encoding the variations, one of which might be a service (ISTR reading of a proposal for something like that recently).

@EssyGreen

@jralls - good points but the application would need to know the "way of reasoning" for each name since a researcher could have relatives in more than one camp ... so we're back to needing some form of field which indicates how to manage the parts.

@ttwetmore

Here is my simplistic, generally disagreed with, view on the name issue.

GEDCOM got it right on this one. Put the "surname" between slashes. It can be first, last or in the middle. If there is no surname, don't put anything in slashes. If the surname is a phrase, fine, just put it it all between the slashes.

We can either get hung up on all the different name parts, and obsess about all the different ways names are defined around the world, or we can simply think of a name as an index or a key or a sorting tag, and if there is something in slashes, we simply want that to be the high-order part of the sort order.

This is the simplest and possibly the most general way to handle names. Never have to worry about cultures. Just stick the most important part of the name, AS A SORTABLE KEY, in the slashes, and let the rest of the sort order be defined by all the characters that are left.

The important properties of a name for genealogical purposes, are to be able to sort them, to be able to search for them, and to be able to compare them. This representation is perfect for these purposes.

If in western culture the idea of first name, middle initial, jr/sr, etc, is important, all an application has to do is look into this general representation and pick out the right pieces. DON'T put the first name, middle name, etc, into the actual structure represented in GEDCOM-X.

@EssyGreen

Just stick the most important part of the name, AS A SORTABLE KEY, in the slashes, and let the rest of the sort order be defined by all the characters that are left.

+1

@stoicflame
Member

The important properties of a name for genealogical purposes, are to be able to sort them, to be able to search for them, and to be able to compare them.

And to display them, right?

So @ttwetmore, why is using some kind of text delimiter to identify the surname better than a property that identifies the style of the name? I understand that a text delimiter functions correctly for those purposes, but so does a style property, right?

Personally, I think it's kind of a pain to have to strip delimiters before displaying the name.

@jralls I don't understand the point you were trying to make by distinguishing between application-level concerns and interchange-level concerns. I understand that we need to focus on interchange and not try to boil the ocean, but the whole purpose of interchange is to interchange between applications. So if there is broad application support for a specific data feature, why would we not want to support it?

@jralls
Contributor
jralls commented Mar 9, 2012

Personally, I think it's kind of a pain to have to strip delimiters before displaying the name.

Learn about regular expressions.

I don't understand the point you were trying to make by distinguishing between application-level concerns and interchange-level concerns

GedcomX is for transferring data between applications, so it should limit itself to the minimal model needed to do that. How the name is presented -- how anything is presented -- is the application's job. In a pure XML environment, it would be accomplished with XMLFO or CSS stylesheets applied to, not part of the GedcomX document. Presentation does not belong in GedcomX.

Similarly, searching isn't really part of GedcomX's mission, though @nealcmb in #140 advocates having an index or catalog of some sort to allow partial extraction without having to parse the whole thing. That use-case aside, it's up to the application to search for and present to the user a set of records based on some name. That said, it's certainly quicker both to write and to execute in XPath to specify an element or attribute name and value ([rdf:type="surname" && text="Ralls"]) than to apply conditionally starts-with, ends-with, or = depending upon the value of a name-order attribute.

So if there is broad application support for a specific data feature, why would we not want to support it?

I don't know of any genealogy programs that support this feature. Do you? AFAIK the broadly-supported model is to divide the name up into parts -- as the conclusion model's NameForm type already does.

@stoicflame
Member

Personally, I think it's kind of a pain to have to strip delimiters before displaying the name.

Learn about regular expressions.

I rest my case. :-)

I don't know of any genealogy programs that support this feature. Do you?

Well, FamilySearch does. "Who else" is still a question that needs to be answered.

I know @DallanQ it working across industry boundaries to put together an open name database. I wonder if he would be willing to comment on this thread?

AFAIK the broadly-supported model is to divide the name up into parts -- as the conclusion model's NameForm type already does.

Good point. I wonder if @daveyse would be willing to comment on why the name order couldn't be determined just by looking at the order of the parts.

@daveyse
Member
daveyse commented Mar 10, 2012

The more I have learned about the GEDCOM-X specifications for the Name, NameForm, and NamePart elements, the better I can see how leveraging name parts in an always-ordered sequence would address most of the issues that I was concerned about. I'm still on a steep learning curve with GedcomX.

If the name form were to rely on just the full text without separating the name into its labeled parts, then valuable context [data] is lost during the transfer needed to aide the consumer in identifying the types of those name parts.

My main concern is that potential loss of context associated with the full name. I have focused for the past 3+ years on parsing names from various languages and sources, including GEDCOM to a) identify the name parts, b) order those name parts into expected, "standardized" formats, and c) enable name matching. I've probably become overly sensitive to the problem of insufficient context. When you're used to driving nails with your hammer, you're even tempted to drive screws with that hammer. :-)

@jralls stated:

GedcomX is for transferring data between applications... Presentation does not belong in GedcomX.

I concur. The style or order specified for the name should be viewed as metadata associated with how the full name should be interpreted, not just to aide in display formatting. I didn't want to lose that metadata. Relying on slashes to identify surnames will convey some of that metadata, but it also has its drawbacks. For example, consider how often slashes are used to represent an OR condition in the name. Providing a style would indicate an overall relationship and role of the name parts (regardless of how they are parsed into separate elements), and not just to identify the surname.

@jralls
Contributor
jralls commented Mar 10, 2012

The more I have learned about the GEDCOM-X specifications for the Name, NameForm, and NamePart elements, the better I can see how leveraging name parts in an always-ordered sequence would address most of the issues that I was concerned about. I'm still on a steep learning curve with GedcomX.

If the name form were to rely on just the full text without separating the name into its labeled parts, then valuable context [data] is lost during the transfer needed to aide the consumer in identifying the types of those name parts.

OK, I see where you're coming from and I agree that it's important to retain the order data, and I see that at present that doesn't happen. NameType can have zero or more fullText elements followed by zero or more NamePart elements.

If there's only a fullText, well, no one's gotten around to parsing it out yet. No harm there from an interchange standpoint as long as it's an exact transcription of what's on the original document. I suppose that if one needs to transliterate it to a different alphabet one should use a second fullText to hold the transliteration -- and in that case there needs to be a way to indicate which one is the original, which there isn't.

If someone has parsed out the name into parts, there's a problem if there's no original fullText, because there's at present no way to order the NamePart elements. The only practical way I can see to specify that in an XML schema is to add a sequence number to the NamePart type.

Ideally the NameType would have cardinality of 1 or more, but I think most applications don't preserve the original string, never mind providing a place to record both the original string and a transliteration. That makes it pretty hard to enforce the presence of the original.

So I propose two new elements/attributes, one on NameForm/fullText to indicate whether the string is a transcription or a transliteration, and the other an xs:int on NamePart to indicate its position in the reconstructed name.

@EssyGreen

If the name form were to rely on just the full text without separating the name into its labeled parts, then valuable context [data] is lost during the transfer

Whilst I agree, this is only an issue in the Record Model where we are concerned with transferring data which has not been entered by the researcher. Hence, keeping the NameParts in the Record Model is sensible and could include some form of indicator that a particular part was the sortable key.

However, on the Conclusion Model side of things the Name is really just a string with a sortable bit. The meaning of the parts is usually self-evident to the user (in as much as anyone has to recognise a name as it is written on a piece of paper, say). That said I think there are benefits to retaining the existing optional breakdown of the name into parts (if nothing else then for backward compatibility).

I propose two new elements/attributes, one on NameForm/fullText to indicate whether the string is a transcription or a transliteration, and the other an xs:int on NamePart to indicate its position in the reconstructed name.

I think this is getting too complicated ... every entry from an original document into a record is by its very nature a transcription of some sort and the process of transcribing it has already lost any guarantee of the original context (e.g. an old will may refer to someone as "John Bishop the elder, of Somerset" - it doesn't necessarily mean his name suffix is "the elder" - more likely the writer was just trying to differentiate between him and his son.) This sort of thing is down to the skill of the interpreter/researcher and we should leave it to them to figure it out. Adding a "transliteration" attribute just adds complexity and confusion ... all it says it that the stuff in the record is not the original ... well we know that anyway, it never is and never can be.

Would it not be easier to have a "Culture" attribute applied to the whole Record which identifies the cultural standards used in creating the record. This could also be applied to other areas (e.g. date formats, place parts etc) as/when necessary. This seems cleaner and less cluttered then creating an endless array of attributes on a lowly field.

@DallanQ
DallanQ commented Mar 12, 2012

I'm coming into this discussion cold, and I really hesitate to disagree with @ttwetmore, but since @stoicflame asked me to comment, having parsed enough gedcom's without identified name parts or slashes, I'd prefer that we not rely upon the user to delimit the name pieces. And as @daveyse says, slashes are problematic because they're used by some to represent OR in addition to surname.

@EssyGreen, I don't think we can have name pieces be optional. Without name piece type identification, you can't do a reasonable job searching names (e.g., Jonathan is a variant of John as a givenname but not John as a surname).

If we want to also preserve name order, what about something like the following?

<name><piece type="surname">Li</piece><piece type="given">Chen</piece></name>

I guess I'm wondering if we could use the order of the piece elements in a name to denote the order of the name pieces, rather than adding a separate "order" attribute. If you wanted to display the name you could simply remove the piece tags.

@jralls
Contributor
jralls commented Mar 12, 2012

I guess I'm wondering if we could use the order of the piece elements in a name to denote the order of the name pieces, rather than adding a separate "order" attribute. If you wanted to display the name you could simply remove the piece tags.

I was thinking that we needed to get the order semantics into the schema. Rethinking it I see that's over-specifying. We can just specify in an annotation that the order of the NameParts ("pieces" in your example) is significant and should match the order in the original (or transliterated) string.

@EssyGreen

I don't think we can have name pieces be optional

If not, then a lot of existing GEDCOM 5 files will fail to comply with the new standard since name pieces are currently all optional and the spec almost advises against using them. GEDCOM 5 migration/compatibility is maybe a separate issue but any system migrating from one to another will have to parse those slashes.

@ttwetmore
So @ttwetmore, why is using some kind of text delimiter to identify the surname better than a property that identifies the style of the name?

Using the delimiters allows all names to be treated as simple strings with no other substructures or attributes or codes or anything else required. It is simpler, can be much shorter, and eliminates the need to invent a "name model" that works for all cultures in the world (since this approach already works for all cultures in the world).

I understand that a text delimiter functions correctly for those purposes, but so does a style property, right?
Yes.
Personally, I think it's kind of a pain to have to strip delimiters before displaying the name.

Are you being serious?

@DallanQ
DallanQ commented Mar 13, 2012

If not, then a lot of existing GEDCOM 5 files will fail to comply with the new standard since name pieces are currently all optional and the spec almost advises against using them. GEDCOM 5 migration/compatibility is maybe a separate issue but any system migrating from one to another will have to parse those slashes.

Absolutely. I'd rather parse names once during migration than have to parse them over and over going forward.

@pjcj
pjcj commented Mar 13, 2012

It's short on practical advice, and perhaps a little OTT, but it might be worth reviewing this post: http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/

IMHO, this is one area that GEDCOM got almost right. Just about any assumption you make about names will be wrong at some point. GEDCOM allows for zero or more PERSONAL_NAME_STRUCTUREs, each of which contains a free-form NAME field with optional sub-fields such as SURN (surname), GIVN (given name) and NICK (nickname), and zero or more source citations and notes.

I would disagree with Tom on one point though. Whilst I've never had any problems with it myself, and although I also dismiss the argument that stripping the delimiters is in any way difficult, I would not use slashes to denote the "surname" for the simple reason that this precludes the use of slashes in the name, unless you want to devise some sort of escaping mechanism. And if you get to that point then it does become, if not particularly difficult, at least somewhat error-prone and redundant given that you might as well be using the tagging system already available in your interchange format (XML).

@ttwetmore
I would not use slashes to denote the "surname" for the simple reason that this precludes the use of slashes in the name, unless you want to devise some sort of escaping mechanism. And if you get to that point then it does become, if not particularly difficult, at least somewhat error-prone and redundant given that you might as well be using the tagging system already available in your interchange format (XML).

I agree that escaping slashes may be required and could seem awkward in the GEDCOM approach. For anyone who uses a slash to indicate a choice, as in "James Smith/Smyth", some kind of convention is needed. I have seen this convention in the past. In my software the slash can be escaped with a forward slash, the UNIX convention, though the best solution is simply to give the person both names. List the one you think best first and add alternates. Use the first name as the display name, but all the others remain valid search and comparison names.

Doesn't the whole issue here boil down to a discussion between those who believe that GEDCOMX should be set up to structure names into substructures of name parts based on many different cultures, and those who believe that GEDCOMX should not know about the details of name structures and simply use a GEDCOM-like convention to set off the key part of a name. In the former case, a complex name model must pre-exist and all applications must agree to use it. In the latter case, applications are free to structure names any way they choose to internally. When importing and exporting GEDCOMX data they have to do the conversions, to and from their conventions, which is very easy software to write.

My experiences suggest the simpler approach suffices. Young computer geeks often have a penchant for over-analyzing and complexifying things, in the name say, of rigor and formality, where it is just not needed. I try to resist this juicy temptation, though I expect resistance will be futile on this particular point.

@jralls
Contributor
jralls commented Mar 14, 2012

My experiences suggest the simpler approach suffices. Young computer geeks often have a penchant for over-analyzing and complexifying things, in the name say, of rigor and formality, where it is just not needed.

Heh. I like being called young. ;-)

But in truth I think that either "lazy" or "practical" is a better description: Categories make searching and sorting easier. There's plenty of value overlap between given and family names, and family databases can get pretty big so that sorting or searching on a single key is insufficient.

Would you be content if instead of some text delimiter (which might end up being a legitimate part of a name no matter what code point you choose) certain parts of a name can be set off with an inline element indicating that it's a key of a certain level, for example <name><key level="2">Thomas</key> T. <key level="1">Wetmore</key>(Tom)</name>?

@daveyse, how wedded are you to trying to categorize every part of every name for every culture world-wide for the past 1000 or so years? Will preserving the string while marking it up in a way to facilitate sorting suffice?

@ttwetmore

John,

I have accepted that my simplistic views will not be implemented, so I can really go along with any suggestion. After all, it's only software.

Just as a comment, using the slashes around the surname does not really limit the ability to search or compare, because search and compare algorithms do not have to assume that the surnames match up. In a previous application I wrote algorithms to compare names where it was assumed that there were many errors in identifying what were the surnames, what where the given names, what were the order of the given names, and so forth, and though a little persnickety, such algorithms are not all that hard to write. It dealt with problems like "Anna Van /Cott/", and "Anna /Van Cott/" and "Anna /Vancott/".

@daveyse
Member
daveyse commented Mar 14, 2012
@daveyse, how wedded are you to trying to categorize every part of every name for every culture world-wide for the past 1000 or so years? Will preserving the string while marking it up in a way to facilitate sorting suffice?

@jralls, my main concern was to preserve some context of the name order to assist in determining the type of name parts algorithmically. And, like @ttwetmore, I have encountered many names where the name parts have been either misidentified or inconsistently identified.

Preserving the string will more than suffice if it has either metadata or markup to provide context for interpreting the name parts to facilitate sorting, searching, and matching.

-Stuart Davey... not Davey Stuart, although I have often been called "Dave" :-)

@jralls
Contributor
jralls commented Mar 15, 2012

my main concern was to preserve some context of the name order to assist in determining the type of name parts algorithmically

Preserving the string will more than suffice if it has either metadata or markup to provide context for interpreting the name parts to facilitate sorting, searching, and matching.

Perhaps I'm dense (not the first time!), but that doesn't seem to me to answer the question. Does there exist a comprehensive expert system capable of correctly parsing name strings from a significant proportion of the cultures extant over the last 1000 years or so? What metadata does that system (or, if it doesn't exist now, would it) need? What value does such a system provide to the genealogist? How is it useful for a genealogical data transfer protocol?

Taking a different tack, the teaching genealogists (meaning the ones who lecture at national conferences) say that to evaluate the evidence in documents one needs the context of the whole document and an understanding of where, how, why, by whom, and for what audience the document was prepared, and depending on the document perhaps other documents created and stored with it. How the heck are you going to encode all of that in metadata attached to a name string?

@EssyGreen

I'm with @ttwetmore and @pjcj on this.

Doesn't the whole issue here boil down to a discussion between those who believe that GEDCOMX should be set up to structure names into substructures

Yes I think that's spot on

In the former case, a complex name model must pre-exist and all applications must agree to use it. In the latter case, applications are free to structure names any way they choose to internally

Agreed

My experiences suggest the simpler approach suffices.

Mine too :)

Does there exist a comprehensive expert system capable of correctly parsing name strings from a significant proportion of the cultures extant over the last 1000 years or so?

I doubt it. No-one here seems to know of one and even if it did exist, surely the purpose of GEDCOM is not to try to subsume every expert area into its format but to provide a set of base standards which applications should (and can easily) comply with.

I'd rather parse names once during migration than have to parse them over and over going forward

Indeed but we're not designing the ultimate genealogy app here just a transfer mechanism. Parse on import; do what you like thereafter.

my main concern was to preserve some context of the name order

If we don't attempt to break it down into bits then we don't have a danger of misinterpreting the order/context of the bits.

@daveyse
Member
daveyse commented Mar 15, 2012

Sorry, @jralls, I didn't make myself clear. The metadata I was referring to was context to help the consumer of the full name in THEIR environment to identify significance of parts of name (whether that be surname(s), name phrases, patronymic, clan, etc.). The "order" of the name is probably the most useful metadata.

If we don't attempt to break it down into bits then we don't have a danger of misinterpreting the order/context of the bits.

True. But that implies [al least to me] the name is one atomic entity, and not a composition of meaningful components.

Does there exist a comprehensive expert system capable of correctly parsing name strings from a significant proportion of the cultures extant over the last 1000 years or so?

I have been involved in a project that is building a treebank of names (much like a treebank of grammar structures for a language). We have currently parsed 150K names, and identified 679 name structures. Ambiguity of name order is a significant issue in determining which structure applies.

@DallanQ
DallanQ commented Mar 15, 2012

@daveyse Your treebank of names project sounds pretty interesting. Here's a link to another paper that describes parsing names into parts: http://datamining.anu.edu.au/publications/2002/adm2002-cleaning.pdf (The complexities involved in machine-parsing of names is why I advocate asking the user to do it.)

@EssyGreen

that implies [al least to me] the name is one atomic entity, and not a composition of meaningful components

The end application can choose how to define a name. Ultimately it is the user who decides what the component parts are and what they mean ... and a user may well make compromises which the application has not bargained for (and can never predict). I personally use parentheses to indicate a nick-name or alternative forename but I wouldn't expect an application to automatically parse this out for me. It is a decision I, as a user, have made about how I want to see the name displayed and stored given the constraints and functionality of the particular application I'm using. Someone else may use the same syntax for a different meaning, say an alternative surname or a maiden name. The only things I think we can be clear about are that it is an identity of a person provided by a researcher and that it may (or may not) contain a part which is customarily inherited or passed on (either by birth, marriage, adoption - or possible some other means) - and because we are in the business of genealogy this transference of the name is of particular import.

@thomast73
Contributor

Hi all. I'm sorry we have not been able to get to this issue until now. However, we are now preparing to take some action.

We have turned this issue into a pull request and I have checked in an initial stab at updating the documentation for Name and NameForm to reflect our current thinking. I would also point you to a Name Model document that details some use cases and requirements around this issue.

I look forward to your feedback.

@thomast73
Contributor

I have made updates to the following: Name, NameForm and NamePart.

@thomast73
Contributor

Let me just attempt to give an overview of the changes being made.

First, we have attempted to clarify the use (and misuse) of the Name.alternateForms property. We have added examples to help in this clarification.

Second, we have added a locale property to NameForm to given each name form a cultural context. I am hoping the examples that have been added to Name and NameForm will help in understanding why we felt we needed this new property. In the end, this is the only addition we are making to the Name* classes.

We have also tried to tighten up the definition of some of the fields (e.g., NameForm.fullText, NameForm.parts).

Please have a look and give your feedback. We would like to close this issue with these changes.

@jralls
Contributor
jralls commented on 3c8864f Sep 18, 2012

You've got a git glitch here: You don't really want to merge a change that replaces the entirety of the XML and JSON specs!

Member

Indeed. It's probably the line feed/carriage return problem. We need to get this resolved because it makes these changes unreviewable.

Member

Okay, I've reverted and re-patched so the diff looks better now. It's at a new commit, 72bbf4c.

Contributor

Indeed. It's probably the line feed/carriage return problem. We need to get this resolved because it makes these changes unreviewable.

Github has a help page you might find useful. Note the OS selector at the top of the page, which changes the recommended settings for each config property.

@jralls
Contributor
jralls commented Sep 18, 2012

Sorry, it doesn't seem to me that these rather cosmetic changes address the issue of how to parse a name into parts or even whether it should be.

@thomast73
Contributor

...doesn't seem to me that these ... changes address the issue of how to parse a name into parts...

I do not think that the "how" of name parsing is in the domain of the GEDCOM X specification.

The thing we are interested in adding here is metadata about the name's cultural context so that subsequent interpreters (human or otherwise) have a better chance at interpreting the data given in the name (parsed or un-parsed). Our feeling is that the NameForm.locale property has sufficient flexibility to be able to describe the cultural context in sufficient detail to be useful in meeting the requirements we have identified thus far.

...or even whether it should be.

We did consider this point. Our feeling at this time is that we did not want to mandate anything in this regard.

I see three possible states for a NameForm (though it occurs to me that this might need to be explicitly stated as well): only a fullText value, only a parts list with at least on part, or both fullText and parts values. If fullText has a value, the parts list is optional and need not even contain a part that include every term in fullText. If fullText does not have a value, the parts better have a value. It doesn't make much sense to have a name without a value in one of these two fields being populated. The specification does give a mechanism for deriving a fullText value in the case it was not provided. In that sense, we are saying that there will always be a fully rendered value that can be presented for (re)interpretation. Perhaps we also need to state that when present, the fullText value is the authoritative rendition of the name.

I think many applications collect their data such that their names are always in parts. It would probably be more natural to allow the data to stay in that form during export. If the parts scheme used in one application is not compatible with the receiving application, we think that enough context exists to allow the receiving application to reform and reinterpret the name data to fit its own scheme.

@jralls
Contributor
jralls commented Sep 23, 2012

6151f28

Oops, another git-burp. Check your line-ending settings

@jralls
Contributor
jralls commented Sep 23, 2012

The BCP47 tag doesn't take into account how names change in a particular language over time: One example is the change from patronymics and farm-names to surnames which occurred in Scandinavian countries in the 19th and early 20th centuries.

@thomast73
Contributor

The BCP47 tag doesn't take into account how names change in a particular language over time: One example is the change from patronymics and farm-names to surnames which occurred in Scandinavian countries in the 19th and early 20th centuries.

The BCP 47 tag says nothing about names in and of itself. It merely defines a cultural context (i.e., Scandinavia). Algorithms/humans would then combine that context with other data (i.e., the time frame and other contextual clues) and make the determination about what sort of qualifiers might be needed to to best described the name.

@jralls
Contributor
jralls commented Sep 24, 2012

The BCP 47 tag says nothing about names in and of itself. It merely defines a cultural context (i.e., Scandinavia).

A bit more specific than "Scandinavia" (BCP 47 encodes language, script, and country). But whatever.

Algorithms/humans would then combine that context with other data (i.e., the time frame and other contextual clues) and make the determination about what sort of qualifiers might be needed to to best described the name.

Isn't it redundant, then, since the other data associated with the record will include the location, and the language and script will be evident from the raw string? It is, in itself, a conclusion, so the evidence to reach that conclusion must already be present.

But what is the goal here? If it's data exchange between programs that use name-parts, then what's needed isn't a "cutlural context" that different programs (and researchers) will interpret differently. What's needed is a way to describe the name parts that are in the record.

@stoicflame
Member

Oops, another git-burp. Check your line-ending settings.

Re-applied at 24e7fa7

@thomast73
Contributor

Isn't it redundant, then, since the other data associated with the record will include the location, and the language and script will be evident from the raw string? It is, in itself, a conclusion, so the evidence to reach that conclusion must already be present.

By "it", I assume you are talking about the "locale". It is possible that all of the evidence necessary to conclude the cultural context is present in the user's system (in their head and in there data), but I think that it is also possible that it is not present (e.g., just because I am encountering Cyrillic text does not mean that it is a Russian context, and even if I find a Russian place name associated with the person, is it really I safely assume that it is a Russian name? I'm not convinced. Plus how much of what is known is still in the user's head?) I think the option to be explicit is valuable.

But what is the goal here? If it's data exchange between programs that use name-parts, then what's needed isn't a "cutlural context" that different programs (and researchers) will interpret differently. What's needed is a way to describe the name parts that are in the record.

The "cultural context" applies to both the full rending and the parts. And the exchange of parts is relevant independent of the need for a "cultural context".

The goal is to be able to easily find the appropriate name form from among the name forms provided. Locale is the reasons for Name has the option to provide alternate name forms. Otherwise, we could just provided as many instances of Name as are needed.

@jralls
Contributor
jralls commented Sep 25, 2012

It is possible that all of the evidence necessary to conclude the cultural context is present in the user's system (in their head and in there data), but I think that it is also possible that it is not present

If it isn't present then the user can't very well make the conclusion, can she?

I think the option to be explicit is valuable.

And where the heck does the explicit conclusion come from, if it isn't supported by evidence?

The goal is to be able to easily find the appropriate name form from among the name forms provided.

And what name-forms are those?

Locale is the reasons for Name has the option to provide alternate name forms. Otherwise, we could just provided as many instances of Name as are needed.

Parse error. Please reformat.

Look, you've painted yourself into a box with this. Yes, it's necessary to determine the cultural basis of a name in order to parse it correctly, but that determination can't be encapsulated into a locale, and for a huge number of cases it can't be determined algorithmically. Once determined, there's no reason to think that different development teams will independently generate the same set of tags to label the parts. In short, it isn't going to work.

I can understand that you don't want to try to create a catalog of all of the possible ways of parsing names in all cultures for all of history, or even all of genealogically useful history (the last millennium or so). So don't. Say that name parts have to be namespaced (or whatever is the JSON equivalent) and that the namespace URI has to resolve to a spec that describes each tag so that people can take two namespace descriptions and write a program to translate from one to the other.

@thomast73
Contributor

I think the option to be explicit is valuable.

And where the heck does the explicit conclusion come from, if it isn't supported by evidence?

The locale is metadata that further describes the name conclusion. It is not intended to be a conclusion in and of it self - with its own source description, etc.

And users of often provide conclusions without providing everything that got them there, whether or not you and I wish it to be otherwise.

BTW, made a modification to the model to remove the notion of "primary" and "alternate" name forms.

@jralls
Contributor
jralls commented Sep 27, 2012

The locale is metadata that further describes the name conclusion.

You've yet to make a case that it's a useful description or that BCP 47 is the right way to encode it.

It is not intended to be a conclusion in and of it self - with its own source description, etc.

OK, that means only that it's part of the name conclusion. It still must be supported by evidence.

And users of often provide conclusions without providing everything that got them there, whether or not you and I wish it to be otherwise.

True, but that's not a good reason to encourage them to do so. But that misses the point: You claimed that

Algorithms/humans would then combine that context with other data (i.e., the time frame and other contextual clues) and make the determination about what sort of qualifiers might be needed to to best described the name.

What makes the locale special that it needs to be encoded, but "the time frame and other contextual clues" don't?

How is this helpful for interchange? What program currently encodes the name's locale rather than maintaining an (often insufficient) list of name-parts?

@thomast73
Contributor

The locale is metadata that further describes the name conclusion.

You've yet to make a case that it's a useful description...

One use case is something like this:

A researcher wishes to publish the name of his Korean ancestor. His ancestor's name has well defined representations in the Hangul, Hanja and Latin scripts. The researcher wishes to publish this ancestor's name via a service that processes/presents information using primarily in Hangul, via a service that processes/presents information using primarily Hanja, and via a service that processes/presents information using primarily a Latin-based scripts -- three different services processing/presenting information in three separate cultural contexts. If the researcher provided only the Hangul form of the ancestor's name, the name is largely unusable for two of the services he wishes to use -- users of these services would not be able to readily find and/or interpret the information he is providing. Knowing this, the researcher provides the additional name forms to allow his conclusions to be interpretable in these additional cultural contexts -- thus expanding the audience in which his information can be made available. When each of these service providers process/present this researcher's ancestor's name, they will work to select and share the data most appropriate for their users (or other processing needs). Being able to select data relevant to their cultural context will allow these services to be able to serve both their users and the researcher who selected their service to publish his data.

... or that BCP 47 is the right way to encode it.

BCP 47 tags are an accepted standard. These tags allow us specify a cultural context in a very detailed way. The standard even has provisions for extensibility. In defining the GEDCOM X standard, we have tried not to create standards where standards already exist. We had concluded that BCP 47 will meet the requirements suggested by the use cases we've identified thus far. Is there a standard that fits the requirements better than BCP 47 does?

@jralls
Contributor
jralls commented Sep 28, 2012

A researcher wishes to publish...

It's really hard not be be sarcastic about that example. It's intuitively obvious that the whole file would have to be translated for each provider, not just the names. Nice try, no cigar.

BCP 47 tags are an accepted standard

So is IS0-4948-1. Doesn't make it germane.

These tags allow us specify a cultural context in a very detailed way.

No, they encode language and script, and modern day location. Languages are available for most known dialects, but regions are current national or colonial names. For example, there's a language tag for Prussian, but no region tag for Prussia, only Germany. The USSR, hardly a cultural designation (it encompassed hundreds of cultures) is there, but deprecated. It goes on. Gascon French is closer to Catalan than to Parisian French, but there's no way to encode it in BCP 47. Klingon, on the other hand, is covered -- but only as a language. No script or locale.

The standard even has provisions for extensibility.

Yes, but not in ways that are useful for recording historical or genealogical data.

EDIT: Changed "transcoded" to "translated" in the second comment sentence. It's not transcoded, of course, as all would be encoded in Unicode, probably UTF-8. "Transcribed" might be even better, but that carries different connotations in the genealogical community.

@stoicflame
Member

Nice try, no cigar.

So much for avoiding sarcasm, eh? A word of advice, if I may: comments like that aren't helpful and only surface immaturity and insecurity. Even if you do feel a need to expose yourself as such, to do so by insulting other forum participants is not acceptable. You're better than that.

So let's focus on some constructive ways to move forward here. Let me see if I can articulate the concerns that have been brought up with the current proposal:

  • The need to have multiple forms of a name is not being disputed. A researcher should be able to provide a Hangul, Hanja, and Romanized form of the same name.
  • The need to identify a "cultural context" for a form of a name is being disputed. You're saying that's just a FamilySearch requirement and shouldn't be imposed on the community.
  • Even if the requirement to identify a cultural context was accepted (which it isn't yet), the proposal to use BCP 47 is being disputed.
  • No alternatives to using BCP 47 for identifying a cultural context have been put forward.

Did I miss anything? Is that accurate?

What if, instead of defining our own "locale" attribute and referring to BCP 47, we just used the xml:lang attribute and stood on that?

@jralls
Contributor
jralls commented Sep 28, 2012

Nice try, no cigar.

So much for avoiding sarcasm, eh?

I don't think that's sarcasm. It's a colloquial way of saying "a good attempt that falls short" 1. It isn't even rude or contemptuous.

Is that accurate?

No, I don't think that it is.

The need to have multiple forms of a name is not being disputed. A researcher should be able to provide a Hangul, Hanja, and Romanized form of the same name.

No, I don't accept that there's a point to providing multiple transliterations of the same phonetic values in a database.
Yes, if the source string is in a particular script, I can see having the quoted string in that script, followed by the same string transliterated into the database's script and enclosed in square brackets. I don't see any point in having the name-parts values in multiple scripts, unless everything in database fields can also be tagged for multiple scripts.

The need to identify a "cultural context" for a form of a name is being disputed. You're saying that's just a FamilySearch requirement and shouldn't be imposed on the community.

Cultural context is a broad term that encompasses more than just language-script-and-location. It includes the religious, customary, and governmental influences on every record that we genealogists use, and a thorough understanding of the cultural context in which records were created is important to correctly interpret the records. It's not something that you can "tag", it is something that you bring to bear in your analysis, and it belongs in the AnalysisDocument.

Cultural context does not apply only to names. Language-script-and-location doesn't, either. Consider the date 5/3/1740. Depending upon who wrote it and when, it could refer to any of these dates in the modern calendar: 5 March 1740, 3 May 1740, 5 June 1740, 3 August 1740, or 5 March 1741. Knowing that it was written in English using Latin script in the modern United Kingdom (BCP 47 en-Latn-GB) doesn't help. Yes, the last instance was often written 5-3-1740/1, but not always.

So, I'm not disputing identifying a cultural context per se, I'm disputing that it's something that doesn't warrant a proof argument and is of concern only to names.

Even if the requirement to identify a cultural context was accepted (which it isn't yet), the proposal to use BCP 47 is being disputed.

Yes, BCP 47 is utterly inadequate to specify a cultural context.

No alternatives to using BCP 47 for identifying a cultural context have been put forward.

Do you know of anything extant that will specify language including regional dialects, religious community, cultural community (how do you encode that?), laws at all levels of government both common and statutory, historical period, and location at the county-equivalent level for the past 1500 years? I don't, and I don't believe one exists.

Did I miss anything?

Yes. The biggest problem of all: This issue isn't about cultural context, it's about whether and how to parse names into tagged name-parts. The whole discussion of the last week is revealed by Thad's use-case to be utterly irrelevant to that.

What if, instead of defining our own "locale" attribute and referring to BCP 47, we just used the xml:lang attribute and stood on that?

Well, it's better in that it makes no claim to expressing cultural context, but it's still BCP 47, so it isn't going to help with old political entities -- there's no way to distinguish Prussian German from Palatine German, for example (and what used to be Prussia is now mostly Poland or Russia, the Germans having been expelled after WWII). It's already part of XML, so anyone using the XML format can use it without a change to GedcomX. I looked for something similar for JSON last night, but couldn't find anything, so I imagine you'd have to add something to the JSON Format spec.

@stoicflame
Member

No, I don't accept that there's a point to providing multiple transliterations of the same phonetic values in a database.
Yes, if the source string is in a particular script, I can see having the quoted string in that script, followed by the same string transliterated into the database's script and enclosed in square brackets.

Those two statements are in conflict in my mind. Help me sort it out. You said you don't see a point to providing multiple transliterations, but then you propose a different way of supporting it. The only difference I see between what you proposed and what's being proposed here is what delimiters to use.

This issue isn't about cultural context, it's about whether and how to parse names into tagged name-parts.

Wow. We've gotten way off-base, then. So I don't see anything in the current proposal that mentions anything about "when and how to parse names into tagged name-parts". There's some stuff about what the parts are, how to treat them, and how they relate to the "full text", but nothing about parsing full text into parts. Are you saying that the spec need to provide that? Why?

@jralls
Contributor
jralls commented Sep 28, 2012

Those two statements are in conflict in my mind. Help me sort it out. You said you don't see a point to providing multiple transliterations, but then you propose a different way of supporting it. The only difference I see between what you proposed and what's being proposed here is what delimiters to use.

The difference is in where the transliteration goes. If one uses NameForm.fullText to quote the string as it is written in the source, she might choose to do that in the unicode encoding of that string in the script used in the source. If that script is different from the script (and perhaps language -- I'll call it the "DB native language") used in the rest of the database, she might then provide a transliteration, set off in brackets to show that it's an interpretation rather than a quote, in the DB native language. She would then break that up into the name parts according to whatever scheme is appropriate, using the transliteration. If she's careful and thorough, she'll cite an AnalysisDocument in which she chose that particular naming scheme.

The same process would be used for any other "foreign" (to the DB) language/script sources and evidence. The reason is pretty obvious: Writing multi-script searches is a pain. Actually, writing multi-script anything is a pain, even if one has a full suite of input methods available.

So I don't see anything in the current proposal that mentions anything about "when and how to parse names into tagged name-parts".

I pointed that out 10 days ago
Thad replied by launching the locale tangent.

Wow. We've gotten way off-base, then. So I don't see anything in the current proposal that mentions anything about "when and how to parse names into tagged name-parts". There's some stuff about what the parts are, how to treat them, and how they relate to the "full text", but nothing about parsing full text into parts. Are you saying that the spec need to provide that? Why?

Isn't

what the parts are, how to treat them, and how they relate to the "full text",

a description of parsing?

Go back and re-read Stuart's lead in: It's all about how to figure out (i.e. parse) the Chinese name Li Chen into surname and given name, and how and whether to separate (parse) "Maria De La Cruz" or "De La Costa" into "atomic parts".

@stoicflame
Member

The difference is in where the transliteration goes.

Indeed. I'm good with the whole scenario you just outlined. My question is why is this:

"original name" [transliterated name]

better than:

<nameForm><fullText>original name</fullText></nameForm><nameForm><fullText>transliterated name</fullText></nameForm>

?

The latter carries the semantic meaning and context as part of the data structure. The former has to be specified outside the boundaries of the media type definition with a bunch of extra junk in the spec that explains what the quotes and brackets are. I concede that the former is more succinct--but I'm having a hard time believing succinctness is included in your motivations in this case.

Also, what do you do when multiple forms of the same name are provided in the same source. Happens a lot in Japan, for example.

Isn't ... a description of parsing?

I would consider it a description of the result of parsing.

Go back and re-read Stuart's lead in: It's all about how to figure out (i.e. parse) the Chinese name Li Chen into surname and given name, and how and whether to separate (parse) "Maria De La Cruz" or "De La Costa" into "atomic parts".

I read Stuart's lead in not as a request to specify how to parse a name, but as a request to accommodate the results of parsing a name. He was bringing up some cases that the existing model doesn't accommodate.

Maybe Stuart can comment to clarify?

@daveyse
Member
daveyse commented Sep 28, 2012

The intent was to raise a flag that there was missing metadata which many systems could benefit from in order for them to parse names into the formal NameParts when not provided.

It also provides a rationale for the designation of the NamePartType for each NamePart. Knowing that an undecorated full text name is surname-first or given-name-first provides significant metadata for a system to validate or refine the name parts and their types.

Does this help clarify the intent?

@jralls
Contributor
jralls commented Sep 28, 2012

Why is this ... better than ...

There's no indication in the separate nameForm elements that one is a literal transcription and the other is a transliteration for DB operations.

I would consider it a description of the result of parsing.

Sigh. Splitting hairs again.

The result of parsing is Li in the surname and Chen in the given name... or the other way around. The rule that with Chinese names in China the surname comes first but when they move to America they (usually) switch it around is an input to the parse.

@thomast73
Contributor

... The rule that with Chinese names in China ...[your favorite naming heuristic that applies to this situation goes here]... but when they move to America they ...[another favorite naming heuristic here]...

... is business logic that you would write one way for your purposes and I would write another way for my purposes. But to write any business logic, we both need some data from the researcher as to what they concluded about the information they are providing. We could guess. We can write an algorithm to evaluate everything they provide and factor it into our guess. But as you point out, the process of arriving at an answer is anything but simple and would best be described with an analysis document. If instead, the researcher (or other data provider) could just tell us "Hey, this name was rendered using Hanja like someone in Korea would render it", our opportunity for doing something meaningful in our business logic goes way up.

I can think of no way to meaningfully pass an analysis doc as an input to a parse algorithm. The language tag, on the other hand, shows some promise.

As you have stated, there may be some situations that are difficult to express using BCP 47 language tags; but for most situations, it would have enough capability to be very useful. The standard does provide mechanisms for updates and extensions. Perhaps there are things we as a community could propose to make the standard better able meet our community's needs. But I don't think BCP 47 is irrelevant to our problem space and unable to provide value in our model.

@jralls
Contributor
jralls commented Sep 28, 2012

I can think of no way to meaningfully pass an analysis doc as an input to a parse algorithm.

Which sums up the problem very nicely. You're interested in building some sort of expert system for names, and I'm interested in building something which helps people organize and document their own research. I had the distinct impression that GedcomX was about getting programs like the latter to exchange data. If that's not the case, I've wasted a lot of my time and yours. I regret the first and apologize for the second.

@stoicflame
Member

@jralls, thank you for your perspective and input. You've given us something to think about and wrestle with. We're going to approach this with some of our affiliates and see if we can get their perspective.

It may very well be that the ability to support multiple forms for a name of a person in a given set of research might be just a FamilySearch-specific requirement. If so, we'll consider getting rid of the concept of NameForm altogether. Give us a few days to gather input from other sources.

@stoicflame
Member

We've done some research gathering perspectives and input from other developers and third parties, and our assessment is that the requirement to exchange multiple forms of the same name is not limited to FamilySearch. Our decision is therefore to include the concept of multiple name forms for a name in GEDCOM X.

The changes attached to this issue have been reviewed by these affiliates and they like what they see. Note that the internationalization issue is broader than NameForm and will be addressed at #213. This means that the original purpose of this issue, to include a "style" or "order", will not be accepted but the functionality will be provided by addressing #123.

Here's the summary of what will be applied as a result of the discussion on this thread (and others):

  • Name is a Conclusion with a type and a list of NameForms.
  • NameForm consists of fullText and a list of NameParts.
  • NamePart consists of a type, a value, and a list of qualifiers.
  • Name part qualifiers include the following:
URI description
http://gedcomx.org/Title A designation for honorifics (e.g. Dr., Rev., His Majesty, Haji), ranks (e.g. Colonel, General, Knight, Esquire), positions (e.g. Count, Chief, Father, King) or other titles (e.g., PhD, MD)
http://gedcomx.org/First A designation for the first given name (or the name most prominent in importance).
http://gedcomx.org/Middle A designation for a middle given name (or a name of lesser importance).
http://gedcomx.org/Familiar A designation for one's familiar name.
http://gedcomx.org/Religious A designation for a name given for religious purposes.
http://gedcomx.org/Family A name that associates a person with a group, such as a clan, tribe, or patriarchal hierarchy.
http://gedcomx.org/Maiden A designation given by women to their original surname after they adopt a new surname upon marriage.
http://gedcomx.org/Patronymic A name derived from a father or paternal ancestor.
http://gedcomx.org/Matronymic A name derived from a mother or maternal ancestor.
http://gedcomx.org/Geographic A name derived from associated geography.
http://gedcomx.org/Occupational A name derived from one's occupation.
http://gedcomx.org/Postnom A name mandedated by law populations from Congo Free State / Belgian Congo / Congo / Democratic Republic of Congo (formerly Zaire).
http://gedcomx.org/Particle A grammatical designation for articles (a, the, dem, las, el, etc.), prepositions (of, from, aus, zu, op, etc.), initials (e.g. PhD, MD), annotations (e.g. twin, wife of, infant, unknown), comparators (e.g. Junior, Senior, younger, little), ordinals (e.g. III, eighth), conjunctions (e.g. and, or, nee, ou, y, o, ne, &).

We'll leave this issue open for a little while longer for comment. Thanks to everybody who contributed to the discussion.

@mikkelee
mikkelee commented Oct 6, 2012

Looks great, but I'm not so fond of the description for middle name: "A designation for a middle given name (or a name of lesser importance)."

Middle names have equal importance to - but are distinct from - surnames in Scandinavia. They are often inherited independently of the surname, though may merge to become a double-surname if they are inherited together for several generations.

As I mentioned elsewhere, my middle name comes from my maternal grandfather, and all my cousins on that side have the same middle name - but we have different surnames.

Additionally, there might be a qualifier "Characteristic" or similar. Though most people with a given + patronymic combination were differentiated by an occupational or geographic byname, they were also at times differentiated with "characteristic" bynames; examples from "Dansk Navneskik" (Steenstrup, 1899) are Rask (healthy), Høj (tall), Krog (hook(ed)), etc.

@daveyse
Member
daveyse commented Oct 8, 2012

middle name: "A designation for a middle given name (or a name of lesser importance)."

@mikkelee, The reason for the parenthetical phrase in the description is because there are cultures, such as Vietnamese, where the two given names have a different order, and the term with "lesser importance" is not actually the "middle" term of the name. The "first" given, or personal name would be the significant name part used to differentiate the person within a group, such as a family.

What would you suggest for a better description of this secondary (but not necessarily of lesser importance) given name?

@mikkelee
mikkelee commented Oct 8, 2012

Thanks for the clarification, @daveyse :)

I then prose the following amendments to the conceptual model as seen in commit d460e59:

Add a NamePartType:

  • http://gedcomx.org/Middle for cultures which have middle names that are distinct from given names and surnames, such as Scandinavia.

Change NamePartQualifier:

  • http://gedcomx.org/First -> http://gedcomx.org/PrimaryName (-Name suffix to avoid colliding with Identifiers)
  • http://gedcomx.org/Middle -> http://gedcomx.org/SecondaryName

These Qualifiers can then be optionally applied to any NamePartType to denote that for instance a given name is primary/secondary as in Vietnamese, or even a surname in a culture where such a distinction might be essential. That way, the actual order of name parts is kept separate from their import.

And so my other suggestion isn't lost, a new qualifier:

  • http://gedcomx.org/Characteristic for NameParts that derive from a characteristic of the named person.

And possibly (this one I can take or leave, but I thought it might be good to mention):

  • http://gedcomx.org/Byname for NameParts that are not part of a person's legal or baptized name but are nevertheless used to identify them in both official and unofficial capacity.
@mikkelee
mikkelee commented Oct 8, 2012

Alternately, list order could be taken to imply primacy for nameparts.

@thomast73
Contributor

Alternately, list order could be taken to imply primacy for nameparts.

The definition for part order allows for a "full text" value to be built in the absence of a fullText value. To change the ordering to be in primacy order would affect this capability.

@mikkelee
mikkelee commented Oct 8, 2012

Alternately, list order could be taken to imply primacy for nameparts.

The definition for part order allows for a "full text" value to be built in the absence of a fullText value. To change the ordering to be in primacy order would affect this capability.

That makes sense, I retract that suggestion then. Any thoughts on the "meat" of my proposal?

@thomast73
Contributor

Any thoughts on the "meat" of my proposal?

We like the suggestions. I think it will improve the model. I will be making changes based on your input, so watch for the check-in. We appreciate your review and feedback.

@mikkelee
mikkelee commented Oct 8, 2012

Thanks! I'll keep an eye out :-)

@stoicflame stoicflame referenced this pull request Oct 8, 2012
Closed

Status of GEDCOM-X ? #214

thomast73 added some commits Oct 11, 2012
@thomast73 thomast73 Removed lang attribute from Document; diagram clean-up. aeaea68
@thomast73 thomast73 Merge remote-tracking branch 'remotes/origin/master' into name-model
Conflicts:
	gedcomx-model/src/main/java/org/gedcomx/common/Note.java
	gedcomx-model/src/main/java/org/gedcomx/conclusion/Date.java
	gedcomx-model/src/main/java/org/gedcomx/conclusion/Document.java
	gedcomx-model/src/test/java/org/gedcomx/common/NoteTest.java
	gedcomx-model/src/test/java/org/gedcomx/conclusion/DocumentTest.java
	gedcomx-model/src/test/java/org/gedcomx/types/TypesTest.java
	specifications/conceptual-model-specification.md
	specifications/support/gedcomx.zargo
	specifications/xml-format-specification.md
f8d5316
@thomast73 thomast73 Modifies the NamePartQualifierType per community suggestion. dea5da2
@thomast73
Contributor

I have modified the name part qualifier types as suggested by @mikkelee. However, I did not implement the Byname part of the suggestion as the definitions I found for "byname" seemed to overlap significantly with "nickname".

@mikkelee

Hi @thomast73

Thanks for considering my suggestions, but I meant for the new http://gedcomx.org/Middle to be a NamePartType not Qualifier - can you clarify the reasoning behind the change from my suggestion?

As for "byname", my main reason to include it would be that it carries no value judgment like "nickname" can (being in my ears strictly informal/familiar, where byname is not necessarily), but as I said, it's not so important to me as middle name.

@thomast73
Contributor

...but I meant for the new http://gedcomx.org/Middle to be a NamePartType not Qualifier - can you clarify the reasoning behind the change from my suggestion?

It is relatively easy to garner acceptance for a model where name inputs are divided into the "canonical" four name fields -- prefix, given, surname, suffix. Most software programs collect name data using these fields (or a subset of them). Many record forms specified that names be collected using fields with these labels. Most search forms request search terms using fields of this nature. It is a very prevalent pattern in name data. Therefore, the GEDCOM X NamePartType vocabulary has been deliberately limited to these four values. Proposals to add terms to this vocabulary result in lively discussions but have never resulted in consensus. Thus, our thinking was that Middle fit best as a qualifier.

While maybe not exactly as you would like it, you could model the middle name with a name part that did not designate a part type, but that did specify a Middle qualifier. Alternatively, the NamePartType vocabulary could be extended (though other problems attend such a strategy).

@mikkelee

If some existing software only supports a subset of those types (prefix, given, surname, suffix), GedcomX is already expanding on current usage, and I don't see that correspondence with current software usage is essential any longer.

Currently, the only functional difference between type and qualifier is that a namepart may only have one type but many qualifiers, both being entirely optional. Am I understanding correctly?

If so, I think it would be cleaner to do the following: Make type required on all nameparts and add an Infix or Middle (I obviously prefer the latter) type to handle nameparts that come between Given & Surname. Qualifiers as before, zero-to-many.

@thomast73
Contributor

Make type required on all nameparts ...

What is it about part type that makes you want to say that it should be required? What will this bring to the model?

Currently, the only functional difference between type and qualifier is that a namepart may only have one type but many qualifiers, both being entirely optional. Am I understanding correctly?

It is true that type and qualifier are both optional, and that type has at most a single value and that there can be multiple qualifiers. I'm not sure that speaks to their functional purposes however.

As I have discussed the model for names, it seems that there are generally two camps around name parts.

The first camp is interested in separating the name terms out from among all of the other terms and they want to classify the name terms as either a surname or some other kind of name; if the name term is not a surname, it is lumped with the given names -- even if this is not quite true. It is almost as if they want "given name(s)", "surname(s)" and "everything else". Why? In part, it seems to be a function of the processing that will be applied; names are a primary finding aide and the algorithms involved process given and surnames differently in order to enable the finding process. Also contributing is the fact that much of the existing genealogical data was collected using prompts of this nature -- either in the records themselves and/or in the software use to compile it.

NamePartType (as it exists today on this branch) easily satisfies this camp. Given and surname portions of the name are easily identified and most other cruft is found in the prefix or suffix portions. While there are cases where undesirable-to-this-camp infix elements exist, users are still willing to live with it because of the model's simplicity and broad applicability.

But names are not so simple. In fact, the diversity of names and naming conventions seems to defy all logic.

The second camp is more aware of the diversity involved in modeling this domain. They want to identify each term or phrase in the name and to express all that is know about each. The focus is much more on the individual terms in the name -- their usage, derivation, the relationships implied, etc. The careful genealogist pays attention to all of this and leverages this information to be successful. But most software does assist this user in identifying, classifying or capturing this level of detail. But many wish that it did, and many algorithms could benefit if it did.

NamePartQualifierType (as it exists today on this branch) seems capable of meeting the majority of the needs identified by this camp -- though there may be quibbles about the vocabulary itself.

I tend to want to think of NamePartType as being useful for modelling term groupings -- modeling data collected by prompting for "given names" and "surname". I can model the "given names" field as a single name part, or as multiple name parts -- each part in the grouping being tagged with the Given type.

I tend to want to think of NamePartQualifierType as being useful in describing individual terms in a name. If I have a phrase and I tag it with a qualifier, the qualifier has to apply to all of the terms in the phrase. As soon as I want to look at an individual terms within the phrase, I have to break it into further parts. It seems that qualifiers will generally be better applied to parts that represent individual terms -- not phrases.

By using both in concert, I can model the grouping that was a result of the prompting mechanism (the Given names field) and important information about the terms in that grouping (that the Given names field contained a PrimaryGiven name and a Middle name) in the same structure.

But it seems to me that to require either or both of type or qualifier is an attempt to impose order where order does not exist.

... and add an Infix or Middle ...

I could be swayed to include Infix, but I am pretty sure many others would not agree. I do not think Middle would be a viable proposal as a part type, but it does fit nicely in the qualifiers.

It seems that with Prefix, Infix and Postfix, all of the affix terms (the non-name terms belonging to a full name) would have a home. But I'm tempted to reduce the vocabulary to just Given, Surname and Affix.

While I'm tempted, I'm not sure any changes will actually result.

thomast73 added some commits Oct 15, 2012
@thomast73 thomast73 A tweak to NamePartQualifierType to make Primary and Secondary more b…
…roadly applicable -- so that these qualifiers can be applied to both given names and surnames.
86abc51
@thomast73 thomast73 Updated the samples (and the recipe used to generated the samples) in…
… the XML and JSON format specifications to reflected the changes in place on this pull request.
97ffe65
@thomast73 thomast73 merged commit cab68c3 into master Oct 15, 2012
@thomast73
Contributor

Thanks, everyone, for your input on this issue. The changes we have made as a result of your feedback were merged into the master branch as part of pull request #155. We appreciate your time and thoughts and we feel that your feedback has helped to improve the GEDCOM X model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment