settle on the file serialization format #185

Closed
stoicflame opened this Issue Jul 10, 2012 · 37 comments

Comments

Projects
None yet
8 participants
Owner

stoicflame commented Jul 10, 2012

All right, folks. This is where we hash out what the serialization format should be. As mentioned in the analysis of the GEDCOM X file format, there are a lot of things to consider. Issues include:

  • bloat/noise
  • built-in feature set (e.g. comments, start/end markers, type definitions, object identity, IDL, etc.)
  • processing efficiency
  • accessibility (e.g. industry acceptance, available parsing libraries, etc.)
  • readability (e.g. can I open it in my text editor?)

The following are worth considering:

  • Multi-namespaced, highly extensible XML built on existing industry-defined types and namespaces (i.e. what we have today).
  • Simple, domain-isolated XML, optimized for size and readability.
  • JSON
  • Protocol Buffers
  • GEDCOM Tags
  • YAML
Owner

stoicflame commented Jul 10, 2012

(Now that I've opened up the issue, I'll take the time to register my personal opinion.)

I like JSON.

I see no point in using XML if you're not going to use the namespace support nor try to leverage existing XML technologies and standards. At that point you've watered down XML so much that you get no benefit from the space inefficiencies.

Protocol buffers are cool, but it's hard to give up being able to open up a file in your favorite text editor.

GEDCOM Tags are nice and efficient, but then everybody would have to write their own parser and you'd push away the all the cool kids who may want to play with genealogical data, but not if they have to write their own parser.

YAML's cool. But just not very widely used or adopted.

Contributor

jralls commented Jul 11, 2012

I see no reason to limit ourselves to a single serialization format. It's pretty clear from earlier discussions that the needs of RESTful web applications are very different from those of desktop applications (where RESTful architectures are inefficient and annoying). It follows then that there should be separate serialization formats for different problem domains.

The namespace support issue is a bit of a red herring. So long as RDF is central to the conceptual model, we're stuck with the namespaces and URIs regardless of serialization.

Your argument about getting no benefit from the space inefficiencies of XML is rather solidly contradicted by your argument about everyone having to write his/her own parser for GEDCOM Tags. ;-) OTOH, it's likely that if GedcomX gets traction there will be libraries made available in the major languages so that it won't be necessary for everyone to write his/her own.

I see no point in using XML if you're not going to use the namespace support nor try to leverage existing XML technologies and standards. At that point you've watered down XML so much that you get no benefit from the space inefficiencies.

GEDCOM X isn't really about the serialization format. That's just a means to an end. Personally I like the simple XML approach.

GEDCOM Tags are nice and efficient, but then everybody would have to write their own parser

Unless everyone is going to take the GEDCOM data structure as their permanent data structure (highly unlikely) then everyone will have to write some form of importer anyway and they will invariably create their own tags/syntax to fill in the bits which GEDCOM didn't think of.

Owner

stoicflame commented Jul 11, 2012

I see no reason to limit ourselves to a single serialization format.

I totally agree. But this thread is about which one to use for the file format.

Are you saying that the file format should accommodate any format? Because that would multiply the complexity of parsers who would have to support all possible formats.

JSON is a great choice. Efficient. I would prefer a variant in which the keys don't have double quotes around them.

Note that MongoDB uses JSON as its native database format (it actually uses BSON, a binary encoding of JSON). The decision to use JSON could prove to be a catalyst for very fast acceptance of the format simply because of the ease of using MongoDB for persistent, indexed, searchable and query able stores.

nealmcb commented Jul 11, 2012

I agree that XML has lots of disadvantages. But another advantage of XML is good support for schemas and validation. and having a solid testing suite and validation of not just the basic gedcom x, but also future versions and extensions of it, could be a major aid to interoperability and quality.

Last I saw schemas and validation was not well supported for the others.

Owner

stoicflame commented Jul 11, 2012

So @nealmcb is that a vote for XML?

It would be easy to write a validator for GEDCOMX no matter its final format. Yes, XML has turn-key programs that can do it out of the gate. Personally I do not see that as an important advantage of XML. They are greatly over-emphasized and have led to much of the politically correct cache that XML now enjoys.

If we are voting I cast my vote to JSON or a custom format. But as has been stated clearly in the past, there is nothing preventing a full-blown XML version of GEDCOMX as well. A choice for the preferred archive format really has no bearing on other possible formats, as long as there is an isomorphic mapping between the two.

another advantage of XML is good support for schemas and validation. and having a solid testing suite and validation of not just the basic gedcom x, but also future versions and extensions of it, could be a major aid to interoperability and quality

+1

Yes that's a vote for XML from me but personally I would prefer this to be the "simple" type.

PS: I'd also be happy with JSON - to be honest it would be pretty easy to convert one to the other if an app had a preference so toss a coin?

Contributor

jralls commented Jul 12, 2012

Are you saying that the file format should accommodate any format? Because that would multiply the complexity of parsers who would have to support all possible formats.

For what value of "any"?

ISTM that if we can specify multiple serializations which can easily be transformed from one to another with tools like XSLT or a project-supplied conversion program then there's not really any problem with multiple serialization formats.

Contributor

jralls commented Jul 12, 2012

I totally agree. But this thread is about which one to use for the file format.

Good. That means for the purposes of this discussion we can assume away RDF.

Contributor

jralls commented Jul 12, 2012

I agree that XML has lots of disadvantages. But another advantage of XML is good support for schemas and validation. and having a solid testing suite and validation of not just the basic gedcom x, but also future versions and extensions of it, could be a major aid to interoperability and quality.

Last I saw schemas and validation was not well supported for the others.

+1

XML is the only one on the list backed by a standards organization or a formal schema language. YAML at least has a long track record and apparently has some validation support (I'm not familiar with it). Interestingly, the Wikipedia article about YAML says that JSON is a subset of YAML 1.2, so presumably those same validators could be aimed at JSON files.

another advantage of XML is good support for schemas and validation. and having a solid testing suite and validation of not just the basic gedcom x, but also future versions and extensions of it, could be a major aid to interoperability and quality

+/- 0

XML validation is a don't care or at most a ho hum at best. All it gives you is syntactic checking and some context insensitive value checking. For real semantic validation (born before you died, you're not your own grandpa, you died before you were 200 years old, you're older than your children, etc, etc) you just have to buckle down and write the code. The incremental cost of writing your own parser in this situation is lost in the noise. To do the semantic validation, given you do it directly from the XML, requires you to do one of two things:

  1. Build up a full DOM of the XML and then write custom code that dives around inside the DOM to do the validation. It's kind of fun to write that kind of code, but it tends to be memory intensive (DOMs are big things), and you're still going to have build a number of temporary data structures as well.
  2. Create a custom SAX program that gets control when each XML element is parsed that builds a custom representation of the data. It has to either validate that custom representation later, or try to do as much validation as possible during parsing, while keeping a probably complicated loose ends list of things that have to be checked later. Though I have to admit that writing SAX-based programs is a lot of fun.

The bottom line is that whether you write custom semantic validation on top of a custom JSON parser, or whether you write a custom validator based on XML DOM or XML SAX, you have an approximately equivalent job to do. Having syntactic XML validation in your pocket ain't all that valuable.

nealmcb commented Jul 12, 2012

@ttwetmore I think its useful to distinguish validation of the schema in use from semantic issues with the data that is represented. I expect that regardless of how complete gedcom x is, folks will want to extend it with extra fields and syntax for stuff that they care about that isn't covered. In order to have interoperability, we need an easy way for developers and users to assess what is needed to parse a file, and having clean tools for defining schema extensions and validating compliance with the schema that is claimed could help enormously with such interoperability. And that form of interoperability should have minimal dependence on the kinds of semantic issues you raise, since issues like "born before you died" are properties of the data, not the schema or syntax.

Independent of that, we certainly want good tools to let users check the semantic validity of the data they've entered. Some such issues like type checking (age is not a number) can be modeled and checked via support that comes with some schema languages, but others (being your own grandpa) are going to require specialized validation tools.

@nealmcb I agree with much of what you say. Not only is it useful to distinguish syntactic from semantic validation, it's necessary.

I have to say though, that the idea of allowing GEDCOMX to be extended with extra fields and syntax is a darned slippery slope that I hope GEDCOMX will not step onto. Much better would be a slow evolution over the first few years of its life as holes are discovered that need to be filled with updates to the language.

If such an extension were allowed then GEDCOMX would need to publish two specifications, one of the GEDCOMX "language" itself, and the other of the GEDCOMX schema language. That's a thought that makes me shiver. GEDCOM tried to add a schema feature for just this purpose and no one understood it well enough to use it.

Even without extensions it's obvious that good validation tools will be most important. GEDCOMX could provide the cores of many tools as a Java and/or C library that could be used by anyone wishing to import or export GEDCOMX data. The library could be released in concert with updated versions of the language.

regardless of how complete gedcom x is, folks will want to extend it with extra fields and syntax for stuff that they care about that isn't covered. In order to have interoperability, we need an easy way for developers and users to assess what is needed to parse a file, and having clean tools for defining schema extensions and validating compliance with the schema that is claimed could help enormously with such interoperability

++1

the idea of allowing GEDCOMX to be extended with extra fields and syntax is a darned slippery slope that I hope GEDCOMX will not step onto

I believe it will be extended as soon at is is born and I think that is a good thing. All users will want maximum rather than minimum migration of data between systems. All vendors will want to import the maximum possible from other systems in order to catch as many customers as possible. To assume that FS will be the guardians is a futile wish - even if it were possible to police in this way, they wouldn't be able to keep up.

I believe it will be extended as soon at is is born and I think that is a good thing. All users will want maximum rather than minimum migration of data between systems. All vendors will want to import the maximum possible from other systems in order to catch as many customers as possible. To assume that FS will be the guardians is a futile wish - even if it were possible to police in this way, they wouldn't be able to keep up.

Do we never learn? The whole movement toward better genealogical models today was stimulated by the problems caused by all the GEDCOM extensions that at best prevent full data sharing between applications and at worst cause incorrect importing of data.

You may be right about what will happen. But if what you believe does happen, the rank and file genealogists who simply want to share their data will loose again.

For as long as we live in a market driven economy (with unregulated genealogical data) we will never have full data sharing (except maybe between partnered vendors). It is contrary to competitive advantage.

Personally I'm not so sure we all banded together because of the problems caused by varying GEDCOM extensions ... It's never been a problem for me ... what has been a problem (in my experience as a genealogist) is software improvements being restricted by the limitations of the old GEDCOM model and (in my experience as a developer) by misinterpretations of the vaguaries and anomalies in the old model.

Fortunately yours and my opinions on data sharing do not matter when it comes to defining the file format.

Owner

stoicflame commented Jul 13, 2012

I'd just like to register my opinion in support of the position articulated by @EssyGreen.

There will always be custom extensions that are added by providers. In my view, the problem with GEDCOM was not that it allowed extensions but that (1) it didn't specify a complete enough model so providers could exchange what they actually wanted to exchange and (2) it didn't establish a process for registering needed extensions.

Look, extensibility has been baked into the foundation of tons of different standards outside the genealogical industry, and history shows that if it's done right, it can increase the adoption rate and encourage innovation without negatively impacting portability. There is no reason it can't be done right in this case, too.

Contributor

jralls commented Jul 13, 2012

Look, extensibility has been baked into the foundation of tons of different standards outside the genealogical industry, and history shows that if it's done right, it can increase the adoption rate and encourage innovation without negatively impacting portability. There is no reason it can't be done right in this case, too.

Roger.

The catch is that the specification is absolutely silent about it. That's not doing it right.

In my view, the problem with GEDCOM was not that it allowed extensions but that (1) it didn't specify a complete enough model so providers could exchange what they actually wanted to exchange and (2) it didn't establish a process for registering needed extensions.

Equally true of GedcomX.

Shall I write a new issue to start hashing something out, or is it already covered somewhere that I don't immediately find?

nealmcb commented Jul 13, 2012

@ttwetmore If the standard doesn't cover important and common types of data, and/or if it is difficult to extend or difficult to validate extensions, then we could end up in another mess as you fear. That's why it is an advantage to work with a format that can easily be extended and validated.

An example of something that might show up in an extension is genetic data. I expect that standardizing fields for SNP alleles or STR data might not make it into the main standard at this point, but will be popular in some circles. I'd much rather have an easy, standard way to define and validate extensions for that, than imagine that people will wait for gedcom x to get it right.

@jrails, good point. If it isn't covered yet, it certainly should be....

Owner

stoicflame commented Jul 16, 2012

The catch is that the specification is absolutely silent about it. That's not doing it right.

Agreed. I didn't intend to claim that we've got it right. Just want we need to get it right.

Shall I write a new issue to start hashing something out?

Sure. Do you think we could write up the issue so as to include the work needed to address the questions of #187 too?

pipian commented Jul 28, 2012

My two cents:

  1. Simplicity is a virtue. Stick with ONE serialization. Users don't need flexibility of multiple serialization formats, as most implementations will only support (or export) one. If you "allow" different ones, you add complexity to those applications which have to implement GEDCOM X (they have to support reading all formats even though they may export only one).

  2. The two best options (in my book) are JSON and XML. XML has the advantage of schema validation, as has been noted above. Schema validation is probably not crucial for files themselves, but rather for testing and validating the output produced by a program to ensure that the output is valid.

    XML also allows for nice segmentation of extensions if you make use of namespacing. This will prevent two incompatible extensions from colliding, which is a lot harder to do without imposing an additional out-of-band standard on how extensions to the standard are to be made. With XML, such extensions can make use of existing namespacing mechanisms in XML libraries (and simply ignore those elements which have namespaces that are not supported by the application), while in JSON, such namespacing of extensions would need to be implemented separately by all compliant GEDCOM X JSON parsers.

    The biggest disadvantages to XML, however, are that it is more verbose than JSON and it has a significant cost in terms of parse-time (parsing into a DOM tree structure is extremely expensive). However, once it is parsed, this is not as big an issue, especially if the parsing is done only once (as will be the case in most programs). XML can also be a bit more unwieldy to interact with if a less fully-featured XML library is employed.

I personally rather like YAML myself, as it supports datatyping through the use of tags (which JSON does not) and is more human-readable than XML. It also supports references to other values in the same file as a first-order concept, although this level of complexity may not be ideal and may make the spec too confusing. YAML is admittedly a bit of a niche markup language, however, and its use would require applications to include external YAML libraries rather than making use of XML parsers which are already included with practically every operating system distribution or JSON parsers, which are also fairly common.

Contributor

jralls commented Jul 28, 2012

Simplicity is a virtue. Stick with ONE serialization. Users don't need flexibility of multiple serialization formats, as most implementations will only support (or export) one. If you "allow" different ones, you add complexity to those applications which have to implement GEDCOM X (they have to support reading all formats even though they may export only one).

Users don't care about serialization formats. Developers do.

The need for more than one serialization arises from multiple use cases: In particular, the over-riding FamilySearch use case is to support distributed storage of genealogical information for Web applications. This use case is what is driving the stitched-together-small-documents-with-RDF-everywhere currently in the spec (and a bunch of other design choices as well), but regardless of what underlies it (XML, JSON, or GEDCOM-style tags) is about as non-optimal as possible for the use-case that most of the "outsiders" seem to be interested in.

nealmcb commented Jul 28, 2012

@pipian I agree that namespaces is an important factor, and will really help extensions. Along with the schema validation advantage, and the desire to use rdf anyway, I'd say we should just pick xml.

pipian commented Jul 29, 2012

@jralls I probably should have said "developers" rather than "users" there, but you make a good point. If we're talking about the context of web services, a JSON model is equally important due to the way web browsers have implemented cross-site requests (i.e. you can only do them with JSONP, which implies that JSON has to be a serialization format). Of course that doesn't really explain why XML needs to be the other, since you can still get the RDF benefits by employing something like JSON-LD.

My argument though is that each format should have a clearly defined use case or domain in which it is used, so that the cost of implementation is not so high (i.e. a genealogy app does not have to import and export in both the XML and JSON formats for saving, even though it might need the XML format for saving and JSON for interacting with web services)

Contributor

jralls commented Jul 29, 2012

My argument though is that each format should have a clearly defined use case or domain in which it is used, so that the cost of implementation is not so high (i.e. a genealogy app does not have to import and export in both the XML and JSON formats for saving, even though it might need the XML format for saving and JSON for interacting with web services)

OK, we're getting closer.;-)

Now consider that as long as the semantic information and basic structure of different formats are the same, it is trivial to write reformatters to convert from one to another. RelaxNG and trang is a good example with which you are no doubt familiar. If standard, cross-platform reformatters are available then a particular application need support only one format directly, and users (not developers) can simply apply the appropriate reformatter in a pipeline to get the format they need to complete a transfer between applications which support different formats.

joeflint commented Oct 8, 2012

My vote is for XML. While others don't see much value in XML schema validation, I see it as the keystone to maintaining the integrity of the GEDCOM X standard. A basic schema is easy to generate from a sample XML document that contains all the tags in the model. You will then need to manually edit the resulting schema to indicate optional tags. There are probably tools available to do this, but a text editor will do.

Once a "golden" schema has been developed it would be released with the rest of the specifications for GEDCOM X. Any vendor or application claiming to be GEDCOM X compliant would have to produce XML that could be validated with the schema. Period. End of Story.

I agree XML is the way to go. But the assumption that it must be XML in order to be validated against a schema is not true. It would be just as easy to validate against a JSON-based schema, GEDCOM-based schema, and so on.

The universal existence of XML-string to DOM-tree conversion libraries makes XML a very powerful approach. Combine this with XPath-based navigation (see for example the wonderfully powerful nodesForXPath: method in the Objective-C NSXMLDocument class). An early complaint about DOM-tree based processing was the amount of memory required to hold the trees and time necessary to build the trees. Recently I've been processing large-ish XML files (~1 meg) with DOM-trees and I've been impressed with the power of modern implementations. The Mac OS X implementation of NSXMLDocument will parse one of these files and build the DOM tree in 10s of milliseconds. With modern RAM sizes and efficient virtual memory systems, the sizes of the DOM trees is not a big issue either.

However, the rumor is that GEDCOM-X is now dead as an effort for a new standard. So is all this now moot?

Owner

stoicflame commented Oct 8, 2012

However, the rumor is that GEDCOM-X is now dead as an effort for a new standard. So is all this now moot?

Nah, don't be silly. FamilySearch is still very much supportive of GEDCOM X and it will continue to develop it. It's very much active. I'm hoping to put together a tentative release schedule soon.

Sorry, Ryan. I was trying to judge the veracity of Tamura's post. I hadn't read your blog post when I wrote the above. He clearly put his own wondrously weird interpretation on your post. Again, apologies.

tomtn commented Apr 30, 2013

Any update on this? Has there been a decision on the serialization format, and in general is the project alive? Poking around, it looks like most everything has not been updated in a long time.

Owner

stoicflame commented Apr 30, 2013

Hi @tomtn. Welcome.

Indeed, the conversation on this topic has settled, but we've been busy getting other stuff regarding the conceptual model nailed down. We believe we're close to settling the last few things there, at which point we'll settle the file format and move on. We hope to do that by the end of May 2013.

I assure you the project is alive and active, but I don't blame you for wondering otherwise. We haven't seen any blog entries for awhile, and the website severely needs to be updated. The signs of activity can be found in the commit history and in the closed issues, and once we get the "core" specs settled, we can devote more time to documentation and blogging. (It's really hard to document a moving target.)

Owner

stoicflame commented May 22, 2013

In preparation for the pending milestone 1 release of GEDCOM X, we are making the final decisions on the nature of the file format. The file format specification has been updated to reflect our decisions.

This particular question was especially difficult because of how much subjectivity is involved. In the end, XML was selected as the serialization format because of it's proven and well-established industry track record and rich toolset on all platforms. JSON has a well-known set of drawbacks, including the following:

  • Lack of support for comments.
  • Lack of type metadata.
  • Lack of formal support for aliases/anchors for object identity.
  • Weaknesses in extensibility support.
  • Weaker "readability".

We acknowledge that none of the drawbacks are show-stoppers and that many people argue the relevance (or even the existence) of these weaknesses. Nevertheless, we believe these weaknesses exist and that the potential cost of accepting these weaknesses doesn't outweigh the benefits of using JSON.

stoicflame closed this May 22, 2013

Owner

stoicflame commented Mar 2, 2017

This issue was closed almost four years ago. Since then, we haven't seen much industry use (internal or external) of the file format. There has been some recent motion (both internally and externally) to start using the file format, but clients are clearly wanting JSON instead of XML.

Since current use of the file format is apparently non-existent and since current demands of the file format are for JSON, the proposal is to change the file format to use JSON instead of XML.

Comments are welcome at #307.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment