Skip to content

settle on the granularity of the file format data blocks #183

stoicflame opened this Issue Jul 10, 2012 · 4 comments

3 participants

FamilySearch member

Analysis of the GEDCOM X file format shows that the efficiency of the ZIP file format degrades as the size of the entries gets smaller and the number of entries grows. As defined today, the GEDCOM X file format specifies a large number of small files, the side effect being that ZIP itself almost doubles the size of the file.

So we need to rethink whether we want to decrease the number of entries and increase their size by bundling up the entities together into some kind of data blocking strategy. This issue is opened to discuss that strategy.

On one side of the spectrum is what we have today: each entity (person, relationship, source) is its own entry. The reason this strategy was selected was because it allows for a lot of flexibility for processors to decide how to divide up the processing. It also allows for the self-description mechanism to apply at the entity level so that processors can perform more powerful analysis of the file without doing any parsing of the entries.

On the other side of the spectrum is that everything (except maybe multimedia files) is put into a single file.

There are other strategies in the middle. For example, we could bundle all the persons into a block, all the relationships into another block, etc.

FamilySearch member

(Now that I've opened up the issue, I'll take the time to register my personal opinion.)

I like one entry per entity. I want the processing flexibility and the self-description granularity.

I guess that shouldn't be too much of a surprise to anyone :-).




I agree that one file per entity is okay, as long as we can get rid of all the namespaces and other boilerplate.

FamilySearch member

In preparation for the pending milestone 1 release of GEDCOM X, we are making the final decisions on the nature of the file format. The file format specification has been updated to reflect our decisions.

For this particular question, the decision was made to allow implementations the flexibility of deciding how big they want their data blocks. They can be as small or large as they want because a mechanism was provided to make both "same-document" references and "relative path" references. Default implementation will encourage large-grained data blocks. Implementations that choose smaller data blocks won't get as much compression optimization.

@stoicflame stoicflame closed this May 22, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.