Skip to content

What about Non Tabular Data

flyingzumwalt edited this page Feb 19, 2016 · 1 revision

Question:

Is there work underway to do this same thing for other non-binary data serializations? Tabular data is great for a lot of stuff that I'd stuff in a dataframe, but I'd also be interested in the same for textual data, json, xml, and even well-defined binary (or binary-ish) data representations (think Pickle or the emerging Parquet / Arrow work from Apache)...

Or is the thinking that all of these can boil down to tabular / columnar data if you take the time to apply RDBMs strategies?

Answer:

Between jawn and the main dat project, we've got you covered. For now, we are treating tabular data as a special use case. If you're not interested in the particulars of tabular data and tracking row-by-row changes, you don't need jawn and should go to the main dat project.

The main dat project is focused on version control for arbitrary directories of files. It uses hyperdrive to do that. Jawn is a sibling to hyperdrive, one that is specifically for tabular data. The choice between jawn and hyperdrive depends on whether you have tabular data and want to track row-by-row changes. With jawn the vocabulary of changes will match the kind of changes that happen in tabular data. With hyperdrive, the vocabulary of changes is more generic.

Jawn and hyperdrive are two different ways to break your data into 'blocks' and write them into hypercore. Jawn is just for tabular data and represents each row as a block, allowing you to track row-by-row changes. By contrast, hyperdrive is for any type of files (including tabular data). Hyperdrive tracks changes to those files in ~20kb chunks (for details read the hyperdrive spec), so it's able to track any type of changes to any type of file but it does not know anything about the nature of the changes.

Because jawn and hyperdrive are both writing the data into hypercore, there are many opportunities to use these two modules together down the road but for now we're treating tabular data as its own specialized case.

In the case of more structured data in formats like XML, JSON and RDF, if you want "object-level" or "row-level" change tracking. The best approach would be to break those data down into related tables of rows & columns - as you say, take the time to boil them down to tabular/columnar data and apply strategies reminiscent of an RDBMS. To support this, the dat beta allowed you to have multiple 'datasets' (aka. tables) in a single dat repository. Jawn will support this too (see Issue #14).

Note, however, that unlike a SQL database, you would not need to worry about things like denormalizing data. Also, unlike a SQL database or a triplestore, dat/jawn is not aware of the relationships between content. While it does have unique identifiers for each row, it does not have a notion of a 'foreign key'. If one of your columns contains identifiers that point to other rows in your dataset, jawn will not be aware of it. Making that connection between rows is entirely up to you. We are just providing a stable way to track changes to the data over time and to sync those changes across participating hosts.

With textual data, if your primary goal is just to track and replicate those text files, you can already do that with dat/hyperdrive. If you want to track line-by-line changes there might be advantages to using git rather than dat, since that's the main thing git was created for. Alternatively, someone could write a dat module that tracks text files line-by-line like git rather than doing hyperdrive's default of tracking changes in ~20kb chunks. In that case you would have jawn for tabular data, another module for text data, and hyperdrive for everything else.