Skip to content
This repository has been archived by the owner on Nov 7, 2018. It is now read-only.

Open standards for data schemas and serialization formats #161

Open
bsweger opened this issue Aug 27, 2015 · 45 comments
Open

Open standards for data schemas and serialization formats #161

bsweger opened this issue Aug 27, 2015 · 45 comments

Comments

@bsweger
Copy link
Contributor

bsweger commented Aug 27, 2015

Opening this issue to track thoughts on open source data schemas/standards/formats.

General Questions

  • How are people in the world implementing open-source data standards/schemas and moving data around?
  • Do special considerations exist for (non-PII, non-confidential) federal spending govt. data standards that distinguish this space from data standard implementations in other industries?
  • How important is it to build validations/metadata into a schema versus having a separate validation process and metadata definitions?
  • Related: what are the pros and cons of building a data input/validation process tied to a particular schema syntax vs. de-coupling syntax from validations/metadata?

Questions for specific data standard/schema options:

  • How robust is its open-source ecosystem? Has the open source community developed tooling around it? If so, how actively are the tools updated/maintained, and how large is the contributor base? Will users incur licensing fees or get locked into a vendor?
  • How widely adopted is the data standard/schema? Is it limited to a particular domain?
  • How easy is it for data submitters/producers (both biz and tech) to use?
  • How easy is it for data consumers (both biz and tech) to use?
  • What resources exist for non-coders to understand the data standard/schema? Is that even relevant if good, robust metadata exists (see above)?
  • What's the process for updating and tracking changes to the schema/standard?

These are schemas/standards/formats we know about. What are others?

@bsweger
Copy link
Contributor Author

bsweger commented Aug 31, 2015

Taking a first stab at sussing out what the various tools and libraries are attempting to solve. Will update as I learn more.

Name Defining Structure/Type Structure is Extensible Data Serialization Data Validation Human-Friendly
XML/XBRL y y y
Protocol Buffers y ? y y y
JSON y
Avro y y? y y sort of
CSV y
Kwalify (YAML + JSON) y y y y y
XML/NEIM y y n? y y
OKFN/CSV + JSON y ? n n y

@bsweger
Copy link
Contributor Author

bsweger commented Aug 31, 2015

During the past few sprints, we've been working on converting this pilot's data to XBRL, using the schemas posted here: http://fedspendingtransparency.github.io/data-exchange-standard/.

The learning curve is steep, and the open-source XBRL ecosystem is not ideally robust.

The open-source XBRL GUI we're playing with: Arelle

The Python library used to create classes based on the schema files: generateDS. Had to manually install (PyPi version wasn't current; pip install from BitBucket didn't work)--seems a bit janky.

Neither of these tools seems to have a large community or be actively updated. I'd be reluctant to rely on them.

Is XBRL an example of an open standard that's too complicated to use without commercial tools? What are tools/best practices used by others in the open source community for converting data to XML/XBRL?

@harrisj
Copy link
Contributor

harrisj commented Sep 8, 2015

This is great, and I'm glad that someone is thinking of this! I have a few initial thoughts to add in here now. I apologize for how scattered these messages are, but this is a surprisingly broad topic. I almost wonder if there is a better format we should be using for discussing this topic in public. Anyhow, some points for now in no particular order:

  1. I think I would like to consider the utility of any data standard in terms of three distinct audiences:
    • biz/tech in the relevant domain (e.g., financial customers with an active interest in the data)
    • technical users who aren't necessarily domain experts (e.g., civic coders, information designers, data journalists)
    • nontechnical members of the general public who might read script-produced but human-readable extracts of the data

As a former data journalist, I have my own biases here, but it strikes me that heavily-convoluted schemas tend to favor the first audience and disadvantage the second. And I feel they tend to disadvantage the third because they make it harder for news organizations and other civic groups like treasury.io for instance to try to make complicated information more understandable to the general public.

  1. A lack of internal data validation is not necessarily a dealbreaker for any given format. Many existing validation frameworks were essentially developed so that serialized data structures could be ingested into strongly-typed languages (eg, XSD, Protocol Buffers), and that specific validation is either ignored by dynamic languages that consume the data or can be emulated with explicit checks. The XBRL does provide a complicated mechanism for specifying relationships like formulas between XML elements in the schema definition, but this is also something that could be explicitly enforced in simple dynamic languages with more straightforward assertions expressed in a style similar to unit testing.
  2. I would like to consider the CSV format within this discussion as well. It admittedly does not include a machine-readable schema definition or native type representations, but CSV does have an advantage in that it is easily parsed and can even be opened up within Excel or other spreadsheet programs by non-technical users to verify data. This is not true of any of the other formats here.

@harrisj
Copy link
Contributor

harrisj commented Sep 8, 2015

One other thing that occurs to me literally after I submitted this last comment: even the most comprehensive level of validations (whether expressed as machine-readable schema relations or programmer-written post-loading tests) will often miss some caveats/issues that are about how the data is collected or how it should be interpreted by end users:

  1. For instance, I have seen data sets where for legacy reasons a special numeric code like 99999 is meant to be interpreted as null and not a number. Or a field is mostly numeric, but sometimes is sent with a string like N/A or NULL instead
  2. The data set might not be comprehensive and missing the contributions from certain agencies or states or such.
  3. The data might be for a different interval of time or unit than the user is expecting it to be (only 9 months instead of a year, cents instead of dollars)
  4. The data itself might be approximate or measure something different than users expect or have other restrictions that make it not suitable for certain applications. For instance, using a database of reports about acts of terrorism as a measure for actual acts of terrorism.

I'm sure there are other cases, but I guess the tl;dr point I want to make is that even the most automated validation mechanisms will not remove the necessity of creating documentation for this data and the requirement of users to read it. In some cases, automated validation might make things worse by making people assume that programming errors are caught automatically...

@ddmcd
Copy link

ddmcd commented Sep 8, 2015

It seems logical to to me to consider the overall manageability of a schema in terms of what needs to be done to make it usable to the (inevitably different) groups of users who benefit directly or indirectly from the schema's use. As +harrisj points out not all users will be "domain experts" in terms of process or technology. Also, whether a schema is "open source" or not may be secondary to whether not the schema is known and understood widely enough to support (or require) the development of maintainable tools to support its management and use. I've seen instances where the lack of tools to support schema adoption has significantly slowed standards adoption and hope that is not the case here.

@stevage
Copy link

stevage commented Sep 9, 2015

I don't see OpenSpending in that list. Their approach is basically:

It's currently being overhauled (a victim of its own success, they're reaching the limits of the current architecture), so you could consider helping work towards the new version.

@bsweger
Copy link
Contributor Author

bsweger commented Sep 9, 2015

@harrisj Thanks for weighing in. Yeah, this is a big topic to tackle in a GitHub issue 😁

Your audience breakdown is particularly helpful. And to your point that "heavily convoluted" schemas favor some groups of users over others, it seems that the opposite is not true. In other words, if a simpler schema + explicit validations could accommodate the second and third groups, the first group doesn't lose anything.

I added csv to the table up here. You're right--it should be there.

@bsweger
Copy link
Contributor Author

bsweger commented Sep 9, 2015

@ddmcd Good point about a lack of tools slowing a schema's adoption--thank you. Do you have a preferred way of expressing a data standard/schema?

@bsweger
Copy link
Contributor Author

bsweger commented Sep 9, 2015

@stevage Thanks for the pointer to OpenSpending. The use of .csvs as the lowest common denominator for transmitting data plus JSON for structure is intriguing. The CSV + structure is similar to the approach we're using in this project's prototype, except that the mapping is done internally before data is transmitted, so there's no external insight into how it was done. Some great stuff here to think about.

@shawnbot
Copy link

shawnbot commented Sep 9, 2015

FWIW, the W3C has been working on CSV standards for the web for a while. I haven't dug too deeply into the metadata stuff, so I can't speak to its usefulness here. But I would like to register my support for CSV generally, because it's the most "backward-compatible" format (and still one of the most compact representations) out there.

@harrisj
Copy link
Contributor

harrisj commented Sep 9, 2015

Extensible is a confusing column for me, because that could mean a variety of things. I'm assuming it refers to the ability of users to add their own additional columns beyond what is provided in the base data file. It seems like there are some caveats for what is there:

  • XML is indeed the extensible markup language, but I think adding new fields to an XBRL file requires the coordinated step of defining a proprietary XSD schema that extends the base schema and hosting it in a location that can be accessed by third-party tools that need that schema to validate your XML. For some users, this might also mean they have to recompile header files or other programming structures so they can deserialize your changed XML (and they might have some type collisions if also trying to load non-extended XBRL with unextended types in the same namespace. In short, it's complicated.
  • JSON is indeed extensible in that it is possible to add other fields to existing structure. There is no inherent schema in the JSON that would have to be modified (unless we decided to adopt a standard for that), but the challenge is that without proper naming, it's hard to distinguish which fields are in the original unextended structure vs. which were added as part of the extension.
  • Similarly, extending a CSV involves merely adding some columns to right of the existing table. This would be compatible with existing loaders that refer to columns by position or name, since those would simply ignore the new extensions, but there is the same issue of naming that exists with JSON records. In addition, it gets a bit ugly if we are trying to merge two distinct extensions in a new file. Which extension's columns come first? How do we handle name collisions?

I guess I have one other question about these formats and integer/numeric types and date types. Does it use binary formats? Text with some sort of explicit type declaration? Or just plain text with no explicit typing?

@HerschelC
Copy link

@bsweger I think that restricting yourself to the use of only open sources tools for working with the data is a challenge. Open source is great for where it fits, but falls flat where it doesn't - which is often in areas without a very large/broad user base to have an interest in building and maintaining the open source tools. I don't think the community in this space is that large. I think that the community doesn't really care if there is an open source or proprietary tool so long as they have a tool for the job. I too have spent time looking for open source tools related to XBRL and have found few/none. My next thought, that is still just a thought, was to see if I could co-opt an open source, XML-ingesting ETL tool for the purpose.

One other thing that you may want to consider in your analysis is the availability of learning materials and opportunities in the ecosystem. XBRL for example is a global organization with conferences and such. There is an entire track devoted to academia so there is research. I'm not extremely XBRL-fluent myself, having learned mostly from my interactions related to the DATA Act. It's essentially just an extended XML schema though - right? With governance process to manage it.

Beyond the technical, something can be said for using a standard in use by other similar entities (government or financial) for similar purposes because all that use builds an ecosystem (open is nice, but getting the job done is better). For example, the use of automated compliance rules for regulators. Sure the rules could be hand-coded over and over in code to check the data.. but I'd think building the rules in the schema to be more efficient than in the code when it comes to re-use of the rule elsewhere. I'm thinking rule re-use and easy broad adoption, not code re-use. Rule re-use enables broad adoption and agreement on the rules through some sort of governance and compliance mechanism. A final thought, changing the model is often less impactful from an IT perspective than changing the code that runs the model (so long as the code truly uses the model and metadata vs hard coding). Again, I'm still diving into the topic in my spare time. Just sharing my thoughts. I wish I had more time to dig deeper! I look forward to the outcome!

@catherinedevlin
Copy link

I don't know much about it yet, but kwalify applied to YAML or JSON seems to meet:

Defining Structure/Type: y
Structure is Extensible: y
Data Serialization: y
Data Validation: y
Human-Friendly: y

@HerschelC
Copy link

I think NIEM would be another option to consider. It's less technical and more process - but something already well-known in the government data exchange space. I'm again leaning towards the governance structure and process around the exchage protocol being more important than the technical under the hood.

@rufuspollock
Copy link

At Open Knowledge we've been working on very simple open source methods for shipping around data under the heading of "Data Packages" for some years now:

@bsweger
Copy link
Contributor Author

bsweger commented Sep 10, 2015

@catherinedevlin @HerschelC @rgrp above table has been updated with your suggestions--thank you!

Let me know if I got anything wrong, and apologies for the somewhat awkward format of jamming all of this into a GitHub thread. The input is much appreciated!

@ryanbharvey
Copy link

You might also take a look at Microformats: http://microformats.org/.

@ryanbharvey
Copy link

Also, RDF (http://www.w3.org/RDF/) and RDFa (http://rdfa.info/).

@newcomb-d
Copy link

You might want to look at the Open Geospatial Constorium ( OGC) , http://www.opengeospatial.org/, which is a standards body specializing in open standards for geospatial data. A list fo the current OGC standards can be found at http://www.opengeospatial.org/standards/is
A quick read through the reference model , http://rap.opengeospatial.org/orm.php , might be useful.

My personal opinion on open standards is that it is not an open standard unless there is an open source reference implementation of the standard.

OGC has several standards, but a couple that might be useful to look at would be the geopackage data format, http://www.geopackage.org/spec/ , the Web Mapping Service ( WMS ), the Catalogue Service (CSW) , and the Sensor Observation Service ( SOS)

@JJediny
Copy link
Member

JJediny commented Sep 10, 2015

Having dealt with managing/converting hundreds of data formats and services over the last few years - there are notable advantages that @harrisj and @stevage mentioned. I think the thread is hard to digest because Open Source tools are not being distinguished from Open Data Formats/Services/API standards? Can someone clarify/delineate the discussion? as these things are different creatures...

Focusing on the Data itself (i.e standards/specs/formats) of this - I would highly advise against using a XML based standard and would caution the use of CSV as a data exchange format:

Avoid eXtendable Markup Languages (XML)

<Parent_Variable type="DataType">Value</ParentVariable>
 <Child_Variable type="DataType">Value</ChildVariable>

XML is really NOT that extendable because it derives data relationships from nesting data in a hierarchy (i.e. indentation). So adding new fields or remapping hierarchy isn't impossible - but it becomes very impractical for these reasons:

  • Rendering/Parsing of XML requires that both parties (Data Producer/Consumer) conform to the same schema - any remapping has to be done on both sides the the fence this means reaching new consensuses on changes and updating the actual systems that exchange the data.
  • Requires Data conformity that the data types for each field (i.e. text string, boolean, float, integer maintain their types.
  • Its mind-numbing to read/interpret as a human

In other words there is too much rigidity in the structure/data types - rigidity is nice if things never change - but they always will.

Avoid Deliminated formats (CSV) for Data Exchange

Data formats where data is stored as text with a special character (:/,/;/tab/|/etc) deliminating between each value like CSV are 'file' formats convenient for download but ARE NOT useful/practical for data services/exchanges/APIs because the whole/contiguous file needs to be downloaded before it can be opened (i.e. you cannot stream portions or individual data entries separate from the whole dataset) . There can also be issues maintaining the data structure (i.e some rows have more columns or don't match to the correct column heading) if values in a field contain the same characters that are suppose to deliminate between those values.

Use JSON as the Data Exchange format

Because everyone already does and because it doesn't have the issues above ;)

@newcomb-d
Copy link

While a well documented binary format may not be human readable, it sure saves on bandwidth and storage!

@bsweger
Copy link
Contributor Author

bsweger commented Sep 10, 2015

@JJediny asked...

I think the thread is hard to digest because Open Source tools are not being distinguished from Open Data Formats/Services/API standards? Can someone clarify/delineate the discussion? as these things are different creatures...

You're right--there's a lot happening in this thread. A focus on the open data formats/services/standards rather than tools is the intent. Thanks for your great points!

@newcomb-d
Copy link

ICA Commision on Geoinformation Infrastructures and Standards
SDI-Open 2015 meeting proceedings are available for download at http://sdistandards.icaci.org/

@jstekervetz
Copy link

I would argue that NIEM (not NEIM) is Human-Friendly (and should be "y") given a rich set of requirements around developing conformant information exchanges (i.e. IEPDs) by focusing on required technical schemas AND business documentation artifacts for reuse. Reuse is only promoted via a human-friendly understanding of the exchange.

The NIEM model itself has open source tools that makes searching model content easy and offers type and element definitions to aid in understanding content meaning. Lastly, a UML Profile for NIEM exists to aid business users with understanding content relationships.

Data Serialization should be "y" as XML instances are created from NIEM-conformant XML schemas with many implementation examples across industry, federal, state, local and international governments.

@philipashlock
Copy link
Member

I suggest renaming this issue to: Open standards for data schemas and serialization formats

  • Let's use the Bruce Perens definition of open standard which is widely used and even included in some government policies.
  • Let's define a vocabulary as a controlled set of terms with agreed upon meaning. Vocabularies can be independent of schemas and serialization formats - and they often are, but they depend on a agreed upon schema and serialization format to enable machine to machine interoperability.
  • Let's define a data schema as a way of representing terms in the vocabulary using specific data types and a way of representing the relationships between the terms. Schemas can be independent of serialization formats, but are usually tied closely to one.
  • Let's define a serialization format as a standardized machine readable syntax and file format for encoding data. This is often done using a set data schema, but not always.

When talking about open data standards, it's usually assumed that the standard specifies not only the vocabulary but also the schema and the serialization format, even if multiple serialization formats are supported for a given schema. If you look at the schemas on schema.org you can see some good examples of this breakdown. These schemas are made out of a broad vocabulary of terms (often reusing established vocabularies like Dublin Core) which are then assembled into different schemas. Look at the Government Service schema as an example and at the bottom of the page you'll see examples of different serialization formats to encode the schema including JSON, and HTML microdata. NIEM tries to take a similar approach as Schema.org for defining a vocabulary and schemas for use in government that are independent of serialization format. Originally the main serialization format for NIEM schemas was XML, but they can also be represented in JSON and other formats. For what it's worth, I think NIEM and the DATA Act should aspire to the simplicity of the documentation of schemas on schema.org. I don't think it's acceptable to only publish the complete schema documentation as Word documents and XSD files and it's almost impossible to find the actual terms and schemas on the NIEM website.

The openness question with data standards is somewhat related to open source software, but it's helpful to keep them separate. It is true that many standards efforts require an open source reference implementation in order to consider it an open standard, but it's also possible to have an open standard (using the Bruce Perens definition) without there being usable (or widely used) open source tools. Proprietary data standards on the other hand often prevent the existence of open source tools to serve them in the first place (without reverse engineering or license fees). In government, we're typically required to use open standards (see A-119), so that shouldn't really be a question. I think the question you're asking is what open standards have robust ecosystems of developers and tools that support them, including open source options.

From what I can tell, it looks like you've already established most of the vocabulary as seen on the data elements page but as far as defining a schema and serialization to enable machine readability and machine interoperability of the vocabulary, what we currently have is an example of how this could be done using XBRL as seen on the data exchange page.

I haven't spent much time with XBRL but my understanding is that it is a data standard in a way that encompasses vocabulary, schema (XSD), and serialization formats (XML) and the example we have now extends XBRL with the terms that have been defined for the DATA Act vocabulary. It appears that there's a lot of activity around XBRL and ongoing updates to the standard (which date back to 2003), but it's not clear to me whether this legacy is helpful or not. It seems like XBRL is viewed as complex, cumbersome, lacking a robust ecosystem of tools and developers, and perhaps not widely used by relevant audiences. That said, usage of XBRL in other areas may warrant XBRL-based interoperabilty with DATA Act data. It doesn't necessarily have to be one or the other, but it's clear that there are some issues with an XBRL-only approach.

The opposite extreme in terms of simplicity and a robust ecosystem of tools and developers is the CSV. One of the things that makes CSV's such a great format is that they're incredibly easy to use for both developers and non-developers and they probably have the broadest range of tooling available. The downside to CSV's is that they don't provide a schema which is needed not only for validation and ensuring interoperability, but also for helping developers understand the data. Fortunately, the recent aforementioned efforts like Tabular Data Package (hat tip @rgrp) look like they're doing a great job of addressing this by combining CSVs with a schema defined in JSON using JSON Table Schema (which builds on JSON Schema). @shawnbot also mentioned the W3C's CSV work, but the Metadata Vocabulary for Tabular Data spec is actually based on JSON Table Schema and shares authorship, but it seems that JSON Table Schema is more actively used and developed at this point.

The Tabular Data Package seems like a solid candidate for the canonical data schema and serialization, but once you have that it's fairly trivial to generate derivatives in other serialization formats including JSON or XML. As mentioned earlier, the OpenSpending Budget Data Package Standard uses this approach, but so does the Open Contacting Data Standard. Remember the Schema.org Government Services schema I mentioned earlier? The OpenReferral project is also building on this as a standard with the Human Services Data Specification which uses Tabular Data Package as well.

While Tabular Data Package is still fairly new, it does already seem to come with good tooling (e.g. OKFN's Good Tables validation service/code) and the fact that it builds on existing well established ecosystems for CSVs and JSON Schema is also quite encouraging.

I do think it's critical to provide a machine readable schema for validation that's code-independent if you want to foster a robust ecosystem of tools that support machine to machine interoperability. The Project Open Data schema is defined using JSON Schema which makes it much easier for multiple codebases to stay in sync whenever changes to the schema are made. This also makes it possible for others to implement validation so we don't all have to depend on one central validation service (though it is important to provide a public one - as Data.gov does). Several agencies have used the canonical JSON Schema for POD Schema within their own codebases and even within Data.gov we use the same schema file with several different code bases in multiple programming languages.

Once you have the basic schema and serialization in place (e.g. Tabular Data Package), you probably want to think of more sophisticated ways to make the data available such as a RESTful API that returns queries in JSON, but I think it's important to start with raw bulk data interoperability before we start thinking about standard ways of exposing that as an API. Fortunately, some of the components may be reusable (e.g. you could reuse the JSON Schema definitions from JSON Table Schema for your API standard definition in Swagger).

One other important consideration that I haven't seen discussed much is file size and transmission rate. Proto Buffers are probably more valuable when bandwidth reduction, realtime transmission, and parsing efficiency is important, but the serialization format is relevant for large files even when you're not dealing with a network. Stream processors for both JSON and XML are widely available, but they're always going to be more complicated than streaming a CSV which can just be read line by line. Efforts like JSON Lines are trying to make this easier for JSON, but in most cases (if you're not using a special stream processor) you have to read an entire JSON or XML payload into memory before you can parse it which can be much more limiting than parsing a CSV line by line.

Normalization versus de-normalization is also worth considering. JSON and XML are often used to represent de-normalized relational data whereas a Tabular Data Package would probably be normalized across multiple CSVs just like multiple tables in a relational database. As a bulk data format, my guess is that most people are going to load this data into a spreadsheet or database anyway, so I would suspect normalization is what you want. Plus, normalization reduces file size. Maybe once you're at the API stage de-normalization will be more appropriate.

@47ronin
Copy link

47ronin commented Sep 11, 2015

I vote YES for an API that allows [1] a parameter to specify a response in either JSON or XML, [2] JSON response with a supported option for a callback parameter, which should allow robust responses depending on the requesting application/framework.

@lmeyerov
Copy link

Financial data gets big quite quickly (millions/billions of transactions, time points, people, ...), so I'm pretty confused why only one format relevant to nation-scale data is considered.

@lmeyerov
Copy link

(Most of those formats seem more relevant for messaging, and poor for sharing anything sizeable or allowing compute over it.)

@zzolo
Copy link

zzolo commented Sep 11, 2015

+1 to what @philipashlock said.

It's also pretty dependent on the data and the audience(s). Here's my general, straight-to-the-point thoughts on opening up data:
https://github.com/zzolo/so-you-want-to-open-some-data

@wesm
Copy link

wesm commented Sep 12, 2015

From the Python and Hadoop side of things, I can attest that CSV and JSON alone are problematic from a tooling and fidelity-of-analysis point of view as canonical formats. JSON is acceptable as long as an official schema is published so that the JSON data can be transformed into a binary format like Avro or Parquet for analysis. For larger datasets, Parquet is attractive because it can be analyzed efficiently with Spark, Hive, Presto, Drill, or Impala without need for ETL into columnar format.

Making data exportable in Avro has become an increasingly popular choice — Google BigQuery utilizes it as one of their binary export formats.

So, I would recommend making JSON and a binary-format with a built-in schema (e.g. Avro) both available to data consumers.

@jreadey
Copy link

jreadey commented Sep 12, 2015

What are the thoughts on HDF5 as a binary format? It supports @bsweger's attributes with the exception I guess of "Human Friendly". One aspect of HDF5 that may be useful is that is supports object relationships that aren't strictly tree-based (i.e. the collection of HDF5 groups may contain cycles).

I've been working on a JSON representation of HDF5 (and related tools). See: https://github.com/HDFGroup/hdf5-json.

@wesm
Copy link

wesm commented Sep 13, 2015

@jreadey in general I've found interest / development in tools based on HDF5 has dropped off considerably in the past 5 years. Notably PyTables was put up for adoption by its creator.

IMHO: the biggest issue with HDF5 is that it requires a file pointer and the ability to do random access anywhere in the file. This means you can't access data in a file without having all of it (vs. formats developed for Spark / Hadoop which all support streaming reads).

@jreadey
Copy link

jreadey commented Sep 13, 2015

On the other hand, I'm sure what other binary format is poised to replace HDF5.

I'm working on the file pointer aspect. :) ... a REST service that exposes PyTables-style queries. E.g: this url: https://data.hdfgroup.org:7258/datasets/651e787e-16a6-11e5-8632-06fc179afd5e/value?query=temp%3E63&host=compound.test.data.hdfgroup.org pulls rows where the 'temp' field is greater than 63.

@rufuspollock
Copy link

@wesm have you see the Tabular Data Package as that may help solve the fidelity issues with CSV for you (basically by providing a mechanism to add schema / type metadata around that CSV). There's currently quite a bit of work underway to provide integration of Tabular Data Package into other systems ranging from R to BigQuery.

@kaitlin
Copy link
Contributor

kaitlin commented Sep 14, 2015

For this particular case of the DATA Act reporting, there are two distinct use cases:

  • How agencies report spending data to Treasury
  • How Treasury publishes that data (more than one possible answer here)

And the responses in this thread have really focused on the latter. It seems like a very straightforward and accessible format is required for lay-audiences (probably CSV), but also a binary representation (like protobufs or avro) is desired. And perhaps JSON because of its wide adoption and use. So, I guess the answer is, (almost) all of the above :) ?

For the formats that don't include inherent type validation (CSV and JSON) I'd be curious how end users would want to consume metadata on the type and domain for different fields.

@HerschelC
Copy link

@kaitlin I would agree "all of the above" since it's just output of validated, high-quality data. It's simple enough to export into any format from a standardized schema.

I see three general segments for the publishing, the people perusing the website that may request a dump to play with in Excel, people that want bulk downloads for deeper analytics, and people that want api access in a messaging type channel (transactional, short messages).

The truly lay-people would use the built in reporting on the website (and likely an export to Excel so CSV is good, though .xlsx is essentially a set of zipped files which if offered could save on bandwidth usage).

The second set that I mention are people requesting data dumps. These are generally a little more advanced and would have the necessary skills to extract from whatever format you create. This isn't the lay-person audience. I'd thus suggest something most efficient for quality data download (compressed, a binary with error checking). It'd be great to offer open source packages to load this data into whatever platform to further enable analysis. The theme for this segment is efficient data transfer - compressed with error checking from beginning to end of transmission. I'd be curious of any benefit between using one of the binary formats suggested versus using a standard text-based format (csv, xbrl, json) and then gzip'ing similar to today. I'd lean towards simplicity.

The third set of users are those that want a messaging type interface to connect an app or dashboard to the data for querying. This of course would be best-suited for JSON.

I think that the design should keep in mind that financial data for public companies is published using XBRL. It makes some sense to provide this as an option for consuming data since those in this analysis space would already have the tools to consume the government data. Again, output options should be "all of the above" if possible. The key is that the data, in whatever format, is validated BEFORE it's output. So the output format has little meaning - it's more for convenience if the data is truly of high quality. If it is not, then the consumers of the data would want something more robust to allow them to validate the data - like a schema that ensures quality of output. So much government open data now is just a push of data with little regard to quality. Financial data is easy to reconcile (all debits and credits that balance to zero, right?). I'm hoping this effort will be a gold standard for data quality in government open data. The best way to ensure quality output would be to use a schema that enforces the quality, IMHO.

To answer your last question - simple reference data sets for the metadata, offered also in a plethora of formats is fine. This is akin to what we would use when applying DQ checks to the data in an ingest process. We'd bounce the data against the reference table to ensure it fits within the expected parameters. If not, flag it as a defect.

Finally, there are also historical changes to data to consider related to this conversation on data outputs. Some consideration should be given to how this change in data is handled over time. Directly changing it loses the value of the data as it was. Knowing what changed could itself be of value. For example, last week someone produces a report on data from 2013. This week that data is updated by the agency. I'm looking at the report that was done last week and the data doesn't match the report any more. Why? The answer should be that X agency updated the data and I would be able to reconcile the old report based upon this change in data. Changing data without capturing the fact that it changed can lead to a perception of poor quality data.

@roomthily
Copy link

@kaitlin, regarding CSV and JSON, that could be aided by using the CSVW and JSON-LD specs. It doesn't solve the schema problem, but there's really no way to avoid defining a schema for this kind of output anyway. My understanding is that the CSVW group (and others) are trying to ensure that JSON-LD, CSVW, Table for the Web, etc, are aligned so I suspect that once you have the schema for one, you either have a schema for the other or are very close.

I'd say +1 for including one of the interoperable standards in addition to any of the data streaming options. If nothing else, I can interact with those outputs without any special tooling if I want/need to so it's a pretty low bar compared to HDF5/ProtocolBuffers/AVRO/etc. And to make that consideration in a more language-agnostic way (python tooling is great, python is not the only language option).

@scopatz
Copy link

scopatz commented Sep 15, 2015

@wesm I think the PyTables comments miss the mark a little. We have done a ton of work and a lot of innovative things with PyTables since it went from being Francesc's baby to a community code.

@wesm
Copy link

wesm commented Sep 15, 2015

@scopatz that is a fair criticism. I am seeing little HDF5 interest in the BI / analytics / enterprise data management ecosystem which is where I've been focused the last few years. That doesn't really reflect what's been going on in scientific computing, though.

@JJediny
Copy link
Member

JJediny commented Sep 17, 2015

Required watching for this thread:
http://www.ukauthority.com/Live/31-August-2011-Open-Standards-Future-of-ICT.aspx

+1 to close this ticket - as its abit too unwieldy to be productive - I'd suggest (politely of couse) either redirect to other open issues or create a new issue that provides a means to be most productive for those interested in contributing - of which I hope to be one.

@scopatz
Copy link

scopatz commented Sep 17, 2015

@wesm Yes, I agree. The two domains are very different and require distinct technical solutions at this point in history, IMHO.

@bsweger bsweger changed the title Research issue: open source data standards Open standards for data schemas and serialization formats Sep 17, 2015
@bsweger
Copy link
Contributor Author

bsweger commented Sep 18, 2015

Thank you, all, for sharing your thoughts here. It's immensely helpful to hear from people in and outside of government. Special thanks to @philipashlock for parsing the original issue into its separate components and walking through them.

@JJediny agreed with your polite request! I'll summarize my take-aways and would love your thinking on how to split this discussion into more focused pieces.

  • @philipashlock says: the original question is really about open standards that also have a robust ecosystem of tools and developers

  • several people say: a language-independent, machine-readable schema is critical, and there are many options (speaking only for myself, I favor a machine-readable schema that is also somewhat human-readable)

  • several people like that .csvs are universal and easy to use; @JJediny and @wesm point out that they are also fragile and lack fidelity

  • @kaitlin says: it's important to remember that we're talking about 2 types of data transmission:

    1. incoming data from multiple sources that needs to be validated and standardized
    2. the results of step 1: data ready for publication
    

This project is very much focused on the first type above, so that might be a good way to split this up. Thoughts?

@bsweger
Copy link
Contributor Author

bsweger commented Sep 18, 2015

And can't resist one last question before we re-org this. @philipashlock suggests that validating data via a machine-readable schema has some advantages over a centralized validation service.

How do people handle complex validations in a machine-readable schema? For example, if we want to make sure that the sum of values in three different records doesn't exceed an amount in a fourth record? Are you doing those types of validations outside the schema, or do your schemas support those?

It seems like XBRL can handle those, though the implementation isn't especially straightforward.

@JJediny
Copy link
Member

JJediny commented Sep 22, 2015

Think the github issue could be distinguished by having clear line between some of the subliminal/embedded concepts within the thread. A few potentials being:

  1. The Business Logic - the real world requirement for conformity/interoperablity so that we spend less time working on the data rather than with it. It is what developers/engineers attempt to represent/translate into logic a computer can work with OR the model/schema...
  2. The data Model - is how the computer stores the binary data typically in a Relational/NoSQL Database. The relationships/linkages between tables, the column headings, field types (i.e. Yes/No, Date, Number, String) and the restrictions/constraints of those (No Null, MM/DD/YYYY, significant digits, character length, etc.)
  3. The data Schema - is the map/decipher/cheatsheet between the Human-based business logic & the Machine-based model - it serves as our documentation between the two. They are instructions to ourselves on how to derive context/understanding of how we should interpret a field/attribute of the data.
  4. The data's Metadata - literally translates to "Data about Data". Whereas we can think of the schema being metadata for each field in a dataset. Metadata (how it is most commonly used) can be too thought of as the schema for the cumulative/aggregate dataset. It is how we are meant to interpret/derive meaning about the data in the context it was produced.
  5. Data Formats/Services - is a snapshot of the data in time, stored in a structure manner that can be transmitted (either downloaded or accessed). Formats are how the data is stored/packaged so it can be accessed/opened after transit. We often think of formats as how you download/open something - but formats are used for transactional services (i.e. API - or the ability for two machines to communicate without human intervention). For example XML and JSON can be downloaded - but they are formats that enable APIs because unlike downloads - services allow us Query/Filter for a subset of records dynamically (i.e. in real time). Perhaps most importantly Services enable two way communication (ex. that is you access a subset of records and decide to Create/Read/Update/Delete or CRUD a subset of those back to the original source/database - you need two way communication).
  6. Data Validation - Is the verification that the results of the data are representative of the characteristic of the business logic. In other words that the fields are being filled with the intended data and that “it makes sense” and “is accurate”. However validation of schema is one thing - validation of data is another. As data needs to be reviewed for its quality post transmission - yes we can validate that all fields match the type (i.e. integer/string/etc) and write automated tests/rules (ex. Sum of X should be < Sum of Y) - and if it is machine readable then it can be validated. While validation of the schema/metadata/data types & automated testing make it easier to dashboard many metrics - validation cannot be considered the only verification. However with plethora of existing tools (FOSS or Proprietary) the ability to convert/transform nearly any structured dataset into nearly any other format is trivial and the speed at which these tools allow us to mash/drilldown/visualize/analyze (i.e. derive meaning to make informed decisions) validation beyond the obvious automated testing/metrics need not be a concern - outside of making sure the field/attribute is granular enough and isn’t another equation/formula within itself.

The reason standards are so important is because once - they're in place - they all but erase the hardest of the issues/items above in Validation/Schema/Format/Service..

I think the best use of Open Standards is when they are explicitly named/required in contracts/task orders. If we can get to the point where we can insert a requirement for a two-way API using an Open Standard - we can ensure a greater level of integrity and risk by avoiding poorly planned/executed customizations to meet the need of a customer/user at the expense of broader interoperability. This gets back to the point about the interplay between Open Source and Open Data. I believe the best way to try to conceptualize the interplay is to know that Open Source communities develop around the perceived need for a solution - the fast means to the solution is the path generally taken. As most if not all Open Source projects are really just a lego creation made up of other Open Source components/libraries/etc and/or a patch work of multiple independent projects themselves with their own communities. Open Source communities therefore tend to rely on existing Open Standards - not for a lack of creativity - but because that's something they don't need to worry about (time/money) - which makes a project more attractive to people evaluating options as it is less new learning and more doing. Some examples of technical standards that encourage reuse and interoperablity - Standards - that can be "required":

  • LDAP/Oauth for authentication
  • OpenSSH/SSL for secure communications
  • JSON/XML for API
  • GeoJSON for Storage/Archiving
  • Many Others...

When such Standards are not used however - it can be an incentive to vendors/contractors to customize these aspects of a project. This means more resources wasted on reinventing a wheel concurrently making it more costly down the road as formats/services all have to be continually be customized/maintained/remapped in the software and conceptual (training) to those that use it or those that need to consume or obtain the data - this is when issues like vendor lock-in occur - when it becomes 'harder' to solve the issue then it is to continue whats been done... With a solid Open Standard no matter many vendors/FOSS can take a variety of approaches within their software - so long as it doesn't break the standard and the data models integrity - I think if you can ensure conformity at the data exchange level you enrich the ecosystem for Open Source Communities & Commercial Vendors alike.

@marvinGitHub
Copy link

I do think we need an simple Interface for these operations which defines the basics like https://packagist.org/packages/persis/common

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests