Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments on 'Truly Open Data' #1

Open
bkatiemills opened this issue Jul 6, 2015 · 10 comments
Open

Comments on 'Truly Open Data' #1

bkatiemills opened this issue Jul 6, 2015 · 10 comments

Comments

@bkatiemills
Copy link
Owner

Please reply to this thread with your comments on this post.

@tleeuwenburg
Copy link

Could you set up a BitTorrent tracker for data science data sets?

@bkatiemills
Copy link
Owner Author

@tleeuwenburg I've actually always liked that idea (doesn't require centralized hardware, no software development to do, 'versioning' through hashes etc), but it never really caught on, and I never really heard a convincing reason why not.

That said, torrents are just a distribution system - a big other issue to deal with is making that data easy to use and understand once it's sitting on your hard drive, which IMO is the real barrier to open data.

@tleeuwenburg
Copy link

One answer to your question is "schemas and standards". Another might be peer-reviewed data. I wonder if part of the issue is that a lot of the open data is coming from the science world rather than the engineering world? While you're looking at the legibility of data, I do still find transport and storage is a problem.

@bkatiemills
Copy link
Owner Author

You're certainly right that transport and storage is still a problem; nuclear physics, astronomy and increasingly genomics at least are producing datasets that aren't feasible to transmit. But a great many fields don't have such unwieldy datasets; we can work at the problem from both ends.

I'd like to understand better your ideas on the distinction between scientific and engineering cultures in this regard - I would have guessed they would suffer from the same challenges, but I'd be delighted to learn more.

@tleeuwenburg
Copy link

The difference to my mind is most evident in the 'open source' culture, where there is a strong concept of building shareable, re-usable multi-use content. Science is more focused on individual results rather on making repeated use of data. Therefore, science content is harder to use, in general, than engineering content. Data needs a stronger focus on re-usability and shareability.

@bkatiemills
Copy link
Owner Author

That's really interesting, and spot on; I'd love to see a blog post / lesson / lecture notes / whatever you prefer on lessons the sciences could learn from engineering culture on making that shareable, re-usable, multi-use content; I think there could be some great insights there!

@tleeuwenburg
Copy link

I think that's a good idea. Do you have an example dataset which you think would be a good basis, and be of common, general interest? If you'd be interested, I'd be happy to work up an example of what I mean, and then maybe get some help from you on whether you think it's a good approach? I would just want to make sure we are using good, open data with no major IP restrictions and which is clearly of common use. Archival or realtime, either is fine.

@tleeuwenburg
Copy link

Hey, so I created a pretend tool called 'odit' -- the Open Data Integration Tool, and wrote documentation for what it should do. http://odit.readthedocs.org/en/latest/ ... this is my imagining of what data sharing needs.

@tleeuwenburg
Copy link

@bkatiemills
Copy link
Owner Author

Interesting, odit is definitely on the right track on the distribution side - have you checked out the dat project? It would be interesting to dig into how well that project conforms to odit's recommendations; at first glance I bet there will be some substantial overlaps.

As for datasets, there are a couple of options out there that could be interesting. I've been playing around a little with some environmental data from the British Columbian government; also of enormous popular interest is genomics data from NCBI and others; have a look at these steps to acquire some data being used in a real and ongoing genomics project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants