-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Technology / Architecture Options #4
Comments
|
We should decide on a technology quickly so we can get to work on an alpha/MVP. Any thoughts on which of the following would be easiest to get up and running quickly?
Thoughts? |
I'm 50/50 on CKAN v. Really Simple Open Data for the MVP. The former is worth the setup lift I think, but the latter is promising, if new. @waldoj and @chriswhong would have thoughts here that I would take into high consideration. |
We publish this basic guide to deciding on what repository software to use, though there's nothing about Really Simply Open Data yet because it's pretty early along. CKAN is a strong candidate if Harvard is in a position to use a CKAN AMI on Amazon Web Services. If that presents a procurement obstacle, then there are some non-trivial technical challenges. I'll leave the matter of RSOD to @chriswhong, but I sure like the idea of it being used. |
RSOD is still a twinkle in my eye at this point, unless someone wants to
|
Thanks @rebeccawilliams @waldoj @chriswhong! |
The simplest way to do this is to use Dataverse, which has most of the properties that we want and is already up and running. The other technologies are fine, but they require someone to actually run an operation; Dataverse already has that (and they are good). So unless someone is volunteering to actually run an instance of the other technology, I'd say this one is a no-brainer. |
I confess that all that I know about Dataverse, I learned in the past 10 minutes. My anti-bona-fides established, it seems like it would work, although it sure seems like an awkward fit. I can't seem to find an API, a data.json file, on-screen data views, or data visualizations, and the metadata seems to be in some kind of library-specific XML. It seems like this would, in fact, get you a data repository, but I don't think it would get you many of the benefits normally associated with an open data repository, and it wouldn't allow your repo to be part of the larger world of repositories in any way. But if the alternative is nothing, this is clearly a huge step up, because at least then you'd have machine-readable metadata. |
Here's what Nick and I discussed at our meeting this morning: I think the debate over which technology to use stems from a fundamental debate about the purpose of the platform. Is it more of a catalog of metadata for already-published data, or a platform for hosting data sets? Catalog If we're going for a catalog, our options are CKAN or Socrata, which are both metadata cataloging services.
Dataverse might not be the best option here because it's a little heavy weight (more suited for full-scale data hosting.) I could see a way to wrangle it to work here, but as @waldoj mentioned it's a little awkward. Data hosting If we want to host all the data ourselves, our options are Dataverse or Socrata again, which support full data hosting.
What now? In any case, we'll use one of these technologies to store data/metadata on the backend and have the tech team focus on building frontend extension modules such as search widgets, visualizations, etc. But I think that, before we choose a technology, we should figure out which approach we're taking. Thoughts on what's more suitable for our mission? |
👍 |
Might there be a difference between the technology you want to use for an MVP, and the technology you want to use to scale? For example, it might not make sense to spend five figures on a Socrata contract before the need to do so is demonstrated. It also seems to me that demonstrating standalone data hosting is not necessarily ambitious enough, especially since Dataverse has this capability natively. From that angle, I'd argue that CKAN seems to be an attractive option. On that point, it looks like CKAN is available from a few vendors on the AWS marketplace (e.g. https://aws.amazon.com/marketplace/pp/B00JEF0278/ref=mkt_wir_ckan ) pretty cheaply. It also doesn't look too painful to set up on EC2 (http://docs.ckan.org/en/ckan-1.7.3/install-from-package-amazon.html ) and deploy a few datasets without getting fancy, at least if you believe the documentation. I don't know anything about the procurement obstacles, but if the point of the making a minimum viable product is to test some key hypotheses, CKAN seems to support those hypotheses out of the box. |
"Already-published data" can be as simple as hosting CSV files on an FTP server, which has negligible cost and maintenance concerns. Versioning is harder, no APIs, but in the majority of cases I think files for download are fine. Putting data on the internet for download is easy and cheap. CKAN actually does have the ability to store files (filestore) and data in database tables (datastore), but my thought is that you would be better off using it just as a metadatacatalog and having the data live wherever it is most useful. -Chris |
I should also add that there are firms who can stand up CKAN for you as a managed service so you don't need to worry about maintenance. Ask me off-issue if you want a referral. |
@waldoj based on your comments I believe you've been looking at http://thedata.harvard.edu which is the old production URL running legacy DVN 3.x code which hasn't yet been redirected to the new production URL at https://dataverse.harvard.edu which is running new, shiny Dataverse 4.0 code, a complete rewrite. Please see http://datascience.iq.harvard.edu/blog/dataverse-40-here for an overview of what's new in Dataverse 4.0. Of particular interest may be our new JSON-based APIs documented at http://guides.dataverse.org/en/latest/api and a Python library at https://github.com/IQSS/dataverse-client-python . Data visualization and exploration can now be done with a new d3-based application called TwoRavens. The Dataverse team is always interested in feedback. Please see https://github.com/IQSS/dataverse/blob/master/CONTRIBUTING.md for some details on how to get in touch! Thank you @abidart for putting a bug in my ear about this project! Sounds very interesting! |
I'm biased here as I sit on an advisory board. But has anyone asked Socrata for a free instance? I'm sure they would consider deploying one. Otherwise happy to stay out of the fray. My main point is that time spent deving and deploying on the catalog is time not spent on creating value with the data. |
Looks like Dataverse isn't working for us as a catalog -- we've put a few datasets in and it appears to be fighting us every step of the way. Let's consider some alternative technology like CKAN. |
In the intervening 6 months, my organization has funded a system to make CKAN far cheaper and easier to host multiple copies of, so the price of hosting should start dropping precipitously. We worked with DataCats on that, and they sell CKAN hosting services. Open Knowledge and Ontodia also come to mind. |
Discussed Dataverse vs. alternatives at today's meeting: https://docs.google.com/document/d/1rTIYB43H983xZj4Vy2kwPgxfz8hcHQ9xzXLcN1eKjB0/edit?pli=1 |
👍 |
This thread is for a discussion about Dataverse, CKAN, Socrata, and other possible solutions.
The text was updated successfully, but these errors were encountered: