Technology / Architecture Options #4

nsinai · 2015-04-08T01:41:26Z

This thread is for a discussion about Dataverse, CKAN, Socrata, and other possible solutions.

rebeccawilliams · 2015-04-20T22:06:06Z

some context on the options: http://sunlightfoundation.com/policy/opendatafaq/#portal
another budding option: https://github.com/chriswhong/ReallySimpleOpenData

hathix · 2015-04-21T02:35:30Z

We should decide on a technology quickly so we can get to work on an alpha/MVP. Any thoughts on which of the following would be easiest to get up and running quickly?

Spin up a CKAN instance. It's pretty heavyweight but provides almost all the features we need.
Write a wrapper around Dataverse's APIs. This would definitely require some setup work (other solutions might work out of the box), but that would be pretty painless.
Set up Really Simple Open Data. It's very simple and easy to set up, if lacking more advanced features.
Something else?

Thoughts?

rebeccawilliams · 2015-04-21T21:43:42Z

I'm 50/50 on CKAN v. Really Simple Open Data for the MVP. The former is worth the setup lift I think, but the latter is promising, if new.

@waldoj and @chriswhong would have thoughts here that I would take into high consideration.

waldoj · 2015-04-22T00:17:43Z

We publish this basic guide to deciding on what repository software to use, though there's nothing about Really Simply Open Data yet because it's pretty early along.

CKAN is a strong candidate if Harvard is in a position to use a CKAN AMI on Amazon Web Services. If that presents a procurement obstacle, then there are some non-trivial technical challenges.

I'll leave the matter of RSOD to @chriswhong, but I sure like the idea of it being used.

chriswhong · 2015-04-22T01:48:01Z

RSOD is still a twinkle in my eye at this point, unless someone wants to
dedicate some serious dev time to get it off the ground. It is definitely
not a contender for anything serious at this point.
On Apr 21, 2015 8:17 PM, "Waldo Jaquith" notifications@github.com wrote:

We publish this basic guide to deciding on what repository software to use
http://how-to.usopendata.org/basics/data-repositories.html, though
there's nothing about Really Simply Open Data yet because it's pretty early
along.

CKAN is a strong candidate if Harvard is in a position to use a CKAN AMI
on Amazon Web Services. If that presents a procurement obstacle, then there
are some non-trivial technical challenges.

I'll leave the matter of RSOD to @chriswhong
https://github.com/chriswhong, but I sure like the idea of it being
used.

—
Reply to this email directly or view it on GitHub
#4 (comment)
.

nsinai · 2015-04-23T17:55:17Z

Thanks @rebeccawilliams @waldoj @chriswhong!
@jimwaldo @perryhewitt thoughts?
I like CKAN because so many other national and local open data portals are using it, but want to make sure there isn't something easier to get up as a MVP.

jimwaldo · 2015-05-04T18:56:32Z

The simplest way to do this is to use Dataverse, which has most of the properties that we want and is already up and running. The other technologies are fine, but they require someone to actually run an operation; Dataverse already has that (and they are good).

So unless someone is volunteering to actually run an instance of the other technology, I'd say this one is a no-brainer.

waldoj · 2015-05-04T19:17:44Z

I confess that all that I know about Dataverse, I learned in the past 10 minutes. My anti-bona-fides established, it seems like it would work, although it sure seems like an awkward fit. I can't seem to find an API, a data.json file, on-screen data views, or data visualizations, and the metadata seems to be in some kind of library-specific XML. It seems like this would, in fact, get you a data repository, but I don't think it would get you many of the benefits normally associated with an open data repository, and it wouldn't allow your repo to be part of the larger world of repositories in any way. But if the alternative is nothing, this is clearly a huge step up, because at least then you'd have machine-readable metadata.

hathix · 2015-05-06T15:36:17Z

Here's what Nick and I discussed at our meeting this morning:

I think the debate over which technology to use stems from a fundamental debate about the purpose of the platform. Is it more of a catalog of metadata for already-published data, or a platform for hosting data sets?

Catalog

If we're going for a catalog, our options are CKAN or Socrata, which are both metadata cataloging services.

CKAN: free and open-source, but requires us to maintain it
Socrata: nontrivial cost (Nick estimates somewhere in the tens of thousands of dollars), but as a SaaS platform all the backend tech support is taken care of by the Socrata guys.

Dataverse might not be the best option here because it's a little heavy weight (more suited for full-scale data hosting.) I could see a way to wrangle it to work here, but as @waldoj mentioned it's a little awkward.

Data hosting

If we want to host all the data ourselves, our options are Dataverse or Socrata again, which support full data hosting.

Dataverse: pretty easy to set up, although it requires maintenance by our team (we do have the Dataverse staff on hand.)
Socrata: I believe (correct me if I'm wrong) they allow for full dataset hosting. Again, this costs money but they handle tech support.

What now?

In any case, we'll use one of these technologies to store data/metadata on the backend and have the tech team focus on building frontend extension modules such as search widgets, visualizations, etc.

But I think that, before we choose a technology, we should figure out which approach we're taking. Thoughts on what's more suitable for our mission?

waldoj · 2015-05-06T15:51:11Z

👍

fzaman93 · 2015-05-06T17:40:37Z

Might there be a difference between the technology you want to use for an MVP, and the technology you want to use to scale?

For example, it might not make sense to spend five figures on a Socrata contract before the need to do so is demonstrated. It also seems to me that demonstrating standalone data hosting is not necessarily ambitious enough, especially since Dataverse has this capability natively.

From that angle, I'd argue that CKAN seems to be an attractive option. On that point, it looks like CKAN is available from a few vendors on the AWS marketplace (e.g. https://aws.amazon.com/marketplace/pp/B00JEF0278/ref=mkt_wir_ckan ) pretty cheaply. It also doesn't look too painful to set up on EC2 (http://docs.ckan.org/en/ckan-1.7.3/install-from-package-amazon.html ) and deploy a few datasets without getting fancy, at least if you believe the documentation.

I don't know anything about the procurement obstacles, but if the point of the making a minimum viable product is to test some key hypotheses, CKAN seems to support those hypotheses out of the box.

chriswhong · 2015-05-06T17:58:52Z

"Already-published data" can be as simple as hosting CSV files on an FTP server, which has negligible cost and maintenance concerns. Versioning is harder, no APIs, but in the majority of cases I think files for download are fine. Putting data on the internet for download is easy and cheap.

CKAN actually does have the ability to store files (filestore) and data in database tables (datastore), but my thought is that you would be better off using it just as a metadatacatalog and having the data live wherever it is most useful.

-Chris

chriswhong · 2015-05-06T17:59:50Z

I should also add that there are firms who can stand up CKAN for you as a managed service so you don't need to worry about maintenance. Ask me off-issue if you want a referral.

pdurbin · 2015-05-18T17:34:45Z

I can't seem to find an API, a data.json file, on-screen data views, or data visualizations, and the metadata seems to be in some kind of library-specific XML.

@waldoj based on your comments I believe you've been looking at http://thedata.harvard.edu which is the old production URL running legacy DVN 3.x code which hasn't yet been redirected to the new production URL at https://dataverse.harvard.edu which is running new, shiny Dataverse 4.0 code, a complete rewrite.

Please see http://datascience.iq.harvard.edu/blog/dataverse-40-here for an overview of what's new in Dataverse 4.0. Of particular interest may be our new JSON-based APIs documented at http://guides.dataverse.org/en/latest/api and a Python library at https://github.com/IQSS/dataverse-client-python . Data visualization and exploration can now be done with a new d3-based application called TwoRavens.

The Dataverse team is always interested in feedback. Please see https://github.com/IQSS/dataverse/blob/master/CONTRIBUTING.md for some details on how to get in touch!

Thank you @abidart for putting a bug in my ear about this project! Sounds very interesting!

deaves · 2015-07-15T07:28:45Z

I'm biased here as I sit on an advisory board. But has anyone asked Socrata for a free instance? I'm sure they would consider deploying one.

Otherwise happy to stay out of the fray. My main point is that time spent deving and deploying on the catalog is time not spent on creating value with the data.

hathix · 2015-11-17T20:03:47Z

Looks like Dataverse isn't working for us as a catalog -- we've put a few datasets in and it appears to be fighting us every step of the way. Let's consider some alternative technology like CKAN.

waldoj · 2015-11-17T20:22:08Z

In the intervening 6 months, my organization has funded a system to make CKAN far cheaper and easier to host multiple copies of, so the price of hosting should start dropping precipitously. We worked with DataCats on that, and they sell CKAN hosting services. Open Knowledge and Ontodia also come to mind.

hathix · 2015-11-20T18:04:28Z

Discussed Dataverse vs. alternatives at today's meeting: https://docs.google.com/document/d/1rTIYB43H983xZj4Vy2kwPgxfz8hcHQ9xzXLcN1eKjB0/edit?pli=1

waldoj · 2015-11-20T18:09:47Z

👍

nsinai changed the title ~~Technology~~ Technology / Architecture Options Apr 8, 2015

hathix mentioned this issue Jun 2, 2015

Building the MVP #16

Closed

hathix closed this as completed Nov 2, 2015

hathix reopened this Nov 17, 2015

hathix closed this as completed Nov 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Technology / Architecture Options #4

Technology / Architecture Options #4

nsinai commented Apr 8, 2015

rebeccawilliams commented Apr 20, 2015

hathix commented Apr 21, 2015

rebeccawilliams commented Apr 21, 2015

waldoj commented Apr 22, 2015

chriswhong commented Apr 22, 2015

nsinai commented Apr 23, 2015

jimwaldo commented May 4, 2015

waldoj commented May 4, 2015

hathix commented May 6, 2015

waldoj commented May 6, 2015

fzaman93 commented May 6, 2015

chriswhong commented May 6, 2015

chriswhong commented May 6, 2015

pdurbin commented May 18, 2015

deaves commented Jul 15, 2015

hathix commented Nov 17, 2015

waldoj commented Nov 17, 2015

hathix commented Nov 20, 2015

waldoj commented Nov 20, 2015

Technology / Architecture Options #4

Technology / Architecture Options #4

Comments

nsinai commented Apr 8, 2015

rebeccawilliams commented Apr 20, 2015

hathix commented Apr 21, 2015

rebeccawilliams commented Apr 21, 2015

waldoj commented Apr 22, 2015

chriswhong commented Apr 22, 2015

nsinai commented Apr 23, 2015

jimwaldo commented May 4, 2015

waldoj commented May 4, 2015

hathix commented May 6, 2015

waldoj commented May 6, 2015

fzaman93 commented May 6, 2015

chriswhong commented May 6, 2015

chriswhong commented May 6, 2015

pdurbin commented May 18, 2015

deaves commented Jul 15, 2015

hathix commented Nov 17, 2015

waldoj commented Nov 17, 2015

hathix commented Nov 20, 2015

waldoj commented Nov 20, 2015