Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technology / Architecture Options #4

Closed
nsinai opened this issue Apr 8, 2015 · 19 comments
Closed

Technology / Architecture Options #4

nsinai opened this issue Apr 8, 2015 · 19 comments

Comments

@nsinai
Copy link
Contributor

nsinai commented Apr 8, 2015

This thread is for a discussion about Dataverse, CKAN, Socrata, and other possible solutions.

@nsinai nsinai changed the title Technology Technology / Architecture Options Apr 8, 2015
@rebeccawilliams
Copy link

@hathix
Copy link
Contributor

hathix commented Apr 21, 2015

We should decide on a technology quickly so we can get to work on an alpha/MVP. Any thoughts on which of the following would be easiest to get up and running quickly?

  • Spin up a CKAN instance. It's pretty heavyweight but provides almost all the features we need.
  • Write a wrapper around Dataverse's APIs. This would definitely require some setup work (other solutions might work out of the box), but that would be pretty painless.
  • Set up Really Simple Open Data. It's very simple and easy to set up, if lacking more advanced features.
  • Something else?

Thoughts?

@rebeccawilliams
Copy link

I'm 50/50 on CKAN v. Really Simple Open Data for the MVP. The former is worth the setup lift I think, but the latter is promising, if new.

@waldoj and @chriswhong would have thoughts here that I would take into high consideration.

@waldoj
Copy link

waldoj commented Apr 22, 2015

We publish this basic guide to deciding on what repository software to use, though there's nothing about Really Simply Open Data yet because it's pretty early along.

CKAN is a strong candidate if Harvard is in a position to use a CKAN AMI on Amazon Web Services. If that presents a procurement obstacle, then there are some non-trivial technical challenges.

I'll leave the matter of RSOD to @chriswhong, but I sure like the idea of it being used.

@chriswhong
Copy link

RSOD is still a twinkle in my eye at this point, unless someone wants to
dedicate some serious dev time to get it off the ground. It is definitely
not a contender for anything serious at this point.
On Apr 21, 2015 8:17 PM, "Waldo Jaquith" notifications@github.com wrote:

We publish this basic guide to deciding on what repository software to use
http://how-to.usopendata.org/basics/data-repositories.html, though
there's nothing about Really Simply Open Data yet because it's pretty early
along.

CKAN is a strong candidate if Harvard is in a position to use a CKAN AMI
on Amazon Web Services. If that presents a procurement obstacle, then there
are some non-trivial technical challenges.

I'll leave the matter of RSOD to @chriswhong
https://github.com/chriswhong, but I sure like the idea of it being
used.


Reply to this email directly or view it on GitHub
#4 (comment)
.

@nsinai
Copy link
Contributor Author

nsinai commented Apr 23, 2015

Thanks @rebeccawilliams @waldoj @chriswhong!
@jimwaldo @perryhewitt thoughts?
I like CKAN because so many other national and local open data portals are using it, but want to make sure there isn't something easier to get up as a MVP.

@jimwaldo
Copy link

jimwaldo commented May 4, 2015

The simplest way to do this is to use Dataverse, which has most of the properties that we want and is already up and running. The other technologies are fine, but they require someone to actually run an operation; Dataverse already has that (and they are good).

So unless someone is volunteering to actually run an instance of the other technology, I'd say this one is a no-brainer.

@waldoj
Copy link

waldoj commented May 4, 2015

I confess that all that I know about Dataverse, I learned in the past 10 minutes. My anti-bona-fides established, it seems like it would work, although it sure seems like an awkward fit. I can't seem to find an API, a data.json file, on-screen data views, or data visualizations, and the metadata seems to be in some kind of library-specific XML. It seems like this would, in fact, get you a data repository, but I don't think it would get you many of the benefits normally associated with an open data repository, and it wouldn't allow your repo to be part of the larger world of repositories in any way. But if the alternative is nothing, this is clearly a huge step up, because at least then you'd have machine-readable metadata.

@hathix
Copy link
Contributor

hathix commented May 6, 2015

Here's what Nick and I discussed at our meeting this morning:

I think the debate over which technology to use stems from a fundamental debate about the purpose of the platform. Is it more of a catalog of metadata for already-published data, or a platform for hosting data sets?

Catalog

If we're going for a catalog, our options are CKAN or Socrata, which are both metadata cataloging services.

  • CKAN: free and open-source, but requires us to maintain it
  • Socrata: nontrivial cost (Nick estimates somewhere in the tens of thousands of dollars), but as a SaaS platform all the backend tech support is taken care of by the Socrata guys.

Dataverse might not be the best option here because it's a little heavy weight (more suited for full-scale data hosting.) I could see a way to wrangle it to work here, but as @waldoj mentioned it's a little awkward.

Data hosting

If we want to host all the data ourselves, our options are Dataverse or Socrata again, which support full data hosting.

  • Dataverse: pretty easy to set up, although it requires maintenance by our team (we do have the Dataverse staff on hand.)
  • Socrata: I believe (correct me if I'm wrong) they allow for full dataset hosting. Again, this costs money but they handle tech support.

What now?

In any case, we'll use one of these technologies to store data/metadata on the backend and have the tech team focus on building frontend extension modules such as search widgets, visualizations, etc.

But I think that, before we choose a technology, we should figure out which approach we're taking. Thoughts on what's more suitable for our mission?

@waldoj
Copy link

waldoj commented May 6, 2015

👍

@fzaman93
Copy link

fzaman93 commented May 6, 2015

Might there be a difference between the technology you want to use for an MVP, and the technology you want to use to scale?

For example, it might not make sense to spend five figures on a Socrata contract before the need to do so is demonstrated. It also seems to me that demonstrating standalone data hosting is not necessarily ambitious enough, especially since Dataverse has this capability natively.

From that angle, I'd argue that CKAN seems to be an attractive option. On that point, it looks like CKAN is available from a few vendors on the AWS marketplace (e.g. https://aws.amazon.com/marketplace/pp/B00JEF0278/ref=mkt_wir_ckan ) pretty cheaply. It also doesn't look too painful to set up on EC2 (http://docs.ckan.org/en/ckan-1.7.3/install-from-package-amazon.html ) and deploy a few datasets without getting fancy, at least if you believe the documentation.

I don't know anything about the procurement obstacles, but if the point of the making a minimum viable product is to test some key hypotheses, CKAN seems to support those hypotheses out of the box.

@chriswhong
Copy link

"Already-published data" can be as simple as hosting CSV files on an FTP server, which has negligible cost and maintenance concerns. Versioning is harder, no APIs, but in the majority of cases I think files for download are fine. Putting data on the internet for download is easy and cheap.

CKAN actually does have the ability to store files (filestore) and data in database tables (datastore), but my thought is that you would be better off using it just as a metadatacatalog and having the data live wherever it is most useful.

-Chris

@chriswhong
Copy link

I should also add that there are firms who can stand up CKAN for you as a managed service so you don't need to worry about maintenance. Ask me off-issue if you want a referral.

@pdurbin
Copy link

pdurbin commented May 18, 2015

I can't seem to find an API, a data.json file, on-screen data views, or data visualizations, and the metadata seems to be in some kind of library-specific XML.

@waldoj based on your comments I believe you've been looking at http://thedata.harvard.edu which is the old production URL running legacy DVN 3.x code which hasn't yet been redirected to the new production URL at https://dataverse.harvard.edu which is running new, shiny Dataverse 4.0 code, a complete rewrite.

Please see http://datascience.iq.harvard.edu/blog/dataverse-40-here for an overview of what's new in Dataverse 4.0. Of particular interest may be our new JSON-based APIs documented at http://guides.dataverse.org/en/latest/api and a Python library at https://github.com/IQSS/dataverse-client-python . Data visualization and exploration can now be done with a new d3-based application called TwoRavens.

The Dataverse team is always interested in feedback. Please see https://github.com/IQSS/dataverse/blob/master/CONTRIBUTING.md for some details on how to get in touch!

Thank you @abidart for putting a bug in my ear about this project! Sounds very interesting!

@hathix hathix mentioned this issue Jun 2, 2015
@deaves
Copy link
Contributor

deaves commented Jul 15, 2015

I'm biased here as I sit on an advisory board. But has anyone asked Socrata for a free instance? I'm sure they would consider deploying one.

Otherwise happy to stay out of the fray. My main point is that time spent deving and deploying on the catalog is time not spent on creating value with the data.

@hathix hathix closed this as completed Nov 2, 2015
@hathix hathix reopened this Nov 17, 2015
@hathix
Copy link
Contributor

hathix commented Nov 17, 2015

Looks like Dataverse isn't working for us as a catalog -- we've put a few datasets in and it appears to be fighting us every step of the way. Let's consider some alternative technology like CKAN.

@waldoj
Copy link

waldoj commented Nov 17, 2015

In the intervening 6 months, my organization has funded a system to make CKAN far cheaper and easier to host multiple copies of, so the price of hosting should start dropping precipitously. We worked with DataCats on that, and they sell CKAN hosting services. Open Knowledge and Ontodia also come to mind.

@hathix
Copy link
Contributor

hathix commented Nov 20, 2015

Discussed Dataverse vs. alternatives at today's meeting: https://docs.google.com/document/d/1rTIYB43H983xZj4Vy2kwPgxfz8hcHQ9xzXLcN1eKjB0/edit?pli=1

@waldoj
Copy link

waldoj commented Nov 20, 2015

👍

@hathix hathix closed this as completed Nov 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants