Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

About this project... #88

Closed
sere4tegdd opened this issue Jan 14, 2017 · 6 comments
Closed

About this project... #88

sere4tegdd opened this issue Jan 14, 2017 · 6 comments
Labels

Comments

@sere4tegdd
Copy link

... if I understand correctly, the point of this project is to choose some data to pin and offer as "archive".

  1. are you guys paying a server to host and stream all this data? Who is paying for it?
  2. this is the list of archives that you're hosting, right? Are you guys hosting each archive as a single .gz file, or can I get a single file from one of the datasets (like 1 arxiv paper)?
  3. if files can be downloaded individually, there should be a searchable index of all hashes. The reason is that I'd love to have a "single, huge, distributed, repository for the commons", such that I can download a file regardless of what program I'm using (for example download a PDF from my local desktop pdf reader or from a web page)
@davidar
Copy link
Collaborator

davidar commented Jan 15, 2017

  1. Ideally "the community" would be hosting it, but in practice I suspect a lot of stuff is only pinned on one of our nodes
  2. That, as well as various links throughout the issues. Format varies, but generally speaking it should be possible to retrieve individual files
  3. https://hash-archive.org/ might be able to help with this (cc @btrask)

@e4567uertryt
Copy link

Isn't there any company that could offer a couple of servers to pin content? Protocol Labs for example, they got quite a lot of money for IPFS, why can't they use a couple of servers to pin free/libre content? Doing that would be perfectly useful to promote IPFS and speed up adoption, especially if the content pinned in IPFS is immediately available via the web as src="fs://ipfs/<hash>". Is there anybody from Protocol Labs reading this?

@flyingzumwalt
Copy link
Contributor

Note: This is a good example of a discussion that fits in forum software (we're currently trying out discourse) rather than github, since it's more of a discussion and a request for information rather than a feature request or a bug report.

I expressed concern about exactly this confusion about 2 weeks ago when we set up archives.ipfs.io. At that time I said:

Could we also include a way for people to tell us about new archives? I have a hunch there are a lot more datasets out there, with many more to come in the next few months.

Keep in mind that this should just be a stop-gap. I want to encourage people to publish IPLD metadata about their datasets so that anyone can build (and display) registries of the datasets they care about. There's a lot of work and experimentation in this area. I don't want to distract people from participating in that experimentation, but I do agree that in the meantime we want to have some listing of datasets that are available via IPFS. (Related: this call to form a community around decentralied web in Libraries Archvies & Museums )

This ipfs/archives github repository was set up as a catch-all for keeping track of any work that anyone is doing with archives on ipfs. That work has evolved a lot over the past year, which has made the information in this repository scattered and unclear. Until this week, the repository didn't have a "captain" tending to it. I've been named as the temporary captain while we look for a community member to take the role.

The big picture goal for datasets on the distributed web is for many people and organizations to pin the data they care about and build their own registries with metadata about those datasets along with hashes to identify the data on the network. Those registries can then serve as points of reference for people and orgs who want to hold & serve copies of the data they care about.

In the meantime, https://archives.ipfs.io/ gives a short list of datasets that we know are pinned.

I'd love to have a "single, huge, distributed, repository for the commons", such that I can download a file regardless of what program I'm using (for example download a PDF from my local desktop pdf reader or from a web page)

IPFS is that huge, distributed repository. The challenges lie in

  1. Building registries of the hashes for datasets, along with the metadata about those datasets
  2. Coordinating efforts to store & serve those datasets

Some of the things we're doing to address these challenges

Isn't there any company that could offer a couple of servers to pin content? Protocol Labs for example, they got quite a lot of money for IPFS, why can't they use a couple of servers to pin free/libre content?

Protocol Labs is designed as a Research & Development & Deployment lab for Networks. R&D Labs are great for developing new technologies and helping the world understand how to make the most of those technologies. They move fast and constantly shake things up. That's not the type of organization who should be storing everyone's content. Also, asking a central authority to hold everyone's data would run against the whole point of decentralization. The work of providing preservation, discovery, and access services on content is the work of communities and institutions. The OAIS Reference Model is a good starting point for thinking about what it means to store, preserve and serve content over the long term. Especially note the "minimum responsibilities" that an OAIS-type archive is expected to meet. Protocol Labs isn't set up to do this, but many institutions are.

Libraries are the obvious place to turn, which is why it's so important to form a community around Distributed Web for Libraries Archives and Museums. That's not the final solution -- individuals, communities and companies should also pin, annotate & serve the data they care about -- but it's a solid start.

@rht
Copy link

rht commented Jan 15, 2017

From the neighboring issue (#87), it looks like @protocol1 (and undisclosed participating institutions) is pushing an effort to publish https://data.gov (which data is currently hosted by sungard or akamai) to IPFS so that its content is accessible to all researchers.

AFAIK, the existing orgs/private co's/institutions which would have the resource to maintain such libre content are: http://www.alexandria.io/ (@dloa), http://ga4gh.org/ (@ga4gh, for genomics), http://ipdb.foundation/ (@ipdb), http://www.mediachain.io/ (@mediachain), ... ipfs/archives had been around before these entities, remains an open community effort, but w/o given any incentive from these parties to effectively propel it forward.

[1] (with this, "anybody from Protocol Labs" should have seen this issue thread) it is up to anyone to assume that these agents/orgs are making a perfectly rational decision (for the sake of the public datasets?), after having taken into account of various possible moves.

@flyingzumwalt
Copy link
Contributor

@rht I'm not sure I understand your comments.

FYI: the "undisclosed participating institutions" are the libraries at a few large universities. They have the staff, resources and organizational infrastructure to host, curate and preserve high-impact datasets for the long term. We will be adding their names to the sprint issue and other docs over the next few days once they've had a chance to confirm that they want their names on the project. They approached us because they plan to experiment with using ipfs to host & distribute the content. We allocated a 2-week sprint as time where a handful of us can focus our energy on helping them. This is likely the first of many collaborations that will involve many institutions around the world from both the public and private sectors.

@flyingzumwalt
Copy link
Contributor

Closing this issue. We've answered the questions posed and created issue #89 to follow up by improving the docs with info from this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants