Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whole search fails if one node can't connect #3

Open
jure opened this issue Jun 22, 2014 · 1 comment
Open

Whole search fails if one node can't connect #3

jure opened this issue Jun 22, 2014 · 1 comment

Comments

@jure
Copy link
Member

jure commented Jun 22, 2014

Related to this issue: tsujio/webrtc-chord#5

I'm storing keywords and documents in the DHT, so if you search for "cancer software", it will first retrieve the key "cancer", then the key "software" from the network. These keys will contain document id arrays, e.g.:

cancer: ["10.1039/cancer.research.1", "10.1039/cancer.research.2"]
software: ["10.1039/cancer.research.1", "10.1039/cancer.research.2", "10.4000/cancer.research.3"]

It then performs an intersection of these keys, and retrieves the documents from the DHT network. So for the above example, roughly:

intersection = ["10.1039/cancer.research.1", "10.1039/cancer.research.2"]
_.each(intersection, function (docId) { chord.retrieve(docId) })

So for this search, 4 requests are made to the DHT: cancer, research, 10.1039/cancer.research.1 and 10.1039/cancer.research.2 (in effect, there are more requests still, because each keyword gets queried for all fields of a document, so title, abstract, authors, journal, etc., in the form of "[fieldname]keyword" keys. With 5 fields per document, that's 10 requests for just two keywords, and then 2 more to get the actual documents.

If any of these fails, the whole search fails. I cache the document id lookups, as these are static, but even so, the failure rate for searches is quite high.

I guess document lookup could happen from dx.doi.org as a fallback, in case the ID is a DOI (not in all cases), but even so, there should be a way to make this more resilient, either by smartly partially failing or contacting replicas for keywords where main node can't be reached.

@jure
Copy link
Member Author

jure commented Jun 29, 2014

Some progress has been made on this, with the addition of ignoring documents, if they are not found: ae883ae#diff-b880c77d0f382525de5100984f260cebR336

This results in a query sometimes saying: '16 results found' and only displaying 4, because the other 12 could not be found on the network.

It's a start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant