-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search queries giving bizarre responses #170
Comments
I would suggest searching on our entire database if you want to attempt to match titles like this.
Dropping the journal filter means this work is found. Further I'd take a look at the As for the score, a score of x doesn't mean anything in and of itself, and is only a suggestion of the relevance of a match within the context of a filtered result set. It's a good relevance score within the result set, but that set is one journal. So again, I would suggest matching against the entire database for this, for cases where your recorded ISSN does not match up with what Crossref have. |
But this used to work just fine: a lot of our searches that used to return good results no longer return good results at all. What changed? I would love an explanation for why the behavior unexpectedly changed between last month and today. |
Additionally, if the results are terrible in a particular journal, the result used to be that we'd get no records matching at all. What was wrong with that behavior? |
Are there any examples of queries that you've made in the past that now report different results? Do you have the old results and the new results to compare? |
For the example above, we would have expected a result of no records at all, since there were no good matches in that journal. That was the previous behavior. |
I can confirm this problem: it has a major impact. The API behavior has clearly changed recently, since 2016-12-07? Can you please document the value of the score parameter in the API? |
I can also confirm observing this change in the scoring system. It used to be that a score above 3.0 would correspond to a perfect match with very high confidence (I would say in 95% cases). Now the score varies between 10 and 100 and it is impossible to make sense out of it. What's worse, the new score is now less accurate than before. For example on this query, the second result ( |
@kjw you say
But can you then please let us know what the current theoretical maximum is? The REST API works with versions, so we could just use an older version to use the old system. |
Any news on this issue? Would be nice to hear some clarification from the crossref guys, given this is a major setback in the search engine quality... |
Hellooo there. Why does it seems that nobody at Crossref cares that their search engine is giving crappy results now, while it was working fine a few months ago? This is a major regression and I am deeply worried when I see the lack of response this seems to generate. |
Hello. Sorry for lack of response. I don't know the answer but I'll make sure someone gets back to you asap. |
Thanks. We are willing to help, but there is only so much we can do from our side. |
Hi - just to let you know we've not forgotten. Work underway to investigate. |
Sorry this took so long. The problem you have been having is result of the expansion of the kinds of metadata that we are collecting. The default behaviour of the query parameter has always been to search all the elements in a DOIs record. Back when the only elements that we collected were "bibliographic metadata", this meant that using a plain In an imminent new version of the API we will be introducing a variant of the query parameter We will modify the Crossref Metadata Search interface to allow users to optionally constrain searches similarly. |
Great, thanks for this answer! Just one other thing though: how about the new score that is being returned? Before, it used to be that the score was between 1 and 5-6ish, and a score above 3 usually translated into a very good agreement with the query. I used to rely on this score as a first way to discard results that would not match my query. Now the scores range between 1 and 100-150. How are we supposed to interpret this new score? |
Apparently this had to do with a change in the way in which Lucene scores documents (BM25) and which we were not aware of when we jumped several Solr versions at the time. The upgrade was done in haste to address stability issues and we missed this important change. We're taking care to not fall behind so far in Solr versions now and so we hope we don't have to make any such hasty major upgrades. Sorry this caused you trouble. |
Ok I see. I am not familiar with Lucene and Solr, so I'm gonna look up this new scoring system (BM25) and see if I can make sense of it all. Thanks for providing this information =) |
Can you perhaps provide a "gold standard" as an example for the score? Like a YAML document with expected scores. That way if you can't tell us what score is garbage, we at least can see a range of scores against their queries. This would better document your intent as well. |
In part thanks to community feedback, changes to match scoring will be handled more sensitively as we roll out future changes. |
At Dryad, we use the CrossRef API to search for titles similar to packages we have in our archive. Our matcher creates a query string for a particular journal based on the author and the title, like so:
http://api.crossref.org/journals/1553-7390/works?sort=score&order=desc&query=Baker+Functional+divergence+of+NRC+TR+as+a+modulator+of+pluripotentiality+during+hominid+evolution
Usually, this gives us very accurate results, with results scoring higher than 3.0 being near perfect matches. When we ran the matcher this morning, however, we got insane results, like the attached
works.json.txt
This first result is nothing like the search string, and has a crazily high match score of 8.619. What gives?
The text was updated successfully, but these errors were encountered: