Search queries giving bizarre responses #170

daisieh · 2016-12-06T22:46:33Z

At Dryad, we use the CrossRef API to search for titles similar to packages we have in our archive. Our matcher creates a query string for a particular journal based on the author and the title, like so: http://api.crossref.org/journals/1553-7390/works?sort=score&order=desc&query=Baker+Functional+divergence+of+NRC+TR+as+a+modulator+of+pluripotentiality+during+hominid+evolution

Usually, this gives us very accurate results, with results scoring higher than 3.0 being near perfect matches. When we ran the matcher this morning, however, we got insane results, like the attached
works.json.txt

            {
                "indexed": {
                    "date-parts": [
                        [
                            2016, 
                            10, 
                            24
                        ]
                    ], 
                    "date-time": "2016-10-24T15:10:14Z", 
                    "timestamp": 1477321814852
                }, 
                "reference-count": 0, 
                "publisher": "Public Library of Science (PLoS)", 
                "issue": "2007", 
                "content-domain": {
                    "domain": [ ], 
                    "crossmark-restriction": false
                }, 
                "short-container-title": [
                    "PLoS Genet"
                ], 
                "published-print": {
                    "date-parts": [
                        [
                            2005
                        ]
                    ]
                }, 
                "DOI": "10.1371/journal.pgen.0030180.eor", 
                "type": "journal-article", 
                "created": {
                    "date-parts": [
                        [
                            2007, 
                            9, 
                            7
                        ]
                    ], 
                    "date-time": "2007-09-07T14:15:42Z", 
                    "timestamp": 1189174542000
                }, 
                "page": "e180", 
                "source": "CrossRef", 
                "title": [
                    "ZIPk: a Unique Case of Murine-Specific Divergence of a Conserved Vertebrate Gene"
                ], 
                "prefix": "http://id.crossref.org/prefix/10.1371", 
                "volume": "preprint", 
                "author": [
                    {
                        "given": "Yishay", 
                        "family": "Shoval", 
                        "affiliation": [ ]
                    }, 
                    {
                        "given": "Shmuel", 
                        "family": "Pietrokovski", 
                        "affiliation": [ ]
                    }, 
                    {
                        "given": "Adi", 
                        "family": "Kimchi", 
                        "affiliation": [ ]
                    }
                ], 
                "member": "http://id.crossref.org/member/340", 
                "container-title": [
                    "PLoS Genetics"
                ], 
                "original-title": [ ], 
                "deposited": {
                    "date-parts": [
                        [
                            2007, 
                            9, 
                            7
                        ]
                    ], 
                    "date-time": "2007-09-07T14:15:44Z", 
                    "timestamp": 1189174544000
                }, 
                "score": 8.618961, 
                "subtitle": [ ], 
                "short-title": [ ], 
                "issued": {
                    "date-parts": [
                        [
                            2005
                        ]
                    ]
                }, 
                "URL": "http://dx.doi.org/10.1371/journal.pgen.0030180.eor", 
                "ISSN": [
                    "1553-7390", 
                    "1553-7404"
                ], 
                "subject": [
                    "Genetics(clinical)", 
                    "Genetics", 
                    "Cancer Research", 
                    "Ecology, Evolution, Behavior and Systematics", 
                    "Molecular Biology"
                ]
            },

This first result is nothing like the search string, and has a crazily high match score of 8.619. What gives?

The text was updated successfully, but these errors were encountered:

kjw · 2016-12-09T16:07:26Z

I would suggest searching on our entire database if you want to attempt to match titles like this.

http://api.crossref.org/works?query=Baker+Functional+divergence+of+NRC+TR+as+a+modulator+of+pluripotentiality+during+hominid+evolution

Dropping the journal filter means this work is found. Further I'd take a look at the query.title, query.author etc parameters to more closely match in these situations.

As for the score, a score of x doesn't mean anything in and of itself, and is only a suggestion of the relevance of a match within the context of a filtered result set. It's a good relevance score within the result set, but that set is one journal. So again, I would suggest matching against the entire database for this, for cases where your recorded ISSN does not match up with what Crossref have.

daisieh · 2016-12-09T20:15:11Z

But this used to work just fine: a lot of our searches that used to return good results no longer return good results at all. What changed?

I would love an explanation for why the behavior unexpectedly changed between last month and today.

daisieh · 2016-12-09T20:22:43Z

Additionally, if the results are terrible in a particular journal, the result used to be that we'd get no records matching at all. What was wrong with that behavior?

kjw · 2016-12-12T16:12:30Z

Are there any examples of queries that you've made in the past that now report different results? Do you have the old results and the new results to compare?

daisieh · 2016-12-12T19:03:33Z

For the example above, we would have expected a result of no records at all, since there were no good matches in that journal. That was the previous behavior.

bart-v · 2016-12-16T13:37:25Z

I can confirm this problem: it has a major impact.

The API behavior has clearly changed recently, since 2016-12-07?
The value of "score" parameter used to vary between 1-5.
Now it tends to value between 1-150 (or so?)

Can you please document the value of the score parameter in the API?

jdumas · 2016-12-21T23:39:55Z

I can also confirm observing this change in the scoring system. It used to be that a score above 3.0 would correspond to a perfect match with very high confidence (I would say in 95% cases). Now the score varies between 10 and 100 and it is impossible to make sense out of it.

What's worse, the new score is now less accurate than before. For example on this query, the second result ("score":43.039627) is the correct one, while the first one ("score":43.593155) is not. And I know that last month, the correct result with the same query used to be on top, and it had a score around 3-5.

bart-v · 2016-12-22T09:55:48Z

@kjw you say

a score of x doesn't mean anything

But can you then please let us know what the current theoretical maximum is?
It used to be around 4 or 5. Now it's 100 or 150. That is confusing...

The REST API works with versions, so we could just use an older version to use the old system.
But what versions are available is also not documented...
Can you tell us what versions exist?

jdumas · 2017-01-15T07:26:09Z

Any news on this issue? Would be nice to hear some clarification from the crossref guys, given this is a major setback in the search engine quality...

jdumas · 2017-03-15T15:59:35Z

Hellooo there. Why does it seems that nobody at Crossref cares that their search engine is giving crappy results now, while it was working fine a few months ago? This is a major regression and I am deeply worried when I see the lack of response this seems to generate.

kmeddings · 2017-03-15T16:15:57Z

Hello. Sorry for lack of response. I don't know the answer but I'll make sure someone gets back to you asap.

jdumas · 2017-03-22T15:06:08Z

Thanks. We are willing to help, but there is only so much we can do from our side.

kmeddings · 2017-04-13T10:45:06Z

Hi - just to let you know we've not forgotten. Work underway to investigate.

gbilder · 2017-04-25T11:15:16Z

Sorry this took so long. The problem you have been having is result of the expansion of the kinds of metadata that we are collecting. The default behaviour of the query parameter has always been to search all the elements in a DOIs record. Back when the only elements that we collected were "bibliographic metadata", this meant that using a plain query parameter was very good at scoring the matching bibliographic record. But as we've added other elements (funders, citations, abstracts, relations, etc.) these other elements have started to skew some queries so the correct bibliographic match is no longer the top scoring title.

In an imminent new version of the API we will be introducing a variant of the query parameter query.bibliographic that constrains the query to just bibliographic metadata. You can actually test the feature on two of the examples provided above. It appears to provide the correct match (see below).

We will modify the Crossref Metadata Search interface to allow users to optionally constrain searches similarly.

http://api.crossref.org/works?query.bibliographic=Baker+Functional+divergence+of+NRC+TR+as+a+modulator+of+pluripotentiality+during+hominid+evolution&rows=1

http://api.crossref.org/works?query.bibliographic=Gao%20Zhu%20Zhang%20Zhou%20An%20Improved%20Adaptive%20Constraint%20Aggregation%20for%20Integrated%20Layout%20and%20Topology%20Optimization&rows=1

jdumas · 2017-04-25T15:04:46Z

Great, thanks for this answer!

Just one other thing though: how about the new score that is being returned? Before, it used to be that the score was between 1 and 5-6ish, and a score above 3 usually translated into a very good agreement with the query. I used to rely on this score as a first way to discard results that would not match my query. Now the scores range between 1 and 100-150. How are we supposed to interpret this new score?

gbilder · 2017-04-27T14:24:28Z

Apparently this had to do with a change in the way in which Lucene scores documents (BM25) and which we were not aware of when we jumped several Solr versions at the time. The upgrade was done in haste to address stability issues and we missed this important change. We're taking care to not fall behind so far in Solr versions now and so we hope we don't have to make any such hasty major upgrades. Sorry this caused you trouble.

jdumas · 2017-04-27T14:52:09Z

Ok I see.

I am not familiar with Lucene and Solr, so I'm gonna look up this new scoring system (BM25) and see if I can make sense of it all. Thanks for providing this information =)

mjy · 2017-06-06T05:58:31Z

Can you perhaps provide a "gold standard" as an example for the score? Like a YAML document with expected scores. That way if you can't tell us what score is garbage, we at least can see a range of scores against their queries. This would better document your intent as well.

ppolischuk · 2018-06-19T22:18:28Z

In part thanks to community feedback, changes to match scoring will be handled more sensitively as we roll out future changes.

kjw closed this as completed Dec 9, 2016

kmeddings reopened this Dec 12, 2016

kmeddings assigned kjw Jan 25, 2017

kjw added the question label Jul 24, 2017

jdumas mentioned this issue Aug 21, 2017

This is very nice! And a very small issue. jdumas/autobib#2

Closed

mjy mentioned this issue Dec 1, 2017

Recommendations for being polite in API wrapping libraries that use CI #305

Closed

gbilder unassigned kjw Feb 6, 2018

This was referenced Mar 2, 2018

false positives in doi resolving for study citations globalbioticinteractions/globalbioticinteractions#338

Closed

restricting https post to https://search.crossref.org/links using query.bibliographic #334

Closed

pdavis8 assigned jenniferlin15 May 3, 2018

pdavis8 added the ready label May 3, 2018

ppolischuk closed this as completed Jun 19, 2018

ppolischuk removed the ready label Jun 19, 2018

tautme mentioned this issue May 10, 2019

search query does not return AND results #461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search queries giving bizarre responses #170

Search queries giving bizarre responses #170

daisieh commented Dec 6, 2016

kjw commented Dec 9, 2016 •

edited

daisieh commented Dec 9, 2016

daisieh commented Dec 9, 2016

kjw commented Dec 12, 2016

daisieh commented Dec 12, 2016

bart-v commented Dec 16, 2016

jdumas commented Dec 21, 2016

bart-v commented Dec 22, 2016

jdumas commented Jan 15, 2017 •

edited

jdumas commented Mar 15, 2017

kmeddings commented Mar 15, 2017

jdumas commented Mar 22, 2017

kmeddings commented Apr 13, 2017

gbilder commented Apr 25, 2017

jdumas commented Apr 25, 2017 •

edited

gbilder commented Apr 27, 2017 •

edited

jdumas commented Apr 27, 2017

mjy commented Jun 6, 2017 •

edited

ppolischuk commented Jun 19, 2018

Search queries giving bizarre responses #170

Search queries giving bizarre responses #170

Comments

daisieh commented Dec 6, 2016

kjw commented Dec 9, 2016 • edited

daisieh commented Dec 9, 2016

daisieh commented Dec 9, 2016

kjw commented Dec 12, 2016

daisieh commented Dec 12, 2016

bart-v commented Dec 16, 2016

jdumas commented Dec 21, 2016

bart-v commented Dec 22, 2016

jdumas commented Jan 15, 2017 • edited

jdumas commented Mar 15, 2017

kmeddings commented Mar 15, 2017

jdumas commented Mar 22, 2017

kmeddings commented Apr 13, 2017

gbilder commented Apr 25, 2017

jdumas commented Apr 25, 2017 • edited

gbilder commented Apr 27, 2017 • edited

jdumas commented Apr 27, 2017

mjy commented Jun 6, 2017 • edited

ppolischuk commented Jun 19, 2018

kjw commented Dec 9, 2016 •

edited

jdumas commented Jan 15, 2017 •

edited

jdumas commented Apr 25, 2017 •

edited

gbilder commented Apr 27, 2017 •

edited

mjy commented Jun 6, 2017 •

edited