Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search queries giving bizarre responses #170

Closed
daisieh opened this issue Dec 6, 2016 · 19 comments
Closed

Search queries giving bizarre responses #170

daisieh opened this issue Dec 6, 2016 · 19 comments
Assignees
Labels

Comments

@daisieh
Copy link

daisieh commented Dec 6, 2016

At Dryad, we use the CrossRef API to search for titles similar to packages we have in our archive. Our matcher creates a query string for a particular journal based on the author and the title, like so: http://api.crossref.org/journals/1553-7390/works?sort=score&order=desc&query=Baker+Functional+divergence+of+NRC+TR+as+a+modulator+of+pluripotentiality+during+hominid+evolution

Usually, this gives us very accurate results, with results scoring higher than 3.0 being near perfect matches. When we ran the matcher this morning, however, we got insane results, like the attached
works.json.txt

            {
                "indexed": {
                    "date-parts": [
                        [
                            2016, 
                            10, 
                            24
                        ]
                    ], 
                    "date-time": "2016-10-24T15:10:14Z", 
                    "timestamp": 1477321814852
                }, 
                "reference-count": 0, 
                "publisher": "Public Library of Science (PLoS)", 
                "issue": "2007", 
                "content-domain": {
                    "domain": [ ], 
                    "crossmark-restriction": false
                }, 
                "short-container-title": [
                    "PLoS Genet"
                ], 
                "published-print": {
                    "date-parts": [
                        [
                            2005
                        ]
                    ]
                }, 
                "DOI": "10.1371/journal.pgen.0030180.eor", 
                "type": "journal-article", 
                "created": {
                    "date-parts": [
                        [
                            2007, 
                            9, 
                            7
                        ]
                    ], 
                    "date-time": "2007-09-07T14:15:42Z", 
                    "timestamp": 1189174542000
                }, 
                "page": "e180", 
                "source": "CrossRef", 
                "title": [
                    "ZIPk: a Unique Case of Murine-Specific Divergence of a Conserved Vertebrate Gene"
                ], 
                "prefix": "http://id.crossref.org/prefix/10.1371", 
                "volume": "preprint", 
                "author": [
                    {
                        "given": "Yishay", 
                        "family": "Shoval", 
                        "affiliation": [ ]
                    }, 
                    {
                        "given": "Shmuel", 
                        "family": "Pietrokovski", 
                        "affiliation": [ ]
                    }, 
                    {
                        "given": "Adi", 
                        "family": "Kimchi", 
                        "affiliation": [ ]
                    }
                ], 
                "member": "http://id.crossref.org/member/340", 
                "container-title": [
                    "PLoS Genetics"
                ], 
                "original-title": [ ], 
                "deposited": {
                    "date-parts": [
                        [
                            2007, 
                            9, 
                            7
                        ]
                    ], 
                    "date-time": "2007-09-07T14:15:44Z", 
                    "timestamp": 1189174544000
                }, 
                "score": 8.618961, 
                "subtitle": [ ], 
                "short-title": [ ], 
                "issued": {
                    "date-parts": [
                        [
                            2005
                        ]
                    ]
                }, 
                "URL": "http://dx.doi.org/10.1371/journal.pgen.0030180.eor", 
                "ISSN": [
                    "1553-7390", 
                    "1553-7404"
                ], 
                "subject": [
                    "Genetics(clinical)", 
                    "Genetics", 
                    "Cancer Research", 
                    "Ecology, Evolution, Behavior and Systematics", 
                    "Molecular Biology"
                ]
            }, 

This first result is nothing like the search string, and has a crazily high match score of 8.619. What gives?

@kjw
Copy link
Contributor

kjw commented Dec 9, 2016

I would suggest searching on our entire database if you want to attempt to match titles like this.

http://api.crossref.org/works?query=Baker+Functional+divergence+of+NRC+TR+as+a+modulator+of+pluripotentiality+during+hominid+evolution

Dropping the journal filter means this work is found. Further I'd take a look at the query.title, query.author etc parameters to more closely match in these situations.

As for the score, a score of x doesn't mean anything in and of itself, and is only a suggestion of the relevance of a match within the context of a filtered result set. It's a good relevance score within the result set, but that set is one journal. So again, I would suggest matching against the entire database for this, for cases where your recorded ISSN does not match up with what Crossref have.

@kjw kjw closed this as completed Dec 9, 2016
@daisieh
Copy link
Author

daisieh commented Dec 9, 2016

But this used to work just fine: a lot of our searches that used to return good results no longer return good results at all. What changed?

I would love an explanation for why the behavior unexpectedly changed between last month and today.

@daisieh
Copy link
Author

daisieh commented Dec 9, 2016

Additionally, if the results are terrible in a particular journal, the result used to be that we'd get no records matching at all. What was wrong with that behavior?

@kjw
Copy link
Contributor

kjw commented Dec 12, 2016

Are there any examples of queries that you've made in the past that now report different results? Do you have the old results and the new results to compare?

@kmeddings kmeddings reopened this Dec 12, 2016
@daisieh
Copy link
Author

daisieh commented Dec 12, 2016

For the example above, we would have expected a result of no records at all, since there were no good matches in that journal. That was the previous behavior.

@bart-v
Copy link

bart-v commented Dec 16, 2016

I can confirm this problem: it has a major impact.

The API behavior has clearly changed recently, since 2016-12-07?
The value of "score" parameter used to vary between 1-5.
Now it tends to value between 1-150 (or so?)

Can you please document the value of the score parameter in the API?

@jdumas
Copy link

jdumas commented Dec 21, 2016

I can also confirm observing this change in the scoring system. It used to be that a score above 3.0 would correspond to a perfect match with very high confidence (I would say in 95% cases). Now the score varies between 10 and 100 and it is impossible to make sense out of it.

What's worse, the new score is now less accurate than before. For example on this query, the second result ("score":43.039627) is the correct one, while the first one ("score":43.593155) is not. And I know that last month, the correct result with the same query used to be on top, and it had a score around 3-5.

@bart-v
Copy link

bart-v commented Dec 22, 2016

@kjw you say

a score of x doesn't mean anything

But can you then please let us know what the current theoretical maximum is?
It used to be around 4 or 5. Now it's 100 or 150. That is confusing...

The REST API works with versions, so we could just use an older version to use the old system.
But what versions are available is also not documented...
Can you tell us what versions exist?

@jdumas
Copy link

jdumas commented Jan 15, 2017

Any news on this issue? Would be nice to hear some clarification from the crossref guys, given this is a major setback in the search engine quality...

@jdumas
Copy link

jdumas commented Mar 15, 2017

Hellooo there. Why does it seems that nobody at Crossref cares that their search engine is giving crappy results now, while it was working fine a few months ago? This is a major regression and I am deeply worried when I see the lack of response this seems to generate.

@kmeddings
Copy link
Contributor

Hello. Sorry for lack of response. I don't know the answer but I'll make sure someone gets back to you asap.

@jdumas
Copy link

jdumas commented Mar 22, 2017

Thanks. We are willing to help, but there is only so much we can do from our side.

@kmeddings
Copy link
Contributor

Hi - just to let you know we've not forgotten. Work underway to investigate.

@gbilder
Copy link
Contributor

gbilder commented Apr 25, 2017

Sorry this took so long. The problem you have been having is result of the expansion of the kinds of metadata that we are collecting. The default behaviour of the query parameter has always been to search all the elements in a DOIs record. Back when the only elements that we collected were "bibliographic metadata", this meant that using a plain query parameter was very good at scoring the matching bibliographic record. But as we've added other elements (funders, citations, abstracts, relations, etc.) these other elements have started to skew some queries so the correct bibliographic match is no longer the top scoring title.

In an imminent new version of the API we will be introducing a variant of the query parameter query.bibliographic that constrains the query to just bibliographic metadata. You can actually test the feature on two of the examples provided above. It appears to provide the correct match (see below).

We will modify the Crossref Metadata Search interface to allow users to optionally constrain searches similarly.

http://api.crossref.org/works?query.bibliographic=Baker+Functional+divergence+of+NRC+TR+as+a+modulator+of+pluripotentiality+during+hominid+evolution&rows=1

http://api.crossref.org/works?query.bibliographic=Gao%20Zhu%20Zhang%20Zhou%20An%20Improved%20Adaptive%20Constraint%20Aggregation%20for%20Integrated%20Layout%20and%20Topology%20Optimization&rows=1

@jdumas
Copy link

jdumas commented Apr 25, 2017

Great, thanks for this answer!

Just one other thing though: how about the new score that is being returned? Before, it used to be that the score was between 1 and 5-6ish, and a score above 3 usually translated into a very good agreement with the query. I used to rely on this score as a first way to discard results that would not match my query. Now the scores range between 1 and 100-150. How are we supposed to interpret this new score?

@gbilder
Copy link
Contributor

gbilder commented Apr 27, 2017

Apparently this had to do with a change in the way in which Lucene scores documents (BM25) and which we were not aware of when we jumped several Solr versions at the time. The upgrade was done in haste to address stability issues and we missed this important change. We're taking care to not fall behind so far in Solr versions now and so we hope we don't have to make any such hasty major upgrades. Sorry this caused you trouble.

@jdumas
Copy link

jdumas commented Apr 27, 2017

Ok I see.

I am not familiar with Lucene and Solr, so I'm gonna look up this new scoring system (BM25) and see if I can make sense of it all. Thanks for providing this information =)

@mjy
Copy link

mjy commented Jun 6, 2017

Can you perhaps provide a "gold standard" as an example for the score? Like a YAML document with expected scores. That way if you can't tell us what score is garbage, we at least can see a range of scores against their queries. This would better document your intent as well.

@ppolischuk
Copy link
Collaborator

In part thanks to community feedback, changes to match scoring will be handled more sensitively as we roll out future changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants