Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wdi_helpers.id_mapper does not return the complete map #65

Closed
floatingpurr opened this issue May 10, 2018 · 11 comments
Closed

wdi_helpers.id_mapper does not return the complete map #65

floatingpurr opened this issue May 10, 2018 · 11 comments

Comments

@floatingpurr
Copy link
Contributor

Hello, I do not know if I misunderstood something but it seems there are some problems on wdi_helpers.id_mapper.

If I run the following query in Wikidata, I get 65,438 items.

SELECT (count(?item) as ?c)
WHERE 
{
  ?item wdt:P5114 ?x.
}

Now let's go with WikidataIntegrator.

In [23]: school_qid_map = wdi_helpers.id_mapper('P5114', raise_on_duplicate=True)

In [24]: len(school_qid_map)
Out[24]: 61779

In [25]: school_qid_map = wdi_helpers.id_mapper('P5114', raise_on_duplicate=True)

In [26]: len(school_qid_map)
Out[26]: 64443

As you notice, I get 2 different mappings for 2 identical calls. In both cases, they are different from 65,438.

The difference is due to the id_mapper query, that is the following:

SELECT ?id ?item ?mrt 
WHERE
{
  ?item p:P5114 ?s .
  ?s ps:P5114 ?id 
  OPTIONAL {?s pq:P4390 ?mrt}
}

Instead of mapping the item to the value, it maps the item to the statement and the statement to the value. The two patterns should be identical, but actually they are different.

I do not know that is a problem and if it's related with this library but it seems an unexpected behavior. Sorry If I missed something.

@floatingpurr
Copy link
Contributor Author

It turned out it's a problem with the items. Probably a Wikidata issue. That's very strange, though.

Sorry for this issue.

@stuppie
Copy link
Collaborator

stuppie commented May 10, 2018

Hmm, weird that you would get different values running it twice. What do you mean it's a problem with the items? I just ran the command a couple time and I get the same value every time (65,255). Did some of these items just get updated recently? I think so, yes, looking at this for example. It may take ~10 min for the SPARQL endpoint to get updated fully.

@stuppie
Copy link
Collaborator

stuppie commented May 10, 2018

There is some wikidata issue going on here. Some items are not properly in the blazegraph. For example run the following query a bunch of times a couple min apart and you get either a value or no result depending on which server it hits:

select * where {
  wd:Q52839992 p:P5114 ?s . 
  ?s ?a ?b .
}

I assume this is related to a know issue: https://phabricator.wikimedia.org/T112397
I'll post this there and see what they say..

@floatingpurr
Copy link
Contributor Author

floatingpurr commented May 10, 2018

Hey @stuppie! It's a very strange situation. I know there is a little bit of latency but I think that it happened something odd on the Wikidata side during the bulk update. Or maybe there is something odd in the query service just now. For example:

SELECT ?item
WHERE 
{
  ?item wdt:P5114 ?x.
}

gets 65,438 results. Instead:

SELECT distinct ?item
WHERE 
{
  ?item wdt:P5114 ?x.
}

gets 65,437 results. But if i check for double ID:

SELECT ?item ?item2
WHERE 
{
  ?item wdt:P5114 ?x.
  ?item2 wdt:P5114 ?x.
  FILTER (?item != ?item2)
}

I get 0 results. If I did not make mistakes, it's an impossible situation. Regarding the different outputs of the same query run twice, I do confirm that there was a kind of oscillation among those 2 values. Now it seems stable. What a mess! 😢

Update after having read your last post:

Yes, I see. There is something strange going on. I'm sorry, I opened an issue here since I though that it was a problem with WikidataIntegrator, but actually this is not a library problem.

@smalyshev
Copy link

I do not see variation in query response on the server, but I do see difference between distinct and non-distinct counts. This seems to be because https://www.wikidata.org/wiki/Q3747159 has two IDs. Your query checks for one ID belonging to two items, but in fact it's the other way - one item has two IDs.

@floatingpurr
Copy link
Contributor Author

Fixed https://www.wikidata.org/wiki/Q3747159.

It seems like there is still a difference between this (65437 items):

SELECT (count(?item) as ?c)
WHERE 
{
  ?item wdt:P5114 ?x.
}

and this (65282 items):

SELECT ?id ?item ?mrt 
WHERE
{
  ?item p:P5114 ?s .
  ?s ps:P5114 ?id 
  OPTIONAL {?s pq:P4390 ?mrt}
}

@stuppie
Copy link
Collaborator

stuppie commented May 30, 2018

I'm not sure... Stan says its because of IDs on two items but when I get the same count (65437) whether I run either of these queries:

SELECT (count(distinct ?item) as ?c) WHERE {
  ?item wdt:P5114 ?x.
}
SELECT (count(distinct ?x) as ?c) WHERE {
  ?item wdt:P5114 ?x.
}

@smalyshev
Copy link

Indeed, seems like something was broken with these items. I've updated them and now they look OK.

@floatingpurr
Copy link
Contributor Author

I do not know if it is normal, but the 2 queries (with or without p and ps) still returns resultsets with different cardinalities. For example, the query with ps now returns 65298 items

@floatingpurr
Copy link
Contributor Author

Well, after another update by @smalyshev, everything seems working fine : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants