Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate paper items #1025

Open
Daniel-Mietchen opened this issue Oct 29, 2018 · 7 comments
Open

Fix duplicate paper items #1025

Daniel-Mietchen opened this issue Oct 29, 2018 · 7 comments

Comments

@Daniel-Mietchen
Copy link
Owner

Due to the large WDQS server lag throughout the month, duplicate detection mechanisms relying on WDQS being up to date fail in droves.

This needs fixing, and since I cannot easily fix the source, it will have to be the symptoms.

Here is a query that finds PMIDs (filtered by publication date, to avoid a timeout) that occur more than once on Wikidata:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX hint: <http://www.bigdata.com/queryHints#>

SELECT DISTINCT ?value (COUNT(DISTINCT ?item) AS ?ct) 
(GROUP_CONCAT(DISTINCT STRAFTER(STR(?item), "/entity/"); SEPARATOR = ", ") AS ?items) 
(GROUP_CONCAT(DISTINCT ?title; SEPARATOR = "/// ") AS ?titles)
WHERE {
  VALUES (?earliest) {
    ("2017-12-01T00:00:00Z"^^xsd:dateTime)
  }
  VALUES (?latest) {
    ("2031-12-31T00:00:00Z"^^xsd:dateTime)
  }
  ?item wdt:P577 ?date_time.
  hint:Prior hint:rangeSafe "true"^^xsd:boolean.
  FILTER(?date_time >= ?earliest)
  FILTER(?date_time <= ?latest)
  ?item wdt:P698 ?value.
  ?item wdt:P1476 ?title.
}
GROUP BY ?value ?ct ?items ?titles
HAVING (?ct > 1)
ORDER BY DESC(?ct)
LIMIT 100000
@Daniel-Mietchen
Copy link
Owner Author

That query currently gives 20150 results, so I will keep an eye on it for a day or so to see how it develops, and then start some batches for merging.

@Daniel-Mietchen Daniel-Mietchen added this to Needs triage in Recurring tasks via automation Oct 29, 2018
@Daniel-Mietchen
Copy link
Owner Author

Current number is 20183.

@Daniel-Mietchen
Copy link
Owner Author

Current number is 20179, so it seems someone has cleaned things up a bit.

@Daniel-Mietchen
Copy link
Owner Author

A fix batch is running: https://tools.wmflabs.org/quickstatements/#/batch/4962 .

@Daniel-Mietchen
Copy link
Owner Author

That batch has finished, and the number of such duplicates right now is 428.

@Daniel-Mietchen
Copy link
Owner Author

The current number is 2805, so we probably need a new batch run soon.

@Daniel-Mietchen
Copy link
Owner Author

Nor results right now, even for a LIMIT of 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Recurring
Awaiting triage
Recurring tasks
  
Needs triage
Development

No branches or pull requests

1 participant