Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we count the number of words in a paper using SPARQL? #849

Open
Daniel-Mietchen opened this issue Jun 28, 2018 · 8 comments
Open

Can we count the number of words in a paper using SPARQL? #849

Daniel-Mietchen opened this issue Jun 28, 2018 · 8 comments

Comments

@Daniel-Mietchen
Copy link
Owner

I just received this question via email and will walk through it here.

@Daniel-Mietchen
Copy link
Owner Author

Part one of the answer is to consider that SPARQL is an RDF Query Language, i.e. it can only be used to query things that are stored in RDF format (examples).

@Daniel-Mietchen
Copy link
Owner Author

Assuming for the sake of argument that the paper in question is available in RDF in some way, we can then look into how SPARQL could be used to tease out the number of words in that text.

SPARQL has a dedicated COUNT function that can be seen in action in this query for the number of items in Wikidata that have a PubMed Central ID. This could then be combined with some of the string functions.

Another approach might be to use the REGEX function, which can be seen in this query for works of art indexed in Wikidata where the title has an alliteration.

REGEX can be used to count the number of occurrences of a given substring within a string, so we can do things like this SPARQL query for the number of works of art indexed in Wikidata for which the title has more than 33 spaces, so if the full text of the target paper in question were available as a string (or could be converted into one, or a series of strings), this approach would work in principle, and variations thereof could give the exact number of words in the document.

@Daniel-Mietchen
Copy link
Owner Author

I forgot to note that I got the REGEX for number of words in a string from https://stackoverflow.com/questions/35056771/count-number-of-words-in-a-string .

@Daniel-Mietchen
Copy link
Owner Author

Having said all that, I think SPARQL is probably not an efficient way to count the number of words in a text, so other languages are worth a look, and for the task of counting occurrences of a substring, Rosetta Code lists solutions in a whopping 100 languages.

@Daniel-Mietchen
Copy link
Owner Author

Another point worth mentioning is that Wikidata has a nice Request a query page, where I will post this as well to see what others are coming up with.

@Daniel-Mietchen
Copy link
Owner Author

@Daniel-Mietchen
Copy link
Owner Author

Also asked on Twitter: https://twitter.com/EvoMRI/status/1012386463613378560 .

@Daniel-Mietchen
Copy link
Owner Author

OK, here is a solution that came in via the request page and provides an actual count of the number of spaces in a string:

SELECT DISTINCT ?work ?title ?spacecount
WHERE
{
  ?work wdt:P31/wdt:P279* wd:Q838948;
        wdt:P1476 ?title.
  bind(replace(?title, "\\S", "") as ?space)
  bind(strlen(?space) as ?spacecount)
  FILTER(REGEX(?title, "^\\s*\\S+(?:\\s+\\S+){33,}\\s*$", "i")).
}
ORDER BY STR(?title)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant