Skip to content

Conversation

@rickskyy
Copy link
Collaborator

@rickskyy rickskyy commented Sep 18, 2017

This PR includes:

QueryCreator is the tool for simplifying the creation of prepared SPARQL queries.

Example:
q = "doc(author:D. N. Adams) section(name:Stimulus) prop(name:Contrast, value:[20], unit:%)
prepared_query = FuzzyFinder().get_query(q)
print(prepared_query)

<rdflib.plugins.sparql.sparql.Query object at 0x7fbb6dbb4c50>

SELECT * WHERE {
?d rdf:type odml:Document .
?d odml:hasAuthor "D. N. Adams" .
?d odml:hasSection ?s .
?s rdf:type odml:Section .
?s odml:hasName "Stimulus" .
?s odml:hasProperty ?p .
?p rdf:type odml:Property .
?p odml:hasName "Contrast" .
?p odml:hasUnit "%" .
?p odml:hasValue ?v .
?v rdf:type rdf:Bag .
?v rdf:li "20" .
}, initNs={"odml": Namespace("https://g-node.org/projects/odml-rdf#"), "rdf": RDF})

FuzzyFinder is the tool for querying graph through fuzzy queries. The finder executes multiple queries to better match input parameters and returns sets of triples, prioritized from more to less amount of matched parameters.

The prototype supports 2 modes:

  1. FuzzyFinder.find(mode='match')

The tool builds multiple sparql queries from 'match' queries, executes them and returns some matched results. The first result always represents the most specific query (the biggest combination of input parameters that returned at least one triple).

The query syntax is pretty straightforward. Just write the name of the entity property, section or document (also possible to use shortened names prop, sec and doc) and add attributes with their values inside the parentheses divided by colon.

Example from code: prop(name:Date) section(name:Recording-2013-02-08-ak, type:Recording).
Here we search for sections and properties that property has attribute name and its value is Date.

For building 'match' queries you should need to know exactly for which odML attribute the value(subject) is related. So if you write prop(name:Date) section(name:Recording, type:Recording-2013-02-08-ak) the find() method would not return any triples with section parameters. Because it's likely that there is no section with type Recording-2013-02-08-ak.

Non-odML entities' attributes here also will be ignored (e.g. only id, author, date, version, repository, sections can exist in the Document object).
In the example section(not-odml-name:Recording-2013-02-08-ak, record:Recording) the find method return nothing.

  1. FuzzyFinder.find(mode='fuzzy')

The output logic is similair to the previous mode, but there you can provide more broad information and the finder will match the parameters, and create meaningful queries based on the input.

The query string consists of two parts: FIND and HAVING.

In the FIND part a user specifies the set of odML objects and its attributes.
e.g. FIND prop(name) section(name, type)

In the HAVING part a user specifies set of search values which could relate to the attributes in FIND part.
e.g HAVING Recording, Recording-2012-04-04-ab, Date

Finally, the complete query will look like this:
FIND sec(name, type) prop(name) HAVING Recording, Recording-2012-04-04-ab, Date

Please read RDF_tools.ipynb jupyter notebook documentation for more details and examples.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.7%) to 74.317% when pulling 59a4b14 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.


sec_pattern = re.compile("(sec|section)\(.*?\)")
self._parse_sec(re.search(sec_pattern, q_str))
print(re.search(sec_pattern, q_str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment: Is this a debug line or is it required for sthg?

def _parse_sec(self, sec):
p = re.compile("(id|name|definition|type|repository|reference|sections|properties):(.*?)[,|\)]")
self.q_dict['Sec'] = re.findall(p, sec.group(0))
print("sec sdfds ", re.findall(p, sec.group(0)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same minor comment as above. ;)

@coveralls
Copy link

Coverage Status

Coverage decreased (-5.7%) to 71.343% when pulling 8d51a05 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

@coveralls
Copy link

Coverage Status

Coverage decreased (-9.0%) to 68.086% when pulling de98dc4 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

@coveralls
Copy link

Coverage Status

Coverage decreased (-9.0%) to 68.086% when pulling 7bf55cb on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

@coveralls
Copy link

Coverage Status

Coverage decreased (-9.0%) to 68.086% when pulling 52a5584 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

@coveralls
Copy link

Coverage Status

Coverage decreased (-9.0%) to 68.086% when pulling c48405b on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

@mpsonntag mpsonntag mentioned this pull request Oct 10, 2017
"cell_type": "markdown",
"metadata": {},
"source": [
"**RDF (Resource Description Framework)** is one of the three foundational [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web) technologies, the other two being SPARQL and OWL.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide the proper citation or rewrite this paragraph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add "(Resource Description Framework, Wikipedia, 2017)" to the end of the paragraph, it should be fine. ;)

"cell_type": "markdown",
"metadata": {},
"source": [
"![Image](http://dublincore.org/documents/2008/01/14/dc-rdf/rdfexamplefig.png)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I saw on the dublincore page, their content is licensed under the "Creative Commons Attribution 3.0 Unported License" (bottom of the dublincode page). We are allowed to use the graph, as long as we properly cite it, please check the details via the link they provide. ;)

Copy link
Collaborator Author

@rickskyy rickskyy Oct 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will better draw my own graph, too much copyright instructions on their website.

"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create the example odML document."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also provide a link to the odML tutorial at this point? (https://g-node.github.io/python-odml/tutorial.html)

"cell_type": "markdown",
"metadata": {},
"source": [
"##RDFWriter class"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace after ## missing. ;) The same is true for a couple of other headlines below.

"f = tempfile.NamedTemporaryFile(mode='w', suffix=\".ttl\")\n",
"path = f.name\n",
"\n",
"# possible to use 'ttl' instead of 'turtle'\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using format ttl here (and in all occurrences below) leads to the error PluginException: No plugin registered for (ttl, <class 'rdflib.serializer.Serializer'>) when I'm running the notebook. I'm using python3.5 with rdflib v4.2.1. Maybe we just use format turtle everywhere to avoid any version conflicts in general.

Copy link
Collaborator Author

@rickskyy rickskyy Oct 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... I have same python but rdf version is 4.2.2, it seems to me that is the case. Anyway, I agree that it is better to use turtle everywhere to avoid such situations.

"cell_type": "markdown",
"metadata": {},
"source": [
"##SPARQL queries benchmarking"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ## comment above.

@@ -0,0 +1,276 @@
import re
from abc import ABC, abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import ABC does not seem to be supported in python2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it works in python 2.7, also Travis has not failed.

self.prepared_queries_list = []
self._subsets = []

def find(self, graph=None, q_str=None, q_params=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this method to match, since find and fuzzy are already in the class name and this method is supposed to return results of exact matches of section and properties and the provided values; then it will also be better distinguishable from the second method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with renaming to match.

def __init__(self):
super(QueryParser2, self).__init__()

def parse_query_string(self, q_str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure, if the select and where terminology should be used in this context. The actual query result will not return the phrases that are currently stated after select. Maybe change the phrase select to FIND and where to HAVING to avoid confusion on the user side. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is good point. FIND and HAVING is more appropriate and meaningful there.
@jgrewe do you agree?

if q_str and q_params:
raise ValueError("Please pass query parameters only as string or dict object")

if q_str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, find_fuzzy will always use only the query string, if both query string and query parameters are provided. Query parameters will only be used, if no query string is provided. Is this the intended behavior? From the docstring this behavior is not apparent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we can accept either string with query or directly params but not both and not nothing (logical XOR). In my input validation I raise errors exactly to this logic.
So yes, I will use query params if no query string provided.

@coveralls
Copy link

Coverage Status

Coverage decreased (-9.0%) to 68.086% when pulling b51fa07 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

"cell_type": "markdown",
"metadata": {},
"source": [
"Quick video about what is SPARQL: https://www.youtuboe.com/watch?v=FvGndkpa4K0\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice resource, this is perfect here, thx! And a typo snuck into the link. :)

Copy link
Collaborator Author

@rickskyy rickskyy Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "a typo snuck into the link"? Like '\n' character? Just added space after the link =)
I think issues like that does not really influence anything, because I cannot trace that in jupyter, only while reading as source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an 'o' in youtube ;)

"for file_name in os.listdir(input_dir):\n",
" f = os.path.join(input_dir, file_name)\n",
" if os.path.isfile(f):\n",
" graph.parse(f, format=\"ttl\")\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I get the ''No plugin registered for ttl" error again.

}
],
"source": [
"from odml.tools.fuzzy_finder import FuzzyFinder\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the notebook with both python2 and python3. Works like a charm for python3, with python2 at this point I get the following error: ImportError: cannot import name ABC from query_creator.py line 2.

Copy link
Collaborator Author

@rickskyy rickskyy Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed inheritance of (ABC) to __metaclass__ = ABCMeta.

"cell_type": "markdown",
"metadata": {},
"source": [
"**RDF (Resource Description Framework)** is one of the three foundational [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web) technologies, the other two being SPARQL and OWL.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add "(Resource Description Framework, Wikipedia, 2017)" to the end of the paragraph, it should be fine. ;)

@coveralls
Copy link

Coverage Status

Coverage decreased (-9.02%) to 68.026% when pulling 60ee8e2 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

@mpsonntag mpsonntag merged commit 0fa95f4 into G-Node:dev-odml-rdf Oct 17, 2017
@rickskyy rickskyy deleted the finder branch October 17, 2017 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants