-
Notifications
You must be signed in to change notification settings - Fork 30
Fuzzy Finder prototype #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
odml/tools/fuzzy_finder.py
Outdated
|
|
||
| sec_pattern = re.compile("(sec|section)\(.*?\)") | ||
| self._parse_sec(re.search(sec_pattern, q_str)) | ||
| print(re.search(sec_pattern, q_str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor comment: Is this a debug line or is it required for sthg?
odml/tools/fuzzy_finder.py
Outdated
| def _parse_sec(self, sec): | ||
| p = re.compile("(id|name|definition|type|repository|reference|sections|properties):(.*?)[,|\)]") | ||
| self.q_dict['Sec'] = re.findall(p, sec.group(0)) | ||
| print("sec sdfds ", re.findall(p, sec.group(0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same minor comment as above. ;)
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "**RDF (Resource Description Framework)** is one of the three foundational [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web) technologies, the other two being SPARQL and OWL.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide the proper citation or rewrite this paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add "(Resource Description Framework, Wikipedia, 2017)" to the end of the paragraph, it should be fine. ;)
doc/RDF_tools.ipynb
Outdated
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I saw on the dublincore page, their content is licensed under the "Creative Commons Attribution 3.0 Unported License" (bottom of the dublincode page). We are allowed to use the graph, as long as we properly cite it, please check the details via the link they provide. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will better draw my own graph, too much copyright instructions on their website.
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "Let's create the example odML document." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also provide a link to the odML tutorial at this point? (https://g-node.github.io/python-odml/tutorial.html)
doc/RDF_tools.ipynb
Outdated
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "##RDFWriter class" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whitespace after ## missing. ;) The same is true for a couple of other headlines below.
doc/RDF_tools.ipynb
Outdated
| "f = tempfile.NamedTemporaryFile(mode='w', suffix=\".ttl\")\n", | ||
| "path = f.name\n", | ||
| "\n", | ||
| "# possible to use 'ttl' instead of 'turtle'\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using format ttl here (and in all occurrences below) leads to the error PluginException: No plugin registered for (ttl, <class 'rdflib.serializer.Serializer'>) when I'm running the notebook. I'm using python3.5 with rdflib v4.2.1. Maybe we just use format turtle everywhere to avoid any version conflicts in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting... I have same python but rdf version is 4.2.2, it seems to me that is the case. Anyway, I agree that it is better to use turtle everywhere to avoid such situations.
doc/RDF_tools.ipynb
Outdated
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "##SPARQL queries benchmarking" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See ## comment above.
odml/tools/query_creator.py
Outdated
| @@ -0,0 +1,276 @@ | |||
| import re | |||
| from abc import ABC, abstractmethod | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import ABC does not seem to be supported in python2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it works in python 2.7, also Travis has not failed.
odml/tools/fuzzy_finder.py
Outdated
| self.prepared_queries_list = [] | ||
| self._subsets = [] | ||
|
|
||
| def find(self, graph=None, q_str=None, q_params=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this method to match, since find and fuzzy are already in the class name and this method is supposed to return results of exact matches of section and properties and the provided values; then it will also be better distinguishable from the second method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with renaming to match.
| def __init__(self): | ||
| super(QueryParser2, self).__init__() | ||
|
|
||
| def parse_query_string(self, q_str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure, if the select and where terminology should be used in this context. The actual query result will not return the phrases that are currently stated after select. Maybe change the phrase select to FIND and where to HAVING to avoid confusion on the user side. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is good point. FIND and HAVING is more appropriate and meaningful there.
@jgrewe do you agree?
| if q_str and q_params: | ||
| raise ValueError("Please pass query parameters only as string or dict object") | ||
|
|
||
| if q_str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this, find_fuzzy will always use only the query string, if both query string and query parameters are provided. Query parameters will only be used, if no query string is provided. Is this the intended behavior? From the docstring this behavior is not apparent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that we can accept either string with query or directly params but not both and not nothing (logical XOR). In my input validation I raise errors exactly to this logic.
So yes, I will use query params if no query string provided.
doc/RDF_tools.ipynb
Outdated
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "Quick video about what is SPARQL: https://www.youtuboe.com/watch?v=FvGndkpa4K0\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice resource, this is perfect here, thx! And a typo snuck into the link. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "a typo snuck into the link"? Like '\n' character? Just added space after the link =)
I think issues like that does not really influence anything, because I cannot trace that in jupyter, only while reading as source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an 'o' in youtube ;)
doc/RDF_tools.ipynb
Outdated
| "for file_name in os.listdir(input_dir):\n", | ||
| " f = os.path.join(input_dir, file_name)\n", | ||
| " if os.path.isfile(f):\n", | ||
| " graph.parse(f, format=\"ttl\")\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I get the ''No plugin registered for ttl" error again.
| } | ||
| ], | ||
| "source": [ | ||
| "from odml.tools.fuzzy_finder import FuzzyFinder\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the notebook with both python2 and python3. Works like a charm for python3, with python2 at this point I get the following error: ImportError: cannot import name ABC from query_creator.py line 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed inheritance of (ABC) to __metaclass__ = ABCMeta.
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "**RDF (Resource Description Framework)** is one of the three foundational [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web) technologies, the other two being SPARQL and OWL.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add "(Resource Description Framework, Wikipedia, 2017)" to the end of the paragraph, it should be fine. ;)
This PR includes:
doc/example_rdfs/example_dataQueryCreator is the tool for simplifying the creation of prepared SPARQL queries.
Example:
q = "doc(author:D. N. Adams) section(name:Stimulus) prop(name:Contrast, value:[20], unit:%)prepared_query = FuzzyFinder().get_query(q)print(prepared_query)<rdflib.plugins.sparql.sparql.Query object at 0x7fbb6dbb4c50>SELECT * WHERE {?d rdf:type odml:Document .?d odml:hasAuthor "D. N. Adams" .?d odml:hasSection ?s .?s rdf:type odml:Section .?s odml:hasName "Stimulus" .?s odml:hasProperty ?p .?p rdf:type odml:Property .?p odml:hasName "Contrast" .?p odml:hasUnit "%" .?p odml:hasValue ?v .?v rdf:type rdf:Bag .?v rdf:li "20" .}, initNs={"odml": Namespace("https://g-node.org/projects/odml-rdf#"), "rdf": RDF})FuzzyFinder is the tool for querying graph through fuzzy queries. The finder executes multiple queries to better match input parameters and returns sets of triples, prioritized from more to less amount of matched parameters.
The prototype supports 2 modes:
FuzzyFinder.find(mode='match')The tool builds multiple sparql queries from 'match' queries, executes them and returns some matched results. The first result always represents the most specific query (the biggest combination of input parameters that returned at least one triple).
The query syntax is pretty straightforward. Just write the name of the entity
property,sectionordocument(also possible to use shortened namesprop,secanddoc) and add attributes with their values inside the parentheses divided by colon.Example from code:
prop(name:Date) section(name:Recording-2013-02-08-ak, type:Recording).Here we search for sections and properties that
propertyhas attributenameand its value isDate.For building 'match' queries you should need to know exactly for which odML attribute the value(subject) is related. So if you write
prop(name:Date) section(name:Recording, type:Recording-2013-02-08-ak)thefind()method would not return any triples with section parameters. Because it's likely that there is no section with typeRecording-2013-02-08-ak.Non-odML entities' attributes here also will be ignored (e.g. only
id, author, date, version, repository, sectionscan exist in theDocumentobject).In the example
section(not-odml-name:Recording-2013-02-08-ak, record:Recording)the find method return nothing.FuzzyFinder.find(mode='fuzzy')The output logic is similair to the previous mode, but there you can provide more broad information and the finder will match the parameters, and create meaningful queries based on the input.
The query string consists of two parts: FIND and HAVING.
In the FIND part a user specifies the set of odML objects and its attributes.
e.g.
FIND prop(name) section(name, type)In the HAVING part a user specifies set of search values which could relate to the attributes in FIND part.
e.g
HAVING Recording, Recording-2012-04-04-ab, DateFinally, the complete query will look like this:
FIND sec(name, type) prop(name) HAVING Recording, Recording-2012-04-04-ab, DatePlease read
RDF_tools.ipynbjupyter notebook documentation for more details and examples.