Fuzzy Finder prototype #173

rickskyy · 2017-09-18T07:28:20Z

This PR includes:

QueryCreator class.
FuzzyFinder class. RDF Graph "fuzzy finder" prototype #171
Jupyter notebook documentation for implemented RDF tools. RDF tools Jupyter doc #174
Added some examples files from gin/drosophila repo to doc/example_rdfs/example_data

QueryCreator is the tool for simplifying the creation of prepared SPARQL queries.

Example:
q = "doc(author:D. N. Adams) section(name:Stimulus) prop(name:Contrast, value:[20], unit:%)
prepared_query = FuzzyFinder().get_query(q)
print(prepared_query)

<rdflib.plugins.sparql.sparql.Query object at 0x7fbb6dbb4c50>

SELECT * WHERE {
?d rdf:type odml:Document .
?d odml:hasAuthor "D. N. Adams" .
?d odml:hasSection ?s .
?s rdf:type odml:Section .
?s odml:hasName "Stimulus" .
?s odml:hasProperty ?p .
?p rdf:type odml:Property .
?p odml:hasName "Contrast" .
?p odml:hasUnit "%" .
?p odml:hasValue ?v .
?v rdf:type rdf:Bag .
?v rdf:li "20" .
}, initNs={"odml": Namespace("https://g-node.org/projects/odml-rdf#"), "rdf": RDF})

FuzzyFinder is the tool for querying graph through fuzzy queries. The finder executes multiple queries to better match input parameters and returns sets of triples, prioritized from more to less amount of matched parameters.

The prototype supports 2 modes:

FuzzyFinder.find(mode='match')

The tool builds multiple sparql queries from 'match' queries, executes them and returns some matched results. The first result always represents the most specific query (the biggest combination of input parameters that returned at least one triple).

The query syntax is pretty straightforward. Just write the name of the entity property, section or document (also possible to use shortened names prop, sec and doc) and add attributes with their values inside the parentheses divided by colon.

Example from code: prop(name:Date) section(name:Recording-2013-02-08-ak, type:Recording).
Here we search for sections and properties that property has attribute name and its value is Date.

For building 'match' queries you should need to know exactly for which odML attribute the value(subject) is related. So if you write prop(name:Date) section(name:Recording, type:Recording-2013-02-08-ak) the find() method would not return any triples with section parameters. Because it's likely that there is no section with type Recording-2013-02-08-ak.

Non-odML entities' attributes here also will be ignored (e.g. only id, author, date, version, repository, sections can exist in the Document object).
In the example section(not-odml-name:Recording-2013-02-08-ak, record:Recording) the find method return nothing.

FuzzyFinder.find(mode='fuzzy')

The output logic is similair to the previous mode, but there you can provide more broad information and the finder will match the parameters, and create meaningful queries based on the input.

The query string consists of two parts: FIND and HAVING.

In the FIND part a user specifies the set of odML objects and its attributes.
e.g. FIND prop(name) section(name, type)

In the HAVING part a user specifies set of search values which could relate to the attributes in FIND part.
e.g HAVING Recording, Recording-2012-04-04-ab, Date

Finally, the complete query will look like this:
FIND sec(name, type) prop(name) HAVING Recording, Recording-2012-04-04-ab, Date

Please read RDF_tools.ipynb jupyter notebook documentation for more details and examples.

coveralls · 2017-09-18T07:34:04Z

Coverage decreased (-2.7%) to 74.317% when pulling 59a4b14 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

mpsonntag · 2017-09-18T15:32:35Z

odml/tools/fuzzy_finder.py

+
+        sec_pattern = re.compile("(sec|section)\(.*?\)")
+        self._parse_sec(re.search(sec_pattern, q_str))
+        print(re.search(sec_pattern, q_str))


Just a minor comment: Is this a debug line or is it required for sthg?

mpsonntag · 2017-09-18T15:32:54Z

odml/tools/fuzzy_finder.py

+    def _parse_sec(self, sec):
+        p = re.compile("(id|name|definition|type|repository|reference|sections|properties):(.*?)[,|\)]")
+        self.q_dict['Sec'] = re.findall(p, sec.group(0))
+        print("sec sdfds ", re.findall(p, sec.group(0)))


Same minor comment as above. ;)

coveralls · 2017-10-01T22:21:32Z

Coverage decreased (-5.7%) to 71.343% when pulling 8d51a05 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

coveralls · 2017-10-08T22:50:52Z

Coverage decreased (-9.0%) to 68.086% when pulling de98dc4 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

coveralls · 2017-10-08T23:02:24Z

Coverage decreased (-9.0%) to 68.086% when pulling 7bf55cb on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

coveralls · 2017-10-08T23:09:03Z

Coverage decreased (-9.0%) to 68.086% when pulling 52a5584 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

coveralls · 2017-10-09T10:00:55Z

Coverage decreased (-9.0%) to 68.086% when pulling c48405b on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

mpsonntag · 2017-10-10T09:28:24Z

doc/RDF_tools.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**RDF (Resource Description Framework)** is one of the three foundational [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web) technologies, the other two being SPARQL and OWL.\n",


Please provide the proper citation or rewrite this paragraph.

If you add "(Resource Description Framework, Wikipedia, 2017)" to the end of the paragraph, it should be fine. ;)

mpsonntag · 2017-10-10T10:12:44Z

doc/RDF_tools.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Image](http://dublincore.org/documents/2008/01/14/dc-rdf/rdfexamplefig.png)"


As far as I saw on the dublincore page, their content is licensed under the "Creative Commons Attribution 3.0 Unported License" (bottom of the dublincode page). We are allowed to use the graph, as long as we properly cite it, please check the details via the link they provide. ;)

I will better draw my own graph, too much copyright instructions on their website.

mpsonntag · 2017-10-10T10:13:05Z

doc/RDF_tools.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's create the example odML document."


Could you also provide a link to the odML tutorial at this point? (https://g-node.github.io/python-odml/tutorial.html)

mpsonntag · 2017-10-10T10:13:14Z

doc/RDF_tools.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##RDFWriter class"


Whitespace after ## missing. ;) The same is true for a couple of other headlines below.

mpsonntag · 2017-10-10T10:14:00Z

doc/RDF_tools.ipynb

+    "f = tempfile.NamedTemporaryFile(mode='w', suffix=\".ttl\")\n",
+    "path = f.name\n",
+    "\n",
+    "# possible to use 'ttl' instead of 'turtle'\n",


Using format ttl here (and in all occurrences below) leads to the error PluginException: No plugin registered for (ttl, <class 'rdflib.serializer.Serializer'>) when I'm running the notebook. I'm using python3.5 with rdflib v4.2.1. Maybe we just use format turtle everywhere to avoid any version conflicts in general.

Interesting... I have same python but rdf version is 4.2.2, it seems to me that is the case. Anyway, I agree that it is better to use turtle everywhere to avoid such situations.

mpsonntag · 2017-10-10T10:17:20Z

doc/RDF_tools.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##SPARQL queries benchmarking"


See ## comment above.

mpsonntag · 2017-10-10T11:08:51Z

odml/tools/query_creator.py

@@ -0,0 +1,276 @@
+import re
+from abc import ABC, abstractmethod


Import ABC does not seem to be supported in python2.

I assume it works in python 2.7, also Travis has not failed.

mpsonntag · 2017-10-10T11:44:44Z

odml/tools/fuzzy_finder.py

+        self.prepared_queries_list = []
+        self._subsets = []
+
+    def find(self, graph=None, q_str=None, q_params=None):


Maybe rename this method to match, since find and fuzzy are already in the class name and this method is supposed to return results of exact matches of section and properties and the provided values; then it will also be better distinguishable from the second method.

Agree with renaming to match.

mpsonntag · 2017-10-10T11:58:11Z

odml/tools/query_creator.py

+    def __init__(self):
+        super(QueryParser2, self).__init__()
+
+    def parse_query_string(self, q_str):


I am unsure, if the select and where terminology should be used in this context. The actual query result will not return the phrases that are currently stated after select. Maybe change the phrase select to FIND and where to HAVING to avoid confusion on the user side. What do you think?

That is good point. FIND and HAVING is more appropriate and meaningful there.
@jgrewe do you agree?

mpsonntag · 2017-10-10T12:18:28Z

odml/tools/fuzzy_finder.py

+        if q_str and q_params:
+            raise ValueError("Please pass query parameters only as string or dict object")
+
+        if q_str:


With this, find_fuzzy will always use only the query string, if both query string and query parameters are provided. Query parameters will only be used, if no query string is provided. Is this the intended behavior? From the docstring this behavior is not apparent.

The idea is that we can accept either string with query or directly params but not both and not nothing (logical XOR). In my input validation I raise errors exactly to this logic.
So yes, I will use query params if no query string provided.

coveralls · 2017-10-15T13:21:11Z

Coverage decreased (-9.0%) to 68.086% when pulling b51fa07 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

mpsonntag · 2017-10-17T07:57:16Z

doc/RDF_tools.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Quick video about what is SPARQL: https://www.youtuboe.com/watch?v=FvGndkpa4K0\n",


Really nice resource, this is perfect here, thx! And a typo snuck into the link. :)

What do you mean by "a typo snuck into the link"? Like '\n' character? Just added space after the link =)
I think issues like that does not really influence anything, because I cannot trace that in jupyter, only while reading as source.

There's an 'o' in youtube ;)

mpsonntag · 2017-10-17T07:58:38Z

doc/RDF_tools.ipynb

+    "for file_name in os.listdir(input_dir):\n",
+    "    f = os.path.join(input_dir, file_name)\n",
+    "    if os.path.isfile(f):\n",
+    "        graph.parse(f, format=\"ttl\")\n",


Here I get the ''No plugin registered for ttl" error again.

mpsonntag · 2017-10-17T08:12:09Z

doc/RDF_tools.ipynb

+    }
+   ],
+   "source": [
+    "from odml.tools.fuzzy_finder import FuzzyFinder\n",


I tested the notebook with both python2 and python3. Works like a charm for python3, with python2 at this point I get the following error: ImportError: cannot import name ABC from query_creator.py line 2.

Changed inheritance of (ABC) to __metaclass__ = ABCMeta.

mpsonntag · 2017-10-17T08:15:35Z

doc/RDF_tools.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**RDF (Resource Description Framework)** is one of the three foundational [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web) technologies, the other two being SPARQL and OWL.\n",


If you add "(Resource Description Framework, Wikipedia, 2017)" to the end of the paragraph, it should be fine. ;)

coveralls · 2017-10-17T10:11:43Z

Coverage decreased (-9.02%) to 68.026% when pulling 60ee8e2 on rickskyy:finder into ae93c27 on G-Node:dev-odml-rdf.

rickskyy added 2 commits September 17, 2017 14:33

[rdf] fixes

a62e880

[rdf] fuzzy finder prototype

59a4b14

mpsonntag approved these changes Sep 18, 2017

View reviewed changes

rickskyy added 4 commits October 2, 2017 00:48

[rdf] fuzzy finder implementation

90ad6a9

[rdf] query creator improvements

042e4ce

[rdf] jupyter notebook documentation for RDF tools

f1e09b5

[rdf] add drosophila files for jupyter notebook

8d51a05

rickskyy added 2 commits October 9, 2017 01:47

[rdf] extend query creator with additional parser

aae1e07

[rdf] extend fuzzy finder with new prototype fuzzy search function

de98dc4

[rdf] comments changes

52a5584

rickskyy force-pushed the finder branch from 7bf55cb to 52a5584 Compare October 8, 2017 23:07

[rdf] rdf tools jupyter documentation updates

c48405b

mpsonntag mentioned this pull request Oct 10, 2017

RDF tools Jupyter doc #174

Closed

mpsonntag requested changes Oct 10, 2017

View reviewed changes

rickskyy added 4 commits October 15, 2017 16:16

[rdf] fixes and structure improvements for FuzzyFinder

4f4af41

[rdf] fixes and structure improvements for QueryCreator

e97cd88

[rdf] update rdf example files directory

2adf81f

[rdf] update jupyter notebook

b51fa07

mpsonntag requested changes Oct 17, 2017

View reviewed changes

rickskyy added 2 commits October 17, 2017 13:07

[rdf] python 2 abc fixes

2829844

[rdf] jupyter updates

60ee8e2

mpsonntag approved these changes Oct 17, 2017

View reviewed changes

mpsonntag merged commit 0fa95f4 into G-Node:dev-odml-rdf Oct 17, 2017

rickskyy deleted the finder branch October 17, 2017 14:44

		@@ -0,0 +1,276 @@
		import re
		from abc import ABC, abstractmethod

Fuzzy Finder prototype #173

Fuzzy Finder prototype #173

Uh oh!

Conversation

rickskyy commented Sep 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Sep 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Oct 1, 2017

Uh oh!

coveralls commented Oct 8, 2017

Uh oh!

coveralls commented Oct 8, 2017

Uh oh!

coveralls commented Oct 8, 2017

Uh oh!

coveralls commented Oct 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rickskyy Oct 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rickskyy Oct 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Oct 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rickskyy Oct 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rickskyy Oct 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rickskyy commented Sep 18, 2017 •

edited

Loading

rickskyy Oct 14, 2017 •

edited

Loading

rickskyy Oct 14, 2017 •

edited

Loading

rickskyy Oct 17, 2017 •

edited

Loading

rickskyy Oct 17, 2017 •

edited

Loading