Skip to content
This repository has been archived by the owner on Sep 24, 2019. It is now read-only.

Filter query parameters broken for several annotations #45

Closed
nbargnesi opened this issue Feb 24, 2015 · 8 comments
Closed

Filter query parameters broken for several annotations #45

nbargnesi opened this issue Feb 24, 2015 · 8 comments
Labels

Comments

@nbargnesi
Copy link
Member

Using the following APIs:

  1. /api /annotations /{annotation} /values
  2. /api/annotations/values

I always get no results (404) when searching the following annotations:

  1. Experimental Factory Ontology aka efo
  2. Uberon aka uberon
  3. Ncbi Taxonomy aka taxon

I labeled this as a bug though I may not be covering enough search terms in uberon and efo. I would have expected a result searching for 9606 in taxon.

@nbargnesi nbargnesi added the bug label Feb 24, 2015
@abargnesi
Copy link
Member

Annotation (and namespace) search doesn't do partial matches unless you use wildcards. This implies that the search parameter is a pass-through for the underlying FTS mechanism (SQLite FTS4). Here are some example requests:

EFO Example

GET http://next.belframework.org/api/annotations/efo/values?filter={%22category%22:%22fts%22,%22name%22:%22search%22,%22value%22:%22*cell%20line*%22}

Uberon

GET http://next.belframework.org/api/annotations/uberon/values?filter={%22category%22:%22fts%22,%22name%22:%22search%22,%22value%22:%22*lung*%22}

Currently species annotations are not searchable. A fix for this is forthcoming.

@nbargnesi
Copy link
Member Author

It appears to do partial matches now w/out wildcards. Can I leave them out?

@abargnesi
Copy link
Member

The wildcards are required. You might be matching on the whole token value. For example searching on tiss will not find results with the tissue token. You will have to use either the whole term or wildcard:

http://next.belframework.org/api/annotations/values?filter={"category":"fts","name":"search","value":"tissue"}

http://next.belframework.org/api/annotations/values?filter={"category":"fts","name":"search","value":"tiss*"}

@nbargnesi
Copy link
Member Author

How is this request/response possible?

request:

next.belframework.org/api/annotations/values?size=1&filter={"category":"fts","name":"search","value":"9"}

response:

{
  "annotation_values": [
    {
      "identifier": "0001045",
      "name": "11-5.2.1.9 cell",
      "type": "AnnotationConcept"
    }
  ]
}

Here's the actual GET line:

GET /api/annotations/values?size=1&filter=%7B%22category%22:%22fts%22,%22name%22:%22search%22,%22value%22:%229%22%7D HTTP/1.1

@abargnesi
Copy link
Member

The result is possible because 9 is a separate token within it. The SQLite FTS4 table treats . as a token separator and not a token character.

The token characters are defined as space = , - ( ) within the table creation. I should be able to add the period as a token character. Have you seen other non-alphanumeric characters apart from these?

@nbargnesi
Copy link
Member Author

You may want to consider taking more reponsibility from the client and exposing simple search capabilities for now. This will get us to good enough. We can go for robustness at a later point and take on the added complexity then.

If you decide to go the robust route now, there are at least three points that need addressing:

  1. Fail to match leading *, e.g.: *ateral fails to match lateral
  2. Fail to match both leading and trailing *, e.g.: *atera* fails to match lateral
  3. Unexpected results when we are performing searches and match other data.

Some amplifying information on (3) as well - if we search ell and match electrosensory lateral line lobe, it needs to be apparent why we matched it.

examples

search by token

...Creveld

Using Creveld alone gets expected results:

filter:

{"category":"fts","name":"search","value":"Creveld"}

response:

{
  "identifier": "12714",
  "name": "Ellis-Van Creveld syndrome",
  "type": "BiologicalProcessConcept"
}

...Ellis-Van

Or Ellis-Van:

filter:

{"category":"fts","name":"search","value":"Ellis-Van"}

response:

{
  "identifier": "12714",
  "name": "Ellis-Van Creveld syndrome",
  "type": "BiologicalProcessConcept"
}

...syndrome

Okay, now we're getting somewhere. How about syndrome:

filter:

{"category":"fts","name":"search","value":"syndrome"}

response:

{
  "identifier": "0050120",
  "name": "hemophagocytic lymphohistiocytosis",
  "type": "BiologicalProcessConcept"
}

Okay, hemophagocytic lymphohistiocytosis is a type of syndrome, but how do I know that from the search result?

...lateral

This one makes sense:

filter:

{"category":"fts","name":"search","value":"lateral"}

response:

{
  "identifier": "0008021",
  "name": "anterior lateral line ganglion neuron",
  "type": "CellAnnotationConcept"
}

...latera*

So does this (note the trailing *):

filter:

{"category":"fts","name":"search","value":"latera*"}

response:

{
  "identifier": "0008021",
  "name": "anterior lateral line ganglion neuron",
  "type": "CellAnnotationConcept"
}

...*ateral

This doesn't (leading *):

filter:

{"category":"fts","name":"search","value":"*ateral"}

response:

HTTP/1.1 404 Not Found

...atera

Or this (leading and trailing *):

filter:

{"category":"fts","name":"search","value":"*atera*"}

response:

HTTP/1.1 404 Not Found

multiple wildcards

Multiple wildcards support is nice:

filter:

{"category":"fts","name":"search","value":"posterior lateral * ganglion *"}

response:

{
  "identifier": "1000245",
  "name": "posterior lateral line ganglion neuron",
  "type": "CellAnnotationConcept"
}

weird results

Try ell:

filter:

{"category":"fts","name":"search","value":"ell"}

response:

{
"identifier": "2002105",
"name": "electrosensory lateral line lobe",
"type": "AnatomyAnnotationConcept"
}

Not sure why I'm getting this. I suspect ell is an abbreviation, but it's not apparent.

@abargnesi
Copy link
Member

Thanks for the write-up with examples. Very thorough and provides a nice set to test with.

The major problems seem to be:

  1. Suffix and inner matches cannot be found.
  2. Does not return where in the result the search matched (e.g. name, title, synonym and which part matched).

To address (1) I will have to index suffixes of identifiers, names, titles, and synonyms. For example the term carcinoma would index using carcinoma arcinoma rcinoma cinoma inoma noma oma ma aAn approach using SQLite that seems reasonable is explained here. I am in the middle of this work.

For (2) I will be able to provide which field(s) match and where (position range), but this will take longer.

@abargnesi
Copy link
Member

Closing. Split work among a few tickets.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants