better linquistic matches higher #10

missinglink · 2015-10-23T07:52:03Z

a more robust elasticsearch query

removed all sort scripts. these only act as tiebreakers when the _score is the same for two or more hits (if this happens then either the documents are exact duplicates or something is very wrong with the query!)
change query_string to match_all. this still allows you to search against the whole document, I'm guessing this is nice to search for ids, misc fields, etc. so that's still enabled. if you were using any of the fancy-pants query_string syntax features like wildcards etc. then you can change this back to using query_string.
all the goodies for boosting and scoring now live inside the should section
- do a phrase query against wof:name and score that multiplied by the boost (this takes token order in to account so for the tokens ["new","york"] it will score "New York" higher than "York New")
- take the reciprocal of wof:scale and add that to the score
- take the log1p of gn:population and add that in the mix
- the last one I don't really like, I'm not sure if you wanted this or not, basically this allows you to do the wof:megacity boosting, I guess this is helpful if the previous 2 functions fail to match

I played with it locally and I think it works much better, but search is super subjective so maybe you might want to tweak some of the boost values and function weights to suit your tastes.

missinglink · 2015-10-23T08:35:33Z

there are 3 documents in the index which are causing shard failures such as ElasticsearchException[Result of field modification [reciprocal(0.0)] must be a number]

is that a mistake "wof:scale": 0? otherwise the reciprocal function will need to account for 0.

POST /whosonfirst/_search
{
   "query": {
      "match": {
         "wof:scale": 0
      }
   },
   "fields": [
      "wof:name",
      "wof:id",
      "wof:scale"
   ]
}

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1,
      "hits": [
         {
            "_index": "whosonfirst",
            "_type": "locality",
            "_id": "101748481",
            "_score": 1,
            "fields": {
               "wof:name": [
                  "Augsburg"
               ],
               "wof:id": [
                  101748481
               ],
               "wof:scale": [
                  0
               ]
            }
         },
         {
            "_index": "whosonfirst",
            "_type": "locality",
            "_id": "85922165",
            "_score": 1,
            "fields": {
               "wof:name": [
                  "Martinez"
               ],
               "wof:id": [
                  85922165
               ],
               "wof:scale": [
                  0
               ]
            }
         },
         {
            "_index": "whosonfirst",
            "_type": "locality",
            "_id": "101749163",
            "_score": 1,
            "fields": {
               "wof:name": [
                  "Århus"
               ],
               "wof:id": [
                  101749163
               ],
               "wof:scale": [
                  0
               ]
            }
         }
      ]
   }
}

better linquistic matches higher

thisisaaronland · 2015-10-23T16:52:36Z

Alas, nope:

  File "build/bdist.linux-x86_64/egg/elasticsearch/transport.py", line 307, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py", line 89, in perform_request
    self._raise_error(response.status, raw_data)
  File "build/bdist.linux-x86_64/egg/elasticsearch/connection/base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
RequestError: TransportError(400, u'SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[8CT46Kp3Sr2GeHOCvlIc4Q][whosonfirst][0]: SearchParseException[[whosonfirst][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[8CT46Kp3Sr2GeHOCvlIc4Q][whosonfirst][1]: SearchParseException[[whosonfirst][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[8CT46Kp3Sr2GeHOCvlIc4Q][whosonfirst][2]: SearchParseException[[whosonfirst][2]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[8CT46Kp3Sr2GeHOCvlIc4Q][whosonfirst][3]: SearchParseException[[whosonfirst][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[8CT46Kp3Sr2GeHOCvlIc4Q][whosonfirst][4]: SearchParseException[[whosonfirst][4]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }]')
2015-10-23 16:51:00 [31290] [INFO] Handling signal: winch

…#10

missinglink · 2015-10-28T13:34:16Z

hmm.. this merge and then reject workflow isn't working for me, if you'd like my help then feel free to ask.

thisisaaronland · 2015-10-28T23:19:07Z

$> cat ./data/query_boost_wof_name.json | ./scripts/wof-es-rawquery --query - 
{u'error': u'SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[aCdmWzgaSxSSnH_10fyYXg][whosonfirst][0]: SearchParseException[[whosonfirst][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[aCdmWzgaSxSSnH_10fyYXg][whosonfirst][1]: SearchParseException[[whosonfirst][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[aCdmWzgaSxSSnH_10fyYXg][whosonfirst][2]: SearchParseException[[whosonfirst][2]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[aCdmWzgaSxSSnH_10fyYXg][whosonfirst][3]: SearchParseException[[whosonfirst][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[aCdmWzgaSxSSnH_10fyYXg][whosonfirst][4]: SearchParseException[[whosonfirst][4]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }]',
 u'status': 400}

As in:

The root problem appears to be indexing and type casting. We are not using ES's default type casting because it suffers from an excess of cleverness (or at least enough that basic indexing was turning in to a yak-shaving exercise) so with a few exceptions everything is just being stored as a string:

https://github.com/whosonfirst/py-mapzen-whosonfirst-search/blob/master/mapzen/whosonfirst/search/__init__.py#L90

If there are specific things that we can/should index as numeric values it's easy enough to update.

missinglink · 2015-10-29T09:15:16Z

ugh these elasticsearch errors are unreadable.

I can confirm that the cat command above and curl -XPOST -H "Content-Type: application/json" --data @data/query_boost_wof_name.json localhost:9200/whosonfirst/_search both complete without error on my system.

I have only loaded the WOF regions and not the venues, which may where the issue lies..

could you paste the result of curl -XGET localhost:9200/whosonfirst/_mapping

here's mine

mapping | grep -A1 "wof:scale\|gn:population\|wof:megacity"

          "gn:population": {
            "type": "long"
--
          "gn:population": {
            "type": "long"
--
          "wof:megacity": {
            "type": "long"
--
          "wof:scale": {
            "type": "long"
--
          "gn:population": {
            "type": "long"
--
          "wof:megacity": {
            "type": "long"
--
          "wof:scale": {
            "type": "long"
--
          "gn:population": {
            "type": "long"
--
          "wof:megacity": {
            "type": "long"
--
          "wof:scale": {
            "type": "long"

missinglink force-pushed the search branch from b3d2dea to 2450a6a Compare October 23, 2015 08:10

better linquistic matches higher

c5ef15b

missinglink force-pushed the search branch from 2450a6a to c5ef15b Compare October 23, 2015 09:41

thisisaaronland added a commit that referenced this pull request Oct 23, 2015

Merge pull request #10 from missinglink/search

03dbb92

better linquistic matches higher

thisisaaronland merged commit 03dbb92 into whosonfirst:master Oct 23, 2015

thisisaaronland pushed a commit that referenced this pull request Oct 23, 2015

git is weird and revert/reset is a fiddly waste of time; rollback pull …

8db3323

…#10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better linquistic matches higher #10

better linquistic matches higher #10

missinglink commented Oct 23, 2015

missinglink commented Oct 23, 2015

thisisaaronland commented Oct 23, 2015

missinglink commented Oct 28, 2015

thisisaaronland commented Oct 28, 2015

missinglink commented Oct 29, 2015

better linquistic matches higher #10

better linquistic matches higher #10

Conversation

missinglink commented Oct 23, 2015

missinglink commented Oct 23, 2015

thisisaaronland commented Oct 23, 2015

missinglink commented Oct 28, 2015

thisisaaronland commented Oct 28, 2015

missinglink commented Oct 29, 2015