Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better linquistic matches higher #10

Merged
merged 1 commit into from Oct 23, 2015

Conversation

missinglink
Copy link
Contributor

a more robust elasticsearch query

  • removed all sort scripts. these only act as tiebreakers when the _score is the same for two or more hits (if this happens then either the documents are exact duplicates or something is very wrong with the query!)
  • change query_string to match_all. this still allows you to search against the whole document, I'm guessing this is nice to search for ids, misc fields, etc. so that's still enabled. if you were using any of the fancy-pants query_string syntax features like wildcards etc. then you can change this back to using query_string.
  • all the goodies for boosting and scoring now live inside the should section
    • do a phrase query against wof:name and score that multiplied by the boost (this takes token order in to account so for the tokens ["new","york"] it will score "New York" higher than "York New")
    • take the reciprocal of wof:scale and add that to the score
    • take the log1p of gn:population and add that in the mix
    • the last one I don't really like, I'm not sure if you wanted this or not, basically this allows you to do the wof:megacity boosting, I guess this is helpful if the previous 2 functions fail to match

I played with it locally and I think it works much better, but search is super subjective so maybe you might want to tweak some of the boost values and function weights to suit your tastes.

@missinglink
Copy link
Contributor Author

there are 3 documents in the index which are causing shard failures such as ElasticsearchException[Result of field modification [reciprocal(0.0)] must be a number]

is that a mistake "wof:scale": 0? otherwise the reciprocal function will need to account for 0.

POST /whosonfirst/_search
{
   "query": {
      "match": {
         "wof:scale": 0
      }
   },
   "fields": [
      "wof:name",
      "wof:id",
      "wof:scale"
   ]
}
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1,
      "hits": [
         {
            "_index": "whosonfirst",
            "_type": "locality",
            "_id": "101748481",
            "_score": 1,
            "fields": {
               "wof:name": [
                  "Augsburg"
               ],
               "wof:id": [
                  101748481
               ],
               "wof:scale": [
                  0
               ]
            }
         },
         {
            "_index": "whosonfirst",
            "_type": "locality",
            "_id": "85922165",
            "_score": 1,
            "fields": {
               "wof:name": [
                  "Martinez"
               ],
               "wof:id": [
                  85922165
               ],
               "wof:scale": [
                  0
               ]
            }
         },
         {
            "_index": "whosonfirst",
            "_type": "locality",
            "_id": "101749163",
            "_score": 1,
            "fields": {
               "wof:name": [
                  "Århus"
               ],
               "wof:id": [
                  101749163
               ],
               "wof:scale": [
                  0
               ]
            }
         }
      ]
   }
}

thisisaaronland added a commit that referenced this pull request Oct 23, 2015
better linquistic matches higher
@thisisaaronland thisisaaronland merged commit 03dbb92 into whosonfirst:master Oct 23, 2015
@thisisaaronland
Copy link
Member

Alas, nope:

  File "build/bdist.linux-x86_64/egg/elasticsearch/transport.py", line 307, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py", line 89, in perform_request
    self._raise_error(response.status, raw_data)
  File "build/bdist.linux-x86_64/egg/elasticsearch/connection/base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
RequestError: TransportError(400, u'SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[8CT46Kp3Sr2GeHOCvlIc4Q][whosonfirst][0]: SearchParseException[[whosonfirst][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[8CT46Kp3Sr2GeHOCvlIc4Q][whosonfirst][1]: SearchParseException[[whosonfirst][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[8CT46Kp3Sr2GeHOCvlIc4Q][whosonfirst][2]: SearchParseException[[whosonfirst][2]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[8CT46Kp3Sr2GeHOCvlIc4Q][whosonfirst][3]: SearchParseException[[whosonfirst][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[8CT46Kp3Sr2GeHOCvlIc4Q][whosonfirst][4]: SearchParseException[[whosonfirst][4]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }]')
2015-10-23 16:51:00 [31290] [INFO] Handling signal: winch

@missinglink
Copy link
Contributor Author

hmm.. this merge and then reject workflow isn't working for me, if you'd like my help then feel free to ask.

@thisisaaronland
Copy link
Member

$> cat ./data/query_boost_wof_name.json | ./scripts/wof-es-rawquery --query - 
{u'error': u'SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[aCdmWzgaSxSSnH_10fyYXg][whosonfirst][0]: SearchParseException[[whosonfirst][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[aCdmWzgaSxSSnH_10fyYXg][whosonfirst][1]: SearchParseException[[whosonfirst][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[aCdmWzgaSxSSnH_10fyYXg][whosonfirst][2]: SearchParseException[[whosonfirst][2]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[aCdmWzgaSxSSnH_10fyYXg][whosonfirst][3]: SearchParseException[[whosonfirst][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }{[aCdmWzgaSxSSnH_10fyYXg][whosonfirst][4]: SearchParseException[[whosonfirst][4]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query": {"bool": {"should": [{"match": {"wof:name": {"query": "SF", "boost": 10, "type": "phrase"}}}, {"function_score": {"functions": [{"weight": 1, "field_value_factor": {"field": "wof:scale", "modifier": "reciprocal", "missing": 10}}, {"weight": 1, "field_value_factor": {"field": "gn:population", "modifier": "log1p", "missing": 0}}, {"weight": 1, "field_value_factor": {"field": "wof:megacity", "missing": 0}}], "boost": 2}}], "must": [{"match": {"_all": {"query": "SF"}}}]}}}]]]; nested: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]; }]',
 u'status': 400}

As in:

The root problem appears to be indexing and type casting. We are not using ES's default type casting because it suffers from an excess of cleverness (or at least enough that basic indexing was turning in to a yak-shaving exercise) so with a few exceptions everything is just being stored as a string:

https://github.com/whosonfirst/py-mapzen-whosonfirst-search/blob/master/mapzen/whosonfirst/search/__init__.py#L90

If there are specific things that we can/should index as numeric values it's easy enough to update.

@missinglink
Copy link
Contributor Author

ugh these elasticsearch errors are unreadable.

I can confirm that the cat command above and curl -XPOST -H "Content-Type: application/json" --data @data/query_boost_wof_name.json localhost:9200/whosonfirst/_search both complete without error on my system.

I have only loaded the WOF regions and not the venues, which may where the issue lies..

could you paste the result of curl -XGET localhost:9200/whosonfirst/_mapping

here's mine

mapping | grep -A1 "wof:scale\|gn:population\|wof:megacity"

          "gn:population": {
            "type": "long"
--
          "gn:population": {
            "type": "long"
--
          "wof:megacity": {
            "type": "long"
--
          "wof:scale": {
            "type": "long"
--
          "gn:population": {
            "type": "long"
--
          "wof:megacity": {
            "type": "long"
--
          "wof:scale": {
            "type": "long"
--
          "gn:population": {
            "type": "long"
--
          "wof:megacity": {
            "type": "long"
--
          "wof:scale": {
            "type": "long"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants