New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better linquistic matches higher #10
Conversation
there are 3 documents in the index which are causing shard failures such as is that a mistake POST /whosonfirst/_search
{
"query": {
"match": {
"wof:scale": 0
}
},
"fields": [
"wof:name",
"wof:id",
"wof:scale"
]
} {
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "whosonfirst",
"_type": "locality",
"_id": "101748481",
"_score": 1,
"fields": {
"wof:name": [
"Augsburg"
],
"wof:id": [
101748481
],
"wof:scale": [
0
]
}
},
{
"_index": "whosonfirst",
"_type": "locality",
"_id": "85922165",
"_score": 1,
"fields": {
"wof:name": [
"Martinez"
],
"wof:id": [
85922165
],
"wof:scale": [
0
]
}
},
{
"_index": "whosonfirst",
"_type": "locality",
"_id": "101749163",
"_score": 1,
"fields": {
"wof:name": [
"Århus"
],
"wof:id": [
101749163
],
"wof:scale": [
0
]
}
}
]
}
} |
better linquistic matches higher
Alas, nope:
|
hmm.. this merge and then reject workflow isn't working for me, if you'd like my help then feel free to ask. |
As in:
The root problem appears to be indexing and type casting. We are not using ES's default type casting because it suffers from an excess of cleverness (or at least enough that basic indexing was turning in to a yak-shaving exercise) so with a few exceptions everything is just being stored as a string: If there are specific things that we can/should index as numeric values it's easy enough to update. |
ugh these elasticsearch errors are unreadable. I can confirm that the I have only loaded the WOF regions and not the venues, which may where the issue lies.. could you paste the result of mapping | grep -A1 "wof:scale\|gn:population\|wof:megacity"
"gn:population": {
"type": "long"
--
"gn:population": {
"type": "long"
--
"wof:megacity": {
"type": "long"
--
"wof:scale": {
"type": "long"
--
"gn:population": {
"type": "long"
--
"wof:megacity": {
"type": "long"
--
"wof:scale": {
"type": "long"
--
"gn:population": {
"type": "long"
--
"wof:megacity": {
"type": "long"
--
"wof:scale": {
"type": "long" |
a more robust elasticsearch query
sort
scripts. these only act as tiebreakers when the_score
is the same for two or more hits (if this happens then either the documents are exact duplicates or something is very wrong with the query!)query_string
tomatch_all
. this still allows you to search against the whole document, I'm guessing this is nice to search for ids, misc fields, etc. so that's still enabled. if you were using any of the fancy-pantsquery_string
syntax features like wildcards etc. then you can change this back to usingquery_string
.should
sectionphrase
query againstwof:name
and score that multiplied by the boost (this takes token order in to account so for the tokens["new","york"]
it will score "New York" higher than "York New")reciprocal
ofwof:scale
and add that to the scorelog1p
ofgn:population
and add that in the mixwof:megacity
boosting, I guess this is helpful if the previous 2 functions fail to matchI played with it locally and I think it works much better, but search is super subjective so maybe you might want to tweak some of the
boost
values and functionweights
to suit your tastes.