New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong analyzer used when indexing dynamic property #3544
Comments
I think you should must not add |
I can still reproduce the behavior here, even with the modified mapping and a slightly simpler test. Looks like a bug to me. |
this might relate to a bug we fixed in put mapping, where it didn't always wait for all the relevant nodes to properly ack that the mapping have been applied. It should be fixed in master and 0.90 (upcoming 0.90.4). |
@brwe can you attach the test you use to reproduce it here? |
I prepared a (very crude) test on branch issue-3544-test in my repository. Adding an actionGet() after
seems to solve the problem, as you expected @s1monw . |
@brwe thanks for clarifying! @pmclellan are you ok with closing this issue, does it work for you as well based on the examples? |
Thanks for the feedback. However, I still think this is a bug. Indexing all the documents synchronously, as @brwe suggested, does prevent the issue from happening, but I believe the problem should not occur when indexing asynchronously either. I'll try running my test against master to see if the issue has been resolved by the fix @kimchy mentioned. |
@pmclellan you can index them async but you need to wait until they are all indexed on all replicas and then call refresh otherwise you might miss some docs depending on which shard / replica you go. I don't see the issue you are talking about sync/async here can you explain? |
Apologies for the confusion. I'll do my best to explain what I mean as it's possible I've got the wrong end of the stick here. @brwe suggested adding '.actionGet()' to the end of my indexing call. My understanding is that this would cause the test to wait until a response has been received from the Elasticsearch node before proceeding to index the next document. This is what I am terming as synchronous execution because each index request must be acknowledged before a subsequent one is issued. My understanding is that the 'execute()' method returns a ListenableActionFuture so that you can issue a series of requests without waiting for responses and then use the futures to determine when they've been acknowledged. This is what I'd consider to be asynchronous execution. In either case I'd expect the end result to be the same with all documents being indexed according to the type mapping. Is this a valid assumption or have I got things mixed up here? One thing I can see that I'm not currently doing in my test is waiting for the result of all the indexing futures. I'll give this a try and see what happens. |
Modified test to wait for all indexing responses before sending refresh request and re-ran it using latest code from master. Unfortunately, neither change had any effect. However, if I group all the separate indexing actions into a single bulk request then the problem goes away. Hope this info helps with further investigation. |
I have to apologize... I looked into this closer and I can see the dynamic mapping being applied correctly even on master but somehow it pulls a default analyzer even if the field mapper has |
If we publish the object mappers before we updated the actual DocumentFieldMappers concurrently indexing documents can suddenly use default analysis chains since they skip application of dynamic template or dynamic mappers in general since if there is a race condition ObjectMappers are only build once. Closes elastic#3544
@pmclellan Just writing to let you know we are still on it. It is bug indeed and as @s1monw guessed a race condition. Thanks again for providing the test and being so persistent! @s1monw prepared a fix (see above) but we need to discuss this further before pulling it in. |
Hi @brwe, thanks for the update. It's a rare day when someone actually thanks me for my persistence. Good to hear that you've pinned down the cause of the problem. Keep me posted on the progress of the fix. |
@pmclellan I am glad you were persistent and I learned my part here as well :) thanks for this |
I still experience some wrong behavior of elasticsearch when selecting the index analyzer for a dynamic property. I have a "autocomplete*" property; the concrete fields "_autocomplete_de" or "_autocomplete_en" are filled by a transform script depending on the lanuage of the document. The index analyzer to use is equal to concrete name of the field. Look at this example: PUT test_dynamic_template
{
"settings": {
"analysis": {
"analyzer": {
"nGram_analyzer": {
"alias": "default_index",
"tokenizer": "nGram_tokenizer",
"filter": [
"lowercase",
"asciifolding"
]
},
"_autocomplete_de": {
"tokenizer": "standard",
"filter": [
"lowercase",
"stopwords_de_filter"
]
},
"_autocomplete_en": {
"tokenizer": "standard",
"filter": [
"lowercase",
"stopwords_en_filter"
]
}
},
"tokenizer": {
"nGram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 25,
"token_chars": [ "letter", "digit", "symbol", "punctuation" ]
}
},
"filter": {
"stopwords_de_filter": {
"type": "stop",
"stopwords": "_german_"
},
"stopwords_en_filter": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
PUT test_dynamic_template/_mapping/page
{
"page": {
"properties": {
"_lang": {
"type": "string",
"index": "not_analyzed"
}
},
"dynamic_templates" : [
{
"autocomplete_for_lang" : {
"match" : "_autocomplete_*",
"mapping" : {
"type" : "string",
"index_analyzer" : "{name}",
"search_analyzer": "standard"
}
}
}
],
"transform": {
"script": "if(ctx._source._lang instanceof String) { ctx._source['_autocomplete_' + ctx._source._lang] = autocomplete_fields.collect{ ctx._source[it] }; }",
"params": {
"autocomplete_fields": ["title", "content"]
}
}
}
}
POST test_dynamic_template/page/
{
"_lang": "en",
"title": "elasticsearch dynamic template",
"content": "this is some content to reproduce the behavior"
}
GET test_dynamic_template/_search?search_type=count
{
"query": {
"match_all": { }
},
"facets" : {
"all_autocomplete" : {
"terms" : {
"field": "_autocomplete_en",
"regex": "^s.*",
"size": 5
}
}
}
} I've executed this commands multiple times, sometimes I see the following behavior: elasticsearch version: 1.4.2 Update: The wrong behavior only occurs if the name of the dynamic property starts with an underscore. I named it "autocomplete_*" now and it works fine. However, I would still consider this as a bug, since underscores are permitted in property names. |
@sschuerz This is a different issue. Analyzer names may not start with an _, these names are reserved for special analyzers. But you are right, this is not documented anywhere and there is no check for this, the analyzer is just silently dropped. Thanks for reporting and especially the update that points to the _. This saved a lot of time. |
Dynamic mapping allow to dynamically introduce new fields into an existing mapping. There is a (pretty rare) race condition, where a new field/object being introduced will not be immediately visible for another document that introduces it at the same time. closes elastic#3667, closes elastic#3544
Hi,
I'm seeing some unusual behaviour when indexing documents in Elasticsearch with a dynamic template. I have a unit test that's failing intermittently. The flow of the test is as follows:
The type mapping I'm using includes the following dynamic template definition:
This template is intended to produce fields of the form:
The problem I have is that occasionally (perhaps once in every ten runs) the test will fail because step 6 does not return all the expected documents. When I check the indexed terms for the missing documents I see that values in the 'participants' field have been split into separate tokens on the '-' character. This seems to suggest that the default analyzer is being used for indexing instead of the whitespace one.
So far I haven't been able to detect any pattern to the failures. The unexpected tokenisation only affects a portion of the indexed documents and can occur at any point in the indexing process (i.e. it isn't always the first or last document that has problems).
Here is a gist with a simplified version of my original test: https://gist.github.com/pmclellan/64b192537c97529ec2e4 This version fails much more consistently, usually on the first run. However, I have noticed that the problem does not occur if I index documents in a synchronous manner (i.e. waiting for a response to each indexing request before issuing the next).
Cheers,
Paul
The text was updated successfully, but these errors were encountered: