Wrong analyzer used when indexing dynamic property #3544

pmclellan · 2013-08-20T19:12:30Z

Hi,

I'm seeing some unusual behaviour when indexing documents in Elasticsearch with a dynamic template. I have a unit test that's failing intermittently. The flow of the test is as follows:

1) Initialise in-memory Elasticsearch cluster (one local node, no replicas)
2) Create new index
3) Create new type mapping
4) Index some documents
5) Refresh index and wait for all documents to be processed
6) Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template definition:

{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}

This template is intended to produce fields of the form:

participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]

The problem I have is that occasionally (perhaps once in every ten runs) the test will fail because step 6 does not return all the expected documents. When I check the indexed terms for the missing documents I see that values in the 'participants' field have been split into separate tokens on the '-' character. This seems to suggest that the default analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The unexpected tokenisation only affects a portion of the indexed documents and can occur at any point in the indexing process (i.e. it isn't always the first or last document that has problems).

Here is a gist with a simplified version of my original test: https://gist.github.com/pmclellan/64b192537c97529ec2e4 This version fails much more consistently, usually on the first run. However, I have noticed that the problem does not occur if I index documents in a synchronous manner (i.e. waiting for a response to each indexing request before issuing the next).

Cheers,
Paul

The text was updated successfully, but these errors were encountered:

s1monw · 2013-08-21T08:06:34Z

I think you should must not add "participants.*" as the pattern but simply use "*" instead since the template is already defined for a type see this gist: https://gist.github.com/s1monw/6291595
does this make sense?

brwe · 2013-08-21T10:31:53Z

I can still reproduce the behavior here, even with the modified mapping and a slightly simpler test. Looks like a bug to me.

kimchy · 2013-08-21T10:35:42Z

this might relate to a bug we fixed in put mapping, where it didn't always wait for all the relevant nodes to properly ack that the mapping have been applied. It should be fixed in master and 0.90 (upcoming 0.90.4).

s1monw · 2013-08-21T10:44:49Z

@brwe can you attach the test you use to reproduce it here?

brwe · 2013-08-21T13:28:25Z

I prepared a (very crude) test on branch issue-3544-test in my repository.

Adding an actionGet() after

    client.prepareIndex(INDEX_NAME, MAPPING_TYPE)
            .setSource(source)
            .setConsistencyLevel(WriteConsistencyLevel.QUORUM)
            .execute();

seems to solve the problem, as you expected @s1monw .
So, probably not a bug after all.

s1monw · 2013-08-21T13:39:02Z

@brwe thanks for clarifying! @pmclellan are you ok with closing this issue, does it work for you as well based on the examples?

pmclellan · 2013-08-21T16:39:16Z

Thanks for the feedback. However, I still think this is a bug. Indexing all the documents synchronously, as @brwe suggested, does prevent the issue from happening, but I believe the problem should not occur when indexing asynchronously either.

I'll try running my test against master to see if the issue has been resolved by the fix @kimchy mentioned.

s1monw · 2013-08-21T16:41:03Z

@pmclellan you can index them async but you need to wait until they are all indexed on all replicas and then call refresh otherwise you might miss some docs depending on which shard / replica you go. I don't see the issue you are talking about sync/async here can you explain?

pmclellan · 2013-08-21T17:34:22Z

Apologies for the confusion. I'll do my best to explain what I mean as it's possible I've got the wrong end of the stick here.

@brwe suggested adding '.actionGet()' to the end of my indexing call. My understanding is that this would cause the test to wait until a response has been received from the Elasticsearch node before proceeding to index the next document. This is what I am terming as synchronous execution because each index request must be acknowledged before a subsequent one is issued. My understanding is that the 'execute()' method returns a ListenableActionFuture so that you can issue a series of requests without waiting for responses and then use the futures to determine when they've been acknowledged. This is what I'd consider to be asynchronous execution. In either case I'd expect the end result to be the same with all documents being indexed according to the type mapping. Is this a valid assumption or have I got things mixed up here?

One thing I can see that I'm not currently doing in my test is waiting for the result of all the indexing futures. I'll give this a try and see what happens.

pmclellan · 2013-08-21T18:11:46Z

Modified test to wait for all indexing responses before sending refresh request and re-ran it using latest code from master. Unfortunately, neither change had any effect. However, if I group all the separate indexing actions into a single bulk request then the problem goes away.

Hope this info helps with further investigation.

s1monw · 2013-08-22T06:51:31Z

I have to apologize... I looked into this closer and I can see the dynamic mapping being applied correctly even on master but somehow it pulls a default analyzer even if the field mapper has whitespace initialized. We will look deeper into this! It has certainly something todo with a race when we see the field the first time.

If we publish the object mappers before we updated the actual DocumentFieldMappers concurrently indexing documents can suddenly use default analysis chains since they skip application of dynamic template or dynamic mappers in general since if there is a race condition ObjectMappers are only build once. Closes elastic#3544

brwe · 2013-08-30T12:36:40Z

@pmclellan Just writing to let you know we are still on it. It is bug indeed and as @s1monw guessed a race condition. Thanks again for providing the test and being so persistent! @s1monw prepared a fix (see above) but we need to discuss this further before pulling it in.

pmclellan · 2013-08-30T15:45:18Z

Hi @brwe, thanks for the update. It's a rare day when someone actually thanks me for my persistence.

Good to hear that you've pinned down the cause of the problem. Keep me posted on the progress of the fix.

s1monw · 2013-08-30T19:21:10Z

@pmclellan I am glad you were persistent and I learned my part here as well :) thanks for this

Dynamic mapping allow to dynamically introduce new fields into an existing mapping. There is a (pretty rare) race condition, where a new field/object being introduced will not be immediately visible for another document that introduces it at the same time. closes #3667, closes #3544

sschuerz · 2015-02-06T08:37:54Z

I still experience some wrong behavior of elasticsearch when selecting the index analyzer for a dynamic property. I have a "autocomplete*" property; the concrete fields "_autocomplete_de" or "_autocomplete_en" are filled by a transform script depending on the lanuage of the document. The index analyzer to use is equal to concrete name of the field. Look at this example:

PUT test_dynamic_template
{
  "settings": {
    "analysis": {
      "analyzer": {
        "nGram_analyzer": {
          "alias": "default_index",
          "tokenizer": "nGram_tokenizer",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        },
        "_autocomplete_de": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "stopwords_de_filter"
          ]
        },
        "_autocomplete_en": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "stopwords_en_filter"
          ]
        }
      },
      "tokenizer": {
        "nGram_tokenizer": {
          "type": "nGram",
          "min_gram": 3,
          "max_gram": 25,
          "token_chars": [ "letter", "digit", "symbol", "punctuation" ]
        }
      },
      "filter": {
        "stopwords_de_filter": {
          "type": "stop",
          "stopwords": "_german_"
        },
        "stopwords_en_filter": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

PUT test_dynamic_template/_mapping/page
{
  "page": {
    "properties": {
      "_lang": {
        "type": "string",
        "index": "not_analyzed"
      }
    },
    "dynamic_templates" : [
      {
        "autocomplete_for_lang" : {
          "match" : "_autocomplete_*",
          "mapping" : {
            "type" : "string",
            "index_analyzer" : "{name}",
            "search_analyzer": "standard"
          }
        }
      }
    ],
    "transform": {
      "script": "if(ctx._source._lang instanceof String) { ctx._source['_autocomplete_' + ctx._source._lang] = autocomplete_fields.collect{ ctx._source[it] }; }",
      "params": {
        "autocomplete_fields": ["title", "content"]
      }
    }
  }
}

POST test_dynamic_template/page/
{
  "_lang": "en",
  "title": "elasticsearch dynamic template",
  "content": "this is some content to reproduce the behavior"
}

GET test_dynamic_template/_search?search_type=count
{
  "query": {
    "match_all": { }
  },
  "facets" : {
    "all_autocomplete" : {
      "terms" : {
        "field": "_autocomplete_en",
        "regex": "^s.*", 
        "size": 5
      }
    }
  }
}

I've executed this commands multiple times, sometimes I see the following behavior:
Depending on the elasticsearch node where the facet query is executed, it returns only the word "some" (which is correct) or it returns "sticsearch", "sticsearc", "sticsear", "sticsea", "sticse" (in this case, a wrong analyzer seems to be used).

elasticsearch version: 1.4.2

Update: The wrong behavior only occurs if the name of the dynamic property starts with an underscore. I named it "autocomplete_*" now and it works fine. However, I would still consider this as a bug, since underscores are permitted in property names.

brwe · 2015-02-06T15:20:28Z

@sschuerz This is a different issue. Analyzer names may not start with an _, these names are reserved for special analyzers. But you are right, this is not documented anywhere and there is no check for this, the analyzer is just silently dropped.
I opened a new issue for this: #9596

Thanks for reporting and especially the update that points to the _. This saved a lot of time.

Dynamic mapping allow to dynamically introduce new fields into an existing mapping. There is a (pretty rare) race condition, where a new field/object being introduced will not be immediately visible for another document that introduces it at the same time. closes elastic#3667, closes elastic#3544

ghost assigned brwe Aug 21, 2013

kimchy mentioned this issue Sep 11, 2013

Rare race condition when introducing new fields into a mapping #3667

Closed

kimchy closed this as completed in bbce6e8 Sep 11, 2013

brwe mentioned this issue Feb 6, 2015

Analyzers starting with _ are silently ignored #9596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong analyzer used when indexing dynamic property #3544

Wrong analyzer used when indexing dynamic property #3544

pmclellan commented Aug 20, 2013

s1monw commented Aug 21, 2013

brwe commented Aug 21, 2013

kimchy commented Aug 21, 2013

s1monw commented Aug 21, 2013

brwe commented Aug 21, 2013

s1monw commented Aug 21, 2013

pmclellan commented Aug 21, 2013

s1monw commented Aug 21, 2013

pmclellan commented Aug 21, 2013

pmclellan commented Aug 21, 2013

s1monw commented Aug 22, 2013

brwe commented Aug 30, 2013

pmclellan commented Aug 30, 2013

s1monw commented Aug 30, 2013

sschuerz commented Feb 6, 2015

brwe commented Feb 6, 2015

Wrong analyzer used when indexing dynamic property #3544

Wrong analyzer used when indexing dynamic property #3544

Comments

pmclellan commented Aug 20, 2013

s1monw commented Aug 21, 2013

brwe commented Aug 21, 2013

kimchy commented Aug 21, 2013

s1monw commented Aug 21, 2013

brwe commented Aug 21, 2013

s1monw commented Aug 21, 2013

pmclellan commented Aug 21, 2013

s1monw commented Aug 21, 2013

pmclellan commented Aug 21, 2013

pmclellan commented Aug 21, 2013

s1monw commented Aug 22, 2013

brwe commented Aug 30, 2013

pmclellan commented Aug 30, 2013

s1monw commented Aug 30, 2013

sschuerz commented Feb 6, 2015

brwe commented Feb 6, 2015