Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong analyzer used when indexing dynamic property #3544

Closed
pmclellan opened this issue Aug 20, 2013 · 16 comments
Closed

Wrong analyzer used when indexing dynamic property #3544

pmclellan opened this issue Aug 20, 2013 · 16 comments

Comments

@pmclellan
Copy link

Hi,

I'm seeing some unusual behaviour when indexing documents in Elasticsearch with a dynamic template. I have a unit test that's failing intermittently. The flow of the test is as follows:

1) Initialise in-memory Elasticsearch cluster (one local node, no replicas)
2) Create new index
3) Create new type mapping
4) Index some documents
5) Refresh index and wait for all documents to be processed
6) Query Elasticsearch for documents

The type mapping I'm using includes the following dynamic template definition:

{
    "participants": {
        "path_match": "participants.*",
        "mapping": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "analyzer": "whitespace"
        }
    }
}

This template is intended to produce fields of the form:

participants.new = [ 'user-1', 'user-2' ]
participants.removed = [ 'user-3' ]

The problem I have is that occasionally (perhaps once in every ten runs) the test will fail because step 6 does not return all the expected documents. When I check the indexed terms for the missing documents I see that values in the 'participants' field have been split into separate tokens on the '-' character. This seems to suggest that the default analyzer is being used for indexing instead of the whitespace one.

So far I haven't been able to detect any pattern to the failures. The unexpected tokenisation only affects a portion of the indexed documents and can occur at any point in the indexing process (i.e. it isn't always the first or last document that has problems).

Here is a gist with a simplified version of my original test: https://gist.github.com/pmclellan/64b192537c97529ec2e4 This version fails much more consistently, usually on the first run. However, I have noticed that the problem does not occur if I index documents in a synchronous manner (i.e. waiting for a response to each indexing request before issuing the next).

Cheers,
Paul

@s1monw
Copy link
Contributor

s1monw commented Aug 21, 2013

I think you should must not add "participants.*" as the pattern but simply use "*" instead since the template is already defined for a type see this gist: https://gist.github.com/s1monw/6291595
does this make sense?

@brwe
Copy link
Contributor

brwe commented Aug 21, 2013

I can still reproduce the behavior here, even with the modified mapping and a slightly simpler test. Looks like a bug to me.

@ghost ghost assigned brwe Aug 21, 2013
@kimchy
Copy link
Member

kimchy commented Aug 21, 2013

this might relate to a bug we fixed in put mapping, where it didn't always wait for all the relevant nodes to properly ack that the mapping have been applied. It should be fixed in master and 0.90 (upcoming 0.90.4).

@s1monw
Copy link
Contributor

s1monw commented Aug 21, 2013

@brwe can you attach the test you use to reproduce it here?

@brwe
Copy link
Contributor

brwe commented Aug 21, 2013

I prepared a (very crude) test on branch issue-3544-test in my repository.

Adding an actionGet() after

    client.prepareIndex(INDEX_NAME, MAPPING_TYPE)
            .setSource(source)
            .setConsistencyLevel(WriteConsistencyLevel.QUORUM)
            .execute();

seems to solve the problem, as you expected @s1monw .
So, probably not a bug after all.

@s1monw
Copy link
Contributor

s1monw commented Aug 21, 2013

@brwe thanks for clarifying! @pmclellan are you ok with closing this issue, does it work for you as well based on the examples?

@pmclellan
Copy link
Author

Thanks for the feedback. However, I still think this is a bug. Indexing all the documents synchronously, as @brwe suggested, does prevent the issue from happening, but I believe the problem should not occur when indexing asynchronously either.

I'll try running my test against master to see if the issue has been resolved by the fix @kimchy mentioned.

@s1monw
Copy link
Contributor

s1monw commented Aug 21, 2013

@pmclellan you can index them async but you need to wait until they are all indexed on all replicas and then call refresh otherwise you might miss some docs depending on which shard / replica you go. I don't see the issue you are talking about sync/async here can you explain?

@pmclellan
Copy link
Author

Apologies for the confusion. I'll do my best to explain what I mean as it's possible I've got the wrong end of the stick here.

@brwe suggested adding '.actionGet()' to the end of my indexing call. My understanding is that this would cause the test to wait until a response has been received from the Elasticsearch node before proceeding to index the next document. This is what I am terming as synchronous execution because each index request must be acknowledged before a subsequent one is issued. My understanding is that the 'execute()' method returns a ListenableActionFuture so that you can issue a series of requests without waiting for responses and then use the futures to determine when they've been acknowledged. This is what I'd consider to be asynchronous execution. In either case I'd expect the end result to be the same with all documents being indexed according to the type mapping. Is this a valid assumption or have I got things mixed up here?

One thing I can see that I'm not currently doing in my test is waiting for the result of all the indexing futures. I'll give this a try and see what happens.

@pmclellan
Copy link
Author

Modified test to wait for all indexing responses before sending refresh request and re-ran it using latest code from master. Unfortunately, neither change had any effect. However, if I group all the separate indexing actions into a single bulk request then the problem goes away.

Hope this info helps with further investigation.

@s1monw
Copy link
Contributor

s1monw commented Aug 22, 2013

I have to apologize... I looked into this closer and I can see the dynamic mapping being applied correctly even on master but somehow it pulls a default analyzer even if the field mapper has whitespace initialized. We will look deeper into this! It has certainly something todo with a race when we see the field the first time.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Aug 26, 2013
If we publish the object mappers before we updated the actual DocumentFieldMappers
concurrently indexing documents can suddenly use default analysis chains since they skip
application of dynamic template or dynamic mappers in general since if there is a race condition
ObjectMappers are only build once.

Closes elastic#3544
@brwe
Copy link
Contributor

brwe commented Aug 30, 2013

@pmclellan Just writing to let you know we are still on it. It is bug indeed and as @s1monw guessed a race condition. Thanks again for providing the test and being so persistent! @s1monw prepared a fix (see above) but we need to discuss this further before pulling it in.

@pmclellan
Copy link
Author

Hi @brwe, thanks for the update. It's a rare day when someone actually thanks me for my persistence.

Good to hear that you've pinned down the cause of the problem. Keep me posted on the progress of the fix.

@s1monw
Copy link
Contributor

s1monw commented Aug 30, 2013

@pmclellan I am glad you were persistent and I learned my part here as well :) thanks for this

@kimchy kimchy closed this as completed in bbce6e8 Sep 11, 2013
kimchy added a commit that referenced this issue Sep 11, 2013
Dynamic mapping allow to dynamically introduce new fields into an existing mapping. There is a (pretty rare) race condition, where a new field/object being introduced will not be immediately visible for another document that introduces it at the same time.

closes #3667, closes #3544
@sschuerz
Copy link

sschuerz commented Feb 6, 2015

I still experience some wrong behavior of elasticsearch when selecting the index analyzer for a dynamic property. I have a "autocomplete*" property; the concrete fields "_autocomplete_de" or "_autocomplete_en" are filled by a transform script depending on the lanuage of the document. The index analyzer to use is equal to concrete name of the field. Look at this example:

PUT test_dynamic_template
{
  "settings": {
    "analysis": {
      "analyzer": {
        "nGram_analyzer": {
          "alias": "default_index",
          "tokenizer": "nGram_tokenizer",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        },
        "_autocomplete_de": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "stopwords_de_filter"
          ]
        },
        "_autocomplete_en": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "stopwords_en_filter"
          ]
        }
      },
      "tokenizer": {
        "nGram_tokenizer": {
          "type": "nGram",
          "min_gram": 3,
          "max_gram": 25,
          "token_chars": [ "letter", "digit", "symbol", "punctuation" ]
        }
      },
      "filter": {
        "stopwords_de_filter": {
          "type": "stop",
          "stopwords": "_german_"
        },
        "stopwords_en_filter": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

PUT test_dynamic_template/_mapping/page
{
  "page": {
    "properties": {
      "_lang": {
        "type": "string",
        "index": "not_analyzed"
      }
    },
    "dynamic_templates" : [
      {
        "autocomplete_for_lang" : {
          "match" : "_autocomplete_*",
          "mapping" : {
            "type" : "string",
            "index_analyzer" : "{name}",
            "search_analyzer": "standard"
          }
        }
      }
    ],
    "transform": {
      "script": "if(ctx._source._lang instanceof String) { ctx._source['_autocomplete_' + ctx._source._lang] = autocomplete_fields.collect{ ctx._source[it] }; }",
      "params": {
        "autocomplete_fields": ["title", "content"]
      }
    }
  }
}

POST test_dynamic_template/page/
{
  "_lang": "en",
  "title": "elasticsearch dynamic template",
  "content": "this is some content to reproduce the behavior"
}

GET test_dynamic_template/_search?search_type=count
{
  "query": {
    "match_all": { }
  },
  "facets" : {
    "all_autocomplete" : {
      "terms" : {
        "field": "_autocomplete_en",
        "regex": "^s.*", 
        "size": 5
      }
    }
  }
}

I've executed this commands multiple times, sometimes I see the following behavior:
Depending on the elasticsearch node where the facet query is executed, it returns only the word "some" (which is correct) or it returns "sticsearch", "sticsearc", "sticsear", "sticsea", "sticse" (in this case, a wrong analyzer seems to be used).

elasticsearch version: 1.4.2

Update: The wrong behavior only occurs if the name of the dynamic property starts with an underscore. I named it "autocomplete_*" now and it works fine. However, I would still consider this as a bug, since underscores are permitted in property names.

@brwe
Copy link
Contributor

brwe commented Feb 6, 2015

@sschuerz This is a different issue. Analyzer names may not start with an _, these names are reserved for special analyzers. But you are right, this is not documented anywhere and there is no check for this, the analyzer is just silently dropped.
I opened a new issue for this: #9596

Thanks for reporting and especially the update that points to the _. This saved a lot of time.

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
Dynamic mapping allow to dynamically introduce new fields into an existing mapping. There is a (pretty rare) race condition, where a new field/object being introduced will not be immediately visible for another document that introduces it at the same time.

closes elastic#3667, closes elastic#3544
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants