Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested multi_field type wrapped by a custom type passing 'external values' doesn't get values passed #5402

Closed
thomasm82 opened this issue Mar 12, 2014 · 2 comments · Fixed by #6867
Assignees

Comments

@thomasm82
Copy link

This issue came up when I tried to combine the Attachment type (elasticsearch-mapper-attachments plugin) with a nested multi_field type. The intention behind that was to use the fulltext content extracted by the Attachment type for creating shingles and nGrams as well.

Prerequisites

Install the Attachment type plugin:

$ elasticsearch-1.0.1/bin/plugin --install elasticsearch/elasticsearch-mapper-attachments/2.0.0.RC1

Index configuration

I've prepared a sample configuration for an elasticsearch index 'attachment' as listed below:

# delete the index
$ curl -XDELETE 'http://localhost:9200/attachment'

# create the index with some analyzers to be used later on
$ curl -XPUT http://localhost:9200/attachment?pretty -d ' 
index :
    number_of_shards : 1
    number_of_replicas : 1

    analysis:
        analyzer:
            default:
                type: custom
                tokenizer: standard
            ngram_analyzer:
                type : custom
                tokenizer : standard
                filter: [ngram_filter]
            e_ngram_analyzer:
                type : custom
                tokenizer : standard
                filter: [e_ngram_filter]
            shingle_analyzer:
                type : custom
                tokenizer : standard
                filter: [shingle_filter]

        filter: 
            shingle_filter:
                type : shingle
                min_shingle_size : 2
                max_shingle_size : 5
                output_unigrams : true
            ngram_filter:
                type : nGram
                min_gram : 1
                max_gram : 20
            e_ngram_filter:
                type : edgeNGram
                min_gram : 1
                max_gram : 20                

    mapping:
        attachment:
            ignore_errors: false
'

# create the mapping for the index's documents
$ curl -XPUT http://localhost:9200/attachment/document/_mapping?pretty -d '
{
    "document" : {
        "properties" : {
            "file" : {
                "type" : "attachment",
                "path" : "full",
                "fields" : {
                    "file" : {
                        "type" : "multi_field",
                        "fields" : {
                            "file" : {
                                "type" : "string",
                                "store" : true,
                                "index" : "not_analyzed"
                            },
                            "shingle" : {
                                "type" : "string",
                                "store" : true,
                                "analyzer" : "shingle_analyzer"
                            },
                            "ngram" : {
                                "type" : "string",
                                "store" : true,
                                "analyzer" : "ngram_analyzer"
                            },
                            "e_ngram" : {
                                "type" : "string",
                                "store" : true,
                                "analyzer" : "e_ngram_analyzer"
                            }
                        }
                    },
                    "author" : {
                        "type" : "string",
                        "store" : true
                    },
                    "title" : {
                        "type" : "string",
                        "store" : true
                    },
                    "name" : {
                        "type" : "string",
                        "store" : true
                    },
                    "date" : {
                        "type" : "date",
                        "store" : true,
                        "format" : "dateOptionalTime"
                    },
                    "keywords" : {
                        "type" : "string",
                        "store" : true
                    },
                    "content_type" : {
                        "type" : "string",
                        "store" : true
                    },
                    "content_length" : {
                        "type" : "integer",
                        "store" : true
                    }
                }   
            }
        }
    }
}

'

Adding document to the index

After adding a document, we will see, that the defined multi_field fields for shingles and nGrams do NOT receive any data.

# Adding a document to the index, as needed by the Attachment plugin binary data as base64 encoded string
$ curl -XPUT 'http://localhost:9200/attachment/document/1?pretty' -d '
{
    "file" : "VGhpcyBpcyBhIHRlc3QgZG9jdW1lbnQgdG8gZGVtb25zdHJhdGUgdGhlIGJlaGF2aW9yIG9mIHRoZSBBdHRhY2htZW50IHR5cGUgcGx1Z2luIAp3aGljaCBkb2Vzbid0IHBhc3MgdGhpcyBkb2N1bWVudCdzIGNvbnRlbnQgdG8gdGhlIGRlZmluZWQgbmVzdGVkIG11bHRpX2ZpZWxkIHBhcnRzIApmb3IgY3JlYXRpbmcgbkdyYW1zIGFuZCBzaGluZ2xlcy4="
}
'

# Searching for documents afterwards shows, that the multi_field fields are not receiving the data provided
$ curl -XGET 'http://localhost:9200/attachment/_search?q=*&fields=*&pretty'

# Result:
{
  "took" : 48,  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "attachment",
      "_type" : "document",
      "_id" : "1",
      "_score" : 1.0,
      "fields" : {
        "file.e_ngram" : [ "" ],
        "file.content_length" : [ 200 ],
        "file.ngram" : [ "" ],
        "file.file" : [ "" ],
        "file.content_type" : [ "text/plain; charset=ISO-8859-1" ],
        "file.shingle" : [ "" ]
      }
    } ]
  }
}

Problem in elasticsearch code

I've started investigations on this issue, as I could find discussions online about similar problems, but no solution. Digging into the depths of the elasticsearch code I've detected the issue and built a workaround inside the Attachment plugin's code for my local needs which produced the desired result. From my point of view this should be fixed inside the elasticsearch code and I will try to explain how.

From what I hopefully got right, there are two ways the org.elasticsearch.index.mapper.ParseContext may provide values to any subclass of org.elasticsearch.index.mapper.core.AbstractFieldMapper for parsing:

  • accessing the next token from the provided JSON stream: org.elasticsearch.index.mapper.ParseContext.parser().currentToken()
  • checking for a so called 'external value' to be used instead: org.elasticsearch.index.mapper.ParseContext.externalValue()

Inside the org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateFieldForString(ParseContext, String, float) method this can be seen, where the 'external value' is tried to be consumed (emphasis intended) by calling org.elasticsearch.index.mapper.ParseContext.externalValue(). I am using the term 'consumed' as this is literally what happens as the call to the externalValue() method sets a boolean flag to 'false' which is actually used for checking for existence of such an 'external value'.

So, this is where the problem resides and I overcame it by wrapping the ParseContext passed to the Attachment mapper plugin with my own implementation, that actually just delegates any call to the original context except for the org.elasticsearch.index.mapper.ParseContext.externalValueSet() method which I've overwritten to not only check the boolean flag but also check for the 'external value' to be not null. This way it is assured that as long as the multi_field fields are processed all of them get the content.

Desired output after fixing the code

Just to make sure what the expected result should look like:

# result of the above document being indexed with my local changes, where we can see that the nGram and shingle fields properly get content
{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "attachment",
      "_type" : "document",
      "_id" : "1",
      "_score" : 1.0,
      "fields" : {
        "file.e_ngram" : [ "This is a test document to demonstrate the behavior of the Attachment type plugin \nwhich doesn't pass this document's content to the defined nested multi_field parts \nfor creating nGrams and shingles.\n" ],
        "file.content_length" : [ 200 ],
        "file.ngram" : [ "This is a test document to demonstrate the behavior of the Attachment type plugin \nwhich doesn't pass this document's content to the defined nested multi_field parts \nfor creating nGrams and shingles.\n" ],
        "file.file" : [ "This is a test document to demonstrate the behavior of the Attachment type plugin \nwhich doesn't pass this document's content to the defined nested multi_field parts \nfor creating nGrams and shingles.\n" ],
        "file.content_type" : [ "text/plain; charset=ISO-8859-1" ],
        "file.shingle" : [ "This is a test document to demonstrate the behavior of the Attachment type plugin \nwhich doesn't pass this document's content to the defined nested multi_field parts \nfor creating nGrams and shingles.\n" ]
      }
    } ]
  }
}

Hopefully my explanations are clear to you guys. As I am not totally sure which way one should solve this issue in order to avoid any side-effects I am relying on you to get this thing fixed.

Thanks in advance
Tom

dadoonet added a commit to dadoonet/elasticsearch that referenced this issue Jun 14, 2014
In context of mapper attachment and other mapper plugins, when dealing with multi fields, sub fields never get the `externalValue` although it was set.

This patch consider that an external value is set when `externalValue` is different than `null`.

Here is a full script which reproduce the issue when used with mapper attachment plugin:

```
DELETE /test

PUT /test
{
    "mappings": {
        "test": {
            "properties": {
                "f": {
                    "type": "attachment",
                    "fields": {
                        "f": {
                            "analyzer": "english",
                            "fields": {
                                "no_stemming": {
                                    "type": "string",
                                    "store": "yes",
                                    "analyzer": "standard"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

PUT /test/test/1
{
    "f": "VGhlIHF1aWNrIGJyb3duIGZveGVz"
}

GET /test/_search
{
    "query": {
        "match": {
           "f": "quick"
        }
    }
}

GET /test/_search
{
    "query": {
        "match": {
           "f.no_stemming": "quick"
        }
    }
}

GET /test/test/1?fields=f.no_stemming
```

Related to elastic/elasticsearch-mapper-attachments#57

Closes elastic#5402.
@dadoonet dadoonet self-assigned this Jun 14, 2014
@dadoonet dadoonet added the bug label Jun 14, 2014
@dadoonet dadoonet added v1.1.3 and removed v1.1.3 labels Jul 4, 2014
@clintongormley
Copy link

@dadoonet What's the status of this?

@dadoonet
Copy link
Member

@clintongormley At this exact moment, we are talking about it with @jpountz :) We would like to have it in 1.3!

@s1monw s1monw added v1.4.0 and removed v1.3.0 labels Jul 14, 2014
dadoonet added a commit to dadoonet/elasticsearch that referenced this issue Jul 25, 2014
In context of mapper attachment and other mapper plugins, when dealing with multi fields, sub fields never get the `externalValue` although it was set.

Here is a full script which reproduce the issue when used with mapper attachment plugin:

```
DELETE /test

PUT /test
{
    "mappings": {
        "test": {
            "properties": {
                "f": {
                    "type": "attachment",
                    "fields": {
                        "f": {
                            "analyzer": "english",
                            "fields": {
                                "no_stemming": {
                                    "type": "string",
                                    "store": "yes",
                                    "analyzer": "standard"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

PUT /test/test/1
{
    "f": "VGhlIHF1aWNrIGJyb3duIGZveGVz"
}

GET /test/_search
{
    "query": {
        "match": {
           "f": "quick"
        }
    }
}

GET /test/_search
{
    "query": {
        "match": {
           "f.no_stemming": "quick"
        }
    }
}

GET /test/test/1?fields=f.no_stemming
```

Related to elastic/elasticsearch-mapper-attachments#57

Closes elastic#5402.
dadoonet added a commit that referenced this issue Jul 25, 2014
In context of mapper attachment and other mapper plugins, when dealing with multi fields, sub fields never get the `externalValue` although it was set.

Here is a full script which reproduce the issue when used with mapper attachment plugin:

```
DELETE /test

PUT /test
{
    "mappings": {
        "test": {
            "properties": {
                "f": {
                    "type": "attachment",
                    "fields": {
                        "f": {
                            "analyzer": "english",
                            "fields": {
                                "no_stemming": {
                                    "type": "string",
                                    "store": "yes",
                                    "analyzer": "standard"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

PUT /test/test/1
{
    "f": "VGhlIHF1aWNrIGJyb3duIGZveGVz"
}

GET /test/_search
{
    "query": {
        "match": {
           "f": "quick"
        }
    }
}

GET /test/_search
{
    "query": {
        "match": {
           "f.no_stemming": "quick"
        }
    }
}

GET /test/test/1?fields=f.no_stemming
```

Related to elastic/elasticsearch-mapper-attachments#57

Closes #5402.

(cherry picked from commit 11eced0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants