Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String fields longer than 32kb cannot be indexed #873

Open
kroepke opened this Issue Jan 14, 2015 · 19 comments

Comments

Projects
None yet
@kroepke
Copy link
Member

kroepke commented Jan 14, 2015

Elasticsearch has an upper limit for term length, so trying to index values longer than ~32kb fails with an error.
Find a way to store those values but not trying to analyze them.

@kroepke kroepke added this to the 1.1.0 milestone Jan 14, 2015

@kroepke

This comment has been minimized.

Copy link
Member Author

kroepke commented Jan 14, 2015

@razvanphp

This comment has been minimized.

Copy link
Contributor

razvanphp commented Mar 20, 2015

+1

@kroepke kroepke modified the milestones: 1.2.0, 1.1.0 May 29, 2015

@bernd

This comment has been minimized.

Copy link
Member

bernd commented Aug 11, 2015

Removing this from 1.2, not gonna make it. Sorry.

@delfer

This comment has been minimized.

Copy link

delfer commented Dec 1, 2015

Found one else 'dummy' solution. Cut tail from long field into another field and then replace this field by any other field.

{
  "extractors": [
    {
      "condition_type": "regex",
      "condition_value": "^.{16383,}$",
      "converters": [],
      "cursor_strategy": "cut",
      "extractor_config": {
        "regex_value": "^.{0,16383}(.*)"
      },
      "extractor_type": "regex",
      "order": 0,
      "source_field": "msg.response",
      "target_field": "responseTail",
      "title": "cut response"
    },
    {
      "condition_type": "none",
      "condition_value": "",
      "converters": [],
      "cursor_strategy": "copy",
      "extractor_config": {},
      "extractor_type": "copy_input",
      "order": 0,
      "source_field": "gl2_remote_ip",
      "target_field": "responseTail",
      "title": "replace responseTail by server IP"
    }
  ],
  "version": "1.2.2 (91c7822)"
}
@ghost

This comment has been minimized.

Copy link

ghost commented Jan 28, 2016

+1

Same issue:
2016-01-28T17:03:43.165+01:00 ERROR [Messages] Failed to index [1] messages. Please check the index error log in your web interface for the reason. Error: failure in bulk execution:
[13]: index [graylog2_4], type [message], id [b31f5110-c5d8-11e5-8227-001a4a777b5d], message [IllegalArgumentException[Document contains at least one immense term in field="other" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[49, 50, 51, 52, 53, 54, 55, 56, 57, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 48]...', original message: bytes can be at most 32766 in length; got 186000]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 186000]; ]

My Gelf Message:
{ "message": "OK", "other": "more than 32k...." }

@joschi

This comment has been minimized.

Copy link
Contributor

joschi commented Jan 28, 2016

@ghost

This comment has been minimized.

Copy link

ghost commented Feb 1, 2016

@joschi Thanks for the links.
Is there somewhere a good definition from the GELF? https://www.graylog.org/resources/gelf/
Like which field has which data type and limitations. That would help us a lot.
Now we find out such "limitations" by reverse engineering (Bugs and testing).

@joschi

This comment has been minimized.

Copy link
Contributor

joschi commented Feb 1, 2016

@kablz The GELF specification can be found at https://www.graylog.org/resources/gelf/ and describes the names and types of mandatory fields in a GELF message. Additional fields (see specification) naturally don't have a fixed schema unless you enforce that on your GELF producers.

@csquire

This comment has been minimized.

Copy link

csquire commented Mar 22, 2016

You could try using an index template which includes a dynamic template that matches all string fields, then use 'ignore_above' to prevent the document from failing to index. Below is a template I use on another Elasticsearch cluster I feed logs to (not Graylog) where I was getting rejections from long fields, such as Java stacktraces. For my purposes, I didn't find it useful to index any field that has over 512 characters, but that value can be tweaked to whatever you like. The other settings could be removed or changed as desired.

Indices Templates
Index Mapping

{
  "logs_template": {
    "template": "logs*",
    "mappings": {
      "_default_": {
        "_all": {
          "enabled": false
        },
        "dynamic_templates": [
          {
            "notanalyzed": {
              "match": "*",
              "match_mapping_type": "string",
              "mapping": {
                "ignore_above": 512,
                "type": "string",
                "index": "not_analyzed",
                "doc_values": true
              }
            }
          }
        ]
      }
    }
  }
}

From the docs:

The analyzer will ignore strings larger than this size. Useful for generic not_analyzed fields that should ignore long text.

This option is also useful for protecting against Lucene’s term byte-length limit of 32766. Note: the value for ignore_above is the character count, but Lucene counts bytes, so if you have UTF-8 text, you may want to set the limit to 32766 / 3 = 10922 since UTF-8 characters may occupy at most 3 bytes.

@meixger

This comment has been minimized.

Copy link

meixger commented Apr 14, 2016

@csquire Nice, and this would work on a custom template, but unfortunately i've not found a way to replace the store_generic template in the default graylog-internal template:

{
  "graylog-internal": {
    "order": -2147483648,
    "template": "graylog_*",
    "mappings": {
      "message": {
        ...
        "dynamic_templates": [
          {
            "internal_fields": {
              ...
              "match": "gl2_*"
            }
          },
          {
            "store_generic": {
              "mapping": {
                "index": "not_analyzed",
              },
              "match": "*"
            }
          }
        ],
        "properties": {
          ...
        }
      }
    }
  }
}

What would speak against adding a ignore_above as default?

...
"store_generic": {
  "mapping": {
    "index": "not_analyzed",
    "ignore_above": 32766
  },
  "match": "*"
}
...
@sjoerdmulder

This comment has been minimized.

Copy link

sjoerdmulder commented Nov 15, 2016

I also hit this issue on Graylog 2.1.1, ignore_above seems a good option that will fix this.

@mike-daoust

This comment has been minimized.

Copy link

mike-daoust commented Mar 2, 2017

this is an issue for me also

@listingmirror

This comment has been minimized.

Copy link

listingmirror commented May 9, 2017

I hit this and my entire cluster dies (stops getting new messages). Shouldn't there be some kind of default limit to prevent cluster death? (Maybe this didnt kill cluster, still researching)

@cultavix

This comment has been minimized.

Copy link

cultavix commented May 10, 2017

Same as @listingmirror. Anytime we get a larger than normal java stack trace, it's bringing down Graylog completely. Not indexing documents anymore, no new logs. The only solution I've found so far is to kill -9 and then delete the on-disk journal.

@jebucha

This comment has been minimized.

Copy link

jebucha commented May 31, 2017

I believe we are also running in to this issue. I hadn't connected the dots but I'm seeing indexing failures "Document contains at least one immense term in field="full_message"", and the node that threw that error is not currently processing incoming messages, just queuing them up, backed up by 2 million and counting. As with others, my primary resolution has been to restart the service.

@Aenima4six2

This comment has been minimized.

Copy link

Aenima4six2 commented Aug 15, 2017

We were getting this issue in pre Graylog 2.3 versions and fixed the issue using @joschi's advice above (using custom mappings). However, with our recent upgrade to Graylog 2.3, the issue is back, even though a custom mapping is present in ES preventing the field from being indexed.

Current Error

{"type":"illegal_argument_exception","reason":"DocValuesField \"requestContent\" is too large, must be <= 32766"}
--

Old (Pre 2.3) Error

{"type":"illegal_argument_exception","reason":"Document contains at least one immense term in field=\"requestContent\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[45, 45, 45, 32, 82, 101, 113, 117, 101, 115, 116, 32, 72, 101, 97, 100, 101, 114, 115, 32, 45, 45, 45, 13, 10, 67, 111, 110, 110, 101]...', original message: bytes can be at most 32766 in length; got 38345","caused_by":{"type":"max_bytes_length_exceeded_exception","reason":"max_bytes_length_exceeded_exception: bytes can be at most 32766 in length; got 38345"}}

Pushed the following custom mapping to ES to address this, but no luck.

curl -X PUT -d '{ "template": "graylog_*", "mappings" : { "message" : { "properties" : { "requestContent" : { "type" : "string", "index" : "no" } } } } }' http://localhost:9200/_template/graylog-custom-mapping?pretty

curl -X GET 'http://localhost:9200/graylog_deflector/_mapping?pretty' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12743  100 12743    0     0  2107k      0 --:--:-- --:--:-- --:--:-- 2488k
{
  "graylog_5": {
    "mappings": {
      "message": {
        "dynamic_templates": [
          {
            "internal_fields": {
              "match": "gl2_*",
              "mapping": {
                "type": "keyword"
              }
            }
          },
          {
            "store_generic": {
              "match": "*",
              "mapping": {
                "index": "not_analyzed"
              }
            }
          }
        ],
        "properties": {
          "AccountName": {
            "type": "keyword"
          },
          ...
          "requestContent": {
            "type": "keyword",
            "index": false
          },
         ...
        }
      }
    }
  }
}

UPDATE
Think i found a viable solution.
https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html

curl -X PUT http://localhost:9200/_template/graylog-custom-mapping?pretty -d '
{
  "template": "graylog_*",
  "mappings" : {
    "message" : {
      "properties" : {
        "requestContent" : {
          "type" : "string",
          "index" : "no",
          "doc_values": false ---> turn this off.. ES 5.5 appears to have a 32k size limit.
        }
      }
    }
  }
}'
@avdhoot

This comment has been minimized.

Copy link

avdhoot commented Sep 25, 2017

@Aenima4six2 thanks for above solution.
+1

@Ayyappa752

This comment has been minimized.

Copy link

Ayyappa752 commented Feb 19, 2018

Hi @csquire , I ran into the same problem when I'm storing a html template in elastic. I have tried not to index it, and to increase the size by using "ignoreabove":512. But that didn't work. finally i have to use "doc_values": true along with the size. how come this doc_values solved the issue?

@zhangtemplar

This comment has been minimized.

Copy link

zhangtemplar commented Dec 11, 2018

@Aenima4six2 recent version of ElasticSearch does not allow you to change the type, index and/or doc_values.

However using ignore_above works. Here is the command:

curl -XPUT 'http://localhost:9200/graylog_0/_mapping/message' -d '
{
    "message" : {
      "properties" : {
        "screenShot" : {
          "type" : "keyword",
          "ignore_above": 32000
        }
      }
    }
}
'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.