Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp formatting in index name breaks Cascading and Pig integrations #985

Closed
jbaiera opened this issue May 8, 2017 · 1 comment
Closed

Comments

@jbaiera
Copy link
Member

jbaiera commented May 8, 2017

As described in the documentation:

When using dynamic/multi writes, one can also specify a formatting of the value returned by the field. Out of the box, elasticsearch-hadoop provides formatting for date/timestamp fields which is useful for automatically grouping time-based data (such as logs) within a certain time range under the same index. By using the Java SimpleDataFormat syntax, one can format and parse the date in a locale-sensitive manner.

What this boils down to is that you can use a pattern like index/type-{time:YYYY.MM.dd} to denote that the time field should be formatted as YYYY.MM.dd and inserted into the type name. This works fine when used in the type name, but in 6.0 by default indices will only have one type allowed on them. Which leads us to the problem:

Cascading and Pig break when this formatting is applied in the index part of the name. There is a lot of logic in both Cascading and Pig that expects that the Tap or Load function refer to resources that can be resolved on HDFS. Generally this is not a problem since the paths are processed and do not contain any host address in them, but when the pattern is specified, the colon trips up the path parsing and causes the code to throw an exception.

There's really no way to change this behavior since much of the code lives on Cascading and Pig's side, and when the input classes are queried they must respond with an absolute path to the resource. In Cascading's case this is low impact since we could just return a placeholder and be on our way, but Pig requires the resource to be real, as it will later pass the resource to the load function for loading, and modifying it in any way would cause the load function to not receive the index and type correctly.

To get around this we will need to change the separator in the parsing of formatted patterns. In 6.0 only the pipe character "|" will be accepted, but in 5.x we will continue to accept colon ":" as well as the new | character going forward, with a deprecation warning about the former when we encounter it.

jbaiera added a commit that referenced this issue May 8, 2017
… format.

Colon character gets in the way when some frameworks attempt to fit the input resource
into a Path for HDFS (even though they eventually never use it as an HDFS path). The
break in parsing causes the jobs to fail when using this format.

Applying fix to the format separator in the index pattern.

relates #985
@jbaiera
Copy link
Member Author

jbaiera commented Jul 5, 2017

This is fixed in master for 6.0, and the new separator will be available in 5.5 along with deprecation warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant