Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear docs #990

Closed
ov7a opened this issue May 16, 2017 · 10 comments
Closed

Unclear docs #990

ov7a opened this issue May 16, 2017 · 10 comments

Comments

@ov7a
Copy link

ov7a commented May 16, 2017

Partial copy of this thread:
https://discuss.elastic.co/t/writing-to-multiple-indices-and-documentation-about-it/82859

There are some unclear things in the documentation

At some point it says:

Note that multiple indices and/or types are allowed only for reading

But the next paragraph is titled "Dynamic/multi resource writes" and states that

For writing, elasticsearch-hadoop allows the target resource to be resolved at runtime by using patterns (by using the {} format), resolved at runtime based on the data being streamed to Elasticsearch. That is, one can save documents to a certain index or type based on one or multiple fields resolved from the document about to be saved.

Is seems like these two statements contradict each other. I'm still not sure is it possible to write to multiple indices or not.

Secondly, the example with timestamp also looks unclear. I think about resource as 'index/type' (please fix me if I'm wrong). In the example

# index the documents based on their date
es.resource.write = my-collection/{@timestamp:YYYY.MM.dd}

timestamp is a type right? But usually we separate indices by time, not data types. This seems wrong. I would expect timestamp to be a part of index:

# index the documents based on their date
es.resource.write = my-collection.{@timestamp:YYYY.MM.dd}/{media_type}

Is it valid resource? Will it work as expected (i.e. write to multiple indices)?

P.S. If I misplaced the issue, please specify a proper place to report it. I had zero replies at forum for almost a month. Elasticsearch/docs says: "If you find an error in the documentation, you should open an issue or pull request on the repository which contains the docs"

@jbaiera jbaiera added the doc label May 16, 2017
@jbaiera
Copy link
Member

jbaiera commented May 16, 2017

This could certainly be cleared up a bit more:

Note that multiple indices and/or types are allowed only for reading

This corresponds more toward using index and type names like _all/foo, where multiple indices are being read by usage of a pattern sent to Elasticsearch.

For writing, elasticsearch-hadoop allows the target resource to be resolved at runtime by using patterns (by using the {} format), resolved at runtime based on the data being streamed to Elasticsearch. That is, one can save documents to a certain index or type based on one or multiple fields resolved from the document about to be saved.

In this case, it explains that you can use a special pattern (denoted by curly braces) to have the connector determine which index and type to save documents to at runtime using the values stored in the documents' fields. This is different from the above because we are resolving the resource to a single target resource at write time for each document. If you were to use this pattern with something that does not resolve to a single index (_all/{field} for instance will have a single type resolved, but the _all index does not correspond to a single index) then the writing operation will not be successful.

es.resource.write = my-collection/{@ timestamp:YYYY.MM.dd}

~ versus ~

es.resource.write = my-collection.{@ timestamp:YYYY.MM.dd}/{media_type}

This is mostly to highlight that you can use the @timestamp field in a pattern, and format it however you like. These patterns can exist in either the index path element or the type path element, it makes no difference. Multiple patterns can be used as well, they will be resolved at runtime, as long as in resolving them with data from the document they point to a single index afterward.

Does that clear things up? I'll look into expanding the documentation around this to clarify the differences in "multiple indices" for each situation.

@ov7a
Copy link
Author

ov7a commented May 16, 2017

Yes, it does clear the things up, thank you.

Is it ok to create a feature-request for an ability to explicitly pass an index for a document through it metadata (alongside with id)?

@jbaiera
Copy link
Member

jbaiera commented May 16, 2017

@ov7a I think you should be able to do this currently by just specifying a field pattern as the entire index path item, like {index}/type

@ov7a
Copy link
Author

ov7a commented May 16, 2017

What if index name depends on other things (not only fields) and I do not want to store them?
Or index name is a complex function depending on multiple fields?

@ov7a
Copy link
Author

ov7a commented May 16, 2017

E.g.
if (field1 == value1 || field2 == value2)
index = index1
else
index = index2

@jbaiera
Copy link
Member

jbaiera commented May 16, 2017

@ov7a Even if we were to specify a configuration property that selects the entire index from the document field before writing it, it would be the exact same functionality as the existing pattern use case I explained above. In the even that you need to implement complex logic for selecting the index to send data to, I would suggest implementing that logic as part of your transformation steps before persisting to Elasticsearch. Finally, if you are concerned about adding unneeded fields to your index mappings, you can always mark those metafields to be excluded from the final document sent to Elasticsearch by using the es.mapping.exclude property. The fields will be available on the document for the purposes of filling in the index name, but will be omitted from the final rendered JSON data that is sent to Elasticsearch.

@ov7a
Copy link
Author

ov7a commented May 16, 2017

I thought that way, but es.mapping.exclude feature is ignored when es.input.json is specified :(

@jbaiera
Copy link
Member

jbaiera commented May 16, 2017

Yeah, es.input.json is meant to be a performance boosting option to avoid the serialization overhead. Since to include es.mapping.exclude with that would mean we would have to parse the JSON fragments to remove them, we thought it best to leave it off.

@ov7a
Copy link
Author

ov7a commented May 16, 2017

So, if I want es.input.json and complex index logic I have two options:

  1. Extra field stored
  2. Opt out of json

The feature request I want to propose is to provide extra metadata for each document, so it would be possible to have both json input and complex logic.

@jbaiera
Copy link
Member

jbaiera commented May 16, 2017

That's fine with me to open an enhancement ticket for that. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants