You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I can't find a way to have this connector provide access to the "meta data" of the response set.
For example, a very efficient way to get the total count of an elasticsearch query is to simply find the "total" field in the json blob of the first page of results (assuming you are scrolling, etc.)
An example response form a scroll query: { u'_scroll_id': u'xxxx', u'_shards': { u'failed': 0, u'successful': 10, u'total': 10}, u'hits': { u'hits': [], u'max_score': 0.0, u'total': 1593020 }, u'timed_out': False, u'took': 3643}
Notice how it took 3 seconds (on a HUGE data set) and I can now use the "hits.total" value to return to the user.
Obviously actually getting the data then means a full fetch, but there are many use cases to simply get the count of the query first.
Ideas/thoughts?
The text was updated successfully, but these errors were encountered:
@ledjon We use Github issues to track bugs and actionable features only. Please ask questions in the discuss forum.
ES-Hadoop primarily focuses on providing bulk reading of documents from Elasticsearch into Hadoop for analysis and bulk writing of documents from Hadoop into Elasticsearch. While there are some plans to eventually add in the ability to retrieve things like counts and aggregations along with the original data, these initiatives are primarily limited by the available API hooks in Hadoop. If you need these aggregations for a job, my advice is to query them from Elasticsearch and then to serialize them to the tasks for use.
I can't find a way to have this connector provide access to the "meta data" of the response set.
For example, a very efficient way to get the total count of an elasticsearch query is to simply find the "total" field in the json blob of the first page of results (assuming you are scrolling, etc.)
An example response form a scroll query:
{ u'_scroll_id': u'xxxx', u'_shards': { u'failed': 0, u'successful': 10, u'total': 10}, u'hits': { u'hits': [], u'max_score': 0.0, u'total': 1593020 }, u'timed_out': False, u'took': 3643}
Notice how it took 3 seconds (on a HUGE data set) and I can now use the "hits.total" value to return to the user.
Obviously actually getting the data then means a full fetch, but there are many use cases to simply get the count of the query first.
Ideas/thoughts?
The text was updated successfully, but these errors were encountered: