You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am doing the following to fetch data from my ES instance
SparkConf conf = new SparkConf().setAppName("Simple Application")
.set("es.resource", "myindex/account")
.set("es.nodes","192.168.224.94").set("es.port","9200")
.set("es.index.auto.create","no").set("es.nodes.discovery","false").set("pushdown","true");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
DataFrame myEsDump = JavaEsSparkSQL.esDF(sqlContext);
myEsDump.registerTempTable("allAccounts");
DataFrame accounts = sqlContext.sql("SELECT name FROM allAccounts WHERE name = 'Name-801'");
This runs fine and gives me the record I want. However, it appears that this never makes a ES query. I have enabled slow logging for all queries and I never see ES being queried. What would be the reason that all the ES documents are being sucked in and a filter being applied in the Spark layer? I though that enabling pushdown should disable such behavior.
Added minor edits to your post for proper formatting
Push-down applies only for DataFrames registered through Spark's DataSources API as explained in the docs. Applying SQL on custom DataFrames not registered that way, means Spark will not apply any pushdown.
Could you let me know what is wrong in the following set of steps? I registered the dataFrame through Spark's DataSource, but still dont see the query getting executed on ElasticSearch
`
SparkConf conf = new SparkConf().setAppName("Simple Application");
Map<String,String> dataFrameOptions = new HashMap<String,String>();
dataFrameOptions.put("es.resource", "myindex/account");
dataFrameOptions.put("es.nodes","192.168.224.94");
dataFrameOptions.put("es.port","9200");
dataFrameOptions.put("es.index.auto.create","no");
dataFrameOptions.put("es.nodes.discovery","false");
dataFrameOptions.put("pushdown","true");
dataFrameOptions.put("double.filtering","false");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
DataFrame myEsDump = sqlContext.read().format("org.elasticsearch.spark.sql").options(dataFrameOptions).load("myindex/account");
myEsDump.registerTempTable("allAccounts");
DataFrame accounts = sqlContext.sql("SELECT name FROM allAccounts WHERE name = 'Name-888'");
The spark documentation should provide enough info. Notice that the dataframes are manipulated as is without having it registered as a table. In the end, the both should work the same way however there might be a different code path taken by using a registered table.
Have you tried running the query directly on the dataframe instead of going through a temporary table?
Thanks again! I was able to capture the packets and saw that it did indeed generate ES queries. It may have been some abnormality with the ES loggers that is causing it to not log the scan/scroll search queries.
Hi,
I am doing the following to fetch data from my ES instance
This runs fine and gives me the record I want. However, it appears that this never makes a ES query. I have enabled slow logging for all queries and I never see ES being queried. What would be the reason that all the ES documents are being sucked in and a filter being applied in the Spark layer? I though that enabling pushdown should disable such behavior.
Here are the versions that I am using
The text was updated successfully, but these errors were encountered: