Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot restrict DataFrame to certain mapping field #497

Closed
analyticswarescott opened this issue Jul 11, 2015 · 11 comments
Closed

Cannot restrict DataFrame to certain mapping field #497

analyticswarescott opened this issue Jul 11, 2015 · 11 comments

Comments

@analyticswarescott
Copy link

The DataFrame returned by JavaEsSparkSQL.esDF contains Scala Buffers when a query string is specified, but not when the simpler overload is used. Code snippet and log output is below.

    DataFrame rdd = null;
            SourceLoaderES.logger.info(" loading " + indexName + "/" + docType + " using REST to get column list ");

            String query = getQueryString(indexName, docType);
            SourceLoaderES.logger.debug("ES query string" + query);                 
           // rdd = JavaEsSparkSQL.esDF(_handler._sc, indexName + "/" + docType, query);

           rdd = JavaEsSparkSQL.esDF(_handler._sc, indexName + "/" + docType);
           DataFrame rdd2 = JavaEsSparkSQL.esDF(_handler._sc, indexName + "/" + docType, query);


        SourceLoaderES.logger.debug("inferred SCHEMA " + rdd.schema().toString());


        if (AdminMgr.getConfig(SyncRunner.DEBUG_ENABLE_DF_COUNTS) != null) {
            if (AdminMgr.getConfig(SyncRunner.DEBUG_ENABLE_DF_COUNTS).equals( "true")) {
               logger.warn(" DataFrame count: " + rdd.count());
               logger.warn(" DataFrame first row: " + rdd.showString(1));
            }
        }
        logger.warn(" DataFrame first row: " + rdd.showString(1));
        logger.warn(" DataFrame2 first row: " + rdd2.showString(1));
        //SourceLoaderES.logger.info("ES SOURCE " + sourceName + "  SCHEMA " + rdd.schema().toString());

Log output from this snippet:

2015-07-10 17:09:16,281 [ (Sync) 15 - Sync 1] INFO  com.dg.data.sync.SourceLoaderES -  loading g5b778bb6-faa0-41a0-bb36-8a4c54b13774_dg_dim1/dim_lookups using REST to get column list 
2015-07-10 17:09:16,297 [ (Sync) 15 - Sync 1] DEBUG com.dg.data.sync.SourceLoaderES -  returned mapping for query string: {"properties":{"xid":{"type":"string"},"ApplicationLanguageId":{"type":"long"},"LookupKeyActive":{"type":"string"},"LookupKey":{"type":"long"},"LookupLevel":{"type":"long"},"LookupModule":{"type":"long"},"LookupKeyName":{"type":"string"}}}
2015-07-10 17:09:16,297 [ (Sync) 15 - Sync 1] DEBUG com.dg.data.sync.SourceLoaderES - ES query string{"query":{"match_all":{}},"fields":["xid","ApplicationLanguageId","LookupKeyActive","LookupKey","LookupLevel","LookupModule","LookupKeyName"]}
2015-07-10 17:09:16,359 [ (Sync) 15 - Sync 1] DEBUG com.dg.data.sync.SourceLoaderES - inferred SCHEMA StructType(StructField(ApplicationLanguageId,LongType,true), StructField(LookupKey,LongType,true), StructField(LookupKeyActive,StringType,true), StructField(LookupKeyName,StringType,true), StructField(LookupLevel,LongType,true), StructField(LookupModule,LongType,true), StructField(xid,StringType,true))
2015-07-10 17:09:16,641 [ (Sync) 15 - Sync 1] WARN  com.dg.data.sync.SourceLoaderES -  DataFrame first row: ApplicationLanguageId LookupKey LookupKeyActive LookupKeyName LookupLevel LookupModule xid           
1                     11        true            11 am         2           61           61-2-true-1-11
2015-07-10 17:09:16,797 [ (Sync) 15 - Sync 1] WARN  com.dg.data.sync.SourceLoaderES -  DataFrame2 first row: ApplicationLanguageId LookupKey  LookupKeyActive LookupKeyName LookupLevel LookupModule xid                 
Buffer(1)             Buffer(11) Buffer(true)    Buffer(11 am) Buffer(2)   Buffer(61)   Buffer(61-2-true-...
@costin
Copy link
Member

costin commented Jul 14, 2015

Thanks for the snippet; I'm currently on PTO however once I'll try it out as soon as I'm back.

@costin
Copy link
Member

costin commented Jul 28, 2015

@analyticswarescott Unfortunately I'm unable to reproduce the problem. What version of Es-hadoop and Spark are you using? Are you by any chance on Spark 1.3?
On 2.1.x and master, everything gets properly printed - maybe your query is special (can you share it)?
I've also noticed you are using showString which is not available in Spark 1.4 (in fact it is private).
Either way, I have tried the following snippets (which is part of the test suite) and I get no reference to Buffer:

println(JavaEsSparkSQL.esDF(sqc, target, "?q=name:me*", cfg.asJava).head())
[170,Mew,http://userserve-ak.last.fm/serve/252/42247291.jpg,6146220025000,http://www.last.fm/music/Mew]
JavaEsSparkSQL.esDF(sqc, target, "?q=name:me*", cfg.asJava).show(3)
+---+--------------------+--------------------+--------------+--------------------+
| id|                name|            pictures|          time|                 url|
+---+--------------------+--------------------+--------------+--------------------+
|170|                 Mew|http://userserve-...| 6146220025000|http://www.last.f...|
|918|            Megadeth|http://userserve-...|29656092025000|http://www.last.f...|
|996|Mike & The Mechanics|http://userserve-...|32117541625000|http://www.last.f...|
+---+--------------------+--------------------+--------------+--------------------+

Can you post your entire log and potentially turn on logging on the rest package (org.elasticsearch.hadoop.rest) all the way to TRACE and upload the result as a gist?

Thanks

@costin costin closed this as completed Jul 28, 2015
@costin costin reopened this Jul 28, 2015
costin added a commit that referenced this issue Jul 29, 2015
costin added a commit that referenced this issue Jul 29, 2015
@analyticswarescott
Copy link
Author

Your example test query string does not specify a set of fields to return,
only a constraint. The "fields" query element is what causes the issue in
my tests, including in Spark 1.4.1.

I will provide more information as soon as I am able.

--Scott

On Tue, Jul 28, 2015 at 11:49 AM, Costin Leau notifications@github.com
wrote:

@analyticswarescott https://github.com/analyticswarescott Unfortunately
I'm unable to reproduce the problem. What version of Es-hadoop and Spark
are you using? Are you by any chance on Spark 1.3?
On 2.1.x and master, everything gets properly printed - maybe your query
is special (can you share it)?
I've also noticed you are using showString which is not available in
Spark 1.4 (in fact it is private).
Either way, I have tried the following snippets (which is part of the test
suite) and I get no reference to Buffer:

println(JavaEsSparkSQL.esDF(sqc, target, "?q=name:me*", cfg.asJava).head())

[170,Mew,http://userserve-ak.last.fm/serve/252/42247291.jpg,6146220025000,http://www.last.fm/music/Mew]``` http://userserve-ak.last.fm/serve/252/42247291.jpg,6146220025000,http://www.last.fm/music/Mew%5D

println(JavaEsSparkSQL.esDF(sqc, target, "?q=name:me*", cfg.asJava).show(3)

+---+--------------------+--------------------+--------------+--------------------+
| id| name| pictures| time| url|

+---+--------------------+--------------------+--------------+--------------------+
|170| Mew|http://userserve-...| 6146220025000|http://www.last.f...|
|918| Megadeth|http://userserve-...|29656092025000|http://www.last.f...|
|996|Mike & The Mechanics|
http://userserve-...|32117541625000|http://www.last.f...|

+---+--------------------+--------------------+--------------+--------------------+

Can you post your entire log and potentially turn on logging on the rest package (org.elasticsearch.hadoop.rest) all the way to TRACE and upload the result as a gist?

Thanks


Reply to this email directly or view it on GitHub
#497 (comment)
.

@costin
Copy link
Member

costin commented Aug 4, 2015

I see. That's a bug - fields should not be used with a DataFrame - it's the DataFrame itself that specifies the fields
(through its schema),
not the user.

On 8/4/15 8:27 PM, Scott wrote:

Your example test query string does not specify a set of fields to return,
only a constraint. The "fields" query element is what causes the issue in
my tests, including in Spark 1.4.1.

I will provide more information as soon as I am able.

--Scott

On Tue, Jul 28, 2015 at 11:49 AM, Costin Leau notifications@github.com
wrote:

@analyticswarescott https://github.com/analyticswarescott Unfortunately
I'm unable to reproduce the problem. What version of Es-hadoop and Spark
are you using? Are you by any chance on Spark 1.3?
On 2.1.x and master, everything gets properly printed - maybe your query
is special (can you share it)?
I've also noticed you are using showString which is not available in
Spark 1.4 (in fact it is private).
Either way, I have tried the following snippets (which is part of the test
suite) and I get no reference to Buffer:

println(JavaEsSparkSQL.esDF(sqc, target, "?q=name:me*", cfg.asJava).head())

[170,Mew,http://userserve-ak.last.fm/serve/252/42247291.jpg,6146220025000,http://www.last.fm/music/Mew]``` http://userserve-ak.last.fm/serve/252/42247291.jpg,6146220025000,http://www.last.fm/music/Mew%5D

println(JavaEsSparkSQL.esDF(sqc, target, "?q=name:me*", cfg.asJava).show(3)

+---+--------------------+--------------------+--------------+--------------------+
| id| name| pictures| time| url|

+---+--------------------+--------------------+--------------+--------------------+
|170| Mew|http://userserve-...| 6146220025000|http://www.last.f...|
|918| Megadeth|http://userserve-...|29656092025000|http://www.last.f...|
|996|Mike & The Mechanics|
http://userserve-...|32117541625000|http://www.last.f...|

+---+--------------------+--------------------+--------------+--------------------+

Can you post your entire log and potentially turn on logging on the rest package (org.elasticsearch.hadoop.rest) all the way to TRACE and upload the result as a gist?

Thanks


Reply to this email directly or view it on GitHub
#497 (comment)
.


Reply to this email directly or view it on GitHub
#497 (comment).

Costin

@analyticswarescott
Copy link
Author

So, to clarify, the .esDF method is only designed to return a DataFrame
that contains all fields in the ES mapping? In our case this is nearly 100
fields. Calls like jsonFile allow one to specify as schema to be applied
to the created DataFrame, but I don't see any overloads of .esDF that
support this.

--Scott

On Tue, Aug 4, 2015 at 7:41 PM, Costin Leau notifications@github.com
wrote:

I see. That's a bug - fields should not be used with a DataFrame - it's
the DataFrame itself that specifies the fields
(through its schema),
not the user.

On 8/4/15 8:27 PM, Scott wrote:

Your example test query string does not specify a set of fields to
return,
only a constraint. The "fields" query element is what causes the issue in
my tests, including in Spark 1.4.1.

I will provide more information as soon as I am able.

--Scott

On Tue, Jul 28, 2015 at 11:49 AM, Costin Leau notifications@github.com
wrote:

@analyticswarescott https://github.com/analyticswarescott
Unfortunately
I'm unable to reproduce the problem. What version of Es-hadoop and Spark
are you using? Are you by any chance on Spark 1.3?
On 2.1.x and master, everything gets properly printed - maybe your query
is special (can you share it)?
I've also noticed you are using showString which is not available in
Spark 1.4 (in fact it is private).
Either way, I have tried the following snippets (which is part of the
test
suite) and I get no reference to Buffer:

println(JavaEsSparkSQL.esDF(sqc, target, "?q=name:me*",
cfg.asJava).head())

[170,Mew,
http://userserve-ak.last.fm/serve/252/42247291.jpg,6146220025000,http://www.last.fm/music/Mew]```
http://userserve-ak.last.fm/serve/252/42247291.jpg,6146220025000,http://www.last.fm/music/Mew%5D
<
http://userserve-ak.last.fm/serve/252/42247291.jpg,6146220025000,http://www.last.fm/music/Mew%5D

println(JavaEsSparkSQL.esDF(sqc, target, "?q=name:me*",
cfg.asJava).show(3)

+---+--------------------+--------------------+--------------+--------------------+
| id| name| pictures| time| url|

+---+--------------------+--------------------+--------------+--------------------+
|170| Mew|http://userserve-...| 6146220025000|http://www.last.f...|
|918| Megadeth|http://userserve-...|29656092025000|http://www.last.f.
..|
|996|Mike & The Mechanics|
http://userserve-...|32117541625000|http://www.last.f...|

+---+--------------------+--------------------+--------------+--------------------+

Can you post your entire log and potentially turn on logging on the
rest package (org.elasticsearch.hadoop.rest) all the way to TRACE and
upload the result as a gist?

Thanks


Reply to this email directly or view it on GitHub
<
#497 (comment)

.


Reply to this email directly or view it on GitHub
<
#497 (comment)
.

Costin


Reply to this email directly or view it on GitHub
#497 (comment)
.

@costin
Copy link
Member

costin commented Aug 4, 2015

Good point - this feature is in there but it is not properly exposed. One could create an RDD based on a query with just the needed fields and then associate a schema with it but it's overkill.
ES-Hadoop relies on fields to perform its own extractions but either way it should either remove it or perform correct validation.
Will ping you once I have an update on this front.

Cheers,

@analyticswarescott
Copy link
Author

Thanks. I can use the RDD approach for now, and will stay tuned.

--Scott

On Tue, Aug 4, 2015 at 10:48 PM, Costin Leau notifications@github.com
wrote:

Good point - this feature is in there but it is not properly exposed. One
could create an RDD based on a query with just the needed fields and then
associate a schema with it but it's overkill.
ES-Hadoop relies on fields to perform its own extractions but either way
it should either remove it or perform correct validation.
Will ping you once I have an update on this front.

Cheers,


Reply to this email directly or view it on GitHub
#497 (comment)
.

@costin costin changed the title DataFrame contains scala buffers only when query is specified Cannot restrict DataFrame to certain mapping field Sep 1, 2015
costin added a commit that referenced this issue Sep 9, 2015
This effectively allows the user to specify a custom schema that cherry-picks
the fields inside a mapping instead of using all of them

relates #497
costin added a commit that referenced this issue Sep 9, 2015
This effectively allows the user to specify a custom schema that cherry-picks
the fields inside a mapping instead of using all of them

relates #497

(cherry picked from commit d90d9ba)
@costin
Copy link
Member

costin commented Sep 10, 2015

@analyticswarescott Hi,

This feature is in master and 2.x; see the docs here, in particular the "Controlling the DataFrame schema".

You can try it out through the dev builds.

Cheers,

@analyticswarescott
Copy link
Author

Thanks very much for keeping me up to date. We'll be checking it out soon!

--Scott

On Thu, Sep 10, 2015 at 11:35 PM, Costin Leau notifications@github.com
wrote:

@analyticswarescott https://github.com/analyticswarescott Hi,

This feature is in master and 2.x; see the docs here
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-sql-read,
in particular the "Controlling the DataFrame schema".

You can try it out through the dev builds
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/install.html#download-dev
.

Cheers,


Reply to this email directly or view it on GitHub
#497 (comment)
.

@costin
Copy link
Member

costin commented Oct 15, 2015

@analyticswarescott Any update? Did you manage to try it out? Wanted to know whether the current feature is properly designed (and rich enough).

@costin
Copy link
Member

costin commented Oct 28, 2015

Closing the issue.

@costin costin closed this as completed Oct 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants