New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON schema inference corrupt with Elasticsearch Spark #441
Comments
@ejsarge-gr First off, thanks for the detailed report; if only all will be done this way, things will be a LOT easier. |
Enhance value order when dealing with nested values relates #441
@ejsarge-gr Hi, the issue should be fixed in master. I've pushed a new dev build to Maven so please try it out and report back whether it addresses your problem. Cheers, |
It appears to be working for me correctly. Thanks! |
That's great to hear. Thanks for the feedback. |
I grabbed the latest build as I was seeing this issue as well. While the schema inference seems to work now, the DataFrame is now returning all the objects wrapped in Buffers (in one case more than one level). This is causing a downstream error in (in my case) df.insertIntoJDBC when I try to invoke it on the DF from ES (code and logs below). Any ideas on how to get this convenient method working again? ------code --debug log 2015-06-04 04:40:16,173 [ForkJoinPool.commonPool-worker-1] DEBUG com.dg.data.sync.writers.StoreWriterJDBC - SCHEMA for DataFrame StructType(StructField(medid,StringType,true), StructField(meid,StringType,true), StructField(ot,LongType,true), StructField(otn,StringType,true)) |
@analyticswarescott Please open a new issue with some information about the actual versions of your runtime (git SHA1 of es-hadoop; you'll find it in the logs). Moreover note that the test suite compares the data output and the schema against a raw JSON input as you can see here. I'm not sure where that Anyway, this issue has been derailed long enough - please let's continue the discussion through a new issue. |
With the new release build I was able to isolate the issue and I recorded it as #497 |
The elasticsearch-hadoop library appears to corrupt the JSON schema inference. The same JSON source read using the
SQLContext.jsonFile
method succeeds.Reproduction Steps
curl -XDELETE localhost:9200/events2-salsnap1-2013-12/
curl -XPUT localhost:9200/events2-salsnap1-2013-12/event/76e3773d-8a19-485a-a75c-225070e2cbc6 -d '{"EventType":"MessageExportRequested","EventTime":1387710245000,"SessionId":"gsk*****","Trigger":"_null_","ScopedCompanyId":148,"ScopedArchiveId":"anArchive","EventId":"dbnbudzu4wge","ActorEntity":{"CompanyId":148,"IpAddress":"127.0.0.1","EntityId":"602","EntityName":"first-1 last-1","EntityType":"CompanyUser"},"AffectedEntity1":{"EntityId":"5678","EntityName":"5678","EntityType":"MessageExport","ExportPurpose":"FinraAudit","CaseName":"R v Sargisson","NumberMessages":534,"CaseId":"Sarg598","PriceCurrency":"CAD","DeliveryOptions":"DirectDownload","NewestMessageDate":1419112760000,"SpecialRequest":"_null_","ExportName":"Some Export","ExportFormat":"EML","SizeMessagesInBytes":1234789,"ExportDescription":"If the NSA can do it then so can I","ExportOption":"IncludeHiddenRecipientData","Price":500.12,"OldestMessageDate":1387576760000}}'
SparkSQLElasticsearchTest
SparkSQLJsonFileTest
Expected Output
Actual Output (from SparkSQLElasticsearchTest)
Files
pom.xml
SparkSQLElasticsearchTest.java
SparkSQLJsonFileTest
src/main/resources/message-export-events.json
The text was updated successfully, but these errors were encountered: