Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SchemaRDD seems to be lost when loading parquet files #403

Closed
costin opened this issue Mar 24, 2015 · 1 comment
Closed

SchemaRDD seems to be lost when loading parquet files #403

costin opened this issue Mar 24, 2015 · 1 comment

Comments

@costin
Copy link
Member

costin commented Mar 24, 2015

This seems to fail:

wget https://oss.sonatype.org/content/repositories/snapshots/org/elasticsearch/elasticsearch-hadoop/2.1.0.BUILD-SNAPSHOT/elasticsearch-hadoop-2.1.0.BUILD-20150324.023417-341.jar
 ./spark-shell --jars elasticsearch-hadoop-2.1.0.BUILD-20150324.023417-341.jar

import org.apache.spark.sql.SQLContext

case class KeyValue(key: Int, value: String)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext._

sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString))
    .saveAsParquetFile("large.parquet")
parquetFile("large.parquet").registerTempTable("large")

val schemaRDD = sql("SELECT * FROM large")
import org.elasticsearch.spark._

schemaRDD.saveToEs("test/spark")

Basically the RDD associated schema is not read and the content (an array) is passed as is, instead of handling through its schema.

@costin
Copy link
Member Author

costin commented Mar 26, 2015

Closing in favour of #382

@costin costin closed this as completed Mar 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant