SchemaRDD seems to be lost when loading parquet files #403

costin · 2015-03-24T17:57:04Z

This seems to fail:

wget https://oss.sonatype.org/content/repositories/snapshots/org/elasticsearch/elasticsearch-hadoop/2.1.0.BUILD-SNAPSHOT/elasticsearch-hadoop-2.1.0.BUILD-20150324.023417-341.jar
 ./spark-shell --jars elasticsearch-hadoop-2.1.0.BUILD-20150324.023417-341.jar

import org.apache.spark.sql.SQLContext

case class KeyValue(key: Int, value: String)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext._

sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString))
    .saveAsParquetFile("large.parquet")
parquetFile("large.parquet").registerTempTable("large")

val schemaRDD = sql("SELECT * FROM large")
import org.elasticsearch.spark._

schemaRDD.saveToEs("test/spark")

Basically the RDD associated schema is not read and the content (an array) is passed as is, instead of handling through its schema.

The text was updated successfully, but these errors were encountered:

costin · 2015-03-26T15:29:32Z

Closing in favour of #382

costin added bug :Spark v2.1.0.Beta4 labels Mar 24, 2015

costin mentioned this issue Mar 26, 2015

Unable to index JSON from HDFS using SchemaRDD.saveToEs() #382

Closed

costin added the duplicate label Mar 26, 2015

costin closed this as completed Mar 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SchemaRDD seems to be lost when loading parquet files #403

SchemaRDD seems to be lost when loading parquet files #403

costin commented Mar 24, 2015

costin commented Mar 26, 2015

SchemaRDD seems to be lost when loading parquet files #403

SchemaRDD seems to be lost when loading parquet files #403

Comments

costin commented Mar 24, 2015

costin commented Mar 26, 2015