Skip to content

sryza/simplesparkavroapp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark with Avro and Parquet

Enclosed is a simple Spark app demonstrating how to read and write data in the Parquet and Avro formats.

Avro refers to both a binary format and an in-memory Java object representation. Parquet refers to only a binary format, and it supports pluggable in-memory representations. One of the options for Parquet's in-memory representations is Avro, and that's the one we use here and recommend in general.

Avro's actually has two in-memory representations. "Specific" records rely on generated code. "Generic" records essentially represent objects as key-value pairs. We use the specific representation here because it is more efficient and easier to program against once the code-generation has been set up.

To compile and package:

mvn package

This will both generate Java classes from the Avro schema as well as build the project. The Avro schema is a simple "User" object defined in src/main/resources/user.avsc.

To test writing an Avro file:

spark-submit --class com.cloudera.sparkavro.SparkSpecificAvroWriter \
  target/sparkavroapp-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
  users.avro

To test reading an Avro file:

spark-submit --class com.cloudera.sparkavro.SparkSpecificAvroReader \
  target/sparkavroapp-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
  users.avro

To test writing a Parquet file:

spark-submit --class com.cloudera.sparkavro.SparkSpecificParquetWriter \
  target/sparkavroapp-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
  users.parquet

To test reading a Parquet file:

spark-submit --class com.cloudera.sparkavro.SparkSpecificParquetReader \
  target/sparkavroapp-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
  users.parquet

Note that all the examples register the specific Avro class with Kryo. This allows instances of it to be serialized more efficiently when being passed around within Spark. The examples don't actually end up passing the objects around within Spark, but register them anyway because it's good practice.

About

Simple Spark app that reads and writes Avro data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages