The plugin can read single-dimensional arrays from HDF5 files.
The following types are supported:
- Int8
- UInt8
- Int16
- UInt16
- Int32
- Int64
- Float32
- Float64
- Fixed length strings
If you are using the sbt-spark-package, the easiest way to use the package is by requiring it from the spark packages website:
spDependencies += "LLNL/spark-hdf5:0.0.4"
Otherwise, download the latest release jar and include it on your classpath.
import gov.llnl.spark.hdf._
val df = sqlContext.read.hdf5("path/to/file.h5", "/dataset")
df.show
You can start a spark repl with the console target:
sbt console
This will fetch all of the dependencies, set up a local Spark instance, and start a Spark repl with the plugin loaded.
The following options can be set:
Key | Default | Description |
---|---|---|
extension |
h5 |
The file extension of data |
chunk size |
10000 |
The maximum number of elements to be read in a single scan |
The plugin includes a test suite which can be run through SBT
sbt test
- Use the hdf-obj package rather than the sis-jhdf5 wrapper
- Support for multi-dimensional arrays
- Support for compound datasets
- Additional testing
- Partition discovery (data inference based on location)
This code was developed at the Lawrence Livermore National Lab (LLNL) and is available under the Apache 2.0 license (LLNL-CODE-699384
)