Skip to content
Efficient, distributed downloads of large files from S3 to HDFS using Spark.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Conductor for Apache Spark provides efficient, distributed transfers of large files from S3 to HDFS and back.

Hadoop's distcp utility supports transfers to/from S3 but does not distribute the download of a single large file over multiple nodes. Amazon's s3distcp is intended to fill that gap but, to our best knowledge, hasn't not been released as open source.

A cluster of ten r3.xlarge nodes downloaded a 288GiB file in 377 seconds to an HDFS installation with replication factor 1, yielding an aggregate transfer rate of 782 MiB/s. For comparison, distcp typically gives you 50-80MB/s on that instance type. A cluster of one hundred r3.xlarge nodes downloaded that same file in 80 seconds, yielding an aggregate transfer rate of 3.683 GiB/s.


Run time:

  • JRE 1.7+
  • Spark cluster backed by HDFS

Build time:

  • JDK 1.7+
  • Scala SDK 2.10
  • Maven

Scala 2.11 and Java 1.8 may work, too. We simply haven't tested those, yet.



export AWS_ACCESS_KEY=...
export AWS_SECRET_KEY=...
spark-submit conductor-VERSION-distribution.jar \
             s3://BUCKET/KEY \
             hdfs://HOST[:PORT]/PATH \
             [--s3-part-size <value>] \
             [--hdfs-block-size <value>] \


export AWS_ACCESS_KEY=...
export AWS_SECRET_KEY=...
spark-submit conductor-VERSION-distribution.jar \
             hdfs://HOST[:PORT]/PATH \
             s3://BUCKET/KEY \

Using the --concat flag concatenates all the parts of the files following the upload or download. The source path can be to either a file or directory. If the path points to a file, the parts will be created in the specified part sizes; if it points to a directory, each part will correspond to a file in the directory. Concatenation only works in downloader if all of the parts except for the last one are equal-sized and multiples of the specified block size.

If running Spark-on-YARN, you can pass the AWS access/secret keys by passing the following config flags to spark-submit:

` --conf spark.yarn.appMasterEnv.AWS_ACCESS_KEY=... --conf spark.yarn.appMasterEnv.AWS_SECRET_KEY=... `


export AWS_ACCESS_KEY=...
export AWS_SECRET_KEY=...
spark-submit --conf spark.driver.memory=1G \
             --executor-memory 1G \
             conductor-integration-tests-0.4-SNAPSHOT-distribution.jar \
             -e -s edu.ucsc.cgl.conductor.ConductorIntegrationTests


mvn package

You can customize the Spark and Hadoop versions to build against, by setting the spark.version and hadoop.version properties, for example:

mvn package -Dspark.version=1.5.2 -Dhadoop.version=2.6.2


  • Beta-quality
  • Uses Spark, not Yarn/MapReduce
  • Destination must be a full hdfs:// URL, the property is ignored
  • On failure, temporary files may be left around
  • S3 credentials may be set via Java properties or environment variables as described in the AWS API documentation but are not read from core-site.xml


Hannes Schmidt created the first bare-bones implementation of distributed downloads from S3 to HDFS, originally called spark-s3-downloader.

Clayton Sanford made the HDFS block size and S3 part size configurable, added upload support, optional concatenation and wrote integration tests. During his efforts the project was renamed Conductor.

You can’t perform that action at this time.