This package implements a
spark backend for the
dplyr package, providing a powerful and intuitive DSL to manipulate large datasets on a powerful big data platform. It is a simple package: simple to learn if you have any familiarity with
dplyr or even just R and SQL, simple to deploy: just a few packages to install on a single machine, as long as your Spark installation comes with JDBC support -- or build it in, instructions below.
The current state of the project is:
- adds some
spark-specific goodies, like caching tables.
- can go succesfully through tutorials for
dplyrlike any other database backend^[with the exception of one bug to avoid which you need to run Spark from trunk or wait for version 1.5, see SPARK-9221].
- not yet endowed with a thorugh test suite. Nonetheless we expect it to inherit much of its correctness, scalability and robustness from its main dependencies,
- we don't recommend production use yet
cd <spark root> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests -Phive -Phive-thriftserver clean package
It may work with other hadoop versions, but we need the hive and hive-thriftserver support. The package is able to start the thirft server but can also connect to a running one.
dplyr.spark has a few dependencies: get them with
install.packages(c("RJDBC", "dplyr", "DBI", "devtools")) devtools::install_github("hadley/purrr")
rJava. Make sure that you have
rJava working with:
This is only a test, in general you don't need it before loading
On the mac
rJava required two different versions of java installed, for real, and in particular this shell variable set
The specific path may be different, particularly the version numbers. To start Rstudio (optional, you can use a different GUI or none at all), which doesn't read environment variables, you can enter the following command:
DYLD_FALLBACK_LIBRARY_PATH=/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre/lib/server/ open -a rstudio
HADOOP_JAR environment variable needs to be set to the main hadoop JAR file, something like
To start the thrift server from R, which happens by default when creating a
src_SparkSQL object, you need one more variable set,
SPARK_HOME, as the name suggests pointing to the root of the Spark installation. If you are connecting with a running server, you just need host and port information. Those can be stored in environment variable as well, see help documentation.
Then, to install from source:
devtools::install_github("RevolutionAnalyticsfirstname.lastname@example.org", subdir = "pkg")
The current version is 0.3.0 .
You can find a number of examples derived from @hadley's own tutorials for dplyr look under the test directory, files