This code goes with my Datanami article.
It illustrates MLLib on Spark using an example based on collaborative filtering of employee ratings for companies.
It shows the exact same Spark client functionality written in Java 7 and Java 8. The new new Java 8 features that make Spark's functional style much easier
I use Cassandra providing the data to Spark, and there's a synthesized training/validation set with accompanying spreadsheet to let you tweak parameters.
Here's how to get it working:
To setup (tested on Ubuntu 14.04):
- Install JDK Java8.
sudo apt-get install oracle-java8-installer
- Get Spark.
- Download 1.1.0 for Hadoop 2.4. We will not be using Hadoop even though this build supports it.
- Untar the spark tarball. (E.g., in
~/dev
) - Test the installation with
./bin/run-example SparkPi
- See QuickStart in below for more instructions and tutorials on setup.
Get Eclipse:
- Download Eclipse Luna 4.4.1 Ubuntu 64 Bit (or 32 Bit) from Eclipse.org. Only the latest Eclipse supports Java 8.
- Untar, run Eclipse.
- Set your Java 8 JDK as the default JDK.
- Install Maven2 Eclipse,
- Menu Help -> Install New Software…
- Add this repository
- Check Maven Integration for Eclipse, then install.
Project
- Right-click on
pom.xml
, choose Maven-> install. - This will now download Spark jars; it will take a while.
- It will also set your Eclipse project's source level to Java 8.
Dataset
ratings.csv
is generated fromratings.ods
, which is a spreadsheet for synthesizing data sets to test and fine tune your model.- Adjust
ratings.ods
and save as CSV. Seereadme.txt
in data directory for instructions.
Cassandra
- Instructions for getting Cassandra: here
- Run Cassandra:
sudo /usr/bin/cassandra
- We will be runnning Cassandra and Spark locally with console, rather than remotely in a cluster as daemon/service.
- Create schema by running attached SQL as follows:
- In workspace root, run
cqlsh -f ./collabfilter/src/sql/collab_filter_schema.sql
- In workspace root, run
Running tests:
- Run
collabfilter.CollabFilterCassandraDriver.main
or theCollabFilterTest
unit test.
More references:
- QuickStart has more on setup.
- You can find a collaborative filtering tutorial for Spark and a tutorial on the Spark-Cassandra Java connector which I drew on.
- However, note that the example code in the Spark-Cassandra tutorial is outdated. The Java API class was moved to the japi subpackage.
- Bug in Guava version. The
pom.xml
specifies Guava 15. This is because the Guava 14 used with the Spark-Cassandra connector is mismatched to the Guava 15 or above expected by Spark, which includes additional methods.