A sample machine learning project using Apache Spark.
I am using R.A. Fisher's famous "iris" dataset, a dataset that contains 150 entries with 3 classes. A description of the data can be found here
This project is using Spark 1.6.0 and scala 2.11. Spark does not currently provide a 2.11 distribution, meaning you will need to spend ~15 minutes to download and compile the source.
To use this project, run the following commands after setting or substituting SPARK_1_6_HOME to the spark 1.6.0 directory, and replacing the src/main/resources/iris.data with whichever data you want to use:
sbt clean assembly
# The classification task
${SPARK_1_6_HOME}/bin/spark-submit --class ca.jakegreene.iris.IrisClassification --master spark://127.0.0.1:7077 target/scala-2.11/iris.jar src/main/resources/iris.data
# The clustering task
${SPARK_1_6_HOME}/bin/spark-submit --class ca.jakegreene.iris.IrisClustering --master spark://127.0.0.1:7077 target/scala-2.11/iris.jar src/main/resources/iris.data