@Copyright 2013-2017 Inidana University Apache License 2.0
@Author: Bingjing Zhang
Harp is a framework for machine learning applications.
- A Hadoop plugin. It currently supports hadoop 2.6.0 ~ 2.7.3 version.
- Hierarchical data abstraction (arrays/objects, partitions/tables)
- Pool based memory management
- Collective + event-driven programming model (distributed computing)
- Dynamic Scheduler + Static Scheduler (multi-threading)
1. Install Maven by following the maven official instruction
3. Install third party jar file. This javaml jar is required by randomforest application. It's not required by harp project itself.
mvn install:install-file -Dfile=third_party/javaml-0.1.7.jar -DgroupId=net.sf -DartifactId=javaml -Dversion=0.1.7 -Dpackaging=jar
mvn clean package
cp harp-project/target/harp-project-1.0-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/mapreduce/
cp third_party/fastutil-7.0.13.jar $HADOOP_HOME/share/hadoop/mapreduce/
7. Edit mapred-site.xml in $HADOOP_HOME/etc/hadoop, add java opts settings for map-collective tasks. For example:
<property>
<name>mapreduce.map.collective.memory.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.map.collective.java.opts</name>
<value>-Xmx256m -Xms256m</value>
</property>
jobConf.set("mapreduce.framework.name", "map-collective");
cp harp-app/target/harp-app-1.0-SNAPSHOT.jar $HADOOP_HOME
cd $HADOOP_HOME
sbin/start-dfs.sh
sbin/start-yarn.sh
hadoop jar harp-app-1.0-SNAPSHOT.jar edu.iu.kmeans.regroupallgather.KMeansLauncher <num of points> <num of centroids> <vector size> <num of point files per worker> <number of map tasks> <num threads> <number of iteration> <work dir> <local points dir>
hadoop jar harp-app-1.0-SNAPSHOT.jar edu.iu.kmeans.regroupallgather.KMeansLauncher 1000 10 100 5 2 2 10 /kmeans /tmp/kmeans