A package contains three Spark-based implementations. It includes
- Factorization Machines (LibFM)
- Field-Awared Factorization Machine (FFM)
- Conditional Random Fields (CRF)
- Adaptive learning rate optimizer (AdaGrad, Adam)
This package can be imported as a dependency in other codes. Then, all functions of LibFM, FFM, CRF and AdaOptimizer in this package can be used.
Build from Source
How to Use
IMLLIB package can be either imported directly in Spark-Shell, or be imported as a dependency in other codes.
Use in Spark-Shell
(1) Run Spark-Shell with IMLLIB package
spark-shell --jars 'Path/imllib_2.11-0.0.1.jar'
Path is the path of
(1) On driver node, add following codes into
spark.executor.extraClassPath /usr/local/spark/lib/* spark.driver.extraClassPath /usr/local/spark/lib/*
/usr/local/spark/lib to all worker nodes
(5) Run Spark-Shell
Use as a denpendency
(1) build from source and publish locally
sbt compile publish-local
(2) Move the whole directory
(3) Add following codes into
build.sbt when you want to import IMLLIB package as a denpendency
libraryDependencies += "com.intel" % "imllib_2.11" % "0.0.1"
How to Import
There are three shell scripts in
bin/template for testing LibFM, FFM, CRF and LR with AdaOptimizer respectively. The script runs in a local mode Spark with the data on hadoop.
You can first modify the script with necessary changes, such as hostname, port for hadoop, etc. Then run the script to test if the algorithm works.
A Spark-based implementation of Factorization Machines (LibFM)
A Spark-based implementation of Field-Awared Factorization Machine with parallelled AdaGrad solver. See http://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf
Need to rework the data format the fit FFM, that is to extends LIBSVM data format by adding field information to each feature to have formation like: label field1:feat1:val1 field2:feat2:val2
A Spark-based implementation of Conditional Random Fields (CRF) for segmenting/labeling sequential data.
CRF-Spark provides following features:
- Training in parallel based on Spark RDD
- Support a simple format of training and test file. Any other format also can be read by a simple tokenizer.
- A common feature templates design, which is also used in other machine learning tools, such as CRF++ and miralium
- Fast training based on Limited-memory BFGS optimizaton algorithm (fitting L2-regularized models) or Orthant-wise Limited Memory Quasi-Newton optimizaton algorithm (fitting L1-regularized models)
- Support two verbose levels to provide extra information. VerboseLevel1 shows marginal probabilities for each tag and a conditional probability for the entire output; VerboseLevel2 shows marginal probabilities for all other candidates additionally.
- Support n-best outputs
- Linear-chain (first-order Markov) CRF
- Test can run both in parallel and in serial
A Spark-based implementation of Adam and AdaGrad optimizer, methods for Stochastic Optimization. See https://arxiv.org/abs/1412.6980 and http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf Comparing with SGD, Adam and AdaGrad have better performance. Especially in case of sparse features, Adam can converge faster than normal SGD.
Contact & Feedback
If you encounter bugs, feel free to submit an issue or pull request. Also you can mail to: