Release HiBench-7.0 · Intel-bigdata/HiBench

We are happy to announce HiBench-7.0, a major release of HiBench. This release includes new features like more Machine Learning workloads, Spark 2.1, Spark 2.2 support. It also includes many bug fixes to the previous release.

Spark 2.1, 2.2 Support

Apache Spark 2.1 & 2.2 are new major releases with a few API changes. One of the features of HiBench 7.0 is to fully support Spark 1.6, Spark 2.0, Spark 2.1 and Spark 2.2. You can choose the Spark version when building HiBench and test these Spark versions.

New Workloads

Eight ML workloads for Spark are added. They are ALS (Alternating Least Squares), Bayes, GBT (Gradient Boosting Trees), LDA (Latent Dirichlet Allocation), Linear (Linear Regression), PCA (Principal Component Analysis), RF (Random forests), SVD (Singular Value Decomposition) and SVM (Support Vector Machine).

The alternating least squares (ALS) algorithm is a well-known algorithm for collaborative filtering that implemented in spark.mllib. The input data set is generated by RatingDataGenerator for a product recommendation system.

Naive Bayes (Bayes) is a simple multiclass classification algorithm with the assumption of independence between every pair of features. The workload is implemented in spark.mllib and uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.

Gradient-boosted trees (GBT) is a popular regression method using ensembles of decision trees that implemented in spark.mllib. The input data set is generated by GradientBoostingTreeDataGenerator.

Latent Dirichlet allocation (LDA) is a topic model which infers topics from a collection of text documents that implemented in spark.mllib. The input data set is generated by LDADataGenerator.

Logistic Regression (LR) is a popular method to predict a categorical response that implemented in spark.mllib with LBFGS optimizer. The input data set is generated by LogisticRegressionDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.

Linear Regression (Linear) is a workload that implemented in spark.mllib with SGD optimizer. The input data set is generated by LinearRegressionDataGenerator.

Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. PCA is used widely in dimensionality reduction. This workload is implemented in spark.mllib. The input data set is generated by PCADataGenerator.

Random forests (RF) are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. This workload is implemented in spark.mllib and the input data set is generated by RandomForestDataGenerator.

Singular value decomposition (SVD) factorizes a matrix into three matrices. This workload is implemented in spark.mllib and its input data set is generated by SVDDataGenerator.

Support Vector Machine (SVM) is a standard method for large-scale classification tasks that implemented in spark.mllib. The input data set is generated by SVMDataGenerator.

Contributors

The following developers contributed to this release:
Yinan Xiang(@ynXiang)
Teng Jiang(@jtengyp)
zhuoxiangchen(@zhuoxiangchen)
Shilei Qian(@qiansl127)
Vincent Xie(@VinceShieh)
Peng Meng(@mpjlu)
Carson Wang(@carsonwang)
Yu He(@heyu1)
Chenzhao Guo(@gczsjdy)
Naresh Gundla(@nareshgundla)
Rajarshi Biswas(@rajarshibiswas)
n3rV3(@n3rV3)
Ziyue Huang(@ZiyueHuang)
Chong Tang(@ChongTang)
Michael Mior(@michaelmior)
Yanbing Zhang(@zybing)
Huafeng Wang(@huafengw)
Sophia Sun(@sophia-sun)

Thanks to everyone who contributed! We are looking forward to more contributions from every one for the next release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HiBench-7.0

Spark 2.1, 2.2 Support

New Workloads

Contributors