Python 2.x(>=2.6) is required.
bcis required to generate the HiBench report.
Supported Hadoop version: Apache Hadoop 2.x, 3.0.x, 3.1.x, 3.2.x, CDH5.x, HDP
Supported Spark version: 2.4.x, 3.0.x
Build HiBench according to build HiBench.
Start HDFS, Yarn, Spark in the cluster.
Note: Starting from HiBench 8.0, the support of Spark before 2.3.x(inclusive) was deprecated, please either leverage former version HiBench or upgrade your Spark.
Hadoop is used to generate the input data of the workloads.
Create and edit
cp conf/hadoop.conf.template conf/hadoop.conf
|hibench.hadoop.home||The Hadoop installation location|
|hibench.hadoop.executable||The path of hadoop executable. For Apache Hadoop, it is /YOUR/HADOOP/HOME/bin/hadoop|
|hibench.hadoop.configure.dir||Hadoop configuration directory. For Apache Hadoop, it is /YOUR/HADOOP/HOME/etc/hadoop|
|hibench.hdfs.master||The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username|
|hibench.hadoop.release||Hadoop release provider. Supported value: apache, cdh5, hdp|
Note: For CDH and HDP users, please update
hibench.hadoop.release properly. The default value is for Apache release.
Create and edit
cp conf/spark.conf.template conf/spark.conf
Set the below properties properly:
hibench.spark.home The Spark installation location hibench.spark.master The Spark master, i.e. `spark://xxx:7077`, `yarn-client`
4. Run a workload
To run a single workload i.e.
prepare.sh launches a Hadoop job to generate the input data on HDFS. The
run.sh submits the Spark job to the cluster.
bin/run_all.sh can be used to run all workloads listed in conf/benchmarks.lst.
5. View the report
<HiBench_Root>/report/hibench.report is a summarized workload report, including workload name, execution duration, data size, throughput per cluster, throughput per node.
The report directory also includes further information for debugging and tuning.
<workload>/spark/bench.log: Raw logs on client side.
<workload>/spark/monitor.html: System utilization monitor results.
<workload>/spark/conf/<workload>.conf: Generated environment variable configurations for this workload.
<workload>/spark/conf/sparkbench/<workload>/sparkbench.conf: Generated configuration for this workloads, which is used for mapping to environment variable.
<workload>/spark/conf/sparkbench/<workload>/spark.conf: Generated configuration for spark.
6. Input data size
To change the input data size, you can set
conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata. The definition of these profiles can be found in the workload's conf file i.e.
Change the below properties in
conf/hibench.conf to control the parallelism
|hibench.default.map.parallelism||Partition number in Spark|
|hibench.default.shuffle.parallelism||Shuffle partition number in Spark|
Change the below properties to control Spark executor number, executor cores, executor memory and driver memory.
|hibench.yarn.executor.num||Spark executor number in Yarn mode|
|hibench.yarn.executor.cores||Spark executor cores in Yarn mode|
|spark.executor.memory||Spark executor memory|
|spark.driver.memory||Spark driver memory|