execo-g5k-benchmarks

Benchmarks to use with execo and g5k. This project is meant to be used together with execo-utilities-g5k (https://github.com/Brandonage/execo-utilities-g5k) in order to perform reproducible experiments. This Current version includes SparkBench (https://github.com/SparkTC/spark-bench) and YCSB (https://github.com/brianfrankcooper/YCSB). To make it work we need to add in the in the .execo.conf.py file of our home directory the following key which will be merged with the default execo.config.g5k_configuration

g5k_configuration = { 'g5k_user' : 'abrandon', 'oar_job_key_file': '/home/abrandon/.ssh/id_rsa', }

How to use it

SparkBench

In your execo script, create the SparkBench object provided by this project. The parameters needed are:

The home_directory of a previously built SparkBench project
The master_node from where we will execute the execo commands that will control the SparkBench execution
The resource manager that we want to use to submit the job to the Spark cluster. This is needed since there is functionality and parameters specific to each resource manager
The path to the spark submit folder
The address of the default spark master to use

from spark.sparkbench import SparkBench
sb = SparkBench(home_directory="/home/abrandon/execo-g5k-benchmarks/spark/spark-bench/", master_node='griffon-1',
                    resource_manager="yarn",root_to_spark_submit="/opt/spark/bin/spark-submit",default_master="yarn")

After creating the object we can start launching the different workloads. Example:

sb.launchgeneratetextfile(output="words.txt", size=60,npartitions=200,submit_conf=[["spark.executor.memory","7g"]])
sb.launchngrams(input="words.txt",submit_conf=[["spark.executor.memory","4g"]])

Check the class in https://github.com/Brandonage/execo-g5k-benchmarks/blob/master/spark/sparkbench.py to see the different methods provided

YCSB

Prerequisite: JDK 8 must be installed in all nodes

In contrast to SparkBench in which the benchmark has to be installed in advance, the YCSB class is able to install the benchmark directly on the desired nodes. So far only the Cassandra benchmark of YCSB can be used. In your execo script, create the CassandraYCSB object provided by this project. The parameteres needed are:

The nodes where you want to install YCSB
The execo connection parameters to connect to the nodes. (e.g execo.config.default_connection_params)
The nodes where cassandra is installed

cassandra_ycsb = CassandraYCSB(install_nodes=nodes_to_install,
                                    execo_conn_params=default_connection_params, 
                                    cassandra_nodes=cassandra_nodes)

After creating the object we can start launching the different workloads.

Prepare the workload (Check the documentation of YCSB for further info about the parameters https://github.com/brianfrankcooper/YCSB/wiki/Core-Properties)

cassandra_ycsb.load_workload(from_node=yscb_clients, # the nodes where YCSB are installed
                            workload='workloada',
                            recordcount='1000',
                            threadcount='1',
                            fieldlength='500')

Run the workload

cassandra_ycsb.run_workload(iteration=i, # we can use this iteration counter to have several executions on a loop
                            res_dir="results/execution" + str(i), # where do we want to store the results of the execution
                            from_node=yscb_clients,
                            workload='workloada',
                            recordcount='1000',
                            threadcount='1',
                            fieldlength='500',
                            target='40')

Analysing results from YCSB

In addition, the YCSB class provides a method to analyse the result files generated in all the different hosts for the Throughput. Use it in the following way:

cassandra_ycsb.analyse_results(
                    directory=d, # the directory where it should search for
                    workload=w, # the workload
                    metric="Throughput") # the metric we want to analyse, at the moment only throughput

The method returns a three-tuple with all the throughput metrics registered for each host together with the mean and the variance for the set of throughput metrics

##Bugs or problems

Please open an issue or contact me directly and I can help you to set everything up

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
common		common
spark		spark
ycsb		ycsb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common

common

spark

spark

ycsb

ycsb

README.md

README.md

Repository files navigation

execo-g5k-benchmarks

How to use it

SparkBench

YCSB

Analysing results from YCSB

About

Releases

Packages

Languages

Brandonage/execo-g5k-benchmarks

Folders and files

Latest commit

History

Repository files navigation

execo-g5k-benchmarks

How to use it

SparkBench

YCSB

Analysing results from YCSB

About

Resources

Stars

Watchers

Forks

Languages