Multi-cloud configuration dataset

Introduction

We present the multi-cloud configuration dataset, collected for the purpose of comparing and evaluating various multi-cloud configuration algorithms. The goal of such algorithms is to, for a given workload, select which cloud provider and its respective configuration should be used in order to minimize the cost or runtime of the workload.

We believe that the scientific community would be interested in developing new algorithms for this emerging application, but running online experiments is both time consuming and expensive. Our offline dataset allows researchers to compare different optimization algorithms without performing all the necessary cloud experimentation each time. We make our offline dataset publicly available with the hope that it will make multi-cloud configuration research more accessible to the community.

While a similar dataset has previously been published for the single-cloud configuration problem, to the best of our knowledge there exists no public data relating to the multi-cloud scenario.

Dataset Overview

An example segment of the dataset is presented below and explained in detail in the following paragraphs.

provider	A_family	A_vcpu	B_family	B_vcpu	C_family	C_type	C_vcpu	nodes	workload	target_cost	target_runtime	status
A	1	1						3	santander-xgboost	0.004	78.889	ok
B			0	0				5	buzz-kmeans	0.010	27.047	ok
C					0	1	0	2	santander-xgboost	0.018	598.825	Timed out.

Multi-cloud domain

Each workload was deployed on a variety of different configurations across three popular cloud providers: A, B and C, which have been anonymized in order to prevent a competitive analysis. For each cloud provider, we used the region closest to our research lab. All configuration parameters included in the dataset are described below and summarized in the table below.

Each provider has several configuration parameters that have to be set, such as the CPU family and type. Most configuration parameters are unique to just one cloud provider and are denoted as, e.g., A_vcpu and B_family. The values taken by these parameters have been anonymized using label encoding.

The only configuration parameter which is shared across all cloud providers is the number of nodes.

This resulted in a total of 88 different multi-cloud configurations.

Parameter	Values
A_family	0, 1, 2
A_vcpu	0, 1
B_family	0, 1
B_vcpu	0, 1
C_family	0, 1
C_type	0, 1, 2
C_vcpu	0, 1
nodes	2, 3, 4, 5

Measured metrics

We have considered 2 optimization targets: runtime and cost. For each workload on each configuration, we measured the total execution runtime. Subsequently, we estimated the execution cost by multiplying the runtime by the hourly price for each type of node and by the number of nodes. This is an imperfect estimate, as it does not include additional data transfer costs that may occur. However, it is difficult to estimate the cost more precisely, as cloud billing is done through monthly aggregate invoices.

Workloads

We have considered a range of distributed data analytics tasks, which were implemented using the Dask framework. We chose dask because it has a clean integration with Kubernetes and supports a wide variety of tasks. We used 10 different ML training and data pre-processing tasks running with 3 different input datasets, listed in the table below. The dataset comprises 30 workloads defined as (Dask task, input dataset) pairs.

Dask tasks	Input datasets
kmeans, linear regression, logistic regression, naive bayes, poisson regression, polynomial features, spectral clustering, quantile transformer, standard scaler, xgboost	buzz in social media, credit card, santander

Status and failures

The last column in the dataset is the status. Whenever the given workload with the given configuration ran correctly, we have status=ok. In cases when the workload failed, status includes the error message, but the runtime measurement and cost estimate of the failed run was nevertheless collected.

Additionally, several configurations have never been run as a result of quotas imposed by some cloud providers. In these cases, it was impossible to spin up the Kubernetes cluster and therefore the incurred runtime or cost cannot be measured. Therefore, entries for such configurations are not present in the dataset.

How to Use

When an optimizer needs to evaluate a workload using a given provider with a given configuration, instead of creating a Kubernetes cluster and deploying the workload online, it can instead read the runtime or cost from the respective row of the dataset.

An example script running Random Search and a simple Random Forest predictor is available in example_script.ipynb.

How to Cite

@inproceedings{CloudBandit,
  title={Search-based Methods for Multi-Cloud Configuration},
  author={Lazuka, Malgorzata and Parnell, Thomas and Anghel, Andreea and Pozidis, Haralampos},
  booktitle={the IEEE International Conference on Cloud Computing (IEEE CLOUD 2022)}
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
example_script.ipynb		example_script.ipynb
multi-cloud-configuration.csv		multi-cloud-configuration.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

example_script.ipynb

example_script.ipynb

multi-cloud-configuration.csv

multi-cloud-configuration.csv

Repository files navigation

Multi-cloud configuration dataset

Introduction

Dataset Overview

Multi-cloud domain

Measured metrics

Workloads

Status and failures

How to Use

How to Cite

About

Releases

Packages

Contributors 2

Languages

License

IBM/multi-cloud-configuration-dataset

Folders and files

Latest commit

History

Repository files navigation

Multi-cloud configuration dataset

Introduction

Dataset Overview

Multi-cloud domain

Measured metrics

Workloads

Status and failures

How to Use

How to Cite

About

Resources

License

Stars

Watchers

Forks

Languages