This repository contains instructions for reproducing the experiments in our DEBS'24 submission "Streambed: capacity planning for steam processing".
StreamBed is a capacity planning system for stream processing. It predicts, ahead of any production deployment, the resources that a query will require to process an incoming data rate sustainably, and the appropriate configuration of these resources. StreamBed builds a capacity planning model by piloting a series of runs of the target query in a small-scale, controlled testbed. We implement StreamBed for the popular Flink DSP engine. Our evaluation with large-scale queries of the Nexmark benchmark demonstrates that StreamBed can effectively and accurately predict capacity requirements for jobs spanning more than 1,000 cores using a testbed of only 48 cores.
Guillaume Rosinosky - Donatien Schmitz - Etienne Rivière
- Project Structure
- Description & Requirements
- Setup and Run Streambed
- Experiments with StreamBed
- StreamBed Source Code
- Reproducing the DEBS24 results
- License
Available directories:
- infra: infrastructure setup
- streambed: Streambed source code
- experiments: Instructions on performing experiments with StreamBed and DEBS results
- common: shared scripts for notebooks
How to access
Streambed can be downloaded from the following link: https://github.com/CloudLargeScale-UCLouvain/StreamBed
Hardware dependencies
Streambed assumes a working Kubernetes installation comprising an Apache Kafka installation (that can be deployed by Streambed scripts) and will iteratively deploy and test various Flink installations.
The Kubernetes cluster should be dimensioned accordingly: sufficient quantities of resources should be available for Flink Job Manager, Task Manager and Kafka nodes (if managed by Streambed). A node can be bare-metal or a small VM:
- a Kafka installation of
$K$ nodes preferably with fast SSDs (if deployed by Streambed) - a Flink installation of 1 job manager and
$T$ task managers - a manager node whose goal is to contain transversal modules such as the monitoring stack
- the K8S master node
- the control computer where this repository has been cloned (usually a laptop or a gateway VM)
So for instance, a 4-node Kafka installation and a 4-node Flink installation,
Software dependencies
A running Kubernetes cluster should be deployed on the nodes. We provide scripts for provision and installation for Kind and Grid5000, but Streambed can be used on any Kubernetes Cluster once the dependencies have been installed.
The control machine should have a running installation of Jupyter, papermill, and Python 3.9+. Additional Python dependencies are installed using pip automatically in the provided notebooks. The cbc solver is automatically installed by the PuLP Python library. If the infrastructure scripts are used, Terraform v0.14.11 should be installed and a parameterized Grid5000 environment should be set, or Kind 0.11.1 should be installed. If not, the Kubernetes configuration file .kube/config
should be set at its default place. The Kubernetes cluster (tested on Kubernetes v1.21 and v1.22) should have the following charts or operators installed:
-
Prometheus: famous monitoring framework
-
Apache Zeppelin: notebooks for jobs or queries initialization and execution. We expect the jobs to be expressed as Zeppelin notebooks. Other methods can be easily implemented.
-
Spotify Flink-on-k8s operator: easy deployment of Apache Flink
-
Working installation of Nginx Ingress, used for Apache Flink, Apache Zeppelin, Apache Kafka
-
Facultative helper modules:
- Local Path provisioner: storage solution if none is available, the node path used for the storage should point to an SSD drive on the nodes, especially for Kafka deployment
- Strimzi Kafka operator: easy deployment of Kafka, and topic management, if no existing Kafka installation is available.
- Grafana (Prometheus UI)
The installation of Streambed Python dependencies is straightforward.
cd streambed
pip install -r requirements.txt
Additional dependencies should be cloned, built and placed accordingly in the ./tmp directory before the deployment:
- Rate-limited Kafka connector: this module permits Streambed to control the current reading rate. This connector is used in the job as a source. Our scripts expect the build
flink-sql-connector-kafka-ratelimit_2.12-1.14.2.jar
to be present in the tmp directory. - Facultative: Nexmark benchmark: Nexmark fork linked on Apache Flink 1.14.2. Our scripts expect the build
nexmark-flink-0.2-SNAPSHOT.jar
to be present in the tmp directory.
We provide infrastructure deployment scripts in the infra directory for deployment on Kind and on the large-scale testbed Grid5000. Those permit the provisioning of the infrastructure, the installation of all the dependencies described above, and Nexmark Zeppelin notebooks.
On an already available installation of Kubernetes, the usage of the modules scripts and setup scripts should be sufficient to deploy the dependencies.
tier
set in the following way:
manager
: 1 (or more) node(s) for the deployment of Apache Zeppelin, Prometheusjobmanager
: 1 node for Flink's job managerkafka
: as many nodes as needed for the Kafka installation, if managed by Streambed. Kafka nodes can be colocated with the job manager if set in the configuration file.taskmanager
: as many nodes as needed for Flink's task managers
We provide ingresses for Apache Zeppelin, Kowl (Apache Kafka UI), Apache Flink, Prometheus, and Grafana. Those are parameterized to use the name (module).127-0-0-1.sslip.io
, for instance zeppelin.127-0-0-1.sslip.io. We then use port forwarding to forward the port of the manager node (80 by default with our deployment of Ingress Nginx) directly on the control machine. These values can be parameterized in infra/ingress-localhost.yaml, and streambed/charts/flink/values.yaml.
Zeppelin execution notebooks should be prepared accordingly. The main goal of the Zeppelin notebooks is to:
- initialize the structure of the SQL tables in a customizable way
- set the parallelism to what is indicated by the CO module
- execute the target query
Moreover, the usage of Zeppelin notebooks permits changes on the fly on the queries, for development purposes.
An example is provided for the Nexmark queries, but can be easily generalized to other queries or regular Flink jobs. More details in the Zeppelin README.
Representative data should be initialized in Kafka: Streambed will directly read from Kafka the data, at various rates to estimate the MST rate (CE module), the parallelism of the tasks (CO module) and iteratively the model linking MST and configuration (RE module).
For the Nexmark benchmark, we provide an initialization Zeppelin notebook.
In the current state, the easier way to use Streambed is by using our experiments Jupyter notebooks, as described in the next section. We will shortly merge and normalize Streambed, to permit usage without Jupyter notebooks.
Please see here for a detailed description of performing experiments with StreamBed.
Please see here for a detailed description of the StreamBed source code.
DEBS24 experiments have been launched on the large-scale testbed Grid5000. Note that a guest account to the Grid5000 platform, allowing access to computational resources, can be arranged in order to reproduce our experiments. Please get in touch with us in this case.
Please see here for a detailed description of the scripts used for the DEBS24 experiments.
This project has been funded by the Walloon region (Belgium) through the Win2Wal project GEPICIAD.
StreamBed is released under Apache License 2.0