Skip to content

CloudLargeScale-UCLouvain/StreamBed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Streambed: capacity planning for steam processing

This repository contains instructions for reproducing the experiments in our DEBS'24 submission "Streambed: capacity planning for steam processing".

StreamBed is a capacity planning system for stream processing. It predicts, ahead of any production deployment, the resources that a query will require to process an incoming data rate sustainably, and the appropriate configuration of these resources. StreamBed builds a capacity planning model by piloting a series of runs of the target query in a small-scale, controlled testbed. We implement StreamBed for the popular Flink DSP engine. Our evaluation with large-scale queries of the Nexmark benchmark demonstrates that StreamBed can effectively and accurately predict capacity requirements for jobs spanning more than 1,000 cores using a testbed of only 48 cores.

Authors

Guillaume Rosinosky - Donatien Schmitz - Etienne Rivière

Contact

Table of Contents

Project Structure

Available directories:

  • infra: infrastructure setup
  • streambed: Streambed source code
  • experiments: Instructions on performing experiments with StreamBed and DEBS results
  • common: shared scripts for notebooks

Description & Requirements

How to access

Streambed can be downloaded from the following link: https://github.com/CloudLargeScale-UCLouvain/StreamBed

Hardware dependencies

Streambed assumes a working Kubernetes installation comprising an Apache Kafka installation (that can be deployed by Streambed scripts) and will iteratively deploy and test various Flink installations.

The Kubernetes cluster should be dimensioned accordingly: sufficient quantities of resources should be available for Flink Job Manager, Task Manager and Kafka nodes (if managed by Streambed). A node can be bare-metal or a small VM:

  • a Kafka installation of $K$ nodes preferably with fast SSDs (if deployed by Streambed)
  • a Flink installation of 1 job manager and $T$ task managers
  • a manager node whose goal is to contain transversal modules such as the monitoring stack
  • the K8S master node
  • the control computer where this repository has been cloned (usually a laptop or a gateway VM)

So for instance, a 4-node Kafka installation and a 4-node Flink installation, $K$ + $T$ + 1 + 1 + 1 = 11 nodes will be needed, added to the control node.

Software dependencies

A running Kubernetes cluster should be deployed on the nodes. We provide scripts for provision and installation for Kind and Grid5000, but Streambed can be used on any Kubernetes Cluster once the dependencies have been installed.

The control machine should have a running installation of Jupyter, papermill, and Python 3.9+. Additional Python dependencies are installed using pip automatically in the provided notebooks. The cbc solver is automatically installed by the PuLP Python library. If the infrastructure scripts are used, Terraform v0.14.11 should be installed and a parameterized Grid5000 environment should be set, or Kind 0.11.1 should be installed. If not, the Kubernetes configuration file .kube/config should be set at its default place. The Kubernetes cluster (tested on Kubernetes v1.21 and v1.22) should have the following charts or operators installed:

  • Prometheus: famous monitoring framework

  • Apache Zeppelin: notebooks for jobs or queries initialization and execution. We expect the jobs to be expressed as Zeppelin notebooks. Other methods can be easily implemented.

  • Spotify Flink-on-k8s operator: easy deployment of Apache Flink

  • Working installation of Nginx Ingress, used for Apache Flink, Apache Zeppelin, Apache Kafka

  • Facultative helper modules:

Installing and Running StreamBed

Install build-essentials (e.g., dependencies)

The installation of Streambed Python dependencies is straightforward.

cd streambed
pip install -r requirements.txt

Additional dependencies

Additional dependencies should be cloned, built and placed accordingly in the ./tmp directory before the deployment:

  • Rate-limited Kafka connector: this module permits Streambed to control the current reading rate. This connector is used in the job as a source. Our scripts expect the build flink-sql-connector-kafka-ratelimit_2.12-1.14.2.jar to be present in the tmp directory.
  • Facultative: Nexmark benchmark: Nexmark fork linked on Apache Flink 1.14.2. Our scripts expect the build nexmark-flink-0.2-SNAPSHOT.jar to be present in the tmp directory.

We provide infrastructure deployment scripts in the infra directory for deployment on Kind and on the large-scale testbed Grid5000. Those permit the provisioning of the infrastructure, the installation of all the dependencies described above, and Nexmark Zeppelin notebooks.

On an already available installation of Kubernetes, the usage of the modules scripts and setup scripts should be sufficient to deploy the dependencies.

⚠️ Caution: These scripts expect the nodes to have nodes with the label tier set in the following way:

  • manager: 1 (or more) node(s) for the deployment of Apache Zeppelin, Prometheus
  • jobmanager: 1 node for Flink's job manager
  • kafka: as many nodes as needed for the Kafka installation, if managed by Streambed. Kafka nodes can be colocated with the job manager if set in the configuration file.
  • taskmanager: as many nodes as needed for Flink's task managers

Ingresses

We provide ingresses for Apache Zeppelin, Kowl (Apache Kafka UI), Apache Flink, Prometheus, and Grafana. Those are parameterized to use the name (module).127-0-0-1.sslip.io, for instance zeppelin.127-0-0-1.sslip.io. We then use port forwarding to forward the port of the manager node (80 by default with our deployment of Ingress Nginx) directly on the control machine. These values can be parameterized in infra/ingress-localhost.yaml, and streambed/charts/flink/values.yaml.

Setup StreamBed

Zeppelin notebooks

Zeppelin execution notebooks should be prepared accordingly. The main goal of the Zeppelin notebooks is to:

  • initialize the structure of the SQL tables in a customizable way
  • set the parallelism to what is indicated by the CO module
  • execute the target query

Moreover, the usage of Zeppelin notebooks permits changes on the fly on the queries, for development purposes.

An example is provided for the Nexmark queries, but can be easily generalized to other queries or regular Flink jobs. More details in the Zeppelin README.

Data initialization

Representative data should be initialized in Kafka: Streambed will directly read from Kafka the data, at various rates to estimate the MST rate (CE module), the parallelism of the tasks (CO module) and iteratively the model linking MST and configuration (RE module).

For the Nexmark benchmark, we provide an initialization Zeppelin notebook.

Configuration

In the current state, the easier way to use Streambed is by using our experiments Jupyter notebooks, as described in the next section. We will shortly merge and normalize Streambed, to permit usage without Jupyter notebooks.

Experiments with StreamBed

Please see here for a detailed description of performing experiments with StreamBed.

StreamBed Source Code

Please see here for a detailed description of the StreamBed source code.

Reproducing the DEBS24 results

DEBS24 experiments have been launched on the large-scale testbed Grid5000. Note that a guest account to the Grid5000 platform, allowing access to computational resources, can be arranged in order to reproduce our experiments. Please get in touch with us in this case.

Please see here for a detailed description of the scripts used for the DEBS24 experiments.

Funding

This project has been funded by the Walloon region (Belgium) through the Win2Wal project GEPICIAD.

License

StreamBed is released under Apache License 2.0

About

StreamBed is a capacity planning solution for Apache Flink

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published