Skip to content

TawfikYasser/Keyed-Watermarks-in-Apache-Flink

Repository files navigation

Keyed Watermarks: A Fine-grained Watermark Generation for Apache Flink

Keyed Watermarks in Apache Flink Cluster Deployment using Docker

Vanilla Vs. Keyed WM

The repository is structured as follows:

  • Explanation for the idea of Keyed Watermarks.
  • How to set up and run the cluster to be able to run Apache Flink with Keyed Watermarks.
  • The repo. contains the jars, source java classes of Keyed Watermarks, pipeline to test with, docker file used to build the cluster, and instructions on how to use everything.

What is Keyed Watermarks

Big Data Stream processing engines, exemplified by tools like Apache Flink, employ windowing techniques to manage unbounded streams of events. The aggregation of relevant data within Windows holds utmost importance for event-time windowing due to its impact on result accuracy. A pivotal role in this process is attributed to watermarks, unique timestamps signifying event progression in time. Nonetheless, the existing watermark generation method within Apache Flink, operating at the input stream level, exhibits a bias towards faster sub-streams, causing the omission of events from slower counterparts. Through our analysis, we determined that Apache Flink's standard watermark generation approach results in an approximate $33%$ data loss when $50%$ of median-proximate keys experience delays. Furthermore, this loss exceeds $37%$ in cases where $50%$ of randomly selected keys encounter delays. We introduce a pioneering approach termed keyed watermarks aimed at addressing data loss concerns and enhancing data processing precision to a minimum of $99%$ in the majority of scenarios. Our strategy facilitates distinct progress monitoring through the creation of individualized watermarks for each logical sub-stream (key).

Experimental Setup

Set up an Apache Flink Cluster using Docker

  • Clone this repo. using the following command: git clone https://github.com/TawfikYasser/kw-flink-cluster-docker.git.
  • Download the build-target folder from this link and put it in the same directory of the repo.
  • Clone this repo. and in the same directory run the following command: docker build -t <put-your-docker-image-name-here> ., a docker image will be created.
  • Then we need to create 3 docker containers, one for the JobManager, and two for the TaskManagers.
    • To create the JobManager container run the following command: docker run -it --name <put-your-docker-container-name-here> -p 8081:8081 --network bridge <put-your-docker-image-name-here>:latest, we're exposing the 8081 port in order to be able to access the Apache Flink Web UI from outside the containers, also we're attaching the container to the bridge network.
    • To create the TaskManagers run the following two commands: docker run -it --name <put-your-docker-container-name-here> --network bridge <put-your-docker-image-name-here>:latest, docker run -it --name <put-your-docker-container-name-here> --network bridge <put-your-docker-image-name-here>:latest.
  • Now, you have to configure the masters, workers, and flink-config.yml files on each container as follows:
  • Start the containers, and start the ssh service on the containers of taskManagers using: service ssh start.

Ready!

  • Now, you're ready to start the cluster, go to /home/flink/bin/ inside the container containing the jobManager and run the cluster using: ./start-cluster.sh.
  • Run a Flink job inside /home/flink/bin/ using the following command: ./flink run <pipeline-jar-path>. (-p optional if you want to override the default parallelism in flink-config.yml)
  • Open the Web UI of Apache Flink using: http://localhost:8081. (If you're running the containers on a VM use the VM's External IP, otherwise use your local machine's IP)

Flink Pipeline w/ Keyed Watermarks

  • In the following java file, use it as a starting point.

Citation

@INPROCEEDINGS{10296717,
  author={Yasser, Tawfik and Arafa, Tamer and El-Helw, Mohamed and Awad, Ahmed},
  booktitle={2023 5th Novel Intelligent and Leading Emerging Sciences Conference (NILES)}, 
  title={Keyed Watermarks: A Fine-grained Tracking of Event-time in Apache Flink}, 
  year={2023},
  volume={},
  number={},
  pages={23-28},
  doi={10.1109/NILES59815.2023.10296717}}

IMPORTANT: Code Base of Keyed Watermarks & Ready to Run Flink Cluster


Tawfik Yasser Contact Info.: LinkedIn, Email, and Resume.