diff --git a/_posts/2024-07-30-Unlocking-the-Potential-of-Local-Kafka.md b/_posts/2024-07-30-Unlocking-the-Potential-of-Local-Kafka.md index 6b9371cc2b..ee9e7a36a7 100644 --- a/_posts/2024-07-30-Unlocking-the-Potential-of-Local-Kafka.md +++ b/_posts/2024-07-30-Unlocking-the-Potential-of-Local-Kafka.md @@ -1,7 +1,7 @@ --- layout: post title: "Unlocking The Potential of Local Kafka" -categories: [Apache Kafka, Local Kafka, Docker Compose, Ansible ] +categories: [Apache Kafka, Docker Compose, Ansible ] teaser: Unlock the full potential of local Kafka setups for rapid development, testing, and seamless management and discover the best method for your local kafka setup. authors: Subramanya featured: false diff --git a/_posts/2024-07-31-Learning-by-Debugging-Kafka-Local-Setup.md b/_posts/2024-07-31-Learning-by-Debugging-Kafka-Local-Setup.md index 3ba7b610f2..a245dc0f78 100644 --- a/_posts/2024-07-31-Learning-by-Debugging-Kafka-Local-Setup.md +++ b/_posts/2024-07-31-Learning-by-Debugging-Kafka-Local-Setup.md @@ -1,7 +1,7 @@ --- layout: post title: "Learning by Debugging: Kafka Local Setup" -categories: [Apache Kafka, Local Kafka, Docker Compose, Ansible, Kafka Troubleshooting ] +categories: [Apache Kafka, Docker Compose, Kafka Troubleshooting ] teaser: Discover how to set up and troubleshoot a local Kafka environment using Docker Compose and Ansible, gain hands-on experience and debugging skills in a safe, isolated setting. Dive into real-world scenarios and enhance your Kafka expertise. authors: Subramanya featured: false diff --git a/_posts/2024-08-13-Understanding-Temporal-Interceptors.md b/_posts/2024-08-13-Understanding-Temporal-Interceptors.md index cdaca53301..d938edd6e7 100644 --- a/_posts/2024-08-13-Understanding-Temporal-Interceptors.md +++ b/_posts/2024-08-13-Understanding-Temporal-Interceptors.md @@ -1,7 +1,7 @@ --- layout: post title: "Understanding Temporal Interceptors: Enhancing Workflow Management" -categories: [Temporal, Temporal Interceptors, Pyhton, Workflow Management ] +categories: [Temporal, Temporal Interceptors] teaser: Boost Your Temporal Workflows with Interceptors - Dive into how you can leverage Temporal Interceptors to inject custom logic into your workflows and activities, enhancing control flow, and adding a new layer of flexibility to your distributed systems. authors: Subramanya featured: false diff --git a/_posts/2024-08-21-Introduction-to-Temporal.md b/_posts/2024-08-21-Introduction-to-Temporal.md new file mode 100644 index 0000000000..8772dc64fa --- /dev/null +++ b/_posts/2024-08-21-Introduction-to-Temporal.md @@ -0,0 +1,289 @@ +--- +layout: post +title: "Introduction to Temporal" +categories: [Temporal, Temporal, Pyhton, Workflow Management, Durable Systems ] +teaser: Discover Temporal, a game-changer in building reliable distributed systems. Learn how its durable functions simplify complex systems and guarantee the reliability of your business logic +authors: Subramanya +featured: false +hidden: false +image: assets/blog-images/Introduction-to-temporal/temporal_logo.png +toc: true +--- + +# What is Temporal? + +Temporal is a durable execution platform that empowers developers to build scalable applications without compromising productivity or reliability. One of the core challenges in building distributed applications is managing complexities like handling failures, ensuring consistency, and maintaining visibility into the application state. Temporal addresses these challenges by providing powerful constructs that simplify error handling and retries, making it easier to build reliable systems. Additionally, Temporal's tools, such as its Web UI and CLI, offer developers robust visualization and control over workflow execution. + + +# Why Use Temporal? + +Temporal abstracts the complexities of constructing scalable, reliable distributed systems by maintaining complete application state, enabling seamless migration of execution to another machine in the event of a host or software failure. It allows you to code resilience directly into your applications, removing the need for intricate failure and error handling logic, which accelerates development and simplifies infrastructure. + +Temporal eliminates the need for traditional components like queues, pub/sub systems, and schedulers, which are often used to manage complex workflows in distributed systems. Instead, Temporal inherently manages task execution, event handling, and timing, significantly reducing architectural complexity while enhancing the reliability and scalability of your applications. + + +# High-level architecture + +Temporal's architecture is designed to separate the application logic from the underlying execution infrastructure, providing a clear distinction between user-hosted processes and Temporal cluster processes. + +To fully understand how Temporal's architecture operates, it’s important to first grasp the key components that make up the system. These components—Workflows, Activities, Workers, and the Temporal Server—each play a distinct role in how your application logic is executed and managed. + +**Workflows** represent the heart of your application logic, defining the sequence of steps your application will take. Written using Temporal SDKs in the programming language of your choice, Workflows are designed to be deterministic, ensuring consistent outcomes given the same inputs. They are executed on your infrastructure, not directly on the Temporal Server. + +**Activities** are discrete units of business logic within your application, designed to perform specific tasks that may be prone to failure, such as interacting with external services or sending emails. Like Workflows, Activities run on your infrastructure and are managed by Temporal to ensure reliability. + +**Workers** are processes that execute your Workflows and Activities. They act as the bridge between your application logic and the Temporal Server, ensuring that your tasks are performed reliably and in the correct order. Workers constantly communicate with the Temporal Server to coordinate the execution of tasks. + +**Temporal SDKs** (Software Development Kits) are open-source collections of tools, libraries, and APIs that enable Temporal Application development. They offer a Temporal Client to interact with the Temporal Service, APIs to develop your Temporal Application, and APIs to run horizontally scalable Workers. Temporal SDKs are available in multiple languages including Go, Java, PHP, Python, TypeScript, and .NET + +The Temporal Server is the backbone of the Temporal system. It acts as the supervisor for your application, maintaining the source of truth for Workflow Execution Event Histories and ensuring that your distributed applications execute durably and consistently. This server can be hosted by you in a self-hosted setup or managed by Temporal through Temporal Cloud, a SaaS platform that abstracts the complexities of server management. + + + + + +**User-hosted processes** + + + +* User-hosted processes in the Temporal system refer to the Worker processes that execute the Workflow and Activity code of your application. These Worker processes are hosted and operated by you, the user, and they use one of the Temporal SDKs to communicate with the Temporal Server. +* The Worker processes are responsible for executing the Workflow and Activity code segregated from your application and communicating with the Temporal Server, The Worker processes continuously poll the Temporal Server for Workflow and Activity tasks, and on completion of each task, they send information back to the server. +* The main client of the User-hosted processes is the Temporal Server. The Worker processes communicate with the Temporal Server to receive tasks and update the server on task completion. + +**Temporal Cluster** + + +* The Temporal Cluster is a group of services that together act as a component of the Temporal Platform. It includes the Temporal Server, combined with Visibility stores. One of the key components of the Temporal Server is the Frontend Service. The Frontend Service is a stateless gateway service that exposes a strongly typed API using Protocol Buffers (Protobuf). It is responsible for rate limiting, authorizing, validating, and routing all inbound calls, including those from the Worker processes. +* The Temporal Cluster is responsible for maintaining the state and progress of your workflows. It ensures that your distributed applications execute durably and consistently. +* The History Service within the Temporal Cluster manages individual Workflow Executions, handling RPCs from the User Application and the Temporal Worker, driving the Workflow Execution to completion, and storing all state required for durable execution. + The Matching Service within the Temporal Cluster manages the Task Queues being polled by Temporal Worker processes, with a single task queue holding tasks for multiple Workflow Executions. +* The main clients of the Temporal Cluster are the User Applications and the Temporal Workers. The User Applications use one of the Temporal SDKs to communicate with the Temporal Server to start/cancel workflows, and interact with running workflows. The Temporal Workers execute the Workflow and Activity code and communicate with the Temporal Server. +* The User Applications and Temporal Workers communicate with the Temporal Server (part of the Temporal Cluster) using the Temporal SDKs. This communication is done via gRPC, a high-performance, open-source framework for RPC communication. The Worker processes continuously poll the Temporal Server for tasks and send updates back to the server upon task completion. + +Your application's logic runs in Workers on your infrastructure, while the Temporal cluster orchestrates and manages these Workers, ensuring everything runs smoothly and reliably. With Temporal Cloud, you can offload the management of the cluster to a hosted service, while self-hosting allows you to have full control over the infrastructure. In short, Temporal is a platform designed to ensure the durable execution of your application code. Temporal handles these challenges for you, enabling your application to run reliably and consistently even in the face of unexpected issues. + + +# What does Durable execution mean? + +Durable Execution in the context of Temporal refers to the ability of a Workflow Execution to maintain its state and progress even in the face of failures, crashes, or server outages. This is achieved through Temporal's use of an Event History, which records the state of a Workflow Execution at each step. If a failure occurs, the Workflow Execution can resume from the last recorded event, ensuring that progress isn't lost. This durability is a key feature of Temporal Workflow Executions, making them reliable and resilient. It enables application code to execute effectively once and to completion, regardless of whether it takes seconds or years. + +In other words, Durable Execution guarantees that the main function of your application and all the steps defined in it will effectively execute just once and to completion, even in the face of failures + + +# How does Temporal achieve Durable Execution? + +Temporal achieves durable execution of your application code through a combination of Workflows, Activities, and an underlying Event History mechanism. + +Temporal automatically retries failed Activities based on the Retry Policy configured for that Activity. This allows Temporal to handle transient failures and continue the Workflow Execution once the issue is resolved. You can configure the Retry Policy to fit your needs, specifying things like the maximum number of attempts, the initial interval between retries, and the maximum interval between retries. + +The Event History is a key feature of Temporal that records the state of a Workflow Execution at each step. If a failure occurs, the Workflow Execution can resume from the last recorded event, ensuring that progress isn't lost. This durability is a key feature of Temporal Workflow Executions, making them reliable and resilient. + +Timeouts in Temporal are used to detect application failures. They are configurable for both Workflows and Activities, allowing you to control the maximum duration of different aspects of a Workflow or Activity Execution. There are several types of timeouts that you can set: + + + +* Workflow Timeouts: Each Workflow timeout controls the maximum duration of a different aspect of a Workflow Execution. +* Activity Timeouts: Each Activity timeout controls the maximum duration of a different aspect of an Activity Execution. +* Heartbeat Timeouts: A Heartbeat Timeout works in conjunction with Activity Heartbeats. An Activity Heartbeat is a ping from the Worker that is executing the Activity to the Temporal Service. If the Temporal Service does not receive a heartbeat from an Activity within the heartbeat timeout interval, it will consider the Activity as failed. + +Here's an example of how Temporal handles a server crash: + + + +1. A Workflow is running and executing Activities. +2. The server running the Temporal Cluster crashes. +3. After the Temporal Server (which could be the entire Temporal Cluster if it's a single-server setup) has stopped due to the crash, it is restarted. This could be a manual restart by an operator or an automatic restart by a system supervisor. In a multi-server setup, If one server crashes, the remaining servers in the cluster can continue to operate and maintain the state of the workflows. +4. The Workflow is still listed and running. It picks up where it left off when the server comes back online. Because when a Temporal server crashes, the state of the workflow is preserved in its history. If a worker is unable to communicate with the server, the workflow doesn't stop. Temporal times out the workflow task and places it back on the task queue. Another worker can then pick up the task and continue the execution by replaying the workflow history to recreate the workflow state. + + +# Developing Workflows and Activities + +Let us take an example and look into developing Workflows and Activities using the Temporal Python SDK. + +**Developing a Workflow** + +In Temporal, workflows are defined as classes. The @workflow.defn decorator is used to identify a workflow, and the @workflow.run decorator is used to mark the entry point method to be invoked. Here's an example of a workflow definition: + +```python +@workflow.defn +class GreetSomeone: + @workflow.run + async def run(self, name: str) -> str: + greeting = await workflow.execute_activity( + TranslateActivities.say_hello, + name, + start_to_close_timeout=timedelta(seconds=5), + ) + return f"{greeting}" +``` + +In this example, when the run method of the GreetSomeone workflow is invoked with a name argument, it calls the say_hello activity with the same argument. + +**Developing Activities** + +Activities are where you execute non-deterministic code or perform operations that may fail. In the Temporal Python SDK, you define an Activity by decorating a function with @activity.defn. Here's an example of an activity definition: + +```python +import urllib.parse +from temporalio import activity +class TranslateActivities: + def __init__(self, session): + self.session = session + @activity.defn + async def say_hello(self, name: str) -> str: + greeting = await self.call_service("get-spanish-greeting", name) + return greeting + + # Utility method for making calls to the microservices + async def call_service(self, stem: str, name: str) -> str: + base = f"http://localhost:9999/{stem}" + url = f"{base}?name={urllib.parse.quote(name)}" + async with self.session.get(url) as response: + response.raise_for_status() + return await response.text() +``` + +In this example, the say_hello activity calls the call_service method with the stem argument as "get-spanish-greeting" and the name argument. The call_service method constructs a URL using the stem and name arguments and makes a GET request to this URL using the aiohttp.ClientSession instance stored in self.session. The call_service method returns the text of the response, which is then returned by the say_hello activity. + + +# Understanding Workflow Execution in Temporal + +When working with Temporal, it's essential to grasp how Workflow Execution unfolds. Let’s dive into the process, from launching Workers to handling failures and retries, to ensure you understand each step of Workflow Execution. + +**Starting with the Worker** + +The Workflow Execution hinges on the Worker, which runs your Workflow and Activity code. To begin, you need to launch the Worker, which involves starting a new process. In Python, this typically involves defining a main function that initializes a Temporal client, creates a Worker entity, and starts the Worker. + +Here's a high-level overview of the process: + + + +1. **Creating the Temporal Client:** This client interacts with the Temporal service and manages Workflow and Activity requests. +2. **Setting Up the Worker:** The Worker is created with the Temporal client, a task queue name, and the registered Workflows and Activities. +3. **Polling for Tasks:** Once the Worker is running, it maintains a long-lasting connection to the Temporal Cluster, polling for new tasks. At this point, the task queue is empty, and the Worker is idle. + +In Python, you typically start a Worker process by defining a main function that initializes a Temporal client, creates a Worker entity, and starts the Worker. Here's a simplified example: + +```python +import asyncio +import aiohttp +from temporalio.client import Client +from temporalio.worker import Worker +from translate import TranslateActivities +from greeting import GreetSomeone + +async def main(): + client = await Client.connect("localhost:7233", namespace="default") + + # Run the worker + async with aiohttp.ClientSession() as session: + activities = TranslateActivities(session) + worker = Worker( + client, + task_queue="greeting-tasks", + workflows=[GreetSomeone], + activities=[activities.greet_in_spanish], + ) + + print("Starting the worker....") + await worker.run() + +if __name__ == "__main__": + asyncio.run(main()) +``` + +In this example, the Worker is set up to listen to the "GREETING_TASKS" task queue and is registered with the GreetSomeone Workflow and TranslateActivities Activities. + +**Starting the Workflow** + +You can initiate a Workflow either using the Temporal command-line tool or programmatically through your application code. In both methods, you provide: + + + +* **Task Queue Name** +* **Workflow ID** +* **Workflow Type** +* **Input Data** + +Regardless of the starting method, the Temporal Cluster records a WorkflowExecutionStarted event, marking the beginning of the Workflow Execution. + +Here’s an example for starting a workflow using the python sdk, let us take the workflow described in the Developing a Workflow section as the workflow to start. + +```python +import asyncio +import sys +from greeting import GreetSomeone +from temporalio.client import Client + +async def main(): + # Create client connected to server at the given address + client = await Client.connect("localhost:7233") + + # Execute a workflow + handle = await client.start_workflow( + GreetSomeone.run, + sys.argv[1], + id="greeting-workflow", + task_queue="greeting-tasks", + ) + + print(f"Started workflow. Workflow ID: {handle.id}, RunID {handle.result_run_id}") + result = await handle.result() + print(f"Result: {result}") + +if __name__ == "__main__": + asyncio.run(main()) +``` + + +**Workflow and Activity Tasks** + +As the Workflow Execution progresses, the Temporal Cluster schedules and records various events: + + + +1. **Workflow Task Scheduled:** The Cluster schedules a new Workflow Task and logs an event. +2. **Workflow Task Started:** The Worker accepts the task and starts executing the Workflow code. + +In the Python SDK, Activity options like start_to_close_timeout are specified as keyword arguments. The Workflow method calls execute_activity_method to invoke an Activity. The Worker must wait for the Activity to complete before proceeding. + +**Handling Activity Execution** + +During Activity execution: + + + +1. **Activity Task Scheduled:** The Worker notifies the Cluster that the Workflow Task is complete and requests an Activity Task. +2. **Activity Task Started:** The Cluster schedules and records a new Activity Task. + +The Worker then runs the code for the Activity, such as calling a microservice to get a response. + +**Failure Scenarios and Retries** + +In the event of a Worker failure, such as a crash due to memory issues, you can recover by restarting the Worker or launching a new one. Temporal will automatically recreate the Workflow state up to the failure point, ensuring continuity. Completed Activities are not re-executed; Temporal reuses their previous results. + +For scenarios where the Activity encounters failures (e.g., a microservice is offline), Temporal automatically retries the Activity by default. You can configure this behavior with a Retry Policy. If the microservice is back online, the Activity will eventually succeed, and Temporal will log the appropriate events. + +**Completing the Workflow** + +As the Workflow code progresses and all Activities complete, the Cluster records the final WorkflowTaskCompleted event. The Workflow Execution concludes, and the client application receives the result. + + +# Use Cases + +Netflix uses Temporal to manage complex workflows for their content delivery and recommendation systems. Temporal helps Netflix orchestrate various microservices involved in content ingestion, metadata management, and user personalization, ensuring reliability and scalability. + +DoorDash employs Temporal for orchestrating its delivery and logistics operations. Temporal helps manage workflows related to order processing, delivery routing, and real-time tracking, enhancing operational efficiency and reliability. + +Box uses Temporal to handle file synchronization and collaborative workflows. Temporal enables Box to manage file uploads, version control, and collaborative editing, providing a scalable and fault-tolerant solution for their document management system. + +At Platformatory, we use Temporal for the orchestration of workflows in our Eventception platform. Eventception transforms traditional APIs to an event-driven architecture, enabling seamless integration and scalability without code changes. Temporal is used to provision necessary resources, update resources or delete resources based on user actions on the platform. These typically involves running multiple steps that need to be orchestrated for the end goal. Each user activity involving CUD (Create, Update and Delete Operations) is translated into 1 or more workflows in Temporal. The workflows are triggered as a response to a Kafka message, thus making it an event driven orchestration. Temporal allows us to manage complex, long-running processes efficiently, with automatic retries, state management, and precise execution tracking, all crucial for maintaining the robustness and reliability of our platform. To know more about Eventception, check out this video available on youtube [ EventCeption - Reflexive Eventing for Data-driven Application](https://www.youtube.com/watch?v=3NL2ctqglfo). + +We also use Temporal Interceptors in our Eventception platform to automatically manage workflow statuses, ensuring consistency and reducing repetitive code across our platform's complex processes. This streamlined approach enhances operational visibility and reliability. If you wish to know more about Temporal Interceptors, make sure to check out our blog post [Understanding Temporal Interceptors: Enhancing Workflow Management](https://platformatory.io/blog/Understanding-Temporal-Interceptors/). + + +# Conclusion + +Temporal stands out as a powerful tool for managing durable executions in distributed systems. Its ability to simplify development and infrastructure while enhancing reliability makes it an invaluable asset for organizations looking to scale and innovate. + +Temporal also offers various amounts of courses to learn from, start your Temporal journey from the [Temporal 101](https://learn.temporal.io/courses/temporal_101/) course to get a hold of the basics of Temporal, the course gets you through all the basic concepts as well as how you could set up Temporal in your Local environment. diff --git a/_posts/2024-09-05-How-to-migrate-kafka-clusters-without-downtime.md b/_posts/2024-09-05-How-to-migrate-kafka-clusters-without-downtime.md new file mode 100644 index 0000000000..90da78fc6c --- /dev/null +++ b/_posts/2024-09-05-How-to-migrate-kafka-clusters-without-downtime.md @@ -0,0 +1,241 @@ +--- +layout: post +title: "How to migrate Kafka clusters without downtime" +categories: [Kafka cluster migrattion] +teaser: Migrating Kafka clusters can be challenging, but with the right approach, you can minimize downtime and ensure a smooth transition. Explore our guide for practical steps and strategies to help make your migration process seamless. +authors: Subramanya +featured: false +hidden: false +image: /assets/blog-images/How-to-migrate-kafka-clusters-without-downtime/title.webhp +toc: true +--- + + +Migrating Kafka clusters can be a daunting task, especially when minimizing downtime is a priority. However, with careful planning and the right tools, it's possible to execute a seamless migration. This guide will walk you through the process step by step, ensuring a smooth transition to your new Kafka environment. + + +# Understand the Reasons for Migration + +Before diving into the migration process, it's essential to understand why you're migrating. Common reasons include: + + + +* Upgrading to a newer Kafka version or a managed service (e.g., Confluent Cloud, AWS MSK). +* Rearchitecting the cluster to improve performance, scalability, or fault tolerance. +* Moving to a new environment, such as transitioning from on-premises to the cloud. + +Organizations might have a specific need for migrating other than these mentioned reasons. Knowing the motivation behind your migration will guide your planning and help you make informed decisions throughout the process. + + +# Plan Your Migration Strategy + +A successful migration starts with a solid plan. Key factors to consider include: + + + +* **Cluster Configuration**: Document the current Kafka cluster's configuration, including topics, partitions, replication factors, and consumer groups. Decide if you’ll replicate the existing setup or make improvements. +* **Data Migration**: Determine how you'll move data from the old cluster to the new one. Consider using tools like Apache Kafka's MirrorMaker 2 or Confluent Replicator, which can synchronize data between clusters. +* **Consumer Offset Management**: Plan how to manage consumer offsets to ensure that your consumers resume processing at the correct point in the new cluster. +* **Downtime Tolerance**: Assess your application’s tolerance for downtime. The goal is to minimize it, but understanding the acceptable limits will help in planning. + + +# Define Migration Scope + + + +## Assess Your Existing Cluster + +A thorough understanding of your existing Kafka cluster deployment is essential for ensuring a smooth transition to the new Kafka environment. This assessment should include: + + + +* **Cluster Configurations**: Review the current configurations of your Kafka cluster to understand how they might need to change in the new environment. +* **Topic Topology and Data Model**: Analyze the structure of your topics, partitions, and data flow to ensure compatibility and performance in the new cluster. +* **Kafka Client Configurations and Performance**: Evaluate how your Kafka clients are configured and their performance metrics. Understanding these will help in replicating or improving them in the new setup. +* **Benchmarking**: Identify key benchmarks that can be used to validate the performance and functionality of workloads once they are migrated to the new cluster. This includes assessing latency requirements and understanding how latency is measured, so you can effectively evaluate it during benchmarking in the new environment. +* **Dependency Planning**: Consider all applications that connect to your existing Kafka cluster and understand how they depend on each other. This ensures that all relevant applications are targeted during the migration, and necessary changes are made. + + +## Determine Data Replication Requirements + +Kafka migrations can generally be categorized into two scenarios: migrating without data and migrating with data. Each scenario requires different approaches and considerations. + +**Migrating Without Data Migration** + +In cases where you are not required to move existing data, the migration process focuses on shifting producers and consumers to a new Kafka cluster without interrupting the data flow. + +**Steps for Migration Without Data:** + + + +* **Prepare the New Cluster:** + * Set up the new Kafka cluster with configurations that match or exceed the capabilities of the existing cluster. + * Ensure the new cluster is fully tested in a non-production environment to avoid unexpected issues during the actual migration. +* **Dual Writing (Shadowing):** + * Implement dual writing, where producers send data to both the old and new clusters simultaneously. This ensures that the new cluster starts receiving data without disrupting the current flow. + * Gradually increase the load on the new cluster while monitoring its performance. +* **Consumer Migration:** + * Start migrating consumers to the new cluster in phases. Begin with less critical consumers and verify that they are processing data correctly. + * Ensure that consumers can handle data from the new cluster without issues such as reprocessing or data loss. More information on this will follow. +* **Cutover:** + * Once the stability of the new Confluent cluster is confirmed and the consumers have already been migrated, the next step is to stop dual writes from the producer. At this point, the producer should only write to the new cluster. This ensures a smooth transition while maintaining data consistency and minimizing any potential risks associated with the migration. Make sure a rollback plan is in place to handle any unforeseen issues during this process in case the cutover is unsuccessful. This plan should allow you to revert to the original cluster quickly to minimize downtime and data loss, ensuring business continuity while troubleshooting any issues with the migration. + * Monitor both clusters during and after the cutover to ensure everything is functioning as expected. +* **Decommission the Old Cluster:** + * After a period of stable operation on the new cluster, decommission the old cluster. Ensure all relevant data and metadata have been successfully transitioned before doing so. + +**Key Considerations:** + + + +* **Synchronization**: Keep producers writing to both clusters until you are certain the new cluster can handle the full load. +* **Monitoring**: Use tools like Prometheus and Grafana to monitor the performance of both clusters during the transition. + +**Migrating With Data Migration** + +Data migration adds complexity, as it involves transferring not only active producers and consumers but also the historical data stored in the Kafka topics. + +**Options for Data Migration:** + +**Cluster Linking (Confluent):** Cluster Linking is a feature that enables real-time replication between Kafka clusters, allowing data to flow from the source cluster to the target cluster. + + + +* **Advantages:** + * Low-latency data replication: Enables near real-time data synchronization between clusters. + * No offset translation required: Consumers on the target cluster can continue from the same offsets as the source. + * Minimal data loss: Ideal for scenarios where high data consistency is critical. +* **Shortcomings:** + * Schema compatibility: You’ll need to ensure schema consistency if using schema registries across clusters. + * Limited transformation capabilities: Focuses on pure data replication without offering complex filtering or transformation. + +**When to use**: Cluster Linking is ideal for scenarios where you need seamless, low-latency data replication with minimal data transformation. It’s the go-to choice for Confluent users looking for operational simplicity without requiring offset translation. + +**MirrorMaker 2 (MM2):** MM2 is an open-source Kafka tool for replicating data between clusters, offering more flexibility for non-Confluent environments. + + + +* **Advantages:** + * Flexibility in deployment: Works well with both open-source Kafka and Confluent deployments. + * Supports offset translation: It provides offset translation for consumer groups across clusters. +* **Shortcomings:** + * Higher complexity: Offset translation may introduce complexities during migration. + * Potential replication lag: MirrorMaker 2 may experience higher replication lag compared to Cluster Linking, depending on configurations. + * Limited schema translation: If using schema registries, it lacks advanced schema translation features. + +**When to use:** MM2 is suited for environments where you require flexibility with Kafka distributions or need offset translation between clusters. + +**Confluent Replicator:** A robust tool that enables data replication with advanced features like filtering, transformation, and schema management. + + + +* **Advantages:** + * Filtering and transformation: Allows data filtering and transformation during replication, making it more adaptable for complex use cases. + * Schema management: Supports schema translation, making it ideal for enterprises where schema consistency across clusters is critical. +* **Shortcomings:** + * Offset translation issues: Offset translation is not automatic, requiring careful configuration. + * More resource-intensive: Replicator requires more resources and careful tuning compared to simpler replication methods like Cluster Linking. + +**When to use**: Choose Confluent Replicator for enterprise-grade migrations that require granular control over data replication, including filtering and schema translation. + +**Common Steps for Migration with Data Replication** + + + +1. **Start Replication**: Set up replication between the source and target clusters using one of the tools mentioned (Cluster Linking, MirrorMaker 2, or Confluent Replicator). +2. **Migrate Consumers**: After verifying that data replication is working correctly, migrate consumers to the target cluster. Ensure that consumers are configured to process from the appropriate offsets (translated if necessary) based on the replication tool used. +3. **Cutover Producers**: Once the consumers are stable on the new cluster, stop dual writing and switch the producer to only write to the new cluster. +4 **Monitor and Validate**: Continuously monitor data flow, replication consistency, and consumer performance to ensure a smooth migration. + +**Key Considerations:** + + + +* **Data Integrity**: Ensure that data is accurately replicated without loss or duplication. Use checksums or hashing to verify data consistency. +* **Replication Lag:** Monitor and minimize replication lag to prevent data inconsistencies between clusters. +* **Cutover Strategy**: Plan the cutover carefully to avoid service disruptions. It’s often best to perform the final switch during a maintenance window. + + +## Evaluate Workload Optimization + +A migration is a great time to tackle some of the technical debt that may have built up in your existing deployment, such as optimizing and improving the clients, workload, data model, data quality, and cluster topology. Each option will vary in complexity. + Additionally, migration provides a chance to consolidate Kafka topics by renaming or removing unused ones, which helps clean up the environment and improve data flow efficiency. + This is also a good opportunity to update your security strategy. Consider enabling Role-Based Access Control (RBAC) for better user management or changing the authentication type to a more secure method. These steps will fortify the system’s security while ensuring compliance with best practices. + + +## Defining Acceptable Downtime + +When migrating Kafka clusters, it’s crucial to determine the amount of downtime that is acceptable for your clients. During any Kafka migration, clients need to be restarted to connect to the new bootstrap servers and apply any new security configurations specific to the destination cluster. Therefore, you must either plan for the downtime incurred when clients restart or implement a blue/green deployment strategy to achieve zero downtime. Depending on the client architecture and migration strategy, downtime can be very brief but is almost always greater than zero when the same set of clients is assigned to the new Kafka cluster. It is recommended to schedule the migration for production workloads during a maintenance window or a period of low traffic, such as nighttime or weekends. + +Client downtime can be categorized into two types: + + + +1. Downtime for writes +2. Downtime for reads + +Typically, downtime for reads is more tolerable than downtime for writes, since consumers can often restart from where they left off after migrating to the new Kafka cluster and quickly catch up. + +The amount of client downtime will depend on the following factors: + + + +1. **Speed of Updating Client Configurations**: The time it takes to update the client bootstrap servers and security configurations. Automating this process can significantly reduce downtime. + To automate the updating of client configurations and minimize downtime, consider the following strategies and resources: + + + **Use Ansible Playbooks:** Automate configuration updates using Ansible, which allows for both rolling and parallel deployment strategies. Rolling deployments are safer as they update one node at a time, ensuring high availability. Refer [ Update a Running Confluent Platform Configuration using Ansible Playbooks ](https://docs.confluent.io/ansible/current/ansible-reconfigure.html) + + + **Controlled Shutdown:** Enable controlled shutdown (controlled.shutdown.enable=true) to gracefully stop brokers, syncing logs and migrating partitions before shutdown. This minimizes downtime during updates. Refer [Best Practices for Kafka Production Deployments in Confluent Platform](https://docs.confluent.io/platform/current/kafka/post-deployment.html) + + + **Client Quotas**: Use client quotas to manage resource consumption effectively. This can help prevent any single client from overwhelming the system during configuration updates. Refer [Multi-tenancy and Client Quotas on Confluent Cloud](https://docs.confluent.io/cloud/current/clusters/client-quotas.html) + + + By implementing these strategies, you can effectively automate client configuration updates while minimizing downtime. + +2. **Client Startup Time**: Clients have varying startup times depending on their implementation. Additionally, specific client dependencies may dictate the order in which clients can start, potentially extending the time taken to fully restart the system. + +3. **Execution Speed of Migration-Specific Steps**: Any other migration-specific steps that need to be completed before restarting the clients with the new Kafka cluster can also contribute to downtime. + + +## Define Processing Requirements + +It is essential to clearly define the data processing requirements for each client before initiating a Kafka migration. Understanding whether clients can tolerate duplicates or missed messages will guide you in selecting the appropriate data replication tool and client migration strategy. + + + +1. **Can Tolerate Duplicates but Not Missed Messages:** +* **Data is Migrated**: Start by cutting over the consumer first, ensuring that auto.offset.reset is set to earliest. This setting ensures that the consumer processes all messages from the new Kafka cluster starting from the earliest offset, potentially causing duplicates but preventing any missed messages. Once the consumer has been transitioned, cut over the producer to begin writing to the new cluster. +* **Data is Not Migrated**: Start by enabling a dual-write setup for the producer, meaning the producer writes to both the old and new Kafka clusters simultaneously. After ensuring the dual write is functioning, cut over the consumer with auto.offset.reset set to earliest, allowing it to consume from the new cluster without missing any messages. Finally, stop producing to the old Kafka cluster once the transition is stable. +2. **Can Tolerate Missed Messages but Not Duplicates:** +* **Data is Migrated**: Start by cutting over the consumer first, with auto.offset.reset set to latest. This ensures that the consumer will only process new messages from the latest offset on the new Kafka cluster, avoiding duplicates. After this, transition the producer to the new cluster to continue data production. +* **Data is Not Migrated**: Cut over the producer first to begin sending data to the new Kafka cluster. Then, transition the consumer with auto.offset.reset set to latest to process only the most recent messages on the new cluster, ensuring no duplicates. Alternatively, both the producer and the consumer can be cut over simultaneously, minimizing potential downtime. +3. **Cannot Process Duplicates or Missed Messages:** +* Synchronizing or translating consumer group offsets between clusters can simplify the migration process and reduce client downtime. If this is not feasible, the client migration sequence must be carefully orchestrated to meet the data processing requirements. + +# Best Practices for Kafka Migration + +Regardless of whether you are migrating with or without data, the following best practices will help ensure a successful Kafka cluster migration: + +**Do:** + + + +* Plan Thoroughly: Understand your current setup and define a detailed migration plan, including timelines, stakeholders, and rollback procedures. +* Test Extensively: Before starting the migration, test the new Kafka cluster in a staging environment that mirrors your production setup as closely as possible. +* Monitor in Real-Time: Use monitoring tools like Prometheus, Grafana, or Confluent Control Center to track the health and performance of both clusters during the migration. +* Gradual Rollout: Start with non-critical workloads to test the waters before fully migrating high-priority data streams. + +**Don’t:** + + + +* Neglect Consumer Offsets: Ensure consumer offsets are correctly managed during the migration to prevent data reprocessing or loss. +* Ignore Data Retention Policies: Align data retention policies on both clusters to avoid unexpected data loss or storage issues. +* Skip Backup Plans: Always have a rollback plan and data backups in place. In case of a failure, you need to be able to revert quickly without impacting operations. + + +# Conclusion + +Kafka cluster migration without downtime is challenging but achievable with the right strategy and tools. Whether you’re migrating with or without data, thorough planning, testing, and execution are key to a successful transition. By following the steps and best practices outlined in this blog, you can minimize risks and ensure that your Kafka migration is smooth, efficient, and disruption-free, regardless of the Kafka provider or infrastructure you are using. diff --git a/_posts/2024-09-06-Understanding-RedpandaConnect.md b/_posts/2024-09-06-Understanding-RedpandaConnect.md new file mode 100644 index 0000000000..0d9a410a5f --- /dev/null +++ b/_posts/2024-09-06-Understanding-RedpandaConnect.md @@ -0,0 +1,301 @@ +--- +layout: post +title: "Understanding Redpanda Connect / Benthos" +categories: [Redpanda Connect, Benthos ] +teaser: Learn how Redpanda Connect, formerly Benthos, simplifies real-time data streaming and processing. This blog covers its key features, practical use cases, and a comparison to Kafka Connect. Dive in to see how it can streamline your data workflows. +authors: Subramanya +featured: false +hidden: false +image: /assets/blog-images/understanding-redpanda-connect/redpanda-connect.webp +toc: true +--- + + +Before diving into the capabilities of Redpanda Connect, it's essential to understand its origins, particularly the acquisition and transformation of Benthos, which played a pivotal role in shaping what Redpanda Connect is today. +Benthos, an open-source data streaming tool, was renowned for its versatility and simplicity in managing and processing data streams. It gained popularity due to its declarative approach, allowing users to define complex data processing pipelines with minimal configuration. Benthos was particularly favored for its stateless processing, which made it highly reliable and easy to scale. + +**Key Features of Benthos:** + + + +* **Declarative** Configuration: Users could define processing pipelines using a simple configuration file, making it accessible even to those without deep programming expertise. +* **Flexibility**: Benthos supported a wide range of input, output, and processing components, making it adaptable to various use cases. +* **Scalability**: Built with scalability in mind, Benthos could handle high-throughput data streams efficiently. + +**The Acquisition by Redpanda** + +Recognizing the potential of Benthos, Redpanda, a company known for its high-performance streaming data platform, acquired Benthos. The acquisition marked the beginning of Benthos' transformation into what is now known as Redpanda Connect. + +**The Fork to Bento** + +While Redpanda acquired Benthos, rebranding it as Redpanda Connect and introducing commercial licensing for key integrations. This prompted WarpStream to fork the Benthos project, renaming it Bento, which will remain a 100% MIT-licensed, open-source project. The Bento fork continues Benthos' legacy, offering a lightweight, Go-based stream processing alternative to Kafka Connect. WarpStream invites others to contribute and help maintain Bento, ensuring stability and open governance for those concerned about Redpanda's commercial shifts. + +You can find [the new Bento repository here](https://github.com/warpstreamlabs/bento), as well as a hosted version of the [original docs](https://warpstreamlabs.github.io/bento/). For more details check out the full article on the fork from the official [Warpstream website](https://www.warpstream.com/blog/announcing-bento-the-open-source-fork-of-the-project-formerly-known-as-benthos). + + +# What is Redpanda Connect + +Redpanda Connect builds upon the foundation laid by Benthos, enhancing its capabilities and integrating it into the broader Redpanda ecosystem. While Benthos provided a robust framework for data streaming, Redpanda, known for its high-performance streaming capabilities, took the functionality of Benthos and improved upon it in several ways: + + + +* **Integration with Redpanda Ecosystem**: Redpanda Connect is designed to work natively with Redpanda’s broader ecosystem. This means it's optimized for Redpanda’s distributed data platform, allowing for more seamless and efficient data streaming, transformation, and routing. +* **Scalability**: While Benthos is highly capable, Redpanda Connect is designed to handle larger volumes of data more efficiently. It leverages Redpanda’s architecture to scale up and manage high-throughput data streams in a more resilient and efficient way. +* **Cohesion**: Redpanda Connect provides a more unified and cohesive experience. By building on top of Benthos, it integrates well with Redpanda's tools, offering streamlined workflows and management. + +In summary, Redpanda took Benthos' foundation and optimized it to work specifically with Redpanda’s high-performance platform, adding better scalability, easier integration, and a more cohesive user experience. + +Redpanda Connect is a framework for creating declarative stream processors where a pipeline of one or more sources, an arbitrary series of processing stages, and one or more sinks can be configured in a single config + + +# Benefits of Using Redpanda Connect + +**Real-Time Data Processing**: Redpanda Connect can be used for real-time data processing. It can consume data from the specified inputs, process it using the specified processors, and then send the processed data to the specified outputs. This makes it suitable for use cases that require real-time data processing + +**Fault Tolerance**: Redpanda Connect includes mechanisms for fault tolerance, ensuring that data processing continues smoothly even in the event of failures. This includes support for distributed connectors and automatic recovery from errors. + +**Transaction-Based Resiliency:** It implements transaction-based resiliency with back pressure, ensuring at-least-once delivery without needing to persist messages during transit. + +**Wide Range of Connectors**: Redpanda Connect comes with a wide range of connectors, making it easy to integrate with your existing infrastructure. Some of the connectors include AWS DynamoDB, AWS Kinesis, Azure Blob Storage, MongoDB, Kafka, and many more. + +**Data Agnostic**: Redpanda Connect is totally data agnostic, which means it can handle any type of data you need to process. + +**Overlap with Traditional Tools**: Redpanda Connect has functionality that overlaps with integration frameworks, log aggregators, and ETL workflow engines. This means it can be used to complement these traditional data engineering tools or act as a simpler alternative. + + +# Key Components of Redpanda Connect + +Redpanda Connect is composed of several key components that work together to provide a robust data streaming service: + + + +1. **Inputs**: These are the sources from which data is consumed. They can be various services like Kafka, AWS S3, HTTP, etc. +2. **Buffers**: These are optional components that can be used to temporarily store data before it is processed +3. **Processors**: These are the components that perform operations on the data. They can perform various tasks like mapping, filtering, etc. +4. **Outputs**: These are the destinations where the processed data is sent. They can be various services like AWS S3, HTTP, etc. +5. **Observability Components**: These components allow you to specify how Connect exposes observability data. They include HTTP, logger, metrics, and tracing. + + +# How Does Redpanda Connect Work? + +Redpanda Connect works by consuming data from the specified inputs, optionally storing it in a buffer, processing it using the specified processors, and then sending the processed data to the specified outputs. The observability components provide visibility into the operation of the Connect pipeline. + +In this example, we'll set up a Kafka-Benthos pipeline where data is consumed from one Kafka topic, processed by Benthos, and then sent to another Kafka topic. We'll use Docker Compose to spin up a Zookeeper, Kafka, and Benthos services, and we'll walk through each step of the process, including how to verify that the data has been processed correctly. + +**Step 1: Create the Docker Compose File** + +Let's start by creating a docker-compose.yml file. This file defines the services we need: Zookeeper, Kafka, and Benthos. + +```yaml +version: '3.7' +services: + zookeeper1: + image: confluentinc/cp-zookeeper:7.6.2 + hostname: zookeeper1 + container_name: zookeeper1 + ports: + - "2181:2181" + command: zookeeper-server-start /etc/kafka/zookeeper.properties + volumes: + - ./zookeeper1:/etc/kafka + deploy: + resources: + limits: + cpus: "1" + memory: 512M + + kafka1: + image: confluentinc/cp-server:7.6.2 + hostname: kafka1 + container_name: kafka1 + depends_on: + - zookeeper1 + command: kafka-server-start /etc/kafka/server.properties + volumes: + - ./kafka1:/etc/kafka + deploy: + resources: + limits: + cpus: "1.5" + memory: 1536M + + benthos: + image: jeffail/benthos:latest + hostname: benthos + container_name: benthos + volumes: + - ./benthos.yaml:/benthos.yaml + ports: + - "4195:4195" + depends_on: + - kafka1 +``` + +**Step 2: Configure Benthos** + +Next, we'll configure Benthos by creating a benthos.yaml file. This configuration will consume messages from the benthos Kafka topic, process each message by squaring its value, and then produce the result to the destination_topic. + +```yaml +input: + kafka: + addresses: + - kafka1:19092 + topics: [ benthos ] #source topic name + consumer_group: benthos_group + client_id: benthos_client + start_from_oldest: true + tls: {} + sasl: {} + +pipeline: + processors: + - bloblang: | + root = this * this +output: + kafka: + addresses: + - kafka1:19093 + topic: destination_topic # The new topic to sink data into + client_id: benthos_client + tls: {} + sasl: {} + max_in_flight: 64 +``` + +Benthos uses a powerful language called Bloblang for data transformations. Bloblang allows you to manipulate and reshape data in various ways. + +**root = this * this**: This is the specific Bloblang expression used in the processor. Let’s break it down: + +**root**: This represents the entire document or message being processed. Assigning a value to root effectively transforms the entire message. + +**this**: In Bloblang, this refers to the current value of the data being processed. If the data is a single number, this represents that number. + +**this * this**: This expression takes the value of this (i.e., the number) and multiplies it by itself. This operation squares the number. For example, if the incoming message is 2, this processor will transform it to 4. + +**Step 3: Spin Up the Services** + +With your docker-compose.yml and benthos.yaml files in place, you can now bring up the services using Docker Compose. + +```bash +docker-compose up -d +``` + +This command will start the Zookeeper, Kafka, and Benthos services in detached mode. + +**Step 4: Create Kafka Topics** + +Before sending any data, you need to create the destination Kafka topic that will be used: destination_topic. + +**Step 5: Produce messages in the source topic and verify the output by consuming in the destination topic:** + + + + + +Here we have produced a simple set of integers to the source topic and then consumed from the destination topic. + + +# Applications and Use Cases + +**Data Integration**: Redpanda Connect can be used to integrate data from various sources and send it to different destinations using its wide range of connectors. + +**ETL Workflows**: Redpanda Connect can be used to build ETL (Extract, Transform, Load) workflows. It can extract data from various sources, transform the data using its processors, and load the transformed data into various destinations. + +**Data Filtering**: With its processors, Redpanda Connect can be used to filter data. This can be useful in use cases that require filtering of data before it is sent to the destination. + +**Log Aggregation**: Redpanda Connect can be used to aggregate logs from various systems and send them to a centralized logging service. This can be useful for monitoring and debugging distributed systems. + + +# Comparison with Kafka Connect + +In the world of data streaming, both Kafka Connect and Redpanda Connect serve as critical tools, but they differ significantly in approach and design philosophy. Kafka Connect, part of the broader Apache Kafka ecosystem, is known for its robust, distributed architecture, making it ideal for complex ETL workflows and large-scale deployments. Redpanda Connect, on the other hand, evolved from the lightweight, open-source Benthos project and focuses on simplicity, stateless processing, and high performance. While Kafka Connect excels in stateful transformations, Redpanda Connect aims to streamline real-time data streaming with minimal overhead. Here's how they stack up against each other. + + +
Category + | +Kafka Connect + | +Redpanda Connect + | +
Architecture + | +Distributed, scalable, with support for standalone and distributed modes. + | +Declarative, stateless, simpler architecture focused on ease of deployment. + | +
Ecosystem & Integration + | +Part of the Apache Kafka ecosystem; integrates seamlessly with Kafka topics and brokers. + | +Designed for Redpanda, with flexibility inherited from Benthos; supports a wide range of connectors. + | +
Configuration + | +Requires detailed configuration for connectors, tasks, and workers; supports distributed configuration. + | +Simple, declarative configuration with minimal setup effort, focusing on lightweight pipeline setups. + | +
Processing + | +Stateful processing with distributed transformations; suitable for complex ETL workflows. + | +Stateless, lightweight processing with simpler use cases; focuses on chained, minimal transformations. + | +
Observability + | +Extensive monitoring through JMX, REST APIs, and integration with systems like Prometheus. + | +Offers observability components like logging and tracing, but with potentially less mature tooling. + | +
Licensing & Community + | +Open-source under Apache License 2.0, with strong community and extensive library of connectors. + | +Mixed licensing post-acquisition; originally fully open-source, now some parts are proprietary, leading to a fork. + | +
Use Cases + | +Ideal for large-scale, complex ETL workflows, distributed fault tolerance, and enterprise-level deployments. + | +Best for simpler, stateless streaming, and performance-sensitive applications with minimal overhead. + | +
Performance & Scalability + | +Scales effectively with Kafka's architecture; handles high-throughput data and large deployments. + | +Designed for low-latency streaming; excels in performance but may not scale as easily for large, stateful use cases. + | +