# Hands-On Exercise: Building Data Streaming Pipelines

**Objective**: Students will learn how to build real-time data streaming pipelines using Apache Kafka, Spark-Streaming (PySpark), and Apache Flink. Each tool will be introduced separately with individual hands-on tasks, followed by integrating all three tools into a single real-time streaming pipeline.

## Step 1: Apache Kafka

**Introduction to Kafka**

Apache Kafka is a distributed streaming platform used to publish, subscribe, store, and process real-time event streams. In this step, we will start by using Kafka CLI commands and then programmatically interact with Kafka using Python.

### Task 1: Using Kafka CLI Commands

1. Start Zookeeper: Kafka requires Zookeeper to manage brokers.

In [None]:
$ bin/zookeeper-server-start.sh config/zookeeper.properties

2. Start Kafka Broker: Start the Kafka broker after Zookeeper is running.

In [None]:
$ bin/kafka-server-start.sh config/server.properties

3. Create a Kafka Topic: Use the Kafka CLI to create a topic for your streaming pipeline.

In [None]:
$ bin/kafka-topics.sh --create --topic test-topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

4. Produce Messages to Kafka: Send messages to the Kafka topic from the command line.

In [None]:
$ bin/kafka-console-producer.sh --topic test-topic --bootstrap-server localhost:9092

5. Consume Messages from Kafka: Read messages from the Kafka topic in real-time.

In [None]:
$ bin/kafka-console-consumer.sh --topic test-topic --bootstrap-server localhost:9092 --from-beginning

### Task 2: Kafka Programmatically with Python

1. Install Kafka Python Library: Install the `kafka-python` library using pip.

In [None]:
$ pip install kafka-python

2. Producer Script: Write a Python script to produce messages to a Kafka topic.

In [None]:
import json
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers='localhost:9092',
                         value_serializer=lambda v: json.dumps(v).encode('utf-8'))

for i in range(100):
    producer.send('test-topic', {'number': i})
    producer.flush()

producer.close()


3. Consumer Script: Write a Python script to consume messages from a Kafka topic.

In [None]:
import json
from kafka import KafkaConsumer

consumer = KafkaConsumer('test-topic',
                         bootstrap_servers='localhost:9092',
                         value_deserializer=lambda m: json.loads(m.decode('utf-8')))

for message in consumer:
    print(message.value)


## Step 2: Spark-Streaming Using PySpark

**Introduction to Spark-Streaming**

Spark-Streaming provides real-time stream processing capabilities built on top of Apache Spark. In this task, we'll create a streaming job using PySpark to process CSV files arriving in a directory.

### Task 3: Processing Data Using Spark-Streaming

1. Initialize Spark Session: Create a Spark session for your streaming job.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder. \
    appName("StreamingJob"). \
    getOrCreate()


2. Monitor an HDFS Directory: Set up the streaming context to listen to an HDFS directory and process new CSV files.

In [None]:
csv_stream = spark.readStream.format("csv") \
    .option("header", "true") \
    .option("maxFilesPerTrigger", 1) \
    .schema("customer_id INT, sales DOUBLE") \
    .load("hdfs://namenode:9000/path/to/streaming/directory")

# Transformation
processed_data = csv_stream.groupBy("customer_id").sum("sales")


3. Write the Stream Output: Write the output to the console in real-time.

In [None]:
query = processed_data.writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()


4. Write the above code snippets to one script, and submit it to the Spark cluster using `sprak-submit`.

## Step 3: Apache Flink

**Introduction to Apache Flink**

Apache Flink is a powerful stream-processing framework that enables real-time data analytics. In this task, we’ll set up a Flink pipeline to consume data from Kafka, process it using Flink SQL, and create a virtual table for querying.

### Task 4: Consume from Kafka and Process with Flink

1. Install Flink Python API: Install the PyFlink package using pip.

In [None]:
$ pip install apache-flink

2. Set up Kafka Consumer in Flink: Create a Flink job to consume messages from Kafka.

In [None]:
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
table_env = StreamTableEnvironment.create(env)

# Create Kafka Source Table
table_env.execute_sql("""
    CREATE TABLE kafka_source (
        number INT
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'test-topic',
        'properties.bootstrap.servers' = 'localhost:9092',
        'format' = 'json'
    )
""")


3. Create a Flink SQL Query: Use Flink SQL to process the data.

In [None]:
result = table_env.sql_query("""
    SELECT number, COUNT(*) FROM kafka_source GROUP BY number
""")

# Print the results
result.execute().print()


## Step 4: Real-Time Streaming Pipeline

In this step, we’ll combine Kafka, Spark-Streaming, and Flink to build a full data streaming pipeline. The pipeline will consume data from an external API, produce it to a Kafka topic, process it using both Spark-Streaming and Flink, and output the results back to Kafka.

### Task 5: Building the Pipeline

1. **Step 1: Produce Data from an External API to Kafka**. Use a free online API (e.g., a random quote generator) to fetch data and produce it to a Kafka topic.

In [None]:
import requests
import json
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

response = requests.get('https://api.quotable.io/random')
data = response.json()

# Produce data to Kafka
producer.send(
    'quotes-topic',
    data
)
producer.flush()
producer.close()


2. **Step 2: Consume and Process with Spark-Streaming**. Spark-Streaming job will listen to the quotes-topic Kafka topic, process the data, and produce results to another Kafka topic.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder \
    .appName("KafkaQuoteProcessor") \
    .getOrCreate()

# Read from Kafka
quotes_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "quotes-topic") \
    .load()

# Process the data (e.g., filter quotes by author)
processed_df = quotes_df \
    .selectExpr("CAST(value AS STRING)") \
    .filter(col("value").contains("Einstein"))

# Write back to Kafka
query = processed_df.writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("topic", "processed-quotes-topic") \
    .start()

query.awaitTermination()


3. **Step 3: Consume and Process with Flink**. Flink will consume from the quotes-topic and produce the processed output back to Kafka.

In [None]:
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
table_env = StreamTableEnvironment.create(env)

# Consume from Kafka
table_env.execute_sql("""
    CREATE TABLE quotes_source (
        quote STRING
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'quotes-topic',
        'properties.bootstrap.servers' = 'localhost:9092',
        'format' = 'json'
    )
""")

# Process data (e.g., count the number of quotes by a specific author)
result = table_env.sql_query("""
    SELECT quote, COUNT(*) FROM quotes_source
    WHERE quote LIKE '%Einstein%'
    GROUP BY quote
""")

# Output to Kafka
result.execute_insert("kafka_output_topic")


----------------------------------------------------------------------------------------------------------------------------

### Another real-time streaming pipeline example to be studied and implemented individually:

https://github.com/OmarAlSaghier/realtime_analysis_voting_project

----------------------------------------------------------------------------------------------------------------------------

### **Conclusion**

In this hands-on exercise, students learned how to:

1. Set up and work with Apache Kafka using both CLI and Python.

2. Implement real-time stream processing using Spark-Streaming and PySpark.

3. Use Apache Flink to process data streams and perform real-time SQL analytics.

4. Build an end-to-end real-time streaming pipeline using Kafka, Spark-Streaming, and Flink, integrating external APIs, and processing data in real-time.

This comprehensive pipeline demonstrates how to manage and process real-time data efficiently across multiple systems.