# Hands-On Exercise: Building Data Streaming Pipelines

**Objective**: Students will learn how to build real-time data streaming pipelines using Apache Kafka, Spark-Streaming (PySpark), and Apache Flink. Each tool will be introduced separately with individual hands-on tasks, followed by integrating all three tools into a single real-time streaming pipeline.

## Step 1: Apache Kafka

**Introduction to Kafka**

Apache Kafka is a distributed streaming platform used to publish, subscribe, store, and process real-time event streams. In this step, we will start by using Kafka CLI commands and then programmatically interact with Kafka using Python.

### Task 1: Using Kafka CLI Commands

1. Start Zookeeper: Kafka requires Zookeeper to manage brokers.

In [None]:
$ zookeeper-server-start.sh config/zookeeper.properties

2. Start Kafka Broker: Start the Kafka broker after Zookeeper is running.

In [None]:
$ kafka-server-start.sh config/server.properties

3. List current topics:

In [None]:
$ kafka-topics.sh --bootstrap-server localhost:9092 --list

3. Create a Kafka Topic: Use the Kafka CLI to create a topic for your streaming pipeline.

In [None]:
$ kafka-topics.sh --create --topic test-topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

4. Produce Messages to Kafka: Send messages to the Kafka topic from the command line.

In [None]:
$ kafka-console-producer.sh --topic test-topic --bootstrap-server localhost:9092

5. Consume Messages from Kafka: Read messages from the Kafka topic in real-time.

In [None]:
$ kafka-console-consumer.sh --topic test-topic --bootstrap-server localhost:9092 --from-beginning

### Task 2: Kafka Programmatically with Python

1. Install Kafka Python Library: Install the `kafka-python` library using pip.

In [None]:
$ pip install kafka-python

2. Producer Script: Write a Python script to produce messages to a Kafka topic.

In [None]:
import json
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

for i in range(100):
    producer.send('test-topic', {'number': i})
    producer.flush()

producer.close()


3. Consumer Script: Write a Python script to consume messages from a Kafka topic.

In [None]:
import json
from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'test-topic',
    bootstrap_servers='localhost:9092',
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

for message in consumer:
    print(message.value)


## Step 2: Spark-Streaming Using PySpark

**Introduction to Spark-Streaming**

Spark-Streaming provides real-time stream processing capabilities built on top of Apache Spark. In this task, we'll create a streaming job using PySpark to process CSV files arriving in a directory.

### Task 3: Processing Data Using Spark-Streaming

1. **Initialize Spark Session**: Create a Spark session for your streaming job.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SalesDataStreamingJob") \
    .getOrCreate()


2. **Monitor an HDFS Directory**: Set up the streaming context to listen to an HDFS directory and process new CSV files.

In [None]:
# Define the schema for the streaming data
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

schema = StructType([
    StructField("transaction_id", IntegerType(), True),
    StructField("transaction_date", StringType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("product_id", IntegerType(), True),
    StructField("store_id", IntegerType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("total_amount", DoubleType(), True)
])

# Read the streaming data from the HDFS directory
csv_stream = spark.readStream.format("csv") \
    .option("header", "true") \
    .schema(schema) \
    .load("hdfs:///user/datatech-labs/streaming-data")


3. **Transformation**:
Perform the required transformations on the streaming data. For instance, calculate total sales (total_amount) per customer_id:

In [None]:
# Group by customer_id and calculate total sales
processed_data = csv_stream.groupBy("customer_id").sum("total_amount") \
    .withColumnRenamed("sum(total_amount)", "total_sales") \
    .orderBy("total_sales", ascending=False)


4. **Write the Stream Output**: Write the output to the console in real-time for debugging purposes. You can later modify this to write to a database or another HDFS location.

In [None]:
query = processed_data.writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()


- ### Homework:
5. Write the above code snippets to one script, and submit it to the Spark cluster using `sprak-submit`.

## Step 3: Apache Flink

**Introduction to Apache Flink**

Apache Flink is a powerful stream-processing framework that enables real-time data analytics. In this task, we’ll set up a Flink pipeline to consume data from Kafka, process it using Flink SQL, and create a virtual table for querying.

### Task 4: Consume from Kafka and Process with Flink

1. Install Flink cluster with the script `./install-flink-script.sh` and open the UI at `localhost:8081`

2. Optionally Install Flink Python API: Install the PyFlink package using pip.

In [None]:
$ pip install apache-flink

2. Run (submit) the following example job that comes pre-installed with Apache Flink

In [None]:
$ echo -e "hello world\napache flink\nflink cluster\nhello flink" > ~/Desktop/sample.txt

$ ./bin/flink run examples/batch/WordCount.jar \
    --input ~/Desktop/sample.txt \
    --output ~/Desktop/sample-output.txt

3. **Set up Kafka Consumer in Flink**:

Create a Flink job to consume messages from Kafka, and submit it to the installed Flink Clsuter

In [None]:
#!/usr/bin/env python3

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
table_env = StreamTableEnvironment.create(env)

# Create Kafka Source Table
table_env.execute_sql("""
    CREATE TABLE kafka_source (
        number INT
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'flink-topic',
        'properties.bootstrap.servers' = 'localhost:9092',
        'properties.group.id' = 'flink-consumer-group',
        'scan.startup.mode' = 'earliest-offset',
        'format' = 'json'
    )
""")

result = table_env.sql_query("""
    SELECT number, COUNT(*) 
    FROM kafka_source 
    GROUP BY number
""")

# Print the results
result.execute().print()

4. Make the file executable

In [None]:
$ chmod +x flink_scripts/flink_kafka_example.py

5. Submit the script using this command:

In [None]:
$ flink run -py flink_scripts/flink_kafka_example.py

# or with full command:
$ /opt/flink/bin/flink run -py flink_scripts/flink_kafka_example.py 

---------------------------------------------
### **Conclusion**

In this hands-on exercise, students learned how to:

1. Set up and work with Apache Kafka using both CLI and Python.

2. Implement real-time stream processing using Spark-Streaming and PySpark.

3. Use Apache Flink to process data streams and perform real-time SQL analytics.

4. Build an end-to-end real-time streaming pipeline using Kafka, Spark-Streaming, and Flink, integrating external APIs, and processing data in real-time.

This comprehensive pipeline demonstrates how to manage and process real-time data efficiently across multiple systems.