## Self-study examples: Real-Time Streaming Pipeline

In this step, we’ll combine Kafka, Spark-Streaming, and Flink to build a full data streaming pipeline. The pipeline will consume data from an external API, produce it to a Kafka topic, process it using both Spark-Streaming and Flink, and output the results back to Kafka.

### Task 1: Building the Pipeline

1. **Step 1: Produce Data from an External API to Kafka**. Use a free online API (e.g., a random quote generator) to fetch data and produce it to a Kafka topic.

In [None]:
import requests
import json
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

response = requests.get('https://api.quotable.io/random')
data = response.json()

# Produce data to Kafka
producer.send(
    'quotes-topic',
    data
)
producer.flush()
producer.close()


2. **Step 2: Consume and Process with Spark-Streaming**. Spark-Streaming job will listen to the quotes-topic Kafka topic, process the data, and produce results to another Kafka topic.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder \
    .appName("KafkaQuoteProcessor") \
    .getOrCreate()

# Read from Kafka
quotes_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "quotes-topic") \
    .load()

# Process the data (e.g., filter quotes by author)
processed_df = quotes_df \
    .selectExpr("CAST(value AS STRING)") \
    .filter(col("value").contains("Einstein"))

# Write back to Kafka
query = processed_df.writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("topic", "processed-quotes-topic") \
    .start()

query.awaitTermination()


3. **Step 3: Consume and Process with Flink**. Flink will consume from the quotes-topic and produce the processed output back to Kafka.

In [None]:
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
table_env = StreamTableEnvironment.create(env)

# Consume from Kafka
table_env.execute_sql("""
    CREATE TABLE quotes_source (
        quote STRING
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'quotes-topic',
        'properties.bootstrap.servers' = 'localhost:9092',
        'format' = 'json'
    )
""")

# Process data (e.g., count the number of quotes by a specific author)
result = table_env.sql_query("""
    SELECT quote, COUNT(*) FROM quotes_source
    WHERE quote LIKE '%Einstein%'
    GROUP BY quote
""")

# Output to Kafka
result.execute_insert("kafka_output_topic")


----------------------------------------------------------------------------------------------------------------------------

### Another real-time streaming pipeline example to be studied and implemented individually:

https://github.com/OmarAlSaghier/realtime_analysis_voting_project

----------------------------------------------------------------------------------------------------------------------------

### **Conclusion**

In this hands-on exercise, students learned how to:

1. Set up and work with Apache Kafka using both CLI and Python.

2. Implement real-time stream processing using Spark-Streaming and PySpark.

3. Use Apache Flink to process data streams and perform real-time SQL analytics.

4. Build an end-to-end real-time streaming pipeline using Kafka, Spark-Streaming, and Flink, integrating external APIs, and processing data in real-time.

This comprehensive pipeline demonstrates how to manage and process real-time data efficiently across multiple systems.