# Big Data Lab Manual 2024 / 25 (Week 4)

## Week 4: Kafka for Real-Time Data Ingestion

Topics to be covered in this week's lab includes:

- Kafka Architecture (Brokers, Zookeeper, Topics, Partitions)
- Producers and Consumers
- Streaming Data from APIs (e.g., OpenWeatherMap)
- Integrating Kafka with Spark Structured Streaming

## Learning Outcomes

- Understand real-time streaming fundamentals with Kafka
- Create Kafka producers and consumers in Azure
- Integrate Kafka with Spark for streaming analytics

## Dataset and URL

- **Dataset**: Real-time data from [OpenWeatherMap API](https://openweathermap.org/current) (Signup: openweathermap.org)
  - We simulate streaming by fetching weather data via the API and pushing JSON to Kafka.

## Lab Environment Setup in Azure

> As we have done in previous weeks, you should SSH into the Kafka cluster before you proceed. See previous labs if you are unsure of how to do this again.

## Install Kafka & Zookeeper (This should be done first before proceeding with the labs)

1. **Download Kafka** (check https://downloads.apache.org/kafka for stable version)

In [None]:
!wget https://downloads.apache.org/kafka/3.9.0/kafka_2.12-3.9.0.tgz
!tar -xzf kafka_2.12-3.9.0.tgz
!sudo mv kafka_2.12-3.9.0 /usr/local/kafka

2. **Set Kafka environment variables**:

First, add environment variables to your .bashrc file, then export them to the current session:

In [None]:
!echo "export KAFKA_HOME=/usr/local/kafka" >> ~/.bashrc
!echo "export PATH=\$PATH:\$KAFKA_HOME/bin" >> ~/.bashrc

# Export for current session
import os
os.environ["KAFKA_HOME"] = "/usr/local/kafka"
os.environ["PATH"] = os.environ["PATH"] + ":" + os.environ["KAFKA_HOME"] + "/bin"

print("Environment variables set for this session.")

3. **Install Zookeeper**

In [None]:
!sudo apt-get update
!sudo apt-get install -y zookeeper

4. **Set Zookeeper environment variables**:

In [None]:
!echo "export ZOOKEEPER_HOME=/usr/share/zookeeper" >> ~/.bashrc
!echo "export PATH=\$PATH:\$ZOOKEEPER_HOME/bin" >> ~/.bashrc

# Export for current session
import os
os.environ["ZOOKEEPER_HOME"] = "/usr/share/zookeeper"
os.environ["PATH"] = os.environ["PATH"] + ":" + os.environ["ZOOKEEPER_HOME"] + "/bin"

print("Zookeeper environment variables set for this session.")

5. **Start Zookeeper and Kafka**:

**Note**: Using full paths to avoid command not found errors

In [None]:
# Start Zookeeper first
!sudo /usr/share/zookeeper/bin/zkServer.sh start

# Add a short delay to ensure Zookeeper is fully started
!sleep 5

# Start Kafka using the full path
!/usr/local/kafka/bin/kafka-server-start.sh -daemon /usr/local/kafka/config/server.properties

# Add a short delay to ensure Kafka is fully started
!sleep 5

print("Zookeeper and Kafka started.")

6. **Verify processes are running**:

Use jps to check for:
- Kafka process
- QuorumPeerMain (Zookeeper process)

In [None]:
!jps

**You are now ready to proceed with the labs!**

## Create Kafka Topic:

Using full path to the Kafka command to ensure it works:

In [None]:
!/usr/local/kafka/bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 3 --topic weatherTopic

## Verify that the topic was created

In [None]:
!/usr/local/kafka/bin/kafka-topics.sh --list --bootstrap-server localhost:9092

## Next Steps

Now that Kafka and Zookeeper are properly set up, you can proceed with the next part of the lab which involves creating producers and consumers for the weather data stream.