Use Apache Kafka with Apache Spark on hdinsight

page_type

languages

products

description

urlFragment

sample

azure

azure-hdinsight

This is a basic example of streaming data to and from Kafka on HDInsight from a Spark on HDInsight cluster.

hdinsight-spark-scala-kafka

Use Apache Kafka with Apache Spark on hdinsight

This is a basic example of streaming data to and from Kafka on HDInsight from a Spark on HDInsight cluster. This example uses Kafka DStreams. This example expects Kafka and Spark on HDInsight 5.0.

NOTE: Apache Kafka and Spark are available as two different cluster types. HDInsight cluster types are tuned for the performance of a specific technology; in this case, Kafka and Spark. To use both together, you must create an Azure Virtual network and then create both a Kafka and Spark cluster on the virtual network. For an example of how to do this using an Azure Resource Manager template, see azuredeploy.json file in the root of this project. To use this template to create Azure resources, use the following button:

Use the following button to sign in to Azure and open the template in the Azure portal.

This template creates an Azure Virtual Network, Kafka 2.4.1 on HDInsight 5.0, and Spark 3.1 on HDInsight 5.0.

Understand this example

This example uses a Scala application in a Jupyter notebook. The code in the notebook relies on the following pieces of data:

Kafka brokers: The broker process runs on each workernode on the Kafka cluster. The list of brokers is required by the producer component, which writes data to Kafka.
A Twitter app configuration: The Stream-Tweets-To-Kafka.ipynb notebook uses Twitter to populate data in Kafka. If you do not have a Twitter app set up, visit to create one.
Topic name: The name of the topic that data is written to and read from. This example expects a topic named tweets.

See the Kafka host information section for information on how to obtain the Kafka broker and Zookeeper host information.

To run this example

To use the example Jupyter notebooks, you must upload them to the Jupyter Notebook server on the Spark cluster. Use the following steps to upload the notebook:

In your web browser, use the following URL to connect to the Jupyter Notebook server on the Spark cluster. Replace CLUSTERNAME with the name of your Spark cluster.
```
 https://CLUSTERNAME.azurehdinsight.net/jupyter
```
When prompted, enter the cluster login (admin) and password used when you created the cluster.
From the upper right side of the page, use the Upload button to upload the Stream-Tweets-To-Kafka.ipynb file. Select the file in the file browser dialog and select Open.
Find the Stream-Tweets-To-Kafka.ipynb entry in the list of notebooks, and select Upload button beside it.
Once the file has uploaded, select the KafkaStreaming.ipynb entry to open the notebook. To load tweets into Kafka, follow the instructions in the notebook.
Repeat steps 1-3 to upload the Spark-Streaming-From-Kafka-With-DStreams.ipynb document to Kafka. Once the file has uploaded, select the entry to open the notebook. Follow the instructions in the notebook to read the tweets from Kafka.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
media		media
README.md		README.md
Spark-Streaming-From-Kafka-with-DStreams.ipynb		Spark-Streaming-From-Kafka-with-DStreams.ipynb
Stream-Tweets-To-Kafka.ipynb		Stream-Tweets-To-Kafka.ipynb
azuredeploy.json		azuredeploy.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Use Apache Kafka with Apache Spark on hdinsight

Understand this example

To run this example

About

Releases

Packages

Contributors 4

Languages

Azure-Samples/hdinsight-spark-scala-kafka

Folders and files

Latest commit

History

Repository files navigation

Use Apache Kafka with Apache Spark on hdinsight

Understand this example

To run this example

About

Resources

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages