## To use this notebook

Jupyter Notebooks allow you to modify and run the code in this document. To run a section (known as a 'cell',) select it and then use CTRL + ENTER, or select the play button on the toolbar above. Note that each section already has some example output beneath it, so you can see what the results of running a cell will look like.

NOTE: You must run each cell in order, from top to bottom. Running cells out of order can result in an error.

## Requirements

* An Azure Virtual Network
* A Spark on HDInsight 3.6 cluster, inside the virtual network
* A Kafka on HDInsight cluster, inside the virtual network

## Load packages

To use Spark structured streaming with Kafka, you must load the spark-sql-kafka package. The version must match the version of both kafka and Spark that you are using. The name of the package contains the versions that it works with. For example, `spark-sql-kafka-0-10_2.11:2.1.0` works with the following versions:

* Kafka 0.10
* Spark 2.1.0
* Scala 2.11

Run the next cell to load a package that works with Kafka on HDInsight 3.6, and Spark 2.1 on HDInsight 3.6.

In [1]:
%%configure -f
{
    "conf": {
        "spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0", 
        "spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.11"
    }
}


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1504705746198_0006,spark,idle,Link,Link,


## Define a schema for the data
When reading data from Kafka, the data is provided in the 'value' column. In this example, the data is a JSON document that describes a Tweet. Run the following cell to create a schema for the JSON document structure.

In [2]:
// Import bits useed for declaring schemas and working with JSON data
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

// Define the structure of the Twitter JSON document that is read from Kafka
// Note, this isn't pretty, but there's some odd behavior where moving .add to 
// a new line causes an error.
val schema = (new StructType).add("created_at", StringType).add("id", LongType).add("id_str", StringType).add("text", StringType).add("source", StringType).add("truncated", BooleanType).add("in_reply_to_status_id", LongType).add("in_reply_to_status_id_str", StringType).add("in_reply_to_user_id", LongType).add("in_reply_to_user_id_str", StringType).add("in_reply_to_screen_name", StringType).add("user", (new StructType).add("id", LongType)
        .add("id_str", StringType)
        .add("name", StringType)
        .add("screen_name", StringType)
        .add("location", StringType)
        .add("url", StringType)
        .add("description", StringType)
        .add("protected", BooleanType)
        .add("verified", BooleanType)
        .add("followers_count", LongType)
        .add("friends_count", LongType)
        .add("listed_count", LongType)
        .add("favourites_count", LongType)
        .add("statuses_count", LongType)
        .add("created_at", StringType)
        .add("utc_offset", IntegerType)
        .add("time_zone", StringType)
        .add("geo_enabled", BooleanType)
        .add("lang", StringType)
        .add("contributors_enabled", BooleanType)
        .add("is_translator", BooleanType)
        .add("profile_background_color", StringType)
        .add("profile_background_image_url", StringType)
        .add("profile_background_image_url_https", StringType)
        .add("profile_background_tile", BooleanType)
        .add("profile_link_color", StringType)
        .add("profile_sidebar_border_color", StringType)
        .add("profile_sidebar_fill_color", StringType)
        .add("profile_text_color", StringType)
        .add("profile_use_background_image", BooleanType)
        .add("profile_image_url", StringType)
        .add("profile_image_url_https", StringType)
        .add("profile_banner_url", StringType)
        .add("default_profile", BooleanType)
        .add("default_profile_image", BooleanType)
        .add("following", StringType)
        .add("follow_request_sent", StringType)
        .add("notifications", StringType)).add("geo", StringType).add("coordinates", StringType).add("place", StringType).add("contributors", StringType).add("is_quote_status", BooleanType).add("retweet_count", LongType).add("favorite_count", LongType).add("entities", (new StructType)
        .add("hashtags", ArrayType((new StructType)
            .add("text", StringType)
            .add("indices", ArrayType(LongType)))).add("urls", ArrayType((new StructType)
            .add("url", StringType)
            .add("expanded_url", StringType)
            .add("display_url", StringType)
            .add("indices", ArrayType(LongType))))
        .add("user_mentions", ArrayType(StringType))
        .add("symbols", ArrayType(StringType))).add("favorited", BooleanType).add("retweeted", BooleanType).add("possibly_sensitive", BooleanType).add("filter_level", StringType).add("lang", StringType).add("timestamp_ms", StringType)

// Uncomment to see a tree view of the schema.
//schema.printTreeString

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
3,application_1504705746198_0007,spark,idle,Link,Link,✔


SparkSession available as 'spark'.
schema: org.apache.spark.sql.types.StructType = StructType(StructField(created_at,StringType,true), StructField(id,LongType,true), StructField(id_str,StringType,true), StructField(text,StringType,true), StructField(source,StringType,true), StructField(truncated,BooleanType,true), StructField(in_reply_to_status_id,LongType,true), StructField(in_reply_to_status_id_str,StringType,true), StructField(in_reply_to_user_id,LongType,true), StructField(in_reply_to_user_id_str,StringType,true), StructField(in_reply_to_screen_name,StringType,true), StructField(user,StructType(StructField(id,LongType,true), StructField(id_str,StringType,true), StructField(name,StringType,true), StructField(screen_name,StringType,true), StructField(location,StringType,true), StructField(url,StringType,true), StructFi...

## Read the data and apply the schema

In the following cell, replace `YOUR_KAFKA_BROKER_HOSTS` with the broker hosts for your Kafka cluster. To get the broker host information, use one of the following methods:

* From __Bash__ or other Unix shell:

    ```bash
curl -u admin:$PASSWORD -G "https://$CLUSTERNAME.azurehdinsight.net/api/v1/clusters/$CLUSTERNAME/services/KAFKA/components/KAFKA_BROKER" | jq -r '["\(.host_components[].HostRoles.host_name):9092"] | join(",")' | cut -d',' -f1,2
    ```
    
    Note: This assumes that `$PASSWORD` is set to the password for your HDInsight cluster admin, and that `$CLUSTERNAME` is set to the name of the cluster.

* From __Azure Powershell__:

    ```powershell
$creds = Get-Credential -UserName "admin" -Message "Enter the HDInsight login"
$clusterName = Read-Host -Prompt "Enter the Kafka cluster name"
$resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/api/v1/clusters/$clusterName/services/KAFKA/components/KAFKA_BROKER" `
    -Credential $creds
$respObj = ConvertFrom-Json $resp.Content
$brokerHosts = $respObj.host_components.HostRoles.host_name[0..1]
($brokerHosts -join ":9092,") + ":9092"
    ```


In [4]:
// Read from the Kafka stream source
val kafka = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "YOUR_KAFKA_BROKER_HOSTS").option("subscribe", "tweets").option("startingOffsets","earliest").load()

/* Select the following columns from the Kafka data:
   * value - the JSON data for a tweet
   Use from_json to apply the schema and store the schematized data in the 'tweet' column
*/
val tweetData=kafka.select(
    from_json(col("value").cast("string"), schema) as "tweet")

// There's a lot of data in the Twitter JSON object. Just grab the tweet ID, user name, and text
val tweetText=tweetData.select("tweet.id",
                               "tweet.user.name",
                               "tweet.text")

println("Finished configuring the fields we want to select from the stream.")


Finished configuring the fields we want to select from the stream.

## Process the stream

To start processing the stream, write it to a sink. Run the following cell to write the data to the console (cell output). This cell runs for 30 seconds, then displays the results.

In [5]:
// Start writing the stream to the console. Use a timeout so that control is returned to the notebook.
tweetText.writeStream.format("console").start.awaitTermination(30000)

-------------------------------------------
Batch: 0
-------------------------------------------
+------------------+--------------------+--------------------+
|                id|                name|                text|
+------------------+--------------------+--------------------+
|905441423759171585|       Lady Alphonse|16 años :(
¿Pero ...|
|905441422794309632|         Ryukyu-blue|RT @takaatsurit: ...|
|905441423033372673|         aly merrill|RT @priscillux: W...|
|905441422656057345|            Katlyn☮️|RT @INDIEWASHERE:...|
|905441422941323264|                  Mo|RT @JBrewerBoston...|
|905441422915977216|         #DefendDACA|RT @SkyNews: Hurr...|
|905441423322898433|             Rhya🌷✨|RT @flor_demaga: ...|
|905441422534434816|      Franklin Lopez|Directorio de tel...|
|905441423859863552|             Laila|RT @RCI_GP: [IRMA...|
|905441422962253824|            Lanna 🌸|RT @txflxn: Freak...|
|905441422677090305|     Finlay Copeland|RT @PopeQuanPaul:...|
|905441422169571328|   

In [None]:
// Write the stream to HDInsight storage
tweetText.writeStream.format("parquet").option("path","/example/tweets").option("checkpointLocation", "/checkpoint").start.awaitTermination(60000)
// Data is written to WASB or ADL at `/example/tweets`.