## To use this notebook

Jupyter Notebooks allow you to modify and run the code in this document. To run a section (known as a 'cell',) select it and then use CTRL + ENTER, or select the play button on the toolbar above. Note that each section already has some example output beneath it, so you can see what the results of running a cell will look like.

NOTE: You must run each cell in order, from top to bottom. Running cells out of order can result in an error.

## Requirements

* An Azure Virtual Network
* A Spark on HDInsight 3.6 cluster, inside the virtual network
* A Kafka on HDInsight cluster, inside the virtual network

## Load packages

Run the next cell to load the packages required to read from Twitter and write to Kafka.

In [1]:
%%configure -f
{
    "conf": {
        "spark.jars.packages": "org.apache.spark:spark-streaming_2.11:2.1.0,org.apache.bahir:spark-streaming-twitter_2.11:2.1.0,org.apache.spark:spark-streaming-kafka-0-8_2.10:2.1.0,com.google.code.gson:gson:2.4",
        "spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.11"
    }
}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1498682828153_0006,spark,idle,Link,Link,


## Setup and configuration

In the next cell, you must provide configuration information for a __Twitter app__ and your __Kafka brokers__.

1. To create a Twitter app, see [https://apps.twitter.com](https://apps.twitter.com). After creating an app, add the __consumer key__, __consumer secret__, __access token__, and __access token secret__ in the next cell.

2. Change the value of `kafkaBrokers` to the Kafka broker hosts for your Kafka cluster. The value should be a comma-delimited list of the hosts, similar to the following example:

        wn0-kafka.liftazhqudlunpo4tkvapo234g.dx.internal.cloudapp.net:9092,wn1-kafka.liftazhqudlunpo4tkvapo234g.dx.internal.cloudapp.net:9092,wn2-kafka.liftazhqudlunpo4tkvapo234g.dx.internal.cloudapp.net:9092
        
    To find the Kafka brokers information for your Kafka on HDInsight cluster, you can use the Ambari REST API. The following examples demonstrate how to retrieve this information using the the `curl` and `jq` utilities (from Bash) or Windows PowerShell:

    * From __Bash__ or other Unix shell:

        ```bash
CLUSTERNAME='the name of your HDInsight cluster'
PASSWORD='the password for your cluster login account'
curl -u admin:$PASSWORD -G "https://$CLUSTERNAME.azurehdinsight.net/api/v1/clusters/$CLUSTERNAME/services/KAFKA/components/KAFKA_BROKER" | jq -r '["\(.host_components[].HostRoles.host_name):9092"] | join(",")'
        ```

        * From __Azure Powershell__:

        ```powershell
$creds = Get-Credential -UserName "admin" -Message "Enter the HDInsight login"
$clusterName = Read-Host -Prompt "Enter the Kafka cluster name"
$resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/api/v1/clusters/$clusterName/services/KAFKA/components/KAFKA_BROKER" `
    -Credential $creds
$respObj = ConvertFrom-Json $resp.Content
$brokerHosts = $respObj.host_components.HostRoles.host_name
($brokerHosts -join ":9092,") + ":9092"
        ```

3. Run the next cell to configure Twitter and Kafka for this notebook.
    

In [2]:
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.kafka._
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import java.util.HashMap
import com.google.gson.Gson

// Twitter configuration
val consumerKey="replace with your consumer key"
val consumerSecret="replace with your consumer secret"
val accessToken="replace with your access token"
val accessTokenSecret="replace with your access token secret"

//Words that we want to filter tweets for.
//Note: You want to use words that are used fairly often on Twitter, otherwise you
//      will not capture many (or any) tweets.
val filters=Array("coffee","hadoop","spark","kafka","xbox","ps4","nintendo")

// Kafka configuration
// kafkaBrokers should contain a comma-delimited list of brokers. For example:
// kafkaBrokers = "wn0-kafka.liftazhqudlunpo4tkvapo234g.dx.internal.cloudapp.net:9092,wn1-kafka.liftazhqudlunpo4tkvapo234g.dx.internal.cloudapp.net:9092,wn2-kafka.liftazhqudlunpo4tkvapo234g.dx.internal.cloudapp.net:9092"
val kafkaBrokers="your Kafka brokers"
val kafkaTopic="tweets"

// Make the Twitter config visible to Twitter4j
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)
System.setProperty("twitter4j.oauth.accessToken", accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)

println("Finished configuring Twitter client")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
3,application_1498682828153_0007,spark,idle,Link,Link,✔


SparkSession available as 'spark'.
Finished configuring Twitter client

## Start the stream

Run the next cell to begin streaming tweets into Kafka. This stream will run for a minute.

In [3]:
// Create an accumulator so we can track the number of tweets emitted to Kafka
val numTweets = sc.accumulator(0L,"Tweets sent to Kafka")

// The streaming context (DStream) for reading from Twitter and writing to Kafka
def createStreamingContext(): StreamingContext = {
    // Create the Kafka producer
    val producerProperties = new HashMap[String, Object]()
    producerProperties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBrokers)
    producerProperties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
                           "org.apache.kafka.common.serialization.StringSerializer")
    producerProperties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
                           "org.apache.kafka.common.serialization.StringSerializer")
    
    // set up the streaming context
    val ssc = new StreamingContext(sc, Seconds(5))
    // set up the stream, which we just convert to JSON
    val stream = TwitterUtils.createStream(ssc, None, filters)
    // Write the data to Kafka
    stream.foreachRDD( rdd => {
        rdd.foreachPartition( partition => {
            val producer = new KafkaProducer[String, String](producerProperties)
            partition.foreach( record => {
                // Convert the data to JSON
                val gson = new Gson()
                val data = gson.toJson(record)
                val message = new ProducerRecord[String, String](kafkaTopic, null, data)
                // Send the tweet data to Kafka
                producer.send(message)
                // Increment the counter
                numTweets +=1
            })
            producer.close()
        })
    })
    ssc
}

val ssc = StreamingContext.getActiveOrCreate(createStreamingContext)
ssc.start()
// Timeout after 60 seconds
ssc.awaitTerminationOrTimeout(60000)
println("Finished writting " + numTweets + " tweets to Kafka")

Finished writting 318 tweets to Kafka