<div style="font-size:18pt; padding-top:20px; text-align:center"><b>Spark Streaming and </b> <span style="font-weight:bold; color:green">Twitter API</span></div><hr>
<div style="text-align:right;">Sergei Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Contents</span>
    <ol>
        <li><a href="#1">Kafka Topic Setup</a></li>
        <li><a href="#2">Kafka Producer and Twitter API for Python</a></li>
        <li><a href="#3">Kafka Consumer and Spark Streaming Word Count</a></li>
        <li><a href="#4">Run and Test</a></li>
        <li><a href="#5">Sources</a></li>
    </ol>
</div>

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Kafka Topic Setup</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

<p><b>Prerequisite:</b> Kafka Server and Client are installed and running (see the previous class)</p>

<p>In terminal create the "tweets-kafka" topic that will refer to tweet stream from Twitter API</p>

In [None]:
kafka-topics --create --zookeeper localhost:2181 --topic tweets-kafka --partition 1 --replication-factor 1

<p>Open a new terminal window to run the topic consumer</p>

In [None]:
kafka-console-consumer --zookeeper localhost:2181 --topic tweets-kafka --from-beginning

<p>Open a new terminal window to run the topic producer and submit some message there (just for test). In the consumer terminal this message should appear</p>

In [None]:
kafka-console-producer --broker-list localhost:9092 -topic tweets-kafka

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Kafka Producer and Twitter API for Python</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

<p>Install python packages for Kafka and Twitter API use</p>

In [None]:
sudo pip install kafka-python, tweepy

<p>Create a Kafka Producer that will receive messages from Twitter API and put them in Kafka Broker for the "tweets-kafka" topic</p>

In [None]:
# !!! COPY CONTENT OF THIS CELL AND PAST IT INTO SEPERATE PY FILE TO RUN IN TERMINAL !!!

# -*- coding: utf-8 -*-

import tweepy
from tweepy.streaming import json
from kafka import KafkaProducer


"""

    KAFKA PRODUCER INIT


"""

producer = KafkaProducer(bootstrap_servers="localhost:9092")
topic_name = "tweets-kafka"



"""

    TWITTER API AUTHENTICATION


"""

consumer_token = "YOUR_CONSUMER_TOKEN"
consumer_secret = "YOUR_CONSUMER_SECRET" 
access_token = "YOUR_ACCESS_TOKEN"
access_secret = "YOUR_ACCESS_SECRET"

auth = tweepy.OAuthHandler(consumer_token, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)



"""

    LISTENER FOR MESSAGES FROM TWITTER


"""

class MoscowStreamListener(tweepy.StreamListener):
    
    def on_data(self, raw_data):

        data = json.loads(raw_data)

        if "extended_tweet" in data:
            text = data["extended_tweet"]["full_text"]
            
            #print(text)
            
            # put message into Kafka
            producer.send(topic_name, text.encode("utf-8"))
        else:
            if "text" in data:
                text = data["text"].lower()
                
                #print(data["text"])
                
                # put message into Kafka
                producer.send(topic_name, data["text"].encode("utf-8"))


"""

    RUN PROCESSING


"""


# Create instance of custom listener
moscowStreamListener = MoscowStreamListener()

# Set stream for twitter api with custom listener
moscowStream = tweepy.Stream(auth=api.auth, listener=moscowStreamListener)

# Region that approximately corresponds to Moscow
region = [34.80, 49.87, 149.41, 74.13]

# Start filtering messages
moscowStream.filter(locations=region)


<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. Kafka Consumer and Spark Streaming Word Count</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

In [None]:
# !!! COPY CONTENT OF THIS CELL AND PAST IT INTO SEPERATE PY FILE TO RUN IN TERMINAL !!!

# -*- coding: utf-8 -*-

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils


# Kafka parameters
zk_server = "localhost:2181" # Zookeeper Server
topic = "tweets-kafka" # Topic Name


# Update word count
def update_total_count(currentCount, countState):
    if countState is None:
        countState = 0
    return sum(currentCount, countState)

# Create Spark Context
sc = SparkContext(appName="KafkaTwitterWordCount")

sc.setLogLevel("OFF")

# Create Streaming Context
ssc = StreamingContext(sc, 10)

# Set checkpoint and put directory for storage
ssc.checkpoint("tmp_spark_streaming1")

# Subscribe to Kafka Topic "tweets-kafka" and create DStream
kafka_stream = KafkaUtils.createStream(ssc, zk_server, "spark-streaming-consumer", {topic: 1})

# Exctract just messages (works on RDD that is mini-batch)
lines = kafka_stream.map(lambda x: x[1])

# Count words for each RDD (mini-batch)
counts = lines.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda x1, x2: x1 + x2)

# Update word counts
total_counts = counts.updateStateByKey(update_total_count)

# Sort by counts
total_counts_sorted = total_counts.transform(lambda x_rdd: x_rdd.sortBy(lambda x: -x[1]))

# Print result
total_counts_sorted.pprint()

# Start Spark Streaming
ssc.start()

# Waiting for termination
ssc.awaitTermination()

<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. Run and Test</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

<p>Create two py files for Kafka Producer with Twitter API connection (kafka_producer_tweets.py) and for Spark Streaming word count (spark_streaming_wordcount_kafka.py)</p>

<p>Open three terminals</p>

<p>Run the command below to launch consumer in the first terminal to test Kafka Producer with tweets</p>

In [None]:
kafka-console-consumer --zookeeper localhost:2181 --topic tweets-kafka --from-beginning

<p>Run python file with Kafka Producer. If you uncomment print functions in MoscowStreamListener, you will see messages in terminal</p>

In [None]:
python /YOUR_PATH/kafka_producer_tweets.py

<p>Tweets should appear in the terminal with the consumer. You can close the consumer, if everything works well</p>

<p>Now run the spark streaming application in the third terminal as follow</p>

In [None]:
spark-submit --master local[2] /YOUR_PATH/spark_streaming_wordcount_kafka.py

<p>After application deployment you should see word counts for all input tweets. The list will be refreshed each 10 seconds in accordance with the batch interval value</p>

<a name="5"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">5. Sources</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

In [None]:
http://kafka-python.readthedocs.io/en/master/apidoc/KafkaClient.html
https://kafka.apache.org/0100/documentation/#configuration
https://spark.apache.org/docs/2.2.0/streaming-kafka-0-8-integration.html