# Twitter Structured Streaming Example

For this first example we are going to be going over an article and code provided by Nabarun Chakraborti on Medium. While much of the detail he provided was useful, he was running in Databricks so we will need to adapt it to be able to run in a Jupyter Notebook or actually other Python IDE. I had to make just a few adjustments to the structure of the process that I wanted to go through with you all becuase it'll be foundational to learning how to use this really useful feature. So let's dig in!

### Goal
From live tweet feeds get the count of different hashtag values based on specific topic we are interested in.

**Code Source:** https://ch-nabarun.medium.com/easy-to-play-with-twitter-data-using-spark-structured-streaming-76fe86f1f81c

### Pre Requisits

1. **Twitter Developer Account** (get the authentication keys):
    - Login to your developer account : https://developer.twitter.com/en/apps
    - Click on ‘create app’, provide a name
    - Now, regenerate API keys and auth token keys. We are going to use these keys in our code to connect with twitter and get the live feeds.
    - Copy all 4 token keys as mentioned above: access_token, access_secret_token, consumer_key and consumer_secret_key
        - *Note: consumer_key and consumer_secret_key are like username and access_token and access_secret_token are like password.*
        
2. **pip installs** (go to your command prompt and type "pip install _____" where the ____ are the items listed below
    - tweepy
    
### Coding concept
In this example, we will have one python code file (Tweet_Listener_class.py) which will create a socket for us to use which relies on those 4 authentication keys we created. That code creates the connection with twitter, extracts the feed and channelizes them using Socket or Kafka. For demonstration I’ve used Socket but we can also use Kafka to publish and consume. *If you are willing to use Kafka then you need to install required packages, and start zookeeper service followed by Kafka server.*

In the next phase of the flow, the Spark Structured Streaming program will receive the live feeds from the socket or Kafka and then perform required transformations. Finally we will write those transformed data into memory and run our required analysis on top of it.

### Program Flow
There will be two different steps to this process.
1. Tweet_Listener.py (python programming in command line)
    - This creates a socket for us that we then can connect our jupyter notebook to
2. StreamingTweetData (Spark Structured Streaming in Jupyter)
    - Then we connect to our socket (defined by the host and port number) and read the data off of that

##### Purpose of Tweets_Listener class
1. Import all necessary libraries to create connection with Twitter, read the tweet and keep it available for streaming.
2. Read the incoming tweet JSON file (The inflow tweets are in JSON format).
3. Retrieve only the actual tweet message and sent it to the client socket.
4. Define the host and port. Initialize the socket object and bind host and port together.
5. Establish the connection with Client.
6. Use the authentication keys (access_token, access_secret_token, consumer_key and consumer_secret_key) to get the live stream data.
7. Filter tweets which contains a specific subjects. In my example I searched tweets related to ‘corona’. We can pass multiple tracking criteria.


In [49]:
# Import your dependecies
import pyspark # run after findspark.init() if you need it
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
# from pyspark.sql.functions import col, split

In [50]:
# Start up your pyspark session as always
# Don't run this more than once
spark = SparkSession.builder.appName("TwitterStream").getOrCreate()
spark

First we need to set up a kind of structure to tell spark where to read our stream from. In this case, we will match the below host and port number specified to the one we set up in our Tweet Listner class python file. 

Here we are reading the live streaming data from socket and type casting to String.

In [39]:
# read the tweet data from socket
tweet_df = spark \
    .readStream \
    .format("socket") \
    .option("host", "127.0.0.1") \
    .option("port", 3350) \
    .load()

# type cast the column value
tweet_df_string = tweet_df.selectExpr("CAST(value AS STRING)")

In [4]:
# This is how it would look for Kafka
# tweet_df = spark \
#   .readStream \
#   .format("kafka") \
#   .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
#   .option("subscribe", "trump") \
#   .load()

# # Type cast the key and column value
# tweet_df_string = tweet_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

Then split words based on space, filter out only hashtag (#) values and group them up.

In [40]:
tweets_tab = tweet_df_string.withColumn('word', explode(split(col('value'), ' '))) \
    .groupBy('word') \
    .count() \
    .sort('count', ascending=False). \
    filter(col('word').contains('#'))

After that write the above data into memory. Consider all data in each iterations (output mode = complete), and let the trigger run in every 2 seconds.

This is where the tweet listener actually starts running. Go check out your command line prompt!

In [44]:
writeTweet = tweets_tab.writeStream. \
    outputMode("complete"). \
    format("memory"). \
    queryName("tweetquery"). \
    trigger(processingTime='2 seconds'). \
    start()

print("----- streaming is running -------")

----- streaming is running -------


In [48]:
# Every time you run this cell, there will be fresh data!
# And the streaming keeps running
spark.sql("select * from tweetquery").show()

+----+-----+
|word|count|
+----+-----+
+----+-----+



In [43]:
# Stop the query
writeTweet.stop()

In [31]:
writeTweet.status

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

In [21]:
writeTweet.isActive

True