# Trending Hashtag Analysis on Twitter live data using Spark Streaming

**Import sparkContext & StreamingContext from PySpark library**

In [None]:
import os
import sys
# Here you need to have same Python version on your local machine adn on worker node i.e. EC2. here both should have python3.
os.environ["PYSPARK_PYTHON"] = "/bin/python3"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
os.environ["SPARK_HOME"] = "/home/ec2-user/spark-2.4.4-bin-hadoop2.7"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] + "/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] + "/pyspark.zip")
os.environ["PYTHONIOENCODING"] = "utf8"

In [None]:
from __future__ import print_function
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

- Create a **sparkContext** with AppName "TwitterStreaming".<br>
- Setting the LogLevel of SparkContext to ERROR. This will not print all the logs which are INFO or WARN level.<br>
- Create Spark Streaming Context using SC (spark context). Parameter 10 is the batch interval. <br>
Every 10 second the analysis will be done.

In [None]:
sc = SparkContext("local[2]","TwitterStreaming")
sc.setLogLevel('ERROR')
ssc = StreamingContext(sc, 5)

Connect to socket broker using ssc (spark streaming context)<br>
Host : "127.0.0.1" (localhost) & port : 7777 (It can be anything but it has to be same in both the notebooks)

In [None]:
stream_data = ssc.socketTextStream("127.0.0.1", 7777)

ssc.checkpoint("checkpoint-dir")

window function parameter sets the Window length. All the analysis will be done on tweets stored for 20 secs.

In [None]:
twitter_data = stream_data.window(10)

### Process the Stream:
1. Receives tweet message, stored in lines. **Input DStream**
2. splits the messages into words. **Apply transformation on DStream : flatMap**
3. filters all the words which start with a hashtag(#). **transformation : filter**
4. converts the words to lowercase. **transformation : map**
5. maps each tag to (word, 1). **transformation : map**
6. then reduces and counts occurrences of each hash tag. (action : reduceByKey) hashtags = **output DStream**

Sort the hashtags based on the counts in decreasing order

Print the final analysis: Most popular hashtags on streaming twitter data

### Starting the Spark Streaming:
Spark Streaming code we have written till now will not execute, untill we start the ssc.<br>
ssc.start() will start the spark streaming context. This is the Action for the whole code. <br>
Now it'll create the lineage & DAG & do the lazy evaluation & start running the whole sequesnce of code.


**awaitTermination()** is very important to stop the SSC.<br> 
When we kill this python process then this signal will be sent to awaitTermination() function.<br> 
It will finally stop the spark streaming job.