Lab Assignment 4 Spark MLlib classification algorithms, word count on twitter streaming
Source Code : https://github.com/Ruthvicp/CS5590_BigDataProgramming/tree/master/Lab/Lab4/Source
Video/Demo : https://youtu.be/42IJBhnslpk
This lab assignment consists of using Spark MLlib classification alogirthms on the given data set and also run the word count on the twitter streaming data.
1. Classification Algorithms used are - Decision Tree, Naive Bayes, Random Forest
Read the data set and convert the column data from string to float/double type. We perform the classification based on the columns "Month of Absence", "Day of the week", "Height", "Travel expenses", "Distance" ,"Body Mass Index".
We split the data into train-70%, Test-30%. Finally we evaluate the model and predict the results of the test data set. Then calculate the confusion matrix using the above columns and find the precision and recall values.
We create a vector assembler on input data columns "label (Height)" and "Distance" and use the DecisionTreeClassifier on indexedlabel and indexed features
For the same data set and columns we perform naive bayes to find the prediction for absenteeism at work
We use Random forest Classifier for the input columns height and distance and create a vector assembler on this indexed data.
The input file for this can be found at https://github.com/Ruthvicp/CS5590_BigDataProgramming/raw/master/Lab/Lab4/Source/Absenteeism_at_work.csv and also at https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work
we compare the results - Accuracy and Error calculated using the above classifier and pick the best possible based on highest accuracy or lowest error
The snapshot for the output for Decision Tree is given below
The snapshot for the output for Naive Bayes is given below
The snapshot for the output for Random Forest is given below
Based on the accuracy, for the columns chosen, Decision Tree has the highest accuracy of 95 % and the lowest error of 5%. Also the confusion matrix is plotted for all the 3 classifications and is shown in the above output images
We create a streaming context on a host and bind it to an available port and send the streaming context on it. Once the receiver starts listening to the same port, then the data is sent across. On this data we perform the word count to get the results.
a) Create a twitter streaming class to connect to twitter using the auth credentials
b) Create a listening class to bind onto same host and port
c) Perform the word count on it
Binding a stream using " s.bind((host, port))". Establish the connection "c, addr = s.accept() ". Now send the data as in "sendData(c)".
Inside on_Data() : separate the twitter text and encode them before sending - self.client_socket.send(msg['text'].encode('utf-8'))
Set the authentication credentials given below
- consumer_key
- consumer_secret
- access_token
- access_secret
auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret)
in send_data() : type the below code to get twitter streams. I have filtered the tweets based on 'fifa' auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret)
twitter_stream = Stream(auth, TweetsListener(c_socket))
twitter_stream.filter(track=['fifa'])
To get proper word count results, enocode on sending side and decode the data on the receiving end. Set the window size and reduce the streaming context duration to have the word count done as fast as possible
screen shot on running the twitter streaming class :
screen shot of the word count output on the tweets is :
The word count for the fifa tweets is done using the twitter streams by creating a streaming context in spark. We have done word count on 7237 tweets in 3 minutes of duration.
- https://spark.apache.org/docs/latest/ml-decision-tree.html
- https://spark.apache.org/docs/2.2.0/mllib-naive-bayes.html
- https://weiminwang.blog/2016/06/09/pyspark-tutorial-building-a-random-forest-binary-classifier-on-unbalanced-dataset/
- https://stackoverflow.com/questions/43872281/pyspark-find-number-of-tweets-that-contain-a-word-hashtag