## CSCE 676 :: Data Mining and Analysis :: Texas A&M University :: Fall 2019


# Homework 2

- **100 points [10% of your final grade]**
- **Due Saturday, October 19 by 11:59pm**

**Goals of this homework:** There are five objectives of this homework: 

* Become familiar with Apache Spark and working in a distributed environment in the cloud
* Get hands-on experience designing and running a simple MapReduce data transformation job
* Get hands-on experience using Spark built-in functions; namely, LDA and PageRank
* Design a Pregel algorithm to find tree depth in a network
* Understand and implement Trawling algorithm to find user communities

*Submission instructions:* You should post your notebook to ecampus (look for the homework 2 assignment there). Name your submission **your-uin_hw2.ipynb**, so for example, my submission would be something like **555001234_hw2.ipynb**. Your notebook should be fully executed when you submit ... so run all the cells for us so we can see the output, then submit that. Follow the AWS guide to create a Hadoop/Spark cluster and create an empty Notebook. Copy all the cells in this notebook to the AWS notebook and continue working on your notebook in AWS. When you are done, download your notebook from AWS (navigate to the location on S3 where your notebook is saved and click download) and submit it to ecampus.

## Introduction to the Dataset
We will use a dataset of tweets concerning members of the US congress. The data spans almost a year (from October 3rd, 2018 to September 25th, 2019) covering 577 of the members. Any tweet or retweet posted by the 577 members or directed to them by other Twitter users were collected.

The data is on S3 in a bucket named s3://us-congress-tweets that you can access. There are 277,744,063 tweets. This is a huge dataset so we will not be working directly on this data all the time. Rather we will work on samples or subsets of this data but in some cases, we will ask you to execute your task on the whole dataset.

Below is a summary of all datasets used for this homework:

| Dataset                | Location in S3                                      | Description |
| :---                   | :---                                                | :---
| Congress members       | s3://us-congress-tweets/congress_members.csv        | 577 twitter ids and screen names |
| Raw tweets             | s3://us-congress-tweets/raw/\*.snappy               | the whole json objects of the tweets|
| Sample tweets          | s3://us-congress-tweets/congress-sample-10k.json.gz | 10k sample tweets|
| Trimmed tweets         | s3://us-congress-tweets/trimmed/\*.parquet          | selected fields for all tweets|
| User hashtags          | s3://us-congress-tweets/user_hashtags.csv           | all pairs of <user, hashtag>|
| User replies           | s3://us-congress-tweets/reply_network.csv           | all pairs of <reply_tweet, replied_to_tweet> |
| User mentions           | s3://us-congress-tweets/user_mentions.csv           | all pairs of <src_user_id, src_dest_id, frequency> |

Let's run some exploration below!

In [1]:
# First let's read Twitter ids and screen names of the 577 US congress members

congress_members = spark.read.csv("s3://us-congress-tweets/congress_members.csv", header=True)
congress_members.show()
print("Number of congress members tracked:", congress_members.count())

We can use `spark.read.json(...)` without schema to load the tweets into a dataframe but this will be slow for two reasons:
* First, it will make one pass over the data to build a schema of the content, then a second pass to read the content and parse it to the dataframe. 
* It will read all the content of the Tweet JSON objects but we only need few fields for a given task.

Thus we define our own schema something like the following:

In [2]:
from pyspark.sql.types import *
import pyspark.sql.functions as F
twitter_date_format="EEE MMM dd HH:mm:ss ZZZZZ yyyy"

user_schema = StructType([
    StructField('created_at',TimestampType(),True),
    StructField('followers_count',LongType(),True),
    StructField('id',LongType(),True),
    StructField('name',StringType(),True),
    StructField('screen_name',StringType(),True)
])

hashtag_schema = ArrayType(StructType([StructField('text',StringType(),True)]))
user_mentions_schema = ArrayType(StructType([StructField('id',LongType(),True),
                                             StructField('screen_name',StringType(),True)]))
entities_schema = StructType([
    StructField('hashtags',hashtag_schema,True),
    StructField('user_mentions',user_mentions_schema,True)
    ])

retweeted_status_schema =StructType([        
        StructField("id", LongType(), True),
        StructField("in_reply_to_user_id", LongType(), True),
        StructField("in_reply_to_status_id", LongType(), True),
        StructField("created_at", TimestampType(), True),
        StructField("user", user_schema)
    ])

tweet_schema =StructType([
        StructField("text", StringType(), True),
        StructField("id", LongType(), True),
        StructField("in_reply_to_user_id", LongType(), True),
        StructField("in_reply_to_status_id", LongType(), True),
        StructField("created_at", TimestampType(), True),
        StructField("user", user_schema),
        StructField("entities", entities_schema),
        StructField("retweeted_status", retweeted_status_schema)
    ])

Now we are ready to read the tweets with `spark.read.json` passing our own schema as follows:

In [3]:
tweets = spark.read.option("timestampFormat", twitter_date_format)\
                   .json('s3://us-congress-tweets/congress-sample-10k.json.gz', tweet_schema)\
                   .withColumn('user_id',F.col('user.id'))
tweets.printSchema()

root
 |-- text: string (nullable = true)
 |-- id: long (nullable = true)
 |-- in_reply_to_user_id: long (nullable = true)
 |-- in_reply_to_status_id: long (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- user: struct (nullable = true)
 |    |-- created_at: timestamp (nullable = true)
 |    |-- followers_count: long (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- screen_name: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- user_mentions: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- screen_name: string (nullable = true)
 |-- retweeted_status: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- in_reply_to_user_id: long (nul

## (6 points) Part 1a: Exploratory Data Analysis (Small Scale)

How many unique users and original tweets (i.e. not retweets) are there? 

In [4]:
# your code here for unique users
tweets.select(tweets.user.id).distinct().count()

In [5]:
# your code here for original tweets
tweets.filter(tweets.retweeted_status.isNull()).count()

Who are the ten most mentioned users in the sample?

In [6]:
# code and output here
tweets.select(F.col("user.screen_name"),
              F.explode(tweets.entities.user_mentions.screen_name).alias("mention"))\
      .groupby("mention").count().sort(F.desc("count")).show()

What are the top hashtags used?

In [7]:
# code and output here
tweets.select(F.explode(tweets.entities.hashtags.text).alias("hashtag"))\
      .groupby("hashtag").count().sort(F.desc("count"))\
      .show()

## (4 points) Part 1b: Exploratory Data Analysis (Large Scale)
Repeat the above queries but now against the whole dataset defined in the dataframe below. 

In [8]:
trimmed_files = [x[0] for x in spark.read.csv("s3://us-congress-tweets/trimmed/files.txt").collect()]
tweets_all = spark.read.parquet(*trimmed_files)
tweets_all.printSchema()

root
 |-- text: string (nullable = true)
 |-- id: long (nullable = true)
 |-- in_reply_to_user_id: long (nullable = true)
 |-- in_reply_to_status_id: long (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- user: struct (nullable = true)
 |    |-- created_at: timestamp (nullable = true)
 |    |-- followers_count: long (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- screen_name: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- user_mentions: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- screen_name: string (nullable = true)
 |-- retweeted_status: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- in_reply_to_user_id: long (nul

In [9]:
# your code here for unique users
tweets_all.select(tweets_all.user.id).distinct().count()

In [10]:
# your code here for original tweets
tweets_all.filter(tweets_all.retweeted_status.isNull()).count()

In [11]:
# Top mentioned users code and output here
tweets_all.select(F.col("user.screen_name"),
                  F.explode(tweets_all.entities.user_mentions.screen_name).alias("mention"))\
      .groupby("mention").count().sort(F.desc("count")).show()

In [12]:
# Top hashtags code and output here
tweets_all.select(F.explode(tweets_all.entities.hashtags.text).alias("hashtag"))\
      .groupby("hashtag").count().sort(F.desc("count"))\
      .show()

## (10 points) Part 2: Textual Analysis (LDA)
Using the LDA algorithm provided by the Spark Machine Learning (ML) library, find out the ten most important topics. Use `s3://us-congress-tweets/trimmed/*.parquet` for this task (you can reuse `tweets_all` dataframe from Part1b). 

You may want to work on a small sample first but report your results on the whole dataset.

Hint: for better results aggregate all tweets for a user into a single document

In [13]:
# your code here
# Preprocessing - split words, filter out stopwords, group by user ids and aggregate their tweets

# Because processing the whole dataset gives me an error that I can't solve
# even under 8 instances, I sampled 70% of the data
# data = tweets_all.sample(False, 0.7)
data = tweets_all
user_tweet_words = data.select("user.id", F.split(data.text, "\s+").alias("text"))

In [14]:
# StopWordsRemover
from pyspark.ml.feature import StopWordsRemover

stopWordsRemover = StopWordsRemover(inputCol="text", outputCol="filteredText")
user_tweet_words = stopWordsRemover.transform(user_tweet_words)

In [15]:
user_tweet_words = user_tweet_words.groupby("id")\
                                   .agg(F.flatten(F.collect_list("filteredText")).alias("aggregated_tweets"))

In [16]:
from pyspark.ml.feature import CountVectorizer

# The maximum and minimum occurrence can be further tuned to get better representative topics
cv = CountVectorizer(inputCol="aggregated_tweets", outputCol="features", maxDF=50, minDF=5)
cvModel = cv.fit(user_tweet_words)
user_tweet_words = cvModel.transform(user_tweet_words)
user_tweet_words.show()

+------+--------------------+--------------------+
|    id|   aggregated_tweets|            features|
+------+--------------------+--------------------+
|  3764|[RT, @AOC:, encou...|      (262144,[],[])|
|  5409|[RT, @justinamash...|      (262144,[],[])|
| 11938|[@kevburkeie, @pk...|      (262144,[],[])|
| 13518|[@JohnCornyn, @Te...|      (262144,[],[])|
| 15663|[Wow,, linked, ar...|      (262144,[],[])|
| 26543|[@LindseyGrahamSC...|      (262144,[],[])|
| 35253|[still, think, su...|(262144,[120635],...|
| 48763|[@RepCummings, ht...|      (262144,[],[])|
| 60033|[@JustinAmphlett,...|      (262144,[],[])|
|193283|[@SenFeinstein, @...|      (262144,[],[])|
|601963|[@hoonable, Auton...|(262144,[67106],[...|
|660523|[@RepThomasMassie...|      (262144,[],[])|
|734203|[RT, @FlipScreen:...|      (262144,[],[])|
|781066|[RT, @JohnFugelsa...|      (262144,[],[])|
|781154|[Horrifically, fa...|(262144,[115104],...|
|794532|[RT, @NeerajKA:, ...|(262144,[165046],...|
|806000|[@IAmChrisCrespo,...|  

In [17]:
from pyspark.ml.clustering import LDA

lda = LDA(k=10, optimizer='em')
ldaModel = lda.fit(user_tweet_words)
topics = ldaModel.describeTopics(10)
topics.show()

+-----+--------------------+--------------------+
|topic|         termIndices|         termWeights|
+-----+--------------------+--------------------+
|    0|[53, 54, 55, 70, ...|[0.00111684385496...|
|    1|[16, 42, 44, 46, ...|[0.00280752284532...|
|    2|[11, 29, 30, 24, ...|[0.00218868110020...|
|    3|[3, 4, 5, 6, 7, 9...|[0.00805994202630...|
|    4|[0, 1, 2, 14, 31,...|[0.01515503790438...|
|    5|[8, 35, 62, 66, 7...|[0.00430293329843...|
|    6|[37, 39, 40, 59, ...|[0.00153062096009...|
|    7|[18, 27, 28, 51, ...|[0.00258803530189...|
|    8|[45, 57, 80, 83, ...|[0.00145880523791...|
|    9|[13, 15, 23, 32, ...|[0.00292189398173...|
+-----+--------------------+--------------------+



For each topic, print out 10 words to describe it

In [18]:
# your code here
vocab = cvModel.vocabulary

topic_rows = topics.collect()
for topic in topic_rows:
    terms = []
    for termIndice in topic[1]:
        terms.append(vocab[termIndice].encode('ascii','ignore'))
    print("Topic" + str(topic[0]) + " : " + str(terms))

Topic0 : ['"Louie', '"Ratcliffe', '2018-12-22,', "(R-Tyler)'s'", '.@Sororita', '@PVallum', '@BenStanton77', '$8.55', '#FANNIEGATE', '@maat333']
Topic1 : ['GOPArkansas', '@StrengthINumber:', '"Hurd', '(R-San', '"Kenny', '2019-02-20,', '@franceonu', "Antonio)'s'", '@MattDowneyMPD', '@GotTeam:']
Topic2 : ['2019-02-20,', 'load:', '#manandvan', '2018-12-20,', '@screamguitarman:', '#TweetTheMuellerReport', '@helpmelord12:', '#GOPChairwoman', "(R-Weatherford)'s'", '#MFOL']
Topic3 : ['@Rutherford_Inst', '@WEXWatchdog', '@CCHR', '@johnalexwood', '@NMPoliticsnet', '@haussamen', '@soljourno', '@nmdoh', '@POGOBlo', 'Hiring:']
Topic4 : ['@VA8thCDDems', '@AngelCIraq214', '@news_store_com', '@lowkell', 'HARIHAR', '@Republicist1:', '@judgejeffbrown', '"Weber', "(R-Friendswood)'s'", '']
Topic5 : ['@eltonofficial', '@HRW', 'Gangstalking', '@BreakingNews', '@hairlossclinic1', '@glopol_analysis', '@wqbelle:', '@FCriticalThink', '@Cagsil', '@gopmillennials']
Topic6 : ['@Padres', '@MLBStats', '@LMErdosSCR_A

## (10 points) Part 3a: MapRedce
In this task, design a MapReduce program in python that reads all the original tweets (no retweets) in the sample tweets (`congress-sample-10k.json.gz`) and if a tweet is a reply to another tweet then output a record of the form <src_id, src_user, dst_id, dst_user>.

Create a small cluster (2 or 3 nodes) as per the AWS Guide and then ssh to your cluster and use Hadoop streaming to execute your mapreduce program.

Note: the Hadoop streaming jar file can be found at `/usr/lib/hadoop-mapreduce/hadoop-streaming.jar`

In [19]:
# # your mapper function

# #!/usr/bin/env python
# import sys
# import json

# def get_tweet(line):
#     try:
#         tweet = json.loads(line.strip())
#     except:
#         tweet = {}

#     return tweet

# for line in sys.stdin:
#     tweet = get_tweet(line)

#     # original tweets
#     if "retweeted_status" not in tweet:
#         # reply tweets
#         if "in_reply_to_status_id" in tweet and tweet["in_reply_to_status_id"] != None:
#             print("<%s, %s, %s, %s>" % (\
#                 tweet["id"],\
#                 tweet["user"]["id"],\
#                 tweet["in_reply_to_status_id"],\
#                 tweet["in_reply_to_user_id"]\
#             ))

In [20]:
# # your reducer function
# mapreduce.job.reduces=0 (0 reducer, map only)

In [21]:
# # your Hadoop job submission command (copy/paste your command from the terminal)
# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
# -input s3://us-congress-tweets/congress-sample-10k.json.gz \
# -output mapreduce/output \
# -mapper mapper.py \
# -reducer NONE

How many reply relationships did you get?

In [22]:
# # code to read job output and count
# output = spark.read.csv("mapreduce/output")
# output.count()

## (5 points) Part 3b: Going Large-Scale with MapReduce

Rerun the same MapReduce job above but on the whole dataset (`s3://us-congress-tweets/raw/*.snappy`).
All the files under `s3://us-congress-tweets/raw` can be read from the following file:

`s3://us-congress-tweets/raw/files.txt`

Use shell scripting to parse this file and prepare the input to your MapReduce job as  comma seperated string of all the files. (e.g. your input should be like this `s3://us-congress-tweets/raw/part-00000.snappy,s3://us-congress-tweets/raw/part-00001.snappy,s3://us-congress-tweets/raw/part-00002.snappy,...`)

Inspecting the job logs, how many files did the job operate on? how many input splits were there?

In [23]:
# # Your answer here

# #!/bin/bash
  
# files='s3://us-congress-tweets/raw/files.txt'
# inputs=''
# file_num=0

# while read file; do
#     if (( $file_num != 0 ))
#     then
#         inputs+=','
#     fi
#     inputs+=$file
#     ((file_num+=1))
# done < $files

# echo $inputs

In [None]:
# job logs


How many reply relationships did you get?

In [24]:
# # Number of reply records
# output = spark.read.csv("mapreduce/output")
# output.count()

## (30 points) Part 4: Graph Analysis
In this task, we would like to compute the longest path in *tweet reply* graphs and then perform some statistical calculations on the result. We will use Pregel implementation from GraphFrames for this task. Ignore paths that are longer than 20.

First, construct your tweet reply network using tweet-reply records in this file `s3://us-congress-tweets/reply_network.csv`.
From this file, use src_id and dst_id. The dst_id is the id of the tweet being replied to and the src_id is the id of the replying tweet.

In [None]:
from graphframes import *
from graphframes.lib import Pregel
sc.setCheckpointDir("hdfs:///tmp/graphframes_checkpoint") # this is needed for any GraphFrames operation

In [25]:
# your network construction code here
# your network construction code here
data = spark.read.csv("s3://us-congress-tweets/reply_network.csv")
data.show()

edges = data.select("src_id", "dst_id").cache()
vertices = edges.select(F.col("src_id").alias("vertices")).union(edges.select("dst_id")).distinct()

edges.show()
vertices.show()
print("# of edges:", str(edges.count()))
print("# of vertices:", str(vertices.count()))

graph = GraphFrame(vertices, edges)

What are the top replied to tweets? (show 20)

In [26]:
# your code here
graph.inDegrees.sort(F.desc("inDegree")).show()

How many graphs in the reply network? (Hint: use connectedComponents function)

In [27]:
# your code here
connectedComponents = graph.connectedComponents()
connectedComponents.show()
connectedComponents.select("component").distinct().count()

Now, design and execute a Pregel program that will calculate the longest paths for all reply graphs in the network. Explain your design.

In [28]:
# your pregel code here


What is the average longest path length for all reply graphs in the network?

In [29]:
# your code here

## (30 points) Part 5: Community Detection
User-hashtag relations have been extracted and saved in the file `s3://us-congress-tweets/user_hashtags.csv`. If a user uses a hashtag there will be a record with the userid and the hashtag.

Use the Trawling algorithm discussed in class to find potential user communities in the dataset. (Hint: use FPGrowth in the Spark ML package). Explore different values for the support parameter.

In [30]:
# your code here. Explain all steps.

List two user communities you think are interesting. Explain why they are reasonable communities.

You can use https://twitter.com/intent/user?user_id=? to find out more info about the users

In [31]:
# community 1

In [32]:
# community 2

What value for support did you choose and why?

In [33]:
# Answer here

## (10 points) Part 6: Personalized PageRank
Assume you are given a task to recommend Twitter users for the speaker of the House to engage with.

Construct a user-mentions network using relations in `s3://us-congress-tweets/user_mentions.csv`

Run Personalized PageRank with source (id=15764644) and find out top accounts to recommend.

In [34]:
# your network construction code here
edges = spark.read.csv("s3://us-congress-tweets/user_mentions.csv", header=True)
vertices = spark.read.csv("s3://us-congress-tweets/congress_members.csv", header=True)

edges.show()
vertices.show()

graph = GraphFrame(vertices, edges)

In [35]:
# your Personalized PageRank code here
graph = GraphFrame(graph.outDegrees, edges)
pageranks = graph.pageRank(resetProbability=0.15, maxIter=10, sourceId="15764644")
pageranks.show()
pageranks.vertices.sort(F.desc("pagerank")).show()

In [36]:
# Top 10 accounts to recommend 
# You can use https://twitter.com/intent/user?user_id=? to find out more info about the users

# Troubleshooting Tips

* If you get "spark not available" error, this most likely means the Kernel is python and not PySpark. Just change the Kernel to PySpark and it should work.


* If your notebook seems stuck (may happen if you force stop a cell), you may need to ssh to your master node and kill the spark application associated with the notebook     
    Use `yarn application -list` to find the application id and then `yarn application -kill app-id` to kill it. After that restart your notebook from the browser.


* If you like, you may also ssh to the master node and run `pyspark` and execute your code directly in the shell.

* If you face difficulties accessing the pages for the jobs for example to see logs and so on then you can open all needed ports when you create the cluster. (e.g. 8088)

* If you want to see logs for a MapReduce job from the terminal use the following command:

    `yarn logs -applicationId <application_id>`


* To kill a MapReduce job use:

    `yarn  application -kill <application_id>`