# Project 03 - Due Monday, November 13 at 12pm

*Objectives*: Use Spark to process and perform basic analysis on non-relational data, including its DataFrame and SQL interfaces.

*Grading criteria*: The tasks should all be completed, and questions should all be answered with Python code, SQL queries, shell commands, and markdown cells.  The notebook itself should be completely reproducible (using AWS EC2 instance based on the provided AMI) from start to finish; another person should be able to use the code to obtain the same results as yours.  Note that you will receive no more than partial credit if you do not add text/markdown cells explaining your thinking when appropriate.

*Attestation*: **Work in groups**.  At the end of your submitted notebook, identify the work each partner performed and attest that each contributed substantially to the work.

*Deadline*: Monday, November 13, 12pm.  One member of each group must submit your notebook to Blackboard; you should not submit it separately.

### Group Members:
* Guangyu Xing
* Jiwei (Wayne) Zeng
* Hangman (Agnes) Jiang
* Pei-Hsuan Hsia

## Part 1 - Setup

Begin by setting up Spark and fetching the project data.  

**Note**: you may want to use a larger EC2 instance type than normal.  This project was prepared using a `t2.xlarge` instance.  Just remember that the larger the instance, the higher the per-hour charge, so be sure to remember to shut your instance down when you're done, as always.

### About the data

We will use JSON data from Twitter; we saw an example of this in class.  It should parse cleanly, allowing you to focus on analysis.

This data was gathered using GWU Libraries' [Social Feed Manager](http://sfm.library.gwu.edu/) application during a recent game of the MLB World Series featuring the Los Angeles Dodgers and Houston Astros.  This first file tells you a little bit about how it was gathered:

In [1]:
!wget https://s3.amazonaws.com/2017-dmfa/project-3/9670f3399f774789b7c3e18975d25611-README.txt

--2017-11-13 15:49:57--  https://s3.amazonaws.com/2017-dmfa/project-3/9670f3399f774789b7c3e18975d25611-README.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.0.67
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.0.67|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1920 (1.9K) [text/plain]
Saving to: ‘9670f3399f774789b7c3e18975d25611-README.txt’


2017-11-13 15:49:57 (169 MB/s) - ‘9670f3399f774789b7c3e18975d25611-README.txt’ saved [1920/1920]



In [2]:
!cat 9670f3399f774789b7c3e18975d25611-README.txt

This is an export created with Social Feed Manager.

EXPORT INFORMATION
Selected seeds: All seeds
Export id: 9670f3399f774789b7c3e18975d25611
Export type: twitter_filter
Format: Full JSON
Export completed:  Oct. 30, 2017, 11:21:04 p.m. EDT
Deduplicate: No

COLLECTION INFORMATION
Collection name: test set for world series
Collection id: 34e3f7460b5c4df09d64a1e61fd81238
Collection set: mlb-test (collection set id d6e8c27b1bc942e78790aa55a82b3a7a)
Harvest type: Twitter filter
Collection description: running for just one hour, just for fun.

Harvest options:
Media: No
Web resources: No

Seeds:
* Track: dodgers,astros - Active

Change log:

Change to test set for world series (collection) on Oct. 30, 2017, 10:59:56 p.m. EDT by dchud:
* is_active: "True" changed to "False"

Change to test set for world series (collection) on Oct. 30, 2017, 10:58:51 p.m. EDT by dchud:
* is_on: "True" changed to "False"

Change to test set for world series (collection) on Oct. 2

The most important pieces in that metadata are:

 * It tracked tweets that mentioned "dodgers" or "astros".  Every item in this set should refer to one or the other, or both.
 * This data was not deduplicated; we may see individual items more than once.
 * Data was collected between October 29 and October 30.  Game 5 of the Series was played during this time.
 
You should not need to know anything about baseball to complete this assignment.

**Please note**: sometimes social media data contains offensive material.  This data set has not been filtered; if you do come across something inappropriate, please do your best to ignore it if you can.

## Fetch the data

The following files are available:

 * https://s3.amazonaws.com/2017-dmfa/project-3/9670f3399f774789b7c3e18975d25611_003.json
 * https://s3.amazonaws.com/2017-dmfa/project-3/9670f3399f774789b7c3e18975d25611_004.json
 * https://s3.amazonaws.com/2017-dmfa/project-3/9670f3399f774789b7c3e18975d25611_005.json
 * https://s3.amazonaws.com/2017-dmfa/project-3/9670f3399f774789b7c3e18975d25611_006.json
 
### Q1.1 - Select at least one and obtain it using `wget`.  Verify the file sizes using the command line.

Each file should contain exactly 100,000 tweets.  

*Note*: you are only required to use one of these files, but you may use more than one.  It will be easier to process more data if you use a larger EC2 instance type, as suggested above.  Use the exact same set of files throughout the assignment.

**Answer**

In [3]:
!wget https://s3.amazonaws.com/2017-dmfa/project-3/9670f3399f774789b7c3e18975d25611_003.json

--2017-11-13 15:50:46--  https://s3.amazonaws.com/2017-dmfa/project-3/9670f3399f774789b7c3e18975d25611_003.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.72.98
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.72.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 595711407 (568M) [application/json]
Saving to: ‘9670f3399f774789b7c3e18975d25611_003.json’


2017-11-13 15:50:54 (81.4 MB/s) - ‘9670f3399f774789b7c3e18975d25611_003.json’ saved [595711407/595711407]



In [4]:
!mv 9670f3399f774789b7c3e18975d25611_003.json tweets.json

In [5]:
!wc -l tweets.json

100000 tweets.json


For your reference, here is the text of one Tweet, randomly selected from one of these files.  You might wish to study its structure and refer to it later.

In [6]:
!cat *.json | shuf -n 1 > example-tweet.json

In [7]:
import json
print(json.dumps(json.load(open("example-tweet.json")), indent=2))

{
  "truncated": false,
  "id_str": "924871509230964736",
  "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
  "retweet_count": 0,
  "contributors": null,
  "in_reply_to_user_id_str": null,
  "favorite_count": 0,
  "reply_count": 0,
  "favorited": false,
  "coordinates": null,
  "user": {
    "id_str": "29844665",
    "profile_use_background_image": true,
    "time_zone": "Quito",
    "geo_enabled": true,
    "notifications": null,
    "profile_text_color": "666666",
    "default_profile_image": false,
    "profile_background_tile": false,
    "is_translator": false,
    "friends_count": 206,
    "location": "Cottondale, Fl",
    "contributors_enabled": false,
    "translator_type": "none",
    "profile_link_color": "2FC2EF",
    "screen_name": "i5ainti",
    "following": null,
    "listed_count": 6,
    "followers_count": 183,
    "follow_request_sent": null,
    "profile_background_image_url": "http://abs.twimg.com/images/themes/them

You can find several key elements in this example; the text, time, and language of the tweet, whether it was a reply to another user, the user's screen name along with their primary language and other account information like creation date, follower/friend/tweet counts, and perhaps their location.  If there are hashtags, user mentions, or urls present in their tweet, they will be present in the `entities` section; these are not present in every tweet.  If this is a retweet, you will see the original tweet and its information nested within.

### Q1.2 - Start up Spark, and verify the file sizes.

We will use our normal startup sequence here:

In [8]:
import os

In [9]:
os.environ['SPARK_HOME'] = '/usr/local/lib/spark'

In [10]:
import findspark

In [11]:
findspark.init()

In [12]:
from pyspark import SparkContext

In [13]:
spark = SparkContext(appName='project-03')

In [14]:
spark

In [15]:
from pyspark import SQLContext

In [16]:
sqlc = SQLContext(spark)

In [17]:
sqlc

<pyspark.sql.context.SQLContext at 0x7f3050ca7f98>

In [18]:
tweets = sqlc.read.json("tweets.json")

Verify that Spark has loaded the same number of tweets you saw before:

**Answer**

In [19]:
tweets.count()

100000

## Part 2 - Comparing DataFrames and Spark SQL

For the next three questions, we will look at operations using both DataFrames and SQL queries. Note that `tweets` is already a DataFrame:

This data was not deduplicated. To deal with this problem, we first drop the duplicate records and replace the original one to answer all of the following questions. The new one has 99,855 rows.

In [20]:
tweets = tweets.dropDuplicates()

In [21]:
tweets.count()

99855

In [22]:
tweets.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- display_text_range: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- display_url: string (nullable = true)
 |    |    |    |-- expanded_url: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- id_str: string (nullable = true)
 |    |    |    |

In [23]:
tweets

DataFrame[contributors: string, coordinates: struct<coordinates:array<double>,type:string>, created_at: string, display_text_range: array<bigint>, entities: struct<hashtags:array<struct<indices:array<bigint>,text:string>>,media:array<struct<display_url:string,expanded_url:string,id:bigint,id_str:string,indices:array<bigint>,media_url:string,media_url_https:string,sizes:struct<large:struct<h:bigint,resize:string,w:bigint>,medium:struct<h:bigint,resize:string,w:bigint>,small:struct<h:bigint,resize:string,w:bigint>,thumb:struct<h:bigint,resize:string,w:bigint>>,source_status_id:bigint,source_status_id_str:string,source_user_id:bigint,source_user_id_str:string,type:string,url:string>>,symbols:array<struct<indices:array<bigint>,text:string>>,urls:array<struct<display_url:string,expanded_url:string,indices:array<bigint>,url:string>>,user_mentions:array<struct<id:bigint,id_str:string,indices:array<bigint>,name:string,screen_name:string>>>, extended_entities: struct<media:array<struct<display_

To issue SQL queries, we need to register a table based on `tweets`:

In [24]:
tweets.createOrReplaceTempView("tweets")

Verify that the number of rows is correct, which we deal with the duplication problem. 

In [25]:
sqlc.sql("SELECT COUNT(*) FROM tweets").show()

+--------+
|count(1)|
+--------+
|   99855|
+--------+



### Q2.1 - Which 10 languages are most commonly used in tweets?  Verify your result by executing it with both the dataframe and with SQL.

Hint: for the dataframe, use `groupBy`, `count`, and `orderBy`.  See the documentation at https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html for details on these and other functions.

**Answer**

According to the dictionary of this dataset, 'lang' indicates the machine-detected language of the Tweet post, while 'user.lang' means the languages where users set their tweets. Since we think that the languages used in the post are exactly what this question asks, we decided to use 'lang' in this part.

**DataFrame**

In [26]:
tweets.groupBy("lang") \
      .count() \
      .orderBy("count", ascending = False) \
      .show(10)

+----+-----+
|lang|count|
+----+-----+
|  en|88867|
|  es| 6808|
| und| 3053|
|  in|  210|
|  fr|  181|
|  pt|  133|
|  nl|   89|
|  ht|   83|
|  ja|   80|
|  tl|   77|
+----+-----+
only showing top 10 rows



**SQL**

In [27]:
sqlc.sql("""
    SELECT lang, COUNT(*) 
    FROM tweets
    GROUP BY lang
    ORDER BY COUNT(*) DESC
""").show(10)

+----+--------+
|lang|count(1)|
+----+--------+
|  en|   88867|
|  es|    6808|
| und|    3053|
|  in|     210|
|  fr|     181|
|  pt|     133|
|  nl|      89|
|  ht|      83|
|  ja|      80|
|  tl|      77|
+----+--------+
only showing top 10 rows



 Both results of dataframe and SQL show the same most commonly used languages in tweets.

### Q2.2 - Which 10 time zones are most common among users?  Verify your result with both the dataframe and SQL.

*Note*: for this question, you may leave NULL values present in your results, as a way to help you understand what data is present and what is missing.

**Answer**

**DataFrame**

In [28]:
tweets.groupBy("user.time_zone") \
      .count() \
      .orderBy("count", ascending = False) \
      .show(10, False)

+---------------------------+-----+
|time_zone                  |count|
+---------------------------+-----+
|null                       |42169|
|Central Time (US & Canada) |17420|
|Pacific Time (US & Canada) |17062|
|Eastern Time (US & Canada) |8656 |
|Arizona                    |2486 |
|Mountain Time (US & Canada)|2472 |
|Atlantic Time (Canada)     |1048 |
|Caracas                    |1005 |
|Hawaii                     |820  |
|Mexico City                |791  |
+---------------------------+-----+
only showing top 10 rows



**SQL**

In [29]:
sqlc.sql("""
    SELECT user.time_zone, COUNT(*)
    FROM tweets
    GROUP BY user.time_zone
    ORDER BY COUNT(*) DESC
""").show(10, False)

+---------------------------+--------+
|time_zone                  |count(1)|
+---------------------------+--------+
|null                       |42169   |
|Central Time (US & Canada) |17420   |
|Pacific Time (US & Canada) |17062   |
|Eastern Time (US & Canada) |8656    |
|Arizona                    |2486    |
|Mountain Time (US & Canada)|2472    |
|Atlantic Time (Canada)     |1048    |
|Caracas                    |1005    |
|Hawaii                     |820     |
|Mexico City                |791     |
+---------------------------+--------+
only showing top 10 rows



Both results of dataframe and SQL show the same most common time zones among users.

### Q2.3 - How many tweets mention the Dodgers?  How many mention the Astros?  How many mention both?

You may use either the dataframe or SQL to answer.  Explain why you have chosen that approach.

Hint:  you will want to look at the value of the `text` field.

**Answer**

We ignore the case issue, and we treat Astros/Dodgers and Astro/Dodger as the same.  
We tested the speed of both dataframe and SQL queries, and found that SQL was much faster. Therefore, we chose SQL for the rest of this question. 

In [30]:
from pyspark.sql.functions import lower

**Number of tweets mentioned Astros:**

In [31]:
%time tweets.filter("lower(text) like '%astro%'").count()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 7.3 s


70165

In [32]:
%time
sqlc.sql("""
    SELECT COUNT(text) AS Astros
    FROM tweets
    WHERE LOWER(text) LIKE '%astro%'
""").show()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.01 µs
+------+
|Astros|
+------+
| 70165|
+------+



**Number of tweets mentioned Dodgers:**

In [33]:
sqlc.sql("""
    SELECT COUNT(text) AS Dodgers
    FROM tweets
    WHERE LOWER(text) LIKE '%dodger%'
""").show()

+-------+
|Dodgers|
+-------+
|  33280|
+-------+



**Number of tweets mentioned both Dodgers and Astros:**

In [34]:
sqlc.sql("""
    SELECT COUNT(text) AS AstrosDodgers
    FROM tweets
    WHERE LOWER(text) LIKE "%astro%" AND LOWER(text) LIKE "%dodger%"
""").show()

+-------------+
|AstrosDodgers|
+-------------+
|        14167|
+-------------+



## Part 3 - More complex queries

For this section, you may choose to use dataframe queries or SQL.  If you wish, you may verify results by using both, as in Part 2, but this is not required for this section.

### Q3.1 - Team mentions by location

In which users' locations are the Astros and the Dodgers being mentioned the most?  Consider each team separately, one at a time.  Discuss your findings.

Hint:  you may use either the time zones or user-specified locations for this question.

**Answer**

We used time zones for this question. We first used SQL to answer the question, and then used dataframe to verify our results.

**Dodgers**

In [35]:
sqlc.sql("""
    SELECT user.time_zone, count(text) AS Dodgers
    FROM tweets
    WHERE lower(text) LIKE '%dodger%' AND user.time_zone IS NOT NULL
    GROUP BY user.time_zone
    ORDER BY Dodgers DESC
 """).show(1, False)

+--------------------------+-------+
|time_zone                 |Dodgers|
+--------------------------+-------+
|Pacific Time (US & Canada)|8108   |
+--------------------------+-------+
only showing top 1 row



In [36]:
tweets.filter("lower(text) like '%dodger%'") \
      .groupBy("user.time_zone") \
      .count() \
      .orderBy("count", ascending = False) \
      .na.drop() \
      .show(1, False)

+--------------------------+-----+
|time_zone                 |count|
+--------------------------+-----+
|Pacific Time (US & Canada)|8108 |
+--------------------------+-----+
only showing top 1 row



**Astros**

In [37]:
sqlc.sql("""
    SELECT user.time_zone, count(text) AS Astros
    FROM tweets
    WHERE lower(text) LIKE '%astro%' AND user.time_zone IS NOT NULL
    GROUP BY user.time_zone
    ORDER BY Astros DESC
 """).show(1, False)

+--------------------------+------+
|time_zone                 |Astros|
+--------------------------+------+
|Central Time (US & Canada)|14186 |
+--------------------------+------+
only showing top 1 row



In [38]:
tweets.filter("lower(text) like '%astro%'") \
      .groupBy("user.time_zone") \
      .count() \
      .orderBy("count", ascending = False) \
      .na.drop() \
      .show(1, False)

+--------------------------+-----+
|time_zone                 |count|
+--------------------------+-----+
|Central Time (US & Canada)|14186|
+--------------------------+-----+
only showing top 1 row



**Finding:** Dodgers was mentioned most in Pacific time, and Astros was mentioned most in Central time. We guess it is because the locations of these teams. Dodgers is in LA, while Astros is in Houston.

### Q3.2 - Which Twitter users are being replied to the most?

Discuss your findings.

Hint: use the top-level `in_reply_to_screen_name` for this.

**Answer**

In [39]:
sqlc.sql('''
        SELECT in_reply_to_screen_name, COUNT(in_reply_to_screen_name) AS number
        FROM tweets
        WHERE in_reply_to_screen_name IS NOT NULL
        GROUP BY in_reply_to_screen_name
        ORDER BY number DESC
        ''').show(1)

+-----------------------+------+
|in_reply_to_screen_name|number|
+-----------------------+------+
|                 astros|   821|
+-----------------------+------+
only showing top 1 row



In [40]:
tweets.filter("in_reply_to_screen_name != 'null'") \
      .groupBy("in_reply_to_screen_name") \
      .count() \
      .orderBy("count", ascending = False) \
      .show(1)

+-----------------------+-----+
|in_reply_to_screen_name|count|
+-----------------------+-----+
|                 astros|  821|
+-----------------------+-----+
only showing top 1 row



**Findings: **Twitter user 'astros' are being replied most. 'astros' is the offical account of team Astros. It shows that fans from Astros were most active on Twitter. The reason may be Astros won that game.

### Q3.3 - Which 10 verified users have the most followers?  Which 10 unverified users have the most followers?

Provide both the screen names and follower counts for each.

Discuss your findings.

**Answer**

**10 verified users having the most followers:**

In [41]:
sqlc.sql('''
        SELECT user.id_str, user.screen_name, MAX(user.followers_count) AS followers
        FROM TWEETS
        WHERE user.verified = "true"
        GROUP BY user.id_str, user.screen_name
        ORDER BY followers DESC
        LIMIT 10
        ''').show()

+----------+--------------+---------+
|    id_str|   screen_name|followers|
+----------+--------------+---------+
|   1652541|       Reuters| 18937529|
|   1367531|       FoxNews| 16272836|
|  28785486|           ABC| 12551437|
|   2467791|washingtonpost| 11417638|
|  18479513|           MLB|  7841255|
|   5392522|           NPR|  7289492|
|  32765534|   BillSimmons|  6000106|
|  14173315|       NBCNews|  5442705|
|1394399438|    JohnLegere|  4630104|
|  44728980|     ANCALERTS|  4453229|
+----------+--------------+---------+



In [42]:
from pyspark.sql.functions import max

tweets.filter(tweets.user.verified == "true")\
      .groupBy(tweets.user.id_str,tweets.user.screen_name)\
      .agg(max(tweets.user.followers_count))\
      .sort("max(user['followers_count'])",ascending = False) \
      .show(10)

+--------------+-------------------+----------------------------+
|user['id_str']|user['screen_name']|max(user['followers_count'])|
+--------------+-------------------+----------------------------+
|       1652541|            Reuters|                    18937529|
|       1367531|            FoxNews|                    16272836|
|      28785486|                ABC|                    12551437|
|       2467791|     washingtonpost|                    11417638|
|      18479513|                MLB|                     7841255|
|       5392522|                NPR|                     7289492|
|      32765534|        BillSimmons|                     6000106|
|      14173315|            NBCNews|                     5442705|
|    1394399438|         JohnLegere|                     4630104|
|      44728980|          ANCALERTS|                     4453229|
+--------------+-------------------+----------------------------+
only showing top 10 rows



**10 unverified users having the most followers:**

In [43]:
sqlc.sql('''
        SELECT user.id_str, user.screen_name, MAX(user.followers_count) AS followers
        FROM TWEETS
        WHERE user.verified = "false"
        GROUP BY user.id_str, user.screen_name
        ORDER BY followers DESC
        LIMIT 10
        ''').show()

+----------+---------------+---------+
|    id_str|    screen_name|followers|
+----------+---------------+---------+
|  29614331|        chochos|   833669|
|  82971772|  el_carabobeno|   725952|
|  20897273|       PAMsLOvE|   712254|
|  24733117|        jilevin|   568341|
|2796081233|    sun_das_ill|   559669|
| 108192135|       EP_Mundo|   538525|
|  43846520|         LALATE|   516139|
|  38976017|  piercearrow33|   503015|
| 288513282|      BigNeechi|   496825|
| 290395312|periodicovzlano|   493446|
+----------+---------------+---------+



In [44]:
tweets.filter(tweets.user.verified == "false")\
      .groupBy(tweets.user.id_str,tweets.user.screen_name)\
      .agg(max(tweets.user.followers_count))\
      .sort("max(user['followers_count'])",ascending = False) \
      .show(10)

+--------------+-------------------+----------------------------+
|user['id_str']|user['screen_name']|max(user['followers_count'])|
+--------------+-------------------+----------------------------+
|      29614331|            chochos|                      833669|
|      82971772|      el_carabobeno|                      725952|
|      20897273|           PAMsLOvE|                      712254|
|      24733117|            jilevin|                      568341|
|    2796081233|        sun_das_ill|                      559669|
|     108192135|           EP_Mundo|                      538525|
|      43846520|             LALATE|                      516139|
|      38976017|      piercearrow33|                      503015|
|     288513282|          BigNeechi|                      496825|
|     290395312|    periodicovzlano|                      493446|
+--------------+-------------------+----------------------------+
only showing top 10 rows



**Findings:** Those verified users have much more followers than unverified users. Furthermore, those top 10 verified users are official accounts of large news media, leagues and public celebreties. 

### Q3.4 - What are the most popular sets of hashtags among users with many followers?  Are they the same as among users with few followers?

Decide for yourself exactly how many followers you believe to be "many", and explain your decision.  You may use queries and statistics to support this decision if you wish.

Hint: if your sample tweet above does not include hashtags under the `entities` field, generate a new example by running the `shuf` command again until you find one that does.

Hint 2: the hashtag texts will be in an array, so you may need some functions you haven't used before.  If you're using SQL, see the docs for [Hive SQL](https://docs.treasuredata.com/articles/hive-functions) for details, (and consider `CONCAT_WS`, for example).

Discuss your findings.

**Answer**

We used 0.95 and 0.05 quantiles of follower counts to define 'many' and 'few' followers. It means when users' followers are more than 0.95 quantile, they are users with many followers. When follower counts is less than 0.05 quantile, they are users with few followers.

Find the 95% and 5% quantiles of the number of followers:

In [45]:
tweets.groupBy(tweets.user.screen_name)\
      .agg(max(tweets.user.followers_count))\
      .approxQuantile("max(user['followers_count'])", [0.95], 0.0)

[3897.0]

In [46]:
tweets.groupBy(tweets.user.screen_name)\
      .agg(max(tweets.user.followers_count))\
      .approxQuantile("max(user['followers_count'])", [0.05], 0.0)

[24.0]

The most popular sets of hashtags among users with many followers (follower counts >= 3897):

In [47]:
tweets.filter("user.followers_count >= 3897")\
      .select("entities.hashtags") \
      .select("hashtags.text") \
      .rdd.flatMap(lambda r: r['text']).map(lambda t: (t, 1)).reduceByKey(lambda a, b: a + b)\
      .takeOrdered(10, key=lambda pair: -pair[1])

[('WorldSeries', 897),
 ('Astros', 527),
 ('EarnHistory', 459),
 ('Dodgers', 340),
 ('ThisTeam', 160),
 ('SerieMundial', 112),
 ('ASTROSWIN', 85),
 ('worldseries', 83),
 ('WorldSeries2017', 65),
 ('MLB', 64)]

The most popular sets of hashtags among users with few followers (follower counts <= 24):

In [48]:
tweets.filter("user.followers_count <= 24")\
      .select("entities.hashtags") \
      .select("hashtags.text") \
      .rdd.flatMap(lambda r: r['text']).map(lambda t: (t, 1)).reduceByKey(lambda a, b: a + b)\
      .takeOrdered(10, key=lambda pair: -pair[1])

[('HR4HR', 996),
 ('EarnHistory', 703),
 ('WorldSeries', 637),
 ('Astros', 367),
 ('Dodgers', 178),
 ('ASTROSWIN', 164),
 ('ThisTeam', 129),
 ('worldseries', 103),
 ('WorldSeries2017', 80),
 ('astros', 72)]

**Findings:**  
* First of all, those most popular sets of hashtags among users with many and few followers are pretty similar, which means hashtags are not really related to the number of followers. 
* The biggest differences are 'SerieMundial' with many followers, and 'HR4HR' with few followers. 
* 'SerieMundial' is WorldSeries in Spanish, meaning that some famous users have many Hispanic followers. 
* 'HR4HR' is a charity activity from T-mobile, meaning that even users with fewer followers are involved in this event.

### Q3.5 - Analyze common words in tweet text

Following the example in class, use `tweets.rdd` to find the most common interesting words in tweet text.  To keep it "interesting", add a filter that removes at least 10 common stop words found in tweets, like "a", "an", "the", and "RT" (you might want to derive these stop words from initial results).  To split lines into words, a simple split on text whitespace like we had in class is sufficient; you do not have to account for punctuation.

After you find the most common words, use dataframe or SQL queries to find patterns among how those words are used.  For example, are they more frequently used by Dodgers or Astros fans, or by people in one part of the country over another?  Explore and see what you can find, and discuss your findings.

Hint: don't forget all the word count pipeline steps we used earlier in class.

**Answer**

We decided to ignore case issue. We also accounted for punctuations and tried to get rid of them. To do this, we defined a function called split_words and used it inside rdd.flatMap.

In [49]:
import re
def split_words(line):
    words = re.findall('\w{2,}',line)
    stopwords = ['rt','co','https','am','an','as','at','be','by','do','he','if','in','is','it','me','my','no','of','on','or','so','to','us','up','we','all','and','any','are','but','can','did','for','get','got','had','has','her','his','him','how','its','let','may','nor','not','our','out','own','say','she','the','tis','too','was','who','why','yet','yes','you','able','also','been','does','down','else','ever','from','have','hers','into','just','most','more','much','must','only','over','said','some','such','than','that','them','then','they','this','upon','were','what','when','whom','will','with','your','about','after','among','could','shall','every','least','might','never','often','other','quite','since','their','there','these','where','which','while','would','across','almost','before','cannot','either','likely','rather','should','because','however','neither','off']
    b = []
    for word in words:
        word = word.lower()
        if word not in stopwords:
            b.append(word)
    return b

In [50]:
from operator import add
import re
tweets.rdd.flatMap(lambda r: split_words(r['text'])) \
      .map(lambda t: (t, 1)) \
      .reduceByKey(add) \
      .takeOrdered(15, key=lambda pair: -pair[1])

[('astros', 75158),
 ('dodgers', 34161),
 ('game', 23751),
 ('win', 19837),
 ('worldseries', 15971),
 ('earnhistory', 15209),
 ('series', 11162),
 ('12', 10881),
 ('13', 10211),
 ('world', 8623),
 ('bregman', 7655),
 ('lead', 6440),
 ('one', 6164),
 ('walk', 5938),
 ('abreg_1', 5414)]

For the top 2 common words in tweet text - Astros and Dodgers, we would like to compare the retweet numbers between tweets mentioned Astros and Dodgers.  We found that tweets mentioned Astros gained much more retweets than Dodgers, showing that the win team attracted more attentions!

**Retweets mentioned Astros:**

In [51]:
sqlc.sql('''
        SELECT sum(retweeted_status.retweet_count) as retweet_total
        FROM tweets
        WHERE lower(text) like '%astros%'
''').show()

+-------------+
|retweet_total|
+-------------+
|     69140753|
+-------------+



**Retweets mentioned Dodgers:**

In [52]:
sqlc.sql('''
        SELECT sum(retweeted_status.retweet_count) as retweet_total
        FROM tweets
        WHERE lower(text) like '%dodgers%'
''').show()

+-------------+
|retweet_total|
+-------------+
|     10169151|
+-------------+



We would like to know the location of users who mentioned Bregman, the Astros's player who batted in the walk-off run. As expected, most users mentioned Bregman were from central time zone, where the team locates.

In [53]:
sqlc.sql('''
        SELECT user.time_zone, count(*)
        FROM tweets
        WHERE lower(text) like '%bregman%'
        AND user.time_zone IS NOT NULL
        GROUP BY user.time_zone
        ORDER BY count(*) DESC
''').show(10, False)

+---------------------------+--------+
|time_zone                  |count(1)|
+---------------------------+--------+
|Central Time (US & Canada) |1870    |
|Eastern Time (US & Canada) |747     |
|Pacific Time (US & Canada) |728     |
|Mountain Time (US & Canada)|205     |
|Arizona                    |91      |
|Atlantic Time (Canada)     |81      |
|Quito                      |62      |
|Caracas                    |60      |
|Hawaii                     |57      |
|America/Chicago            |52      |
+---------------------------+--------+
only showing top 10 rows



**We first worked individually and then met together to check our results. For those quetions required either dataframe or SQL, two of us wrote dataframe queries and the other two wrote SQL queries. Then we put our results together. We attest that each contributed substantially to the work.**