# Exercise #3 - Twitter

In this homework, we are going to play with Twitter data.

The data is represented as rows of of [JSON](https://en.wikipedia.org/wiki/JSON#Example) strings.
It consists of [tweets](https://dev.twitter.com/overview/api/tweets), [messages](https://dev.twitter.com/streaming/overview/messages-types), and a small amount of broken data (cannot be parsed as JSON).

For this exercise, we will only focus on tweets and ignore all other messages.


## Tweets

A tweet consists of many data fields. [Here is an example](https://gist.github.com/arapat/03d02c9b327e6ff3f6c3c5c602eeaf8b). You can learn all about them in the Twitter API doc. We are going to briefly introduce only the data fields that will be used in this exercise.

* `created_at`: Posted time of this tweet (time zone is included)
* `id_str`: Tweet ID - we recommend using `id_str` over using `id` as Tweet IDs, becauase `id` is an integer and may bring some overflow problems.
* `text`: Tweet content
* `user`: A JSON object for information about the author of the tweet
    * `id_str`: User ID
    * `name`: User name (may contain spaces)
    * `screen_name`: User screen name (no spaces)
* `retweeted_status`: A JSON object for information about the retweeted tweet (i.e. this tweet is not original but retweeteed some other tweet)
    * All data fields of a tweet except `retweeted_status`
* `entities`: A JSON object for all entities in this tweet
    * `hashtags`: An array for all the hashtags that are mentioned in this tweet
    * `urls`: An array for all the URLs that are mentioned in this tweet


## Data source

All tweets are collected using the [Twitter Streaming API](https://dev.twitter.com/streaming/overview).
We provide you with a file `tweets_10mb.txt`. You should upload it to Databricks.

## Data Exploration
Let's see how many lines there are in the input files.

1. Load the data from `tweets_10mb.txt` into an RDD.
2. Mark the RDD to be cached (so in next operation data will be loaded in memory) 
3. Try to understand the data and get some insights about it. How many tweets are in the dataset?

In [3]:
# Write your code here
rdd = sc.textFile('/FileStore/tables/tweets_10mb.txt').persist()
print("Total amount of tweets: {}".format(rdd.count()))
sample = rdd.take(1)[0]
print(sample)

## Part 1: Parse JSON strings to JSON objects
Python has built-in support for JSON, look at this example:

In [5]:
import json

json_example = '''
{
    "id": 1,
    "name": "A green door",
    "price": 12.50,
    "tags": ["home", "green"]
}
'''

json_obj = json.loads(json_example)
'price' in json_obj.keys()

## Broken tweets and irrelevant messages

The data of this assignment may contain broken tweets (invalid JSON strings). So make sure that your code is robust for such cases.

You can filter out such broken tweet by checking if:
* the line is not in json format

In addition, some lines in the input file might not be tweets, but messages that the Twitter server sent to the developer (such as [limit notices](https://dev.twitter.com/streaming/overview/messages-types#limit_notices)). Your program should also ignore these messages.

These messages would not contain the `created_at` field and can be filtered out accordingly.
* Check if json object of the broken tweet has a `created_at` field

*Hint:* [Catch the ValueError](http://stackoverflow.com/questions/11294535/verify-if-a-string-is-json-in-python)

**********************************************************************************

### Task 1
Parse raw JSON tweets to obtain valid JSON objects. 

### Task 2
From all valid tweets, construct a pair RDD of `(user_id, text)`, where `user_id` is the `id_str` data field of the `user` dictionary (read [Tweets](#Tweets) section above), `text` is the `text` data field.

In [7]:
# Task 1 - Write your code here
import json

def safe_parse(raw_json):
    """
    Input is a String
    Output is a JSON object if the tweet is valid and None if not valid
    """
    ret_val = None
    try:
      ret_val = json.loads(raw_json)
    except Exception:
      ret_val = None
    return ret_val

rdd_parsed = rdd.map(lambda r: safe_parse(r)).persist()
rdd_valid_tweets = rdd_parsed.filter(lambda j: 'created_at' in j.keys()).persist()
print("Valid Tweets: {}".format(rdd_valid_tweets.count()))

In [8]:
# Task 2 - Write your code here
rdd_key_val = rdd_valid_tweets.map(lambda t: (t['user']['id_str'], t['text'])).persist()

### Task 3 - Number of unique users

Count the number of different users in all valid tweets

Hint: [the `distinct()` method](https://spark.apache.org/docs/latest/programming-guide.html#transformations) is an easy way to do this, but try to see if there is a another way to do this.

In [10]:
# Task 3 - Write your code here
users_count = rdd_key_val.map(lambda user_and_tweet: user_and_tweet[0]).distinct().count()
print('The number of unique users is:', users_count)  # Should print 1748

# Another solution:
from operator import add
users_count = rdd_key_val.map(lambda user_and_tweet: user_and_tweet[0]).groupBy(lambda x: x).count()
print('The number of unique users is:', users_count)  # Should print 1748

## Part 2:  Words popularity
In this task you'll need to find the top-20 used words in all the dataset.
The output should be a list of tuples <word, # of appearances in different tweets>
And should be sorted by the number of appearances of the word.

Below is a list of `Stop Words`. Those are the most common words in english and you should remove them from every tweet so we don't calculate their appearance (they are not intereseting in our case).

In [12]:
STOP_WORDS = ["", "-","i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

In [13]:
# Write your code here
rdd_popular_words = rdd_key_val\
                      .flatMap(lambda p: p[1].split(" "))\
                      .map(lambda w: (w,1))\
                      .reduceByKey(lambda a,b: a+b)\
                      .filter(lambda p: p[0].lower() not in STOP_WORDS)\
                      .filter(lambda p: p[1]> 50)\
                      .sortBy(lambda v: -v[1])\
                      .persist()
top_20 = rdd_popular_words.take(20)
top_20