# Part 1
---------------

For the first part of the practical assignment, we will use MapReduce paradigms to implement a word counting program. First, some setup and data preparation.

In [1]:
import pyspark

sc = pyspark.SparkContext("local[*]", "PUC Big Data workshop")

We load the text file containing Shakespeare's famous Romeo & Juliet using a convenient PySpark function. This will automatically split the text file into separate lines.

In [6]:
lines = sc.textFile("shakespeare.txt")

Since some lines in the file are empty, we first filter those out; they do not contain words, so we don't need them! The `filter` operation will give us all lines with a length larger than 0.

In [25]:
non_empty_lines = lines.filter(lambda line: len(line) > 0)

Now that we have a list of non-empty sentences, we will split these sentences into single words. Since this is just an exercise, we will simply split the sentences on space characters; this will not give a perfect split, but it is good enough to use for the rest of the program. `flatMap` will make sure that we don't end up with nested lists and instead just give us one long list of words.

In [26]:
words = non_empty_lines.flatMap(lambda line: line.strip().split())

Let's look at some of the words we ended up with!

In [27]:
words.takeSample(withReplacement=False, num=10)

['thy',
 'tell',
 'a',
 "he's",
 'service.',
 'Presents',
 'Exit.',
 'the',
 'CLEOPATRA.',
 'meet']

We loaded the text of Shakespeare's famous Romeo & Juliet, removed empty lines and split the remaining lines on spaces using the `flatMap` function. Displayed above are 25 random words sampled from the split text. As you can see, splitting the sentences on spaces does not result in a perfect separation of words but it will do for our purposes.

------------------------

Now, let's implement a simple word count! First, we will use the `map` operation to transform each word into a (word, 1) tuple as per the slides.

In [20]:
annotated_words = words.map(lambda word: (word, 1))       

After annotating each word with the number 1, we can then perform the shuffle and a partial reduce step by using the `reduceByKey` operation. This will move all (word, 1) tuples with identical word values to the same worker node, and apply some function on them. In this case, since we are counting word occurrences, we simply add all the 1's together.

In [21]:
word_counts = annotated_words.reduceByKey(lambda x, y: x + y)

Now, in order to make the next steps easier, we first swap the positions of the words and their counts in the tuples so we end up with (count, word) instead of (word, count). This makes the count the key of the item, and will allow us to sort by key to see which words are most common.

In [22]:
word_counts = word_counts.map(lambda x: (x[1], x[0]))

Sort the tuples by count in descending order, putting the most frequent words at the beginning of the list.

In [23]:
sorted_word_counts = word_counts.sortByKey(ascending=False)

That's it! Let's look at the top 10 most frequent words in Romeo & Juliet.

In [31]:
for (count, word) in sorted_word_counts.take(10):
    print(f"{word:<4}: {count:<6} occurrences")

the : 23178  occurrences
I   : 19540  occurrences
and : 18218  occurrences
to  : 15592  occurrences
of  : 15503  occurrences
a   : 12513  occurrences
my  : 10823  occurrences
in  : 9564   occurrences
you : 9058   occurrences
is  : 7829   occurrences
