Word Count Assignment: Building a word count application

The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this assignment, we will write code that calculates the most common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg.

This could also be scaled to find the most common words in Wikipedia. During this assignment we will cover:
* Part 1: Creating a base RDD and pair RDDs
* Part 2: Counting with pair RDDs
* Part 3: Finding unique words and a mean value
* Part 4: Apply word count to a file

Part 1: Creating a base RDD and pair RDDs

In this part of the lab, we will explore creating a base RDD with parallelize and using pair RDDs to count words.

(1a) Create a base RDD
* We'll start by generating a base RDD by using a Python list and the sc.parallelize method.
* Then we'll print out the type of the base RDD

In [1]:
#!pip install pyspark
from pprint import pprint
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
pprint(sc)

<SparkContext master=local[*] appName=pyspark-shell>


In [2]:
words_list = ['cat', 'elephant', 'rat', 'rat', 'cat']
RDD_words = sc.parallelize(words_list, 4)
pprint(type(RDD_words))

<class 'pyspark.rdd.RDD'>


(1b) Pluralize and test

Let's use a map() transformation to add the letter 's' to each string in the base RDD we just created. We'll define a Python function that returns the word with an 's' at the end of the word. Please replace **\<FILL IN>** with your solution.

Exercises will include an explanation of what is expected, followed by codes where they will have one or more **\<FILL IN\>** sections. The code that needs to be modified will have # TODO: Replace **\<FILL IN\>** with appropriate code on its first line.

In [3]:
def make_plural(word): return f'{word}s'
pprint(make_plural('cat'))

'cats'


(1c) Apply make_plural to the base RDD

Now pass each item in the base RDD into a map() transformation that applies the make_plural() function to each element. And then call the collect() action to see the transformed RDD.

In [4]:
RDD_plural = RDD_words.map(make_plural)
pprint(RDD_plural.collect())

['cats', 'elephants', 'rats', 'rats', 'cats']


(1d) Pass a lambda function to map

Let's create the same RDD using a lambda function.

* RDD_plural_lambda = RDD_words.map(lambda **\<FILL IN\>**)
* print(RDD_plural_lambda.collect())
* ['cats', 'elephants', 'rats', 'rats', 'cats']

In [5]:
RDD_plural_lambda = RDD_words.map(lambda word: f'{word}s')
pprint(RDD_plural_lambda.collect())

['cats', 'elephants', 'rats', 'rats', 'cats']


(1e) Length of each word

Now use map() and a lambda function to return the number of characters in each word. We'll collect this result directly into a variable.

* plural_lengths = RDD_plural **\<FILL IN\>**.collect()
* print(plural_lengths)
* [4, 9, 4, 4, 4]

In [6]:
plural_lengths = RDD_plural.map(lambda word: len(word)).collect()
pprint(plural_lengths)

[4, 9, 4, 4, 4]


(1f) Pair RDDs

The next step in writing our word counting program is to create a new type of RDD, called a pair RDD. A pair RDD is an RDD where each element is a pair tuple (k, v) where k is the key and v is the value. In this example, we will create a pair consisting of ('\<word\>', 1) for each word element in the RDD. We can create the pair RDD using the map() transformation with a lambda() function to
create a new RDD.

* word_pairs = RDD_words.**\<FILL IN\>**
* print(word_pairs.collect())
* [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)

In [7]:
word_pairs = RDD_words.map(lambda word: (word, 1))
pprint(word_pairs.collect())

[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]


Part 2: Counting with pair RDDs

Now, let's count the number of times a particular word appears in the RDD. There are multiple ways to perform the counting, but some are much less efficient than others.

A naive approach would be to collect() all of the elements and count them in the driver program. While this approach could work for small datasets, we want an approach that will work for any size dataset including terabyte- or petabyte-sized datasets. In addition, performing all of the work in the
driver program is slower than performing it in parallel in the workers. For these reasons, we will use data parallel operations.

(2a) groupByKey() approach

An approach you might first consider (we'll see shortly that there are better ways) is based on using the groupByKey() transformation. As the name implies, the groupByKey() transformation groups all the elements of the RDD with the same key into a single list in one of the partitions.

There are two problems with using groupByKey():
* The operation requires a lot of data movement to move all the values into the appropriate
partitions.
* The lists can be very large. Consider a word count of English Wikipedia: the lists for common words (e.g., the, a, etc.) would be huge and could exhaust the available memory in a worker.

Use groupByKey() to generate a pair RDD of type ('word', iterator).

* words_grouped = word_pairs.**\<FILL IN\>**
* for key, value in words_grouped.collect():
** print '{0}: {1}'.format(key, list(value))
* rat: [1, 1]
* elephant: [1]
* cat: [1, 1]
* sorted(words_grouped.mapValues(lambda x: list(x)).collect())
* [('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])]

In [8]:
words_grouped = word_pairs.groupByKey()
for key, value in words_grouped.collect():
    print('{0}: {1}'.format(key, list(value)))
sorted(words_grouped.mapValues(lambda x: list(x)).collect())

elephant: [1]
rat: [1, 1]
cat: [1, 1]


[('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])]

(2b) Use groupByKey() to obtain the counts

Using the groupByKey() transformation creates an RDD containing 3 elements, each of which is a pair of a word and a Python iterator. Now sum the iterator using a map() transformation. The result should be a pair RDD consisting of (word, count) pairs.

* word_counts_grouped = words_grouped.**\<FILL IN\>**
* print(word_counts_grouped.collect())
* [('rat', 2), ('elephant', 1), ('cat', 2)]

In [9]:
word_counts_grouped = words_grouped.map(lambda item: (item[0], sum(item[1])))
pprint(word_counts_grouped.collect())

[('elephant', 1), ('rat', 2), ('cat', 2)]


(2c) Counting using reduceByKey

A better approach is to start from the pair RDD and then use the reduceByKey() transformation to create a new pair RDD. The reduceByKey() transformation gathers together pairs that have the same key and applies the function provided to two values at a time, iteratively reducing all of the values to a single value. reduceByKey() operates by applying the function first within each partition on a per-key basis and then across the partitions, allowing it to scale efficiently to large datasets.

* \# reduceByKey takes a function that accepts two values and returns a single value
* word_counts = word_pairs.reduceByKey(**\<FILL IN\>**)
* print(word_counts.collect())
* [('rat', 2), ('elephant', 1), ('cat', 2)]

In [10]:
word_counts = word_pairs.reduceByKey(lambda acc, value: acc + value)
pprint(word_counts.collect())

[('elephant', 1), ('rat', 2), ('cat', 2)]


(2d) All together

The expert version of the code performs the map() to pair RDD, reduceByKey() transformation, and collect in one statement.

* word_counts_collected = (RDD_words**\<FILL IN\>**.collect())
* print(word_counts_collected)
* [('rat', 2), ('elephant', 1), ('cat', 2)]
* sorted(word_counts_collected)
* [('cat', 2), ('elephant', 1), ('rat', 2)]

In [11]:
word_counts_collected = ( # First solution
    RDD_words
    .map(lambda word: (word, 1))
    .groupByKey()
    .map(lambda item: (item[0], sum(item[1])))
    .collect()
)
pprint(word_counts_collected)
pprint(sorted(word_counts_collected))
print()
word_counts_collected = ( # Second solution
    RDD_words
    .map(lambda word: (word, 1))
    .reduceByKey(lambda acc, value: acc + value)
    .collect()
)
pprint(word_counts_collected)
pprint(sorted(word_counts_collected))

[('elephant', 1), ('rat', 2), ('cat', 2)]
[('cat', 2), ('elephant', 1), ('rat', 2)]

[('elephant', 1), ('rat', 2), ('cat', 2)]
[('cat', 2), ('elephant', 1), ('rat', 2)]


Part 3: Finding unique words and a mean value

(3a) Unique words

Calculate the number of unique words in RDD_words. You can use other RDDs that you have already created to make this easier.

* unique_words = **\<FILL IN\>**
* print(unique_words)
* 3

In [12]:
unique_words = len(
    RDD_words
    .map(lambda word: (word, 1))
    .reduceByKey(lambda acc, value: acc + value)
    .map(lambda item: item[0])
    .collect()
)
pprint(unique_words)

3


(3b) Mean using reduce

  Find the mean number of words per unique word in word_counts. Use a reduce() action to sum the counts in word_counts and then divide by the number of unique words. First map() the pair RDD word_counts, which consists of (key, value) pairs, to an RDD of values.
>

* from operator import add
* total_count = (
* * word_counts
* * .map(**\<FILL IN\>**)
* * .reduce(**\<FILL IN\>**))
* average = total_count / float(**\<FILL IN\>**)
* print(total_count)
* 5
* print(round(average, 2))
* 1.67

In [13]:
from operator import add
total_count = (
    word_counts
    .map(lambda item: item[1])
    .reduce(lambda acc, value: acc + value))
average = total_count / float(unique_words)
pprint(total_count)
pprint(round(average, 2))

5
1.67


Part 4: Apply word count to a file

In this section we will finish developing our word count application. We'll have to build the word_count function, deal with real world problems like capitalization and punctuation, load in our data source, and compute the word count on the new data.

(4a) word_count function

First, define a function for word counting. You should reuse the techniques that have been covered in earlier parts of this assignment. This function should take in an RDD that is a list of words like RDD_words and return a pair RDD that has all of the words and their associated counts.

* Creates a pair RDD with word counts from an RDD of words.
* Args: RDD_word_list (RDD of str): An RDD consisting of words.
* Returns: RDD of (str, int): An RDD consisting of (word, count) tuples.

* def word_count(RDD_word_list):
* *   **\<FILL IN\>**
* print word_count(RDD_words).collect()
* [(‘rat', 2), ('elephant', 1), ('cat', 2)]

In [14]:
def word_count(RDD_word_list):
    return RDD_word_list.map(lambda w: (w, 1)).reduceByKey(lambda a, v: a + v)
pprint(word_count(RDD_words).collect())

[('elephant', 1), ('rat', 2), ('cat', 2)]


(4b) Capitalization and punctuation

Real world files are more complicated than the data we have been using here. Some of the issues we have to address are:
* Words should be counted independent of their capitalization (e.g., Spark and spark should be
counted as the same word).
* All punctuation should be removed.
* Any leading or trailing spaces on a line should be removed.

Define the function remove_punctuation that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces. Use the Python re module to remove any text that is not a letter, number, or space. Reading help(re.sub) might be useful. If you are unfamiliar with regular expressions, you may want to review this tutorial from Google. Also, this website is a great resource for debugging your regular expression.

Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note: Only spaces, letters, and numbers should be retained. Other characters should be eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after punctuation is removed. Args: text (str): A string. Returns: str: The cleaned up string.

* import re
* def remove_punctuation(text):
* * **\<FILL IN\>**
* print(remove_punctuation('Hi, you!'))
* hi you
* print(remove_punctuation(' No under_score!'))
* no underscore
* print(remove_punctuation(' * Remove punctuation then spaces * '))
* remove punctuation then spaces

In [15]:
import re
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text) \
             .strip() \
             .lower() \
             .replace('_', '')
pprint(remove_punctuation('Hi, you!'))
pprint(remove_punctuation(' No under_score!'))
pprint(remove_punctuation(' * Remove punctuation then spaces * '))

'hi you'
'no underscore'
'remove punctuation then spaces'


(4c) Load a text file

  For the next part of this assignment, we will use the Complete Works of William Shakespeare from Project Gutenberg. To convert a text file into an RDD, we use the SparkContext.textFile() method. We also apply the recently defined remove_punctuation() function using a map() transformation to strip out the punctuation and change all text to lower case. Download shakespeare.txt file and store it in your Spark directory. Since the file is large we use take(15), so that we only print 15 lines.

* file_name = "shakespeare.txt"
* RDD_shakespeare = (
* *   sc.textFile(file_name, 8)
* *   .map(remove_punctuation))
* print('\n'.join(
* *   RDD_shakespeare
* *   .zipWithIndex() # to (line, line_num)
* *   .map(lambda (l, num): '{0}: {1}'.format(num, l)) # to 'line_num: line'
* *   .take(15))


In [16]:
from google.colab import drive
drive.mount('/content/modules', force_remount=True)

Mounted at /content/modules


In [17]:
file_name = "/content/modules/My Drive/shakespeare.txt"
RDD_shakespeare = sc.textFile(file_name, 8).map(remove_punctuation)
print(
    '\n'.join(
        RDD_shakespeare
        .zipWithIndex() # to (line, line_num)
        .map(lambda item: '{0}: {1}'.format(item[1], item[0])) # to 'line_num: line'
        .take(15)))

0: this is the 100th etext file presented by project gutenberg and
1: is presented in cooperation with world library inc from their
2: library of the future and shakespeare cdroms  project gutenberg
3: often releases etexts that are not placed in the public domain
4: 
5: shakespeare
6: 
7: this etext has certain copyright implications you should read
8: 
9: this electronic version of the complete works of william
10: shakespeare is copyright 19901993 by world library inc and is
11: provided by project gutenberg etext of illinois benedictine college
12: with permission  electronic and machine readable copies may be
13: distributed so long as such copies 1 are for your or others
14: personal use only and 2 are not distributed or used


(4d) Words from lines

Before we can use the word_count() function, we have to address two issues with the format of the RDD: The first issue is that that we need to split each line by its spaces. Performed in (4d). The second issue is we need to filter out empty lines. Performed in (4e).

Apply a transformation that will split each element of the RDD by its spaces. For each element of the RDD, you should apply Python's string split() function. You might think that a map() transformation is the way to do this, but think about what the result of the split() function will be.

Note: Do not use the default implemenation of split(), but pass in a separator value. For example, to split line by commas you would use line.split(',').

* RDD_shakespeare_words = RDD_shakespeare.**\<FILL_IN\>**
* shakespeare_word_count = RDD_shakespeare_words.count()
* print(RDD_shakespeare_words.top(5))
* [u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds']
* print(shakespeare_word_count)
* 946354

In [18]:
RDD_shakespeare_words = RDD_shakespeare.flatMap(lambda item: item.split(' '))
shakespeare_word_count = RDD_shakespeare_words.count()
pprint(RDD_shakespeare_words.top(5))
[u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds']
pprint(shakespeare_word_count)

['zwaggerd', 'zounds', 'zounds', 'zounds', 'zounds']
946354


(4e) Remove empty elements
The next step is to filter out the empty elements. Remove all entries where the word is ' '.

* RDD_shake_words = RDD_shakespeare_words.**\<FILL_IN\>**
* shake_word_count = RDD_shake_words.count()
* print(shake_word_count)
* 901109

In [19]:
RDD_shake_words = RDD_shakespeare_words.filter(lambda item: item != '')
shake_word_count = RDD_shake_words.count()
pprint(shake_word_count)

901109


(4f) Count the words

We now have an RDD that is only words. Next, let's apply the word_count() function to produce a list of word counts. We can view the top 15 words by using the takeOrdered() action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.

You'll notice that many of the words are common English words. These are called stopwords. In a later assignment, we will see how to eliminate them from the results. Use the word_count() function and takeOrdered() to obtain the fifteen most common words and their counts.

* top_15_words_and_counts = **\<FILL IN\>**
* print('\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top_15_words_and_counts)))
* the: 27645
* and: 26733
* i: 20683
* to: 19198
* of: 18180
* a: 14613
* you: 13650
* my: 12480
* that: 11122
* in: 10967
* is: 9598
* not: 8725
* for: 8245
* with: 7996
* me: 7768

In [20]:
top_15_words_and_counts = word_count(RDD_shake_words).takeOrdered(15, lambda item: -item[1])
pprint('\n'.join(map(lambda item: '{0}: {1}'.format(item[0], item[1]), top_15_words_and_counts)))

('the: 27645\n'
 'and: 26733\n'
 'i: 20683\n'
 'to: 19198\n'
 'of: 18180\n'
 'a: 14613\n'
 'you: 13650\n'
 'my: 12480\n'
 'that: 11122\n'
 'in: 10967\n'
 'is: 9598\n'
 'not: 8725\n'
 'for: 8245\n'
 'with: 7996\n'
 'me: 7768')
