# Lesson 13 - Example: Word Count

### Introduction

In this lesson, we will work with text data. Our goal will be to determine the most frequently used words in the H. G. Wells novel, "The War of the Worlds".

In [0]:
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Read Data

We will read the data in from a text file and will count the lines in the resulting RDD.

In [0]:
lines_rdd = sc.textFile('/FileStore/tables/war_of_the_worlds.txt')
print(lines_rdd.count())

To get a sense as to the contents of the file, we will print out the first 20 elements of the RDD.

In [0]:
for line in lines_rdd.take(20):
    print(line)

## Processing the Data

Our goal to create an RDD whose elements are individual words within the novel. As we are doing this, we will perform a small amount of preprocessing. In particular, we will convert all of the words to lower case and will strip out any punctuation. We will start with an example to illustrate how the process of stripping out punctuation is performed.

In [0]:
from string import punctuation

print(punctuation)

In [0]:
test_string = '"Hello, Word!"'
print(test_string)
print(test_string.strip(punctuation))

In the cell below, we process each line of the RDD by performing the following steps, in order:

1. We use `flatMap()` to tokenize the data, splitting on the space character.
2. We use `flatMap()` to tokenize the data, splitting on the hyphen character.
3. We use `map()` to convert each string to lower case.
4. We use `map()` to strip punctuation marks from the end of the strings. 
5. We use `map()` to remove single-quotes (apostrophes).
6. We use `filter()` to remove empty strings created in the steps above. 

We then count the number of words in the resulting RDD, as well as the number of distinct words.

In [0]:
words_rdd = (
    lines_rdd
    .flatMap(lambda x : x.split(' '))      # Split o spaces
    .flatMap(lambda x : x.split('-'))      # Split on hyphens
    .map(lambda x : x.lower())             # Convert to lower case
    .map(lambda x : x.strip(punctuation))  # Strip punctuation
    .map(lambda x : x.replace("'", ''))    # Remove apostrophes
    .filter(lambda x : x != '')            # Remove empty strings
)

words_rdd.persist()

print('Total number of words:   ', words_rdd.count())
print('Number of distinct words:', words_rdd.distinct().count())

## Calculating Word Frequencies

We will determine the number of times each of the unique words appears within the novel. We will perform this task as follows:

1. We use `map()` to create a pair RDD containing elements of the form `(word, 1)`.
2. We use `reduceByKey()` to group the pair RDDs by `word` and then sum the 1s to produce word counts. 
3. We use `sortBy()` to sort the RDD by word frequency, in descending order.

To reassure us that our approach was correct, we print the number of elements in the resulting RDD to confirm that it is the same as the number of distinct words in `words_rdd`.

In [0]:
word_freq = (
    words_rdd
    .map(lambda x : (x, 1))
    .reduceByKey(lambda x, y : x + y)
    .sortBy(lambda x : x[1], ascending=False)
)

print(word_freq.count())

We will determine the 20 most commonly-occurring words in the novel.

In [0]:
for row in word_freq.take(20):
    print(f'{row[0]:<12}{row[1]:>4}')

### Removing Stop Words

%md These results are not very interesting. The most frequently appearing words are, unsuprisingly, "the", "and", "of", "a", and "i". Commonly occuring, low information words such as these are often called **stop words**. A typical task when performing a text-based analysis is to remove any instances of stop words. There is no commonly accepted definition of what is and what is not a stop word, and the definition used could vary by task. 

A document containing a list of stop words that we will use in this example has been provided at the following path: `/Filestore/tables/stopwords.txt`. The document contains one word per line. We will load the contents of this file into an RDD, and will then collect it into a list.

In [0]:
stopwords_rdd = sc.textFile('/FileStore/tables/stopwords.txt')
stopwords = stopwords_rdd.collect()

print(len(stopwords))

We will now view the first 20 stop words contained in our list.

In [0]:
print(stopwords[:20])

We will conclude by reconstructing our word frequency RDD, but with stop words filtered out.

In [0]:
word_freq_ns = (
    words_rdd
    .filter(lambda x : x not in stopwords)
    .map(lambda x : (x, 1))
    .reduceByKey(lambda x, y : x + y)
    .sortBy(lambda x : x[1], ascending=False)
)

print(word_freq_ns.count())

In [0]:
for row in word_freq_ns.take(20):
    print(f'{row[0]:<12}{row[1]:>4}')