# Practice 2 : Working with Key-Value Pairs

1. [Counting Words](#1.-Counting-Words)
1. [Recap](#Recap)
1. [References](#References)

## 1. Counting Words

*This exercise material is taken from [edX - Scalable Machine Learning Lab2](https://github.com/spark-mooc/mooc-setup/blob/master/ML_lab2_word_count_student.ipynb).*

The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this lab, we will write code that calculates the most common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg.

### 1.1 Capitalization and punctuation
Real world files are more complicated than the data we have been using sor far. Some of the issues we have to address are:

- Words should be counted independent of their capitialization (e.g., Spark and spark should be counted as the same word).
- All punctuation should be removed.
- Any leading or trailing spaces on a line should be removed.

Define the function `remove_punct` that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces. Use the Python `re` module to remove any text that is not a letter, number, or space. Reading help(re.sub) might be useful.

In [None]:
import re
def remove_punct(text):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        text (str): A string.

    Returns:
        str: The cleaned up string.
    """
    return re.<FILL IN>

print(remove_punct('Hi, you!'))
print(remove_punct(' No under_score!'))
print(remove_punct(' *      Remove punctuation then spaces  * '))
print(remove_punct(" The Elephant's (4 cats). "))

### 1.2 Load from a text file

Create a new RDD from the text files in Shakespeare's comedies folder. The filename and the path are already configured in `filename`. 

Once the RDD is created, apply the `remove_punct` transformation and check the first 10 elements.

In [None]:
import os.path
basedir = '/scratch/formation/spark/data'
inputpath = os.path.join('shakespeare', 'comedies', '*')
filename = os.path.join(basedir, inputpath)

shakespeareComedies = sc.<FILL IN>

### 1.2 Words from lines 

Before we can count the words' frequency, we have to address two issues with the format of the RDD:

- The first issue is that that we need to split each line by its spaces.
- The second issue is we need to filter out empty lines.

Apply a transformation that will split each element of the RDD.

Words can be divided by other characters than simply space, for example tabs (`\t`). Make sure you cover every case when splitting the lines. Python function `str.split` only covers the case where we want to split the words using a single separating character. If we want multiple characters, we need to look at [`re.split`](https://docs.python.org/3/library/re.html#re.split).

### 1.3 Remove empty elements

The next step is to filter out the empty elements. Remove all entries where the word is ''.

### 1.4 Count the words 

We now have an RDD that is only words. The next step is to transform this RDD in a key-value pair RDD and count the word. 

Once this is done, return the 10 least common word in the dataset.

### 1.5 Halt the SparkContext

## 2. Recap

In this notebook, we used and learned about the following parts of 
**[Python Spark API](http://spark.apache.org/docs/latest/api/python/)**:
2. Create an RDD from text files:
**[`SparkContext.textFile(path)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.textFile)**
5. Apply a transformation on each element of an RDD:
**[`RDD.map(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map)**
5. Apply a transformation on each element of an RDD  then flatten the results.:
**[`RDD.flatMap(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.flatMap)**
5. Filter an RDD:
**[`RDD.filter(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.filter)**
6. Merge the values for each keys: 
**[`RDD.reduceByKey(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey)**
7. Get the N elements from a RDD ordered in ascending order: **[`RDD.takeOrdered(N)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.takeOrdered)**


## 3. References

* [O'Reilly Learning Spark - Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia](http://shop.oreilly.com/product/0636920028512.do)
* [Heather Miller - Parallel Programming and Data Analysis](http://heather.miller.am/teaching/cs212/slides/)