# DS-GA-3001 Advanced Python for Data Science

Before you turn this problem in, make sure you **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart). You can then run the cells **in order**, during the class.

Any textual answers that need to be provided will be marked with "YOUR ANSWER HERE". Replace this text with your answer to the question.

Any code answers that need to be provided will be marked with:

```
# YOUR CODE HERE
raise NotImplementedError()
```

Replace all this code with your answer to the question. If you do not answer the question, the `NotImplementedError` exception will be raised, which will indicate to the grader that no answer has been supplied.

In many cases, code answers will also have some associated test code. You should execute the tests after you have entered your code in order to ensure that your answer is correct. You should not proceed to the next question until your answer is correct.

Finally, insert your Net ID and the Net ID's of any collaborators in the cell below.

In [1]:
NET_ID = "jl6583"
COLLABORATORS = ""

---

# Big Data With PySpark

Any distributed computing framework needs to solve two problems: how to distribute data and how to distribute computation. Once such framework is [Apache Hadoop](http://hadoop.apache.org). Hadoop uses the Hadoop Distributed Filesystem (HDFS) to solve the distributed data problem and MapReduce as the programming paradigm that provides effective distributed computation.

[Apache Spark](https://spark.apache.org) is a general purpose cluster computing framework that provides ***efficient in-memory computations for large data sets*** by distributing computation across multiple computers. Spark can utilize the Hadoop framework or run standalone.

Spark has a functional programming API in multiple languages that provides more operators than map and reduce, and does this via a distributed data framework called ***resilient distributed datasets*** or ***RDDs***.

RDDs are essentially a programming abstraction that represents a read-only collection of objects that are partitioned across machines. RDDs are fault tolerant and are accessed via parallel operations.

Because RDDs can be cached in memory, Spark is extremely effective at iterative applications, where the data is being reused throughout the course of an algorithm. Most machine learning and optimization algorithms are iterative, making Spark an extremely effective tool for data science. Additionally, because Spark is so fast, it can be accessed in an interactive fashion via a command line prompt similar to the Python REPL.

The Spark library itself contains a lot of the application elements that have found their way into most Big Data applications including support for SQL-like querying of big data, machine learning and graph algorithms, and even support for live streaming data.

![](https://spark.apache.org/images/spark-stack.png)

The core components are:

- ***Spark Core***: Contains the basic functionality of Spark; in particular the APIs that define RDDs and the operations and actions that can be undertaken upon them. The rest of Spark's libraries are built on top of the RDD and Spark Core.
- ***Spark SQL***: Provides APIs for interacting with Spark via the Apache Hive variant of SQL called Hive Query Language (HiveQL). Every database table is represented as an RDD and Spark SQL queries are transformed into Spark operations. For those that are familiar with Hive and HiveQL, Spark can act as a drop-in replacement.
- ***Spark Streaming***: Enables the processing and manipulation of live streams of data in real time. Many streaming data libraries (such as Apache Storm) exist for handling real-time data. Spark Streaming enables programs to leverage this data similar to how you would interact with a normal RDD as data is flowing in.
- ***MLlib***: A library of common machine learning algorithms implemented as Spark operations on RDDs. This library contains scalable learning algorithms like classifications, regressions, etc. that require iterative operations across large data sets. The Mahout library, formerly the Big Data machine learning library of choice, will move to Spark for its implementations in the future.
- ***GraphX***: A collection of algorithms and tools for manipulating graphs and performing parallel graph operations and computations. GraphX extends the RDD API to include operations for manipulating graphs, creating subgraphs, or accessing all vertices in a path.

Because these components meet many Big Data requirements as well as the algorithmic and computational requirements of many data science tasks, Spark has been growing rapidly in popularity. Not only that, but Spark provides APIs in Scala, Java, and Python; meeting the needs for many different groups and allowing more data scientists to easily adopt Spark as their Big Data solution.

## MapReduce primer

MapReduce is a software framework for processing large data sets in a distributed fashion over a several machines.  The core idea behind MapReduce is mapping your data set into a collection of (key, value) pairs, and then reducing over all pairs with the same key. The overall concept is simple, but is actually quite expressive when you consider that: 
 
1. Almost all data can be mapped into (key, value) pairs somehow, and 
2. Your keys and values may be of any type: strings, integers, dummy types, and, of course, (key, value) pairs themselves

The canonical MapReduce use case is counting word frequencies in a large text, but some other examples of what you can do in the MapReduce framework include: 
 
- Distributed sort 
- Distributed search 
- Web‐link graph traversal 
- Machine learning 
- ...

Counting the number of occurrences of words in a text is sometimes considered as the “Hello world!” equivalent of MapReduce. A classical way to write such a program is presented in the python script below. The script is very simple. It parses the file from which it extracts and counts words and stores the result in a dictionary that uses words as keys and the number of occurrences as values.

First, download [Moby Dick, the novel by Herman Melville](https://www.gutenberg.org/cache/epub/2701/pg2701.txt) and place the `pg2701.txt` file in the `/tmp` directory.

In [2]:
import re

# remove any non-words and split lines into separate words
# finally, convert all words to lowercase
def splitter(line):
    line = re.sub(r'^\W+|\W+$', '', line)
    return map(str.lower, re.split(r'\W+', line))
  
sums = {}
try:
    in_file = open('/tmp/pg2701.txt', 'r')

    for line in in_file:
        for word in splitter(line):
            word = word.lower()
            sums[word] = sums.get(word, 0) + 1
            
    in_file.close()

except IOError:
    print "error performing file operation"
else:
    M = max(sums.iterkeys(), key=lambda k: sums[k])
    print "max: %s = %d" % (M, sums[M])

max: the = 14620


The main problem with this approach comes from the fact that it requires the use of a dictionary, i.e., a central data structure used to progressively build and store all the intermediate results until the final result is reached.

Since the code we use can only run on a single processor, the best we can expect is that the **time necessary to process a given text will be proportional to the size of the text** (i.e., the number of words processed per second is constant). Actually, the performance degrades as the size of the dictionary grows. As shown on the diagram below, the number of words processed per second diminishes when the size of the dictionary reaches the size of the processor data cache (note that if the cache is structured in several layers of different speeds, the processing speed will decrease each time the dictionary reaches the size of a layer). A new diminution of the processing speed will be reached when the dictionary reaches the size of the Random Access Memory. Eventually, if the dictionary continues to grow, it will exceed the capacity of the swap and an exception will be raised.

![](http://drive.google.com/uc?export=view&id=0B_3lImS7uRMgal8zM3R5Y204aDg)

### The MapReduce aproach

The main advantage of the MapReduce approach is that it does not require a central data structure. 

MapReduce consists of 3 steps:

1. A ***mapping*** step that produces intermediate results and associates them with an output key;
2. A ***shuffling*** step that groups intermediate results associated with the same output key; and
3. A ***reducing*** step that processes groups of intermediate results with the same output key.

![](http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png)

#### Mapping

The mapping step is very simple. The idea is to apply a function to each element of a list and collect the result. Python provides the `map` function that takes a function and sequence of input values and returns a sequence of values that have had the function applied to them.

In our word count example, we want to map each word in the input file into a key/value pair containing the word as key and the number of occurances as the value. This is used to represent an intermediate result that says: “this word occurs one time”. This is equivalent to the following:

In [3]:
words = ['Deer', 'Bear', 'River', 'Car', 'Car', 'River', 'Deer', 'Car', 'Bear']
mapping = map((lambda x : (x, 1)), words)
print mapping

[('Deer', 1), ('Bear', 1), ('River', 1), ('Car', 1), ('Car', 1), ('River', 1), ('Deer', 1), ('Car', 1), ('Bear', 1)]


However, using `map` for our example would result in reading the whole file into memory before we can perform the map operation. This would be no better than the original version, so instead it is done using a temporary file (that we will use later), as follows:

In [4]:
input_file = '/tmp/pg2701.txt'
map_file = '/tmp/pg2701.txt.map'
sorted_map_file = '/tmp/pg2701.txt.map.sorted'


# Implement our mapping function
import re
sums = {}
try:
    in_file = open(input_file, 'r')
    out_file = open(map_file, 'w')

    for line in in_file:
        for word in splitter(line):
            out_file.write(word.lower() + "\t1\n") # Separate key and value with 'tab'
            
    in_file.close()
    out_file.close()

except IOError:
    print "error performing file operation"

#### Shuffling

The shuffling step consists of grouping all the intermediate values that have the same output key. In our word count example, we want to sort the intermediate key/value pairs on their keys. We can use the `sorted` function for the simple case:

In [5]:
sorted_mapping = sorted(mapping)
print sorted_mapping

[('Bear', 1), ('Bear', 1), ('Car', 1), ('Car', 1), ('Car', 1), ('Deer', 1), ('Deer', 1), ('River', 1), ('River', 1)]


Of course, this is sill using the in-memory copy. Instead, we can use a Python program for sorting large files:

In [6]:
def build_index(filename):
    index = []
    f = open(filename)
    while True:
        offset = f.tell()
        line = f.readline()
        if not line:
            break
        length = len(line)
        col = line.split('\t')[0].strip()
        index.append((col, offset, length))
    f.close()
    index.sort()
    return index

try:
    index = build_index(map_file)
    in_file = open(map_file, 'r')
    out_file = open(sorted_map_file, 'w')
    for col, offset, length in index:
        in_file.seek(offset)
        out_file.write(in_file.read(length))
    in_file.close()
    out_file.close()
except IOError:
    print "error performing file operation"

#### Reducing

For the reduction step, we just need to count the number of values with the same key. Now that the different values are ordered by keys (i.e., the different words are listed in alphabetic order), it becomes easy to count the number of times they occur by summing values as long as they have the same key. Using lambda functions in Python, this looks like:

In [7]:
from itertools import groupby

# 1. Group by key yielding (key, grouper)
# 2. For each pair, yield (key, reduce(func, last element of each grouper))
grouper = groupby(sorted_mapping, lambda p:p[0])
print map(lambda l: (l[0], reduce(lambda x, y: x + y, map(lambda p:p[1], l[1]))), grouper)

[('Bear', 2), ('Car', 3), ('Deer', 2), ('River', 2)]


For our sorted mapping file, it's also straight forward. We just read each key/value pair and continue to count until we find a different key. We just print out the value, then reset the values for the next key.

In [8]:
previous = None
M = [None, 0]

def checkmax(key, sum):
    global m, M
    if M[1] < sum:
        M[1] = sum
        M[0] = key

try:
    in_file = open(sorted_map_file, 'r')
    for line in in_file:
        key, value = line.split('\t')
        
        if key != previous:
            if previous is not None:
                checkmax(previous, sum)
            previous = key
            sum = 0
            
        sum += int(value)
        
    checkmax(previous, sum)
    in_file.close()
except IOError:
    print "error performing file operation"
    
print "max: %s = %d" % (M[0], M[1])

max: the = 14620


Although these three steps seem like a complicated way to achieve the same result, there are a few key differences:

- In each of the three steps, the entire contents of the file never had to be held in memory. This means that the program is not affected by the same caching issues as the simple version.
- The mapping function can be be split into many independent parallel tasks, each generating separate files. 
- The shuffing and reducing functions can also be split into many independent parallel tasks, with the final result being written to an output file.

The fact that the MapReduce algorithm can be parallelized easily and efficiently means that it is ideally suited for applications on very large data sets, as well as were resiliance is required.

MapReduce is clearly not a general-purpose framework for all forms of parallel programming. Rather, it is designed specifically for problems that can be broken up into the the map-reduce paradigm. Perhaps surprisingly, there are a lot of data analysis tasks that fit nicely into this model. While MapReduce is heavily used within Google, it also found use in companies such as Yahoo, Facebook, and Amazon.

The original, and proprietary, implementation was done by Google. It is used internally for a large number of Google services. The Apache Hadoop project built a clone to specs defined by Google. Amazon, in turn, uses Hadoop MapReduce running on their EC2 (elastic cloud) computing-on-demand service to offer the Amazon Elastic MapReduce service.

## Introduction to Spark

### Spark Installation

Download Apache Spark from [here](http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz). This should work with Windows, Linux, and Mac OS X.

Untar and uncompress the archive, then move it to a known location, e.g. `/home/user/spark`. 

Run the following command from a shell (where you would normally run `jupyter notebook`). This should start a notebook with spark enabled.

PySpark will automatically create a [`SparkContext`](https://spark.apache.org/docs/1.1.1/api/python/pyspark.context.SparkContext-class.html) for you to work with using the local Spark configuration. We can check that Spark is loaded:

In [9]:
print sc

<pyspark.context.SparkContext object at 0x109064050>


### Spark Overview

Programming Spark applications is similar to other data flow languages that had previously been implemented on Hadoop:

- A ***driver*** is code in a driver program which is lazily evaluated
- One or more works, called ***executors***, run the driver code on their partitions of the RDD which is distributed across the cluster. 
- Results are then sent back to the driver for aggregation or compilation. 

Essentially the driver program creates one or more RDDs, applies operations to transform the RDD, then invokes some action on the transformed RDD.

These steps are outlined as follows:

1. Define one or more RDDs either through accessing data stored on disk (HDFS, Cassandra, HBase, Local Disk), parallelizing some collection in memory, transforming an existing RDD, or by caching or saving.
2. Invoke operations on the RDD by passing closures (functions) to each element of the RDD. Spark offers over 80 high level operators beyond Map and Reduce.
3. Use the resulting RDDs with actions (e.g. count, collect, save, etc.). Actions kick off the computing on the cluster.

When Spark runs a closure on a worker, any variables used in the closure are copied to that node, but are maintained within the local scope of that closure. 

Spark provides two types of shared variables that can be interacted with by all workers in a restricted fashion. 

- ***Broadcast*** variables are distributed to all workers, but are read-only. These variables can be used as lookup tables or stopword lists. 
- ***Accumulators*** are variables that workers can "add" to using associative operations and are typically used as counters.

### Spark Execution

Essentially, Spark applications are run as independent sets of processes, coordinated by a ***SparkContext*** in the driver program. The context will connect to some cluster manager (e.g. YARN) which allocates system resources. Each worker in the cluster is managed by an executor, which is in turn managed by the SparkContext. The executor manages computation as well as storage and caching on each machine.

What is important to note is that:
- Application code is sent from the driver to the executors, and the executors specify the context and the various tasks to be run. 
- The executors communicate back and forth with the driver for data sharing or for interaction. 
- Drivers are key participants in Spark jobs, and therefore, they should be on the same network as the cluster. 

This is different from Hadoop code, where you might submit a job from anywhere to the JobTracker, which then handles the execution on the cluster.

### MapReduce with Spark

To start using Spark, we have to create an [RDD](https://spark.apache.org/docs/1.1.1/api/python/pyspark.rdd.RDD-class.html). The `SparkContext` provides a number of methods to do this. We will use the `textFile` method, which reads a file an creates an RDD of strings, one for each line in the file.

In [10]:
text = sc.textFile('/tmp/pg2701.txt', use_unicode=False)
print text.take(10)

['The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville', '', 'This eBook is for the use of anyone anywhere at no cost and with', 'almost no restrictions whatsoever.  You may copy it, give it away or', 're-use it under the terms of the Project Gutenberg License included', 'with this eBook or online at www.gutenberg.org', '', '', 'Title: Moby Dick; or The Whale', '']


We use the same `splitter` function to split lines correctly. The `flatMap` method applies the function to all elements of the RDD and flattens the results into a single list of words.

In [11]:
words = text.flatMap(splitter)
print words.take(10)

['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale']


Now we perform the mapping step. This is simply the case of applying the function `lambda x: (x,1)` to each element.

In [12]:
words_mapped = words.map(lambda x: (x,1))
print words_mapped.take(10)

[('the', 1), ('project', 1), ('gutenberg', 1), ('ebook', 1), ('of', 1), ('moby', 1), ('dick', 1), ('or', 1), ('the', 1), ('whale', 1)]


The shuffling step is performed using the `sortByKey` methond.

In [13]:
sorted_map = words_mapped.sortByKey()
print sorted_map.take(10)

[('', 1), ('', 1), ('', 1), ('', 1), ('', 1), ('', 1), ('', 1), ('', 1), ('', 1), ('', 1)]


Now the reduce step. The `reduceByKey` method uses the supplied function to merge values for each key. In this case, we use the `add` function to perform a sum.

In [14]:
from operator import add
counts = sorted_map.reduceByKey(add)
print counts.take(10)

[('', 3235), ('unimaginative', 1), ('unscientific', 1), ('foul', 11), ('four', 74), ('gag', 2), ('prefix', 1), ('clotted', 2), ('plaudits', 1), ('looking', 70)]


Finally, we can use the `max` method to find the word with the maximum number of occurrences.

In [15]:
print counts.max(lambda x: x[1])

('the', 14620)


### Parallelizing with Spark

Spark also provides the `parallelize` method which distributes a local Python collection to form an RDD (obviously a cluster is required to obtain true parallelism.)

The following example shows how we can calculate the number of primes in a certain range of numbers. First, we define a function to check if a number is prime. This requires checking if it is divisible by all odd numbers up to the square root.

In [16]:
def isprime(n):
    """
    check if integer n is a prime
    """
    # make sure n is a positive integer
    n = abs(int(n))
    # 0 and 1 are not primes
    if n < 2:
        return False
    # 2 is the only even prime number
    if n == 2:
        return True
    # all other even numbers are not primes
    if not n & 1:
        return False
    # range starts with 3 and only needs to go up the square root of n
    # for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

Now we can create an RDD comprising all numbers from 0 to `n` (in this case `n = 1000000`).

In [17]:
# Create an RDD of numbers from 0 to 1,000,000
nums = sc.parallelize(xrange(1000000)) #creates a RDD using an iterative object

Finally, we use the `filter` method to apply the function to each value, returning an RDD containing only values that evalute to `True`. We can then count these to determine the number of primes.

In [18]:
# Compute the number of primes in the RDD
print nums.filter(isprime).count()

78498


### Excercises

Using the methods available on the [SparkContext](https://spark.apache.org/docs/1.1.1/api/python/pyspark.context.SparkContext-class.html) and [RDD](https://spark.apache.org/docs/1.1.1/api/python/pyspark.rdd.RDD-class.html) objects (see links for details), write the following programs:

#### 1. Count the number of distinct words in the `pg2701.txt`.

It's ok to reuse the `splitter` function.

In [19]:
print words.distinct().count()

17355


#### 2. Compute the product of all the numbers between 1 and 1000.

In [20]:
from operator import mul
numset = sc.parallelize(xrange(1, 1001))
print numset.fold(1,mul) # fold(0 elem in the algebra, op)

4023872600770937735437024339230039857193748642107146325437999104299385123986290205920442084869694048004799886101971960586316668729948085589013238296699445909974245040870737599188236277271887325197795059509952761208749754624970436014182780946464962910563938874378864873371191810458257836478499770124766328898359557354325131853239584630755574091142624174743493475534286465766116677973966688202912073791438537195882498081268678383745597317461360853795345242215865932019280908782973084313928444032812315586110369768013573042161687476096758713483120254785893207671691324484262361314125087802080002616831510273418279777047846358681701643650241536913982812648102130927612448963599287051149649754199093422215668325720808213331861168115536158365469840467089756029009505376164758477284218896796462449451607653534081989013854424879849599533191017233555566021394503997362807501378376153071277619268490343526252000158885351473316117021039681759215109077880193931781141945452572238655414610628921879602238389714760

#### 3. Using the MapReduce approach, calculate the average of the square root of all the numbers between 1 and 1000.

Hint: Use the `map` and `fold` methods.

In [21]:
from operator import add
n = numset.count()
sq_numset = numset.map(lambda x: x**0.5/n)
print sq_numset.fold(0, add)

21.0974558875


### References

- Benjamin Bengfort, [Getting Started with Spark (in Python)](https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python)
- [A Hands-on Introduction to MapReduce in Python](https://zettadatanet.wordpress.com/2015/04/04/a-hands-on-introduction-to-mapreduce-in-python)
- Lucas Allen, [Spark Dataframes and MLlib](http://www.techpoweredmath.com/spark-dataframes-mllib-tutorial/)