# Discussion Week 3: PySpark Review

## List Comprehensions, Lambdas, Generators, and Yield

In [1]:
# List comprehensions:
x = range(100)

y = [n**2 for n in x if n < 5]

print y

y2 = [n**2 if n % 2 else 0 for n in y]
print y2

print [a * b for a in y for b in y2]

[0, 1, 4, 9, 16]
[0, 1, 0, 81, 0]
[0, 0, 0, 0, 0, 0, 1, 0, 81, 0, 0, 4, 0, 324, 0, 0, 9, 0, 729, 0, 0, 16, 0, 1296, 0]


In [2]:
# Lambda Expressions

def convert_me(n):
    return 1./ n ** 2

convert_you = lambda x: 1./x ** 2

convert_me(10) == convert_you(10)

True

In [3]:
gen1 = lambda n: (i for i in range(n))

def gen2(n):
    for i in range(n):
        yield i

In [4]:
g1 = gen1(5)
g2 = gen2(5)
for i in range(5):
    print next(g1) == next(g2)

True
True
True
True
True


## RDDs

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

In [3]:
import sys
import os

sys.path.append(os.path.join(os.getcwd(), 'cs186_spark', 'python', 'lib', 'pyspark.zip'))
sys.path.append(os.path.join(os.getcwd(), 'cs186_spark', 'python', 'lib', 'py4j-0.9-src.zip'))
os.environ["SPARK_HOME"] = os.path.join(os.getcwd(), 'cs186_spark')

In [4]:
import pyspark
n = pyspark.SparkContext()

Let's load new RDD using a collection from 0 to 9.

In [9]:
rdd1 = n.parallelize(range(10), 10)

`rdd.collect()` return a list that contains all of the elements in this RDD.

In [10]:
rdd1.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Instead of making our own collection, let's load in a file:

In [11]:
rdd2 = n.textFile('link_text.txt')

Let's take a peek at the first few entries in this document - 

In [12]:
print "Here is the Raw document"

!head -n 5 link_text.txt

Here is the Raw document
http://gitxiv.com/posts/tfkjEgw9x4KSi2GnH/long-short-term-memory-networks-for-machine-reading
http://fivethirtyeight.com/features/the-rent-seeking-is-too-damn-high/
http://nautil.us/issue/33/attraction/love-is-like-cocaine
http://swiftmonthly.com/issues/latest/?feb2016
http://arstechnica.com/tech-policy/2016/02/profs-protest-invasive-cybersecurity-measures-at-university-of-california-campuses/


In [13]:
rdd2.take(5)

[u'http://gitxiv.com/posts/tfkjEgw9x4KSi2GnH/long-short-term-memory-networks-for-machine-reading',
 u'http://fivethirtyeight.com/features/the-rent-seeking-is-too-damn-high/',
 u'http://nautil.us/issue/33/attraction/love-is-like-cocaine',
 u'http://swiftmonthly.com/issues/latest/?feb2016',
 u'http://arstechnica.com/tech-policy/2016/02/profs-protest-invasive-cybersecurity-measures-at-university-of-california-campuses/']

Let's do something interesting with this data - get the domains of all of the websites

In [14]:
def get_site(iterator):
    for link in iterator:
        index = link.find("www.")
        end = link.find(".com")
        if index > 0 and end > 0:
            yield link[index + 4: end]

site_rdd = rdd2.mapPartitionsWithIndex(lambda index, iterator: get_site(iterator))
# notice how i toss out the index
# also notice how nothing happens

In [15]:
site_rdd.take(5)

[u'nytimes', u'bloomberg', u'righto', u'clicktorelease', u'nytimes']

In [16]:
print rdd2.getNumPartitions(), rdd2.count()

2 685


### Notice how the object itself is not very eventful...

In [17]:
rdd2

MapPartitionsRDD[2] at textFile at NativeMethodAccessorImpl.java:-2

Here is the raw implementation of `rdd.distinct` from PySpark

In [18]:
def distinct(self, numPartitions=None):
    """
    Return a new RDD containing the distinct elements in this RDD.

    >>> sorted(sc.parallelize([1, 1, 2, 3]).distinct().collect())
    [1, 2, 3]
    """
    return self.map(lambda x: (x, None)) \
               .reduceByKey(lambda x, _: x, numPartitions) \
               .map(lambda x: x[0])

Let us get the distinct URLs: 

In [19]:
site_rdd.distinct().take(10)

[u'',
 u'cnbc',
 u'bloomberg',
 u'gq',
 u'timesofisrael',
 u'thestar',
 u'reddit',
 u'linkedin',
 u'easypost',
 u'rbmojournal']

### Rest is left as an optional exercise - can you guess what's going on below?

In [5]:
redrdd = n.parallelize(range(24), 4)
testrdd = redrdd.map(lambda x: ((x ** 2 * 7) % 13 , x ))
testrdd.take(5)

[(0, 0), (7, 1), (2, 2), (11, 3), (8, 4)]

In [6]:
def checkit(rdd):
    def p(n, itr):
        for i in itr:
            yield (n, i)
    return rdd.mapPartitionsWithIndex(p).collect()

testrdd = testrdd.reduceByKey(lambda x, y: x + y, 3)

In [7]:
testrdd.take(5)

[(0, 13), (6, 52), (7, 27), (8, 52), (2, 28)]

In [22]:
checkit(testrdd)

[(0, (0, 13)),
 (0, (6, 52)),
 (1, (7, 27)),
 (2, (8, 52)),
 (2, (2, 28)),
 (2, (11, 52)),
 (2, (5, 52))]