# Spark Tutorials - Learning Apache Sparks

**`Apache Spark`**, is a framework for large-scale data processing. Many traditional frameworks were designed to be run on a single computer. However, many datasets today are too large to be stored on a single computer, and even when a dataset can be stored on one computer, it can often be processed much more quickly using multiple computers.

## `Transformation`
`map(), mapPartitions(), mapPartitionsWithIndex(), filter(), flatMap(), reduceByKey(), groupByKey()`

In [78]:
# Create RDD and subtract 1 from each number then find max
dataRDD = sc.parallelize(xrange(1,21))

# Let's see how many partitions the RDD will be split into using the getNumPartitions()
dataRDD.getNumPartitions()

4

In [60]:
dataRDD.map(lambda x: x - 1).max()
dataRDD.toDebugString()

'(4) PythonRDD[49] at RDD at PythonRDD.scala:43 []\n |  ParallelCollectionRDD[47] at parallelize at PythonRDD.scala:423 []'

In [79]:
# Find the even numbers
print(dataRDD.getNumPartitions())

# Find even numbers
evenRDD = dataRDD.filter(lambda x: x % 2 == 0)

# Reduce by adding up all values in the RDD
print(evenRDD.reduce(lambda x, y: x + y))


# Use Python add function to sum
from operator import add
print(evenRDD.reduce(add))

4
110
110


# `Action`
`first(), take(), takeSample(), takeOrdered(), collect(), count(), countByValue(), reduce(), top()`

In [80]:
# Take first n values
evenRDD.take(10)

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

In [81]:
# Count distinct values in RDD and return dictionary of values and counts
evenRDD.countByValue()

defaultdict(int,
            {2: 1, 4: 1, 6: 1, 8: 1, 10: 1, 12: 1, 14: 1, 16: 1, 18: 1, 20: 1})

## `reduceByKey()`, `combineByKey()` and `foldByKey()` are better than `groupByKey()`!

In [91]:
pairRDD = sc.parallelize([('a', 1), ('a', 2), ('b', 1)])

# mapValues only used to improve format for printing
print pairRDD.groupByKey().mapValues(lambda x: list(x)).collect()

# Different ways to sum by key
print pairRDD.groupByKey().map(lambda (k, v): (k, sum(v))).collect()

# Using mapValues, which is recommended when the key doesn't change
print pairRDD.groupByKey().mapValues(lambda x: sum(x)).collect()

# reduceByKey is more efficient / scalable
print pairRDD.reduceByKey(add).collect()

[('a', [1, 2]), ('b', [1])]
[('a', 3), ('b', 1)]
[('a', 3), ('b', 1)]
[('a', 3), ('b', 1)]


## `mapPartitions()`
The mapPartitions() transformation uses a function that takes in an iterator (to the items in that specific partition) and returns an iterator. The function is applied on a partition by partition basis.

In [93]:
# mapPartitions takes a function that takes an iterator and returns an iterator
print wordsRDD.collect()

itemsRDD = wordsRDD.mapPartitions(lambda iterator: [','.join(iterator)])

print itemsRDD.collect()

['cat', 'elephant', 'rat', 'rat', 'cat']
['cat', 'elephant', 'rat', 'rat,cat']


## `mapPartitionsWithIndex()`
The mapPartitionsWithIndex() transformation uses a function that takes in a partition index (think of this like the partition number) and an iterator (to the items in that specific partition). For every partition (index, iterator) pair, the function returns a tuple of the same partition index number and an iterator of the transformed items in that partition.

In [94]:
itemsByPartRDD = wordsRDD.mapPartitionsWithIndex(lambda index, iterator: [(index, list(iterator))])

# We can see that three of the (partitions) workers have one element and the fourth worker has two
# elements, although things may not bode well for the rat...
print itemsByPartRDD.collect()

# Rerun without returning a list (acts more like flatMap)
itemsByPartRDD = wordsRDD.mapPartitionsWithIndex(lambda index, iterator: (index, list(iterator)))

print itemsByPartRDD.collect()

[(0, ['cat']), (1, ['elephant']), (2, ['rat']), (3, ['rat', 'cat'])]
[0, ['cat'], 1, ['elephant'], 2, ['rat'], 3, ['rat', 'cat']]


# `Others`
`cache(), unpersist(), id(), setName()`

In [104]:
def brokenTen(value):
    """Incorrect implementation of the ten function.

    Note:
        The `if` statement checks an undefined variable `val` instead of `value`.

    Args:
        value (int): A number.

    Returns:
        bool: Whether `value` is less than ten.

    Raises:
        NameError: The function references `val`, which is not available in the local or global
            namespace, so a `NameError` is raised.
    """
#     if (val < 10):
    if (value < 10):
        return True
    else:
        return False

brokenRDD = dataRDD.filter(brokenTen)

In [117]:
brokenRDD.collect()

[1, 2, 3, 4, 5, 6, 7, 8, 9]

# Word Count & Related Processing Lab

### Word Count

In [110]:
wordslist = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordslist)
print(type(wordsRDD))

<class 'pyspark.rdd.RDD'>


### Pluralize words

In [114]:
# Pluralize each word
pluralwords = wordsRDD.map(lambda w: w +'s').collect()
pluralwords

['cats', 'elephants', 'rats', 'rats', 'cats']

### Length of words

In [116]:
# Find length of each word
pluralRDD = sc.parallelize(pluralwords)
pluralRDD.map(len).collect()

[4, 9, 4, 4, 4]

### Word Count, Key-Value Pair

In [121]:
# CountByValue()
pluralRDD.countByValue().items()

[('cats', 2), ('rats', 2), ('elephants', 1)]

In [128]:
# Lambda with CountByKey()
pluralRDD.map(lambda d: (d,1)).countByKey().items()

[('cats', 2), ('rats', 2), ('elephants', 1)]

In [148]:
# Count with Lambda with reduceByKey()
newPluralRDD = pluralRDD.map(lambda d: (d,1))
print(newPluralRDD.collect())

newPluralRDD.reduceByKey(add).collect()

[('cats', 1), ('elephants', 1), ('rats', 1), ('rats', 1), ('cats', 1)]


[('rats', 2), ('cats', 2), ('elephants', 1)]

### Unique Words in RDD

In [157]:
print(pluralRDD.collect())
pluralRDD.distinct().collect()

['cats', 'elephants', 'rats', 'rats', 'cats']


['rats', 'cats', 'elephants']

### Find `mean()`

In [177]:
total = pluralRDD.count()
size = float(pluralRDD.distinct().count())
round(total/size, 2)


1.67

### Apply WordCount to a File 

In [372]:

def wordCount(wordsListRDD):
    """ Inputs word List RDD and outputs Key value pair count of the words."""
    return wordsListRDD.countByValue().items()

wordCount(wordsRDD)


[('rat', 2), ('elephant', 1), ('cat', 2)]

In [373]:

import re
f = open("alice.txt", 'r')

pattern = re.compile(r"[.,\[\]\"'*!?`_\s();-]+")
wordsList = [re.sub(pattern, '', sents).lower() for sents in f.read().split()]
wordsList = filter(None, wordsList)


In [381]:

wordsFileRDD = sc.parallelize(wordsList)
p_wordfile = wordCount(wordsFileRDD)

# Print top 30 words
sorted(p_wordfile, key=lambda tup:tup[1], reverse=True)[:30]

[('the', 1632),
 ('and', 845),
 ('to', 721),
 ('a', 627),
 ('she', 537),
 ('it', 517),
 ('of', 508),
 ('said', 459),
 ('i', 401),
 ('alice', 379),
 ('in', 366),
 ('you', 361),
 ('was', 355),
 ('that', 273),
 ('as', 262),
 ('her', 243),
 ('at', 210),
 ('on', 191),
 ('with', 180),
 ('had', 178),
 ('all', 178),
 ('but', 166),
 ('for', 153),
 ('so', 150),
 ('be', 146),
 ('not', 144),
 ('very', 144),
 ('what', 136),
 ('this', 130),
 ('little', 128)]