![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/PySpark/1.PySpark_RDDs.ipynb)

# **1. PySpark RDDs**

# Introduction, Features and Operations of RDD
Overview

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

This guide shows each of these features in each of Spark’s supported languages. It is easiest to follow along with if you launch Spark’s interactive shell – either bin/spark-shell for the Scala shell or bin/pyspark for the Python one.
[source](https://spark.apache.org/docs/latest/rdd-programming-guide.html)

### Install PySpark

In [None]:
# install PySpark
! pip install pyspark==3.2.0

### Initializing Spark

In [2]:
from pyspark import SparkContext

"""
try:
    # create SparkContext on all CPUs available: in my case I have 6 CPUs on my laptop
    sc = SparkContext(appName="SDDM", master='local[*]')
    print("Just created a SparkContext")
    sqlContext = SQLContext(sc)
    print("Just created a SQLContext")
except ValueError:
    warnings.warn("SparkContext already exists in this scope")
"""

sc = SparkContext(appName="PySparkTutorial", master= "local[*]")

sc = SparkContext.getOrCreate()

sc

In [None]:
# ==>> DO NOT FORGET WHEN YOU'RE DONE>> sc.stop()

In [None]:
print("PySpark version: ", sc.version)
#3.2.0

# Python Version: To retrieve Python version of SparkContext
print("Python version:  ", sc.pythonVer)

# Master: URL of the cluster or “local” string to run in local mode of SparkContext
print (sc.master)

PySpark version:  3.2.0
Python version:   3.7
local[*]


# Spark RDD (Resillient Distributed Datasets)
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

### Creating RDDs

There are two ways to create RDDs,

* parallelizing an existing collection of objects in your driver program,

* External datasets (referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.)

#### Parallelized Collections
Parallelized collections are created by calling SparkContext’s parallelize method on an existing iterable or collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:

In [None]:
# parallelize() for creating RDDs from python lists

data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
print(type(distData))

distData.collect()

<class 'pyspark.rdd.RDD'>


[1, 2, 3, 4, 5]

#### External Datasets
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Text file RDDs can be created using SparkContext’s textFile method. This method takes a URI for the file (either a local path on the machine, or a hdfs://, s3a://, etc URI) and reads it as a collection of lines. Here is an example invocation:

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/PySpark/data/airport-codes.csv

! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/PySpark/data/news_category_test.csv

In [None]:
# textFile() for creating RDDs from existing file

rdd2 = sc.textFile("./news_category_test.csv")

In [None]:
rdd2.collect()[:10]

['category,description',
 "Business,Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.",
 'Sci/Tech," TORONTO, Canada    A second team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for privately funded suborbital space flight, has officially announced the first launch date for its manned rocket."',
 'Sci/Tech," A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins."',
 'Sci/Tech," It\'s barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry and flames will roar."',
 'Sci/Tech," Southern California\'s smog fighting agency went after emissions of the

## RDDs Operations

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

    Transformations (lazy)        Actions (eager)
    map                           count 
    filter                        reduce 
    flatMap                       collect
    reduceByKey                   take 
    join                          saveAsTextFile
    cogroup                       saveAsHadoop
    repartition                   countByValue                            



- collect()	:  Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

In [None]:
rdd2.collect()[:5]

['category,description',
 "Business,Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.",
 'Sci/Tech," TORONTO, Canada    A second team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for privately funded suborbital space flight, has officially announced the first launch date for its manned rocket."',
 'Sci/Tech," A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins."',
 'Sci/Tech," It\'s barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry and flames will roar."']

In [None]:
%%time
%%capture
rdd2.collect()

CPU times: user 51.4 ms, sys: 4.28 ms, total: 55.7 ms
Wall time: 382 ms


In [None]:
numbRDD = sc.parallelize([0,1,2,3,4])

# Create map() transformation to cube numbers
cubedRDD = numbRDD.map(lambda x: x**3)

# Collect the results
numbers_all = cubedRDD.collect ()

# Print the numbers from numbers_all
for numb in numbers_all:
    print(numb)

0
1
8
27
64


- filter() : 	  Return a new dataset formed by selecting those elements of the source on which func returns true.

In [None]:
rdd2.filter(lambda x:  "Business" in x ).collect()[:10]

["Business,Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.",
 'Business," Apparel retailers are hoping their back to school fashions will make the grade among style conscious teens and young adults this fall, but it could be a tough sell, with students and parents keeping a tighter hold on their wallets."',
 'Business," The dollar dipped to a four week low  against the euro on Monday before rising slightly on  profit taking, but steep oil prices and weak U.S. data  continued to fan worries about the health of the world\'s  largest economy."',
 'Business," U.S. Treasury debt prices slipped on  Monday, though traders characterized the move as profit taking  rather than any fundamental change in sentiment."',
 'Business, The dollar extended gains against the  euro on Monday after a report on flows into U.S. assets showed  enough of a rise in foreign investments to offset the current  account gap for the month.

- count()	Return the number of elements in the dataset.

In [None]:
rdd2.count() 

7601

- map()	Return a new distributed dataset formed by passing each element of the source through a function func

In [None]:
LineLength = rdd2.map(lambda x : len(x))
print (LineLength.count())
print (LineLength.collect()[:5])

7601
[20, 136, 234, 221, 279]


- take(n)	Return an array with the first n elements of the dataset.

In [None]:
rdd2.take(5)

['category,description',
 "Business,Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.",
 'Sci/Tech," TORONTO, Canada    A second team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for privately funded suborbital space flight, has officially announced the first launch date for its manned rocket."',
 'Sci/Tech," A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins."',
 'Sci/Tech," It\'s barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry and flames will roar."']

## Partition, Repartition and Coalesce

- saveAsTextFile(path)	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

In [None]:
# delete existing savedData folder
! rm -R savedData

rm: cannot remove 'savedData': No such file or directory


In [None]:
# Understanding Partitioning in PySpark
# Save RDD Data to HDFS
#textFile() method

rdd3 = sc.textFile("./airport-codes.csv", minPartitions=5)

rdd3.saveAsTextFile('./savedData/')
#  hadoop fs -cat /savedData/part-00000
#  hadoop fs -cat /savedData/part-00001
#  hadoop fs -cat /savedData/part-00002
#  hadoop fs -cat /savedData/part-00003
#  hadoop fs -cat /savedData/part-00004

print(rdd3.getNumPartitions())

5


In [None]:
# parallelize() method

numRDD = sc.parallelize(range(10), numSlices = 3)

print (numRDD.getNumPartitions())

3


In [None]:
print(numRDD.collect())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


- glom () - return an RDD created by coalescing all elements within each partition into a list. https://medium.com/parrot-prediction/partitioning-in-apache-spark-8134ad840b0

In [None]:
# Default Partition Number is 2

rdd = sc.parallelize(range(10))

print("Number of partitions: {}".format(rdd.getNumPartitions()))
print("Partitions structure: {}".format(rdd.glom().collect()))

Number of partitions: 2
Partitions structure: [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]


In [None]:
rdd = sc.parallelize(range(10),numSlices = 4)

print("Number of partitions: {}".format(rdd.getNumPartitions()))
print("Partitions structure: {}".format(rdd.glom().collect()))

Number of partitions: 4
Partitions structure: [[0, 1], [2, 3, 4], [5, 6], [7, 8, 9]]


In [None]:
rdd=sc.parallelize(range(10), 4)

print("Number of partitions: {}".format(rdd.getNumPartitions()))
print("Partitions structure: {}".format(rdd.glom().collect()))

Number of partitions: 4
Partitions structure: [[0, 1], [2, 3, 4], [5, 6], [7, 8, 9]]


- if you are increasing the number of partitions use **repartition()** (performing full shuffle),

In [None]:
repartRdd = rdd.repartition(5)

print("Number of partitions: {}".format(repartRdd.getNumPartitions()))
print("Partitions structure: {}".format(repartRdd.glom().collect()))

Number of partitions: 5
Partitions structure: [[], [0, 1], [5, 6, 7, 8, 9], [2, 3, 4], []]


- If you are decreasing the number of partitions in this RDD, consider
using **coalesce()**, which can avoid performing a shuffle.

In [None]:
repartRdd = rdd.coalesce(2)

print("Number of partitions: {}".format(repartRdd.getNumPartitions()))
print("Partitions structure: {}".format(repartRdd.glom().collect()))

Number of partitions: 2
Partitions structure: [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]


## Passing Functions to Spark

Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are three recommended ways to do this:

Lambda expressions, for simple functions that can be written as an expression. (Lambdas do not support multi-statement functions or statements that do not return a value.)
Local defs inside the function calling into Spark, for longer code.
Top-level functions in a module.
For example, to pass a longer function than can be supported using a lambda, consider the code below:

In [None]:
"""MyScript.py"""
if __name__ == "__main__":
    def myFunc(s):

        return len(s)

rdd2.map(myFunc)

print (rdd2.map(myFunc).count())

print (rdd2.map(myFunc).collect()[:5])

7601
[20, 136, 234, 221, 279]


In [None]:
# Read a CSV File
# Writing a Python Function to Parse CSV Lines
import csv
from io import StringIO

def parseCSV(csvRow) :
    data = StringIO(csvRow)
    dataReader = csv.reader(data, lineterminator = '')
    return(next(dataReader))

csvRow = "p,s,r,p"
parseCSV(csvRow)

['p', 's', 'r', 'p']

In [None]:
# Read csv file and Creating a Paired RDD
filamentRDD = sc.textFile('./airport-codes.csv', 4)
filamentRDDCSV = filamentRDD.map(parseCSV)
filamentRDDCSV.take(1)

[['ident',
  'type',
  'name',
  'elevation_ft',
  'continent',
  'iso_country',
  'iso_region',
  'municipality',
  'gps_code',
  'iata_code',
  'local_code',
  'coordinates']]

## Anonymous functions 

Lambda functions are anonymous functions in Python

In [None]:
## Transformations (lazy evaluation)

# map() transformation applies a function to all elements in the RDD

import math

numRDD = sc.parallelize(range(100), numSlices = 3)

RDD = sc.parallelize([0, 1, 1.6, 2, 3, 3.14])
RDD_map = RDD.map(lambda x: math.sin(x))

RDD_map.collect()

[0.0,
 0.8414709848078965,
 0.9995736030415051,
 0.9092974268256817,
 0.1411200080598672,
 0.0015926529164868282]

In [None]:
# filter() transformation returns a new RDD with only the elements that pass the condition

RDD = sc.parallelize([0, 1,2,3,4])

RDD_filter = RDD.filter(lambda x: x >= 2)

RDD_filter.collect()

[2, 3, 4]

In [None]:
# flatMap() transformation returns multiple values for each element in the original RDD

"""
Why are we using flatMap, rather than map?

The reason is that the operation line.split(" ") generates a list of strings, 
so had we used map the result would be an RDD of lists of words. Not an RDD of words.

The difference between map and flatMap is that the second expects to get a list as the result 
from the map and it concatenates the lists to form the RDD.
"""

RDD = sc.parallelize(["hello world", "how are you"])

RDD_flatmap = RDD.flatMap(lambda x: x.split(" "))

RDD_flatmap.collect()

['hello', 'world', 'how', 'are', 'you']

## Introduction to pair RDDs in PySpark

Two common ways to create pair RDDs

    From a list of key-value tuple
    From a regular RDD
    
Get the data into key/value form for paired RDD


In [None]:
my_tuple = [('Sam', 23), ('Mary', 34), ('Peter', 25)]

pairRDD_tuple = sc.parallelize(my_tuple)

pairRDD_tuple.collect()

[('Sam', 23), ('Mary', 34), ('Peter', 25)]

In [None]:
my_list = ['Sam 23', 'Mary 34', 'Peter 25']

regularRDD = sc.parallelize(my_list)

pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))

pairRDD_RDD.collect()

[('Sam', '23'), ('Mary', '34'), ('Peter', '25')]

In [None]:
#  Fetching Values from a Paired RDD
pairRDD_RDD_Values = pairRDD_RDD.values()
pairRDD_RDD_Values.collect()

['23', '34', '25']

In [None]:
#  Fetching Keys from a Paired RDD
pairRDD_RDD_Keys = pairRDD_RDD.keys()
pairRDD_RDD_Keys.collect()

['Sam', 'Mary', 'Peter']

## Transformations on pair RDDs

All regular transformations work on pair RDD

Have to pass functions that operate on tuples rather than on individual elements

Examples of paired RDD Transformations


In [None]:
# we can use user functions to map on RDD

def get_Squares(num):
  return (num**2)

numbRDD = sc.parallelize([0,1,2,3,4,3,2,1,0])

numbRDD.map(get_Squares).collect()

[0, 1, 4, 9, 16, 9, 4, 1, 0]

In [None]:
# distinct() to find the distinct numbers

numbRDD.distinct().collect()

[0, 2, 4, 1, 3]

In [None]:
#  intersection()

numbRDD2 = sc.parallelize([1, 2, 3, 5])

numbRDD.intersection(numbRDD2).collect()

[1, 2, 3]

In [None]:
# calculating basic stats

numbRDD = sc.parallelize([1,2,3,4,2,5,1])

print(numRDD.min())

print(numRDD.max())

print(numRDD.sum())

print(numRDD.mean())

print(numRDD.variance())

print(numRDD.stdev())

print(numRDD.stats())

print(numRDD.stats().asDict())

0
99
4950
49.5
833.25
28.86607004772212
(count: 100, mean: 49.5, stdev: 28.86607004772212, max: 99.0, min: 0.0)
{'count': 100, 'mean': 49.5, 'sum': 4950.0, 'min': 0.0, 'max': 99.0, 'stdev': 29.011491975882016, 'variance': 841.6666666666666}


In [None]:
# reduceByKey() transformation combines values with the same key

# It runs parallel operations for each key in the dataset

# It is a transformation and not action

regularRDD = sc.parallelize([("Messi", 23), ("Ronaldo", 34), ("Neymar", 22), 
                             ("Messi", 24), ("Ronaldo", 24), ("Neymar", 24),])

pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x + y)

pairRDD_reducebykey.collect()


[('Ronaldo', 58), ('Messi', 47), ('Neymar', 46)]

In [None]:
# groupbykey() groups all the values with the same key in the pair RDD

airports = [("US", "JFK"),("UK", "LHR"),("FR", "CDG"),("US", "SFO")]

regularRDD = sc.parallelize(airports)

pairRDD_group = regularRDD.groupByKey().collect()

for cont, air in pairRDD_group:
    print(cont, list(air))

US ['JFK', 'SFO']
UK ['LHR']
FR ['CDG']


In [None]:
# join() transformation joins the two pair RDDs based on their key

x = sc.parallelize([("a", 1), ("b", 4), ("c", 2), ("d", 6)])

y = sc.parallelize([("a", 2), ("a", 3), ("c", 3), ("d", 5)])

sorted(x.join(y).collect())

[('a', (1, 2)), ('a', (1, 3)), ('c', (2, 3)), ('d', (6, 5))]

In [None]:
# reduce(func) action is used for aggregating the elements of a regular RDD

# The function should be commutative and associative

# An example of reduce() action in PySpark

from operator import add

print(sc.parallelize([1, 2, 3, 4, 5]).reduce(add))

15


In [None]:
sc.parallelize((2 for _ in range(10))).map(lambda x: 1).cache().reduce(add)

10

In [None]:
# countByKey() only available for type (K, V)

# countByKey() action counts the number of elements for each key

# Example of countByKey() on a simple list

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2),("b", 3), ("a", 1)])

for key, val in rdd.countByKey().items():
    print(key, val)


a 3
b 2


In [None]:
# collectAsMap() return the key-value pairs in the RDD as a dictionary

# Example of collectAsMap() on a simple tuple

sc.parallelize([(1, 2), (3, 4)]).collectAsMap()


{1: 2, 3: 4}

In [None]:
# word count example

text_file = sc.textFile('./airport-codes.csv')
counts_rdd = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

# print the word frequencies in descending order

counts_rdd.map(lambda x: (x[1], x[0])) \
    .sortByKey(ascending=False)\
    .collect()[:20]

[(1871, 'Hospital'),
 (1392, 'Air'),
 (1314, 'Municipal'),
 (1247, 'Ranch'),
 (1023, 'de'),
 (1009, 'Center'),
 (1003, 'Seaplane'),
 (1000, 'International'),
 (872, 'County'),
 (761, 'Medical'),
 (672, 'Regional'),
 (609, 'Lake'),
 (582, 'Farm'),
 (529, 'Memorial'),
 (497, 'Landing'),
 (451, 'De'),
 (385, 'do'),
 (373, 'Creek'),
 (360, 'Island'),
 (334, 'Do')]

In [None]:
# bigrams and word frequencies

sentences = sc.textFile('./airport-codes.csv') \
    .glom() \
    .map(lambda x: " ".join(x)) \
    .flatMap(lambda x: x.split("."))

bigrams = sentences.map(lambda x:x.split()) \
    .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])

freq_bigrams = bigrams.reduceByKey(lambda x,y:x+y) \
    .map(lambda x:(x[1],x[0])) \
    .sortByKey(False)

freq_bigrams.take(10)

[(683, ('Medical', 'Center')),
 (282, ('Memorial', 'Hospital')),
 (257, ('di', 'Volo')),
 (192, ('Lake', 'Seaplane')),
 (130, ('Regional', 'Medical')),
 (124, ('Community', 'Hospital')),
 (110, ('Air', 'Force')),
 (101, ('General', 'Hospital')),
 (83, ('Building', 'Heliport,,AS,KR,KR-11,Seoul,,,,"37')),
 (68, ('County', 'Hospital'))]

In [None]:
# sc.stop()

# Resources

1. https://spark.apache.org/docs/latest/rdd-programming-guide.html
2. https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html#
3. https://github.com/vkocaman/PySpark_Essentials_March_2019
4. https://github.com/sundarramamurthy/pyspark
5. https://towardsdatascience.com/beginners-guide-to-pyspark-bbe3b553b79f
6. https://www.guru99.com/pyspark-tutorial.html
7. https://towardsdatascience.com/exploratory-data-analysis-eda-with-pyspark-on-databricks-e8d6529626b1
8. https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learning

