# Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.

Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming



A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

RDDs are resilient, which means that if a node performing an operation in spark is lost, the dataset can be reconstructed. This is because spark knows lineage of each RDD, which is sequence of steps to create the RDD.

RDDs are distributed, which means data in RDD is divided into one or more partitions and distributed as in-memory collection of objects across worker nodes in the cluster.

RDDs are datasets that consists of records. A record can be collection of fields like a row in relational db.
RDDs are created in such a way that each partiton contains a unique set of records that can be operated independently. 

RDDs once created are immutable

# How to create a RDD

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext('local', 'test_app') 
spark = SparkSession.builder.appName("test").getOrCreate()

creating an RDD using textFile.                                                                                    
sc.textFile(filename, minPartitions=None, use_unicode=True)       

minPartitions is number of partions to create, if not provided, default is one partition per block

In [None]:
eventsRDD = spark.read.text("users.txt")

In [None]:
eventsRDD.collect()

Creating an RDD from database into DataFrames which are special type of RDD with schema. Creating RDD from relational database table using functions from SparkSession object 

In [None]:


countryDF = spark.read.jdbc(url="jdbc:mysql://localhost:3306/population", table="pops", properties={"user":"root", "password":"root@123"})

In [None]:
countryDF.collect()

Running SQL Queries against a DataFrame

In [None]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [None]:
sqlContext.registerDataFrameAsTable(countryDF, "countries")

In [None]:
cdf = spark.sql("select * from countries x where x.population >= (select max(y.population) from countries y where y.continent=x.continent)")

In [None]:
cdf.collect()

In [None]:
cdf.dtypes

Creating an RDD programmatically.  

sc.parallelize(c, numSlices=None)

c --> collection, numSlices --> Num. of partitions to be created

In [3]:
ints = sc.parallelize([1,2,3,4,5,6,7,8,9])
ints.collect()

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [4]:
rangeRDD = sc.range(1, 20, 2, 2)
rangeRDD.collect()

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

# RDD Persistence and Caching

RDD persistence is an optimization technique in which saves the result of RDD evaluation. Using this we save the intermediate result so that we can use it further if required. It reduces the computation overhead.

We can make persisted RDD through cache() and persist() methods. When we use the cache() method we can store all the RDD in-memory. We can persist the RDD in memory and use it efficiently across parallel operations.

The difference between cache() and persist() is that using cache() the default storage level is MEMORY_ONLY while using persist() we can use various storage levels - MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER

In [None]:
nums = sc.range(0, 10000, 1, 2)
evens = nums.filter(lambda x: x%2)
evens.persist()
evens.count()
evens.collect()


Basic RDD Transformations

In [5]:
# RDD.map(<function>, preservePartitioning=[True, if True partitioning is preserved])
rdd = sc.parallelize([1,2,3,4,1,2,3,4,5])
rdd1 = rdd.map(lambda x: x**2)
rdd1.collect()

[1, 4, 9, 16, 1, 4, 9, 16, 25]

In [6]:
rdd1.filter(lambda x: x > 4).collect()

[9, 16, 9, 16, 25]

In [9]:
rdd2 = rdd.flatMap(lambda x :[x, x*x])
rdd2.collect()

[1, 1, 2, 4, 3, 9, 4, 16, 1, 1, 2, 4, 3, 9, 4, 16, 5, 25]

In [10]:
rdd2.distinct().collect()

[1, 2, 3, 4, 5, 9, 16, 25]

In [11]:
b = sc.textFile("users.txt")
b.take(10)


[u'1,BarackObama,Barack Obama',
 u'2,ladygaga,Goddess of Love',
 u'3,jeresig,John Resig',
 u'4,justinbieber,Justin Bieber',
 u'6,matei_zaharia,Matei Zaharia',
 u'7,odersky,Martin Odersky',
 u'8,anonsys']

In [13]:
b.groupBy(lambda x: x[0]).collect()
b.groupBy(lambda x:x[0]).map(lambda x: (x[0], list(x[1]))).collect()

[(u'1', [u'1,BarackObama,Barack Obama']),
 (u'3', [u'3,jeresig,John Resig']),
 (u'2', [u'2,ladygaga,Goddess of Love']),
 (u'4', [u'4,justinbieber,Justin Bieber']),
 (u'7', [u'7,odersky,Martin Odersky']),
 (u'6', [u'6,matei_zaharia,Matei Zaharia']),
 (u'8', [u'8,anonsys'])]

In [20]:
housingRDD = sc.textFile("housing.csv")
housingRDD.sortBy(lambda x: x[0]).take(10) 

[u'ALPHABET CITY,R4-CONDOMINIUM,25,2001,27798,792799,28.52,309114,11.12,483685,3212028,115.55,Manhattan',
 u'ALPHABET CITY,R4-CONDOMINIUM,23,2005,34734,1202838,34.63,414029,11.92,788809,5840002,168.14,Manhattan',
 u'ALPHABET CITY,R4-CONDOMINIUM,5,2008,7405,263988,35.65,60647,8.19,203341,1501999,202.84,Manhattan',
 u'ALPHABET CITY,R4-CONDOMINIUM,47,2005,36472,1454868,39.89,319859,8.77,1135009,8579000,235.22,Manhattan',
 u'ALPHABET CITY,R2-CONDOMINIUM,13,1920,18990,518047,27.28,206991,10.9,311056,2244001,118.17,Manhattan',
 u'ALPHABET CITY,R9-CONDOMINIUM,30,1901,20940,587576,28.06,264682,12.64,322894,2324000,110.98,Manhattan',
 u'ALPHABET CITY,R4-CONDOMINIUM,16,2000,15704,541631,34.49,210120,13.38,331511,2455000,156.33,Manhattan',
 u'ALPHABET CITY,R9-CONDOMINIUM,78,2001,65832,2518732,38.26,789984,12,1728748,13083000,198.73,Manhattan',
 u'ALPHABET CITY,R4-CONDOMINIUM,83,1928,78982,2138833,27.08,853006,10.8,1285827,9281001,117.51,Manhattan',
 u'ALPHABET CITY,R4-CONDOMINIUM,12,2005,11603,46

# Basic RDD Actions

1. count() and collect() evaluates an RDD and all its parent and returns a value.
2. saveAsTextFile() saves data externally
3. foreach() is an action that performs a function on each element of an RDD

In [21]:
readme = sc.textFile("README.md").flatMap(lambda x: x.split(' '))
readme.count()

526

In [13]:
readme.collect()

[u'#',
 u'Apache',
 u'Spark',
 u'',
 u'Spark',
 u'is',
 u'a',
 u'fast',
 u'and',
 u'general',
 u'cluster',
 u'computing',
 u'system',
 u'for',
 u'Big',
 u'Data.',
 u'It',
 u'provides',
 u'high-level',
 u'APIs',
 u'in',
 u'Scala,',
 u'Java,',
 u'Python,',
 u'and',
 u'R,',
 u'and',
 u'an',
 u'optimized',
 u'engine',
 u'that',
 u'supports',
 u'general',
 u'computation',
 u'graphs',
 u'for',
 u'data',
 u'analysis.',
 u'It',
 u'also',
 u'supports',
 u'a',
 u'rich',
 u'set',
 u'of',
 u'higher-level',
 u'tools',
 u'including',
 u'Spark',
 u'SQL',
 u'for',
 u'SQL',
 u'and',
 u'DataFrames,',
 u'MLlib',
 u'for',
 u'machine',
 u'learning,',
 u'GraphX',
 u'for',
 u'graph',
 u'processing,',
 u'and',
 u'Spark',
 u'Streaming',
 u'for',
 u'stream',
 u'processing.',
 u'',
 u'<http://spark.apache.org/>',
 u'',
 u'',
 u'##',
 u'Online',
 u'Documentation',
 u'',
 u'You',
 u'can',
 u'find',
 u'the',
 u'latest',
 u'Spark',
 u'documentation,',
 u'including',
 u'a',
 u'programming',
 u'guide,',
 u'on',
 u'the

In [14]:
readme.take(5) # returns first n element, but is unordered

[u'#', u'Apache', u'Spark', u'', u'Spark']

In [22]:
readme.distinct().top(5) # like take, but ordered

[u'your', u'you', u'works', u'with', u'will']

In [23]:
readme.first()

u'#'

reduce() is aggregate action, each of which performs a commutative or associative operation 

In [24]:
numbers = sc.parallelize([1,2,3,4,5,6,7])
numbers.reduce(lambda x,y: x+y)

28

RDD.foreach(function)  applies a function to each element of RDD

foreach(), can be used to apply a function to each element of RDD, which is not possible in case of map or flatMap 

In [25]:
def pprint(x):
    print(x*x)

l = []
numbers = sc.parallelize([1,2,3,4,5])
numbers.foreach(pprint)
