### Why Spark, how is it different from Map reduce

Gorwing Data : cluster computing required for manipulating huge size of data  
A. mapreduce vs spark  
1. Speed  
- mapreduce used n/w to shuffle, writing to disk to provide failure resistant experience of nodes  
- spark does in memory computation, provides failure resistant exp. by saving the ops.   
  to be performed on data, and reapplying them on reovery from the point of failure. So, less n/w operation involvment.  
  Even faster writing to disk. Use of functional programming constructs    
- So, faster than map reduce, diff. becomes wider as scale increases  
2. Generality  
- useable for multiple use cases    
- iterative tasks like ML become much easier to work with   
3. Ease of Use   
- Runs on hadoop cluster with scheduler like Yarn or Apache Mesos or even standalone cluster  
- APIs for scala, R, Python, Java  
- libraries for SQL, streaming, graph processing. , ML  
- Interactive programming   

So, overall : spark wins in memory distributed computing, with low latency; APIs and libraries

### Spark Unified Stack, RDD and key operation types

Spark Unified Stack :   

-- Spark SQL, Spark Straming, MLlib, Graph X  --- libraries  
-- Spark Core --  
-- Scheduler with Yarn, Mesos or in built scheduler  

Spark Core:  
1.  RDD : Resilient Data Set is the primary data abrstraction of spark  
- Distributed collection of elements, parallelized across cluster    
- **Types of Operations on RDD **  
A. **Transformations **
- Like a sequence of operations applied on data , just creates a Direct Acyclic Graph of    
  operations, no evaluation, nothing returned 
- As operations are added, DAG is updated  
B. **Actions **
- Prompt evaluation aka Lazy Evalution, DAG is evaluated when action is called  
- DAG updation, Lazy evaluation let Spark be resilient to failures. On failure, the DAG is   
  re-evaluated  from where it was  
- Cach memory is available in spark to do processing in memory, if memory is not suofficient,   
  disk memory is used   

### Scala Overview , Starting spark shell for Scala and Python

- Spark is written in Scala  
- Everything in scala is object : basic data types like numbers, functions    
- Function are objects, so they can be passed as args to other fxs, returned from a fx  ,
  stored in vars.   
- Function syntax : def funcname ([list of args]) : [return type]
- Starting Spark shell from Scala and Python  
**1. Scala **  
 .bin/spark-shell  
 val textfile = sc.textFile('fname')  
**2. Python ** 
 .bin/pyspark  
 textfile = sc.textFile('fname')  
here sc is Spark COntext which is available as an object 


-------------------------------------------------------------------

# Spark Methods for RDD in Python API

**Transformations : filter, map , reduce  , reduceByKey, groupByKey  
Actions : collect, count, take, foreach(func)  **

**filter** - Return a new RDD after applying a specified filtering function on each element of an RDD.  
Define a filtering function and apply it by passing in filter() method of RDD
Example of a sequence of transformation and action:


In [None]:
lt = sc.parallelize([1,2,3,4,5]) # parallelize method creates an RDD from a non-distributed object
lt_tfr = lt.filter(lambda x : x if(lt = sc.parallelize([1,2,3,4,5]) # parallelize method creates an RDD from a non-distributed object
lt_tfr = lt.filter(lambda x : x % 2 == 0) # transformation
lt_action = lt_tfr.collect() # action


**map(f, preservesPartitioning=False) **   
Return a new RDD by applying a function to each element of this RDD

**reduce(f)**  
Reduces the elements of this RDD using the specified **commutative and associative binary operator**. Currently reduces partitions locally  

In [None]:
#from operator import add
sc.parallelize([1, 2, 3, 4, 5]).reduce(lambda a,b : max(a,b) )

**flatMap(f, preservesPartitioning=False)**  
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

In [None]:
groupByKey

**reduceByKey(func, numPartitions=None, partitionFunc=<function portable_hash at 0x7fc35dbc8e60>)**  
Merge the values for each key using an associative and commutative reduce function.  

This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.  

Output will be partitioned with numPartitions partitions, or the default parallelism level if numPartitions is not specified. Default partitioner is hash-partition.  

** sortByKey(ascending=True, numPartitions=None, keyfunc=<function <lambda> at 0x7fc35dbcf848>)**  
Sorts this RDD, which is assumed to consist of (key, value) pairs.   
keyfunc can be used to specify the function used to sort

**join(other, numPartitions=None)**  
Return an RDD containing all pairs of elements with matching keys in self and other.

Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.

Performs a hash join across the cluster.

----------------------------------------------

# Code Examples with Python and Scala

## RDD Transformations and Action operations

### Test spark version  

sc.version

### Read a text file

In [None]:
# Python 
readme = sc.textFile("README.md") # usually will be Reading a distributed file on a cluster
# Scala
val readme = sc.textFile("README.md") 

### Check first line, Count number of lines

In [None]:
# Python
readme.first()
readme.count()

### Count no. of lines that contain word 'Spark'

In [None]:
# filter operation follwed by action count
## Python 
readme.filter(lambda line : 'Spark' in line).count()
## Scala
readme.filter(line => line.contains("Spark")).count()   # single quotes don't work in scala

### Notice that instead of a lambda function, in scala an anonymous function is defined using the syntax   
arg => function specification  

### Count maximum no. of words in a line

### Transformation and Actions can be seen as functions oriented to performing specific types of operations,  and taking arguments as functions defined to perform those

In [None]:
# Break each line into words, count the words (transformations) and find max 
## Python
readme.map(lambda line : len(line.split())).reduce(lambda a,b: max(a,b) )
## Scala
import java.lang.Math
readme.map(line => line.split(" ")).
map(line => line.size).
reduce((a,b) => Math.max(a,b))

In [None]:
# same as above, but not using anaonymous function in Python
def max(a, b):
 if a > b:
    return a
 else:
    return b
readme.map(lambda line : len(line.split())).reduce(max)

### Count of Words / Freq. distribution of words

In [None]:
# On RDD, use transformation - 
# a flat map to create a list of all words in lines, use map to create a tuple with count 1, then use reduceByKey to add counts 
# Action - possibly collect, if collect does not bring a large amount of data into driver node
counts = readme.flatMap(lambda line: line.split()).map(lambda x: (x,1)).reduceByKey(add).collect()

# Scala
readme.flatMap(line => line.split(" ")).
map(wrd => (wrd,1)).reduceByKey((a,b) => a+b).
take(5) # first five pairs

### Word with max counts

In [None]:
# Python 
rdd1 = readme.flatMap(lambda x: x.split()).map(lambda x: (x,1)).reduceByKey(add)
rdd1.reduce(lambda a,b : a if(a[1] > b[1]) else b)

# Scala
readme.flatMap(line => line.split(" ")).
map(wrd => (wrd,1)).reduceByKey((a,b) => a+b).
reduce((a,b) => if(a._2 > b._2) a else b)

# Using Spark Caching

When operating on a small RDD or when needing to do repeated operations on a RDD, it might be speedy to cache the RDD (still in 
a distributed way), i.e all nodes save the partitioned data in memory.   
A typical usage scenario will be doing an interative model training on the same data perhaps.  

In [None]:
l_spark = readme.filter(lambda line : 'Spark' in line)
def count ():
    l_spark.count()

In [None]:
# Timer helps time an operation by running multiple iterations
from timeit import Timer
t = Timer(count())
t.timeit(number = 50)

In [None]:
l_spark.cache()
t.timeit(number = 50)