#### MapReduce
![img](maprreduce.png)


Why not map reduce
1. Not designed for followings
    1. Interactive queries
    2. Iterative
    3. Low latency like streaming
    4. Map output is write to the local disk and the reduce output is write to the HDFS
2. Need different tooling to implement
    1. SQL(Hive)
    2. Machine learning(Mahout)
    3. Graph processing(Giraph)

#### Spark
1. Compute engine
2. Unified data processing
3. Strong consistent API
4. Low disk I/O

### RDD
Resilient distributed datasets
1. Each partition is processed in parallel and executed by the executors. In case of failure only the portion of data will be processed again from the source data. 
2. It maintain the lineage of transformation

### Spark
1. Spark Context
    1. Main entry point of spark functionality
    2. sparkcontext()- load the default functionality of spark but we can configure and override the sc by defining arguments in sc().
2. ClusterManager
    1. Standalone
    2. Apache Mesos
    3. Hadoop yarn
    4. Kubernetes- An open source system for automating deployment, scaling and management of containerized application
3. Spark Architecture
    1. Executor is like service run on worker node. Multiple executors can run on single machine. Each executors are allocated resource by the cluster manager
    2. SparkContext on the driver node  can be connected to several type of cluster manager.
![spark](spark-arch.PNG)
    


In [12]:
## Python
list1=[1,2,3,54,10]

## define square function
def square(x):
    return x**2

map_result =list(map(square,list1))
print(map_result)

## Reduce: it return the single value instead of the list like map and the filter methods
from functools import reduce

reduce_output= reduce(lambda x,y: x if x>y else y, list1)
print(reduce_output)

#filter
filter_output=list(filter(lambda x: x %2==0,list1))
print(filter_output)


[1, 4, 9, 2916, 100]
54
[2, 54, 10]


In [2]:
import pyspark
from pyspark import SparkContext
sc=SparkContext()

In [40]:
#Create RDD from list
list2=[2,5,8,9,0,1]
rdd=sc.parallelize(list2,7)


In [46]:
##map
map_rdd =rdd.map(lambda x:x**2)
print(map_rdd.getNumPartitions())
print(map_rdd.collect())
#reduce
reduce_rdd=rdd.reduce(lambda x,y:x if x>y else y)
print(reduce_rdd)
#filter
even_rdd =rdd.filter(lambda x:x%2==0).collect()
print(even_rdd)

7
[4, 25, 64, 81, 0, 1]
9
[2, 8, 0]


In [47]:
##Aggregation by reducebykey
from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())


[('a', 2), ('b', 1)]


1. Transformation   
    1. narrow: where partition is based on 1 on 1 mapping. If we pass the number of partition in arguments. Then the same number of partition is created
    2. wide: Where the spark do the shuffling of all the partitions which can be very cumbersome when we do for the large data.
2. Job: whenever we call the below actions then the spark jobs will be created.
    1. Collect
    2. take
    3. save as text file
3. Stages: The number of stages depend on the wide transformation. All the narrow transformation will be compacted to 1 stage. Total number of stages will always be wide transformation plus 1.
4. Tasks: depend on the number of partitions. For every partition there will be one task.Each stage there will be tasks running. For example if we have two partition and there are two stages. then each stage will run 2 tasks for each partition.

Concepts around parition
1. On raw data when we create RDD and define the partition it wil create the parition based on the number user give. After the action or transformation The data is pushed in the partition based on the hash parition logic where records are sent



### Hash partition

In [16]:
## Dataset location
#https://drive.google.com/drive/u/0/folders/1yCNpxbFHyH-AEyVgIQ3gSE_Y-LBZ1DsZ

# variable name
partition_num=2
input_rdd = sc.textFile("C:/Users/sharsaur/NA-AI-lakehouse/deltalake/spark/weather.csv", partition_num)
selected_fields_rdd = input_rdd.map(lambda line: (int(line.split(",")[0].split("-")[0]), int(line.split(",")[2])))


In [17]:

max_temperature_rdd = selected_fields_rdd.reduceByKey(lambda x, y: x if x>y else y)
print("Max temperature RDD: {}".format(max_temperature_rdd.collect()))
# COMMAND ----------
print("Partitioner for the max_temperature_rdd is {}".format(max_temperature_rdd.partitioner.partitionFunc))
max_temperature_rdd.saveAsTextFile(r"C:\Users\sharsaur\NA-AI-lakehouse\deltalake\spark\max_temp")
print(open(r"C:\Users\sharsaur\NA-AI-lakehouse\deltalake\spark\max_temp\part-00000", "r").read())


### range partition

In [18]:

sorted_rdd = max_temperature_rdd.sortByKey()
print("Sorted RDD: {}".format(sorted_rdd.collect()))
print("Partitioner for the sorted_rdd is {}".format(sorted_rdd.partitioner.partitionFunc))
sorted_rdd.saveAsTextFile(r"C:\Users\sharsaur\NA-AI-lakehouse\deltalake\spark\sorted_max_temperature")
print(open(r"C:\Users\sharsaur\NA-AI-lakehouse\deltalake\spark\sorted_max_temperature\part-00001", "r").read())

Max temperature RDD: [(2016, 36), (2018, 45), (2010, 39), (2014, 35), (2012, 40), (2019, 47), (2017, 47), (2013, 47), (2015, 41), (2011, 38)]
Partitioner for the max_temperature_rdd is <function portable_hash at 0x000002345AD53280>
Sorted RDD: [(2010, 39), (2011, 38), (2012, 40), (2013, 47), (2014, 35), (2015, 41), (2016, 36), (2017, 47), (2018, 45), (2019, 47)]
Partitioner for the sorted_rdd is <function RDD.sortByKey.<locals>.rangePartitioner at 0x000002345D7853A0>
