## Checking the impact of number of workers

While initializing the SparkContext, we can specify number of worker nodes. Generally, it is recommended to have one worker per core of the machine. But it can be smaller or lagrer. We will examine the impact of number of worker cores on some parallelized operation.

In [1]:
from time import time
from pyspark import SparkContext

In [2]:
for j in range(1,5):
    sc = SparkContext(master= "local[%d]"%(j))
    t0 = time()
    for i in range(10):
        sc.parallelize([1,2] * 10000).reduce(lambda x,y:x+y)
    print(f"{j} executors, time ={time()-t0}")
    sc.stop()

1 executors, time =3.005481004714966
2 executors, time =1.4382414817810059
3 executors, time =1.605672836303711
4 executors, time =1.2498397827148438


We obeserve that it takes almost double time for 1 worker, and after that time reduces to a flat level for 2,3,4 workers etc. This is because this code run on a Linux virtual box using only 2 cores from the machine, if you run this code on machine with 4 cores, we will see benifits of 4 cores and then flattening out of the time taken. It also become clear that using more than one worker per core is not benificial as it just does context-switiching in the case and does not spend up the parallel computation.

## Showing the essence of lazy evaluation

![68747470733a2f2f7170682e66732e71756f726163646e2e6e65742f6d61696e2d71696d672d64343964636633356563623765656366633665356233393439336130653038362d63.jpg](attachment:68747470733a2f2f7170682e66732e71756f726163646e2e6e65742f6d61696e2d71696d672d64343964636633356563623765656366633665356233393439336130653038362d63.jpg)

In [3]:
sc = SparkContext(master="local[2]")

In [4]:
### Make a RDD with 1 million elements

In [5]:
%%time
rdd1=sc.parallelize(range(1000000))

CPU times: user 402 µs, sys: 1.61 ms, total: 2.01 ms
Wall time: 4.66 ms


### Some computing function - taketime

In [6]:
from math import cos
def taketime(x):
    [cos(j) for j in range(100)]
    return cos(x)

### Check how much time is taken by taketime function

In [7]:
%%time
taketime(2)

CPU times: user 16 µs, sys: 11 µs, total: 27 µs
Wall time: 29.3 µs


-0.4161468365471424

### Now do the map operation to the function

In [9]:
%%time
interim = rdd1.map(lambda x: taketime(x))

CPU times: user 9 µs, sys: 6 µs, total: 15 µs
Wall time: 18.4 µs


Here each taketime function takes 18.4 us but the map operation with 10000 element RDD also took similar time. This is because of lazy evaluation. Here nothing was computed in the previous step just a plan of execution was made. The variable interim does not point to a data structure, instead it points to a plan of execution, expressed as a dependency graph. The dependency graph defines how RDDs are computed from each other.

### Let us see the "Dependency Graph" using toDebugString method

In [10]:
print(interim.toDebugString().decode())

(2) PythonRDD[1] at RDD at PythonRDD.scala:53 []
 |  ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:262 []


![68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f7469727468616a796f74692f537061726b2d776974682d507974686f6e2f6d61737465722f496d616765732f5244445f646570656e64656e63795f67726170682e504e47.png](attachment:68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f7469727468616a796f74692f537061726b2d776974682d507974686f6e2f6d61737465722f496d616765732f5244445f646570656e64656e63795f67726170682e504e47.png)

### The actual execution by reduce method

In [11]:
%%time
print('output =', interim.reduce(lambda x,y:x+y))

output = -0.2887054679684353
CPU times: user 7.84 ms, sys: 3.07 ms, total: 10.9 ms
Wall time: 5.96 s


In [12]:
1000000*31e-6

31.0

It is less than what we would have expected considering 1 million operations with the taketime function. This is the result of parallel operation of 2 cores.

#### Now we have not saved (materialized) any intermediate results in interim, so another simple operation (e.g. counting elements > 0) will take almost same time. 

In [13]:
%%time
print(interim.filter(lambda x:x>0).count())

500000
CPU times: user 5.65 ms, sys: 3.9 ms, total: 9.56 ms
Wall time: 5.59 s


### Caching to reduce computation time on similar operation (spending memory)

Run the same computation as before with cache method to tell the dependency graph to plan for caching

In [14]:
%%time
interim = rdd1.map(lambda x: taketime(x)).cache()

CPU times: user 3.89 ms, sys: 2.53 ms, total: 6.42 ms
Wall time: 16 ms


In [15]:
print(interim.toDebugString().decode())

(2) PythonRDD[4] at RDD at PythonRDD.scala:53 [Memory Serialized 1x Replicated]
 |  ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:262 [Memory Serialized 1x Replicated]


In [16]:
%%time
print('output =', interim.reduce(lambda x,y:x+y))

output = -0.2887054679684353
CPU times: user 2.61 ms, sys: 4.05 ms, total: 6.66 ms
Wall time: 5.19 s


### Now run the same filter method with the help of cached result

In [17]:
%%time 
print(interim.filter(lambda x:x>0).count())

500000
CPU times: user 3.34 ms, sys: 5.04 ms, total: 8.37 ms
Wall time: 217 ms


This time it took much shorter time due to cached result, which it could use to compare to 0 and count easily.