## Parallelizing using Pool.apply()

Let’s parallelize the **howmany_within_range() function using multiprocessing.Pool()**.

In [19]:
import numpy as np
from time import time

# Prepare data
np.random.RandomState(100)
arr = np.random.randint(0, 10, size=[200000, 5])
data = arr.tolist()
data[:5]

[[2, 6, 7, 2, 9],
 [6, 9, 3, 6, 8],
 [0, 6, 1, 8, 9],
 [1, 5, 9, 4, 5],
 [6, 7, 8, 8, 1]]

In [20]:
# Solution Without Paralleization

def howmany_within_range(row, minimum, maximum):
    """Returns how many numbers lie within `maximum` and `minimum` in a given `row`"""
    count = 0
    for n in row:
        if minimum <= n <= maximum:
            count = count + 1
    return count

results = []
for row in data:
    results.append(howmany_within_range(row, minimum=4, maximum=8))

print(results[:10])

[2, 3, 2, 3, 4, 2, 2, 3, 1, 4]


In [21]:
import multiprocessing as mp
print("Number of processors: ", mp.cpu_count())

Number of processors:  8


### These paralizations execute on different .py Files. 

1) Write fucntions in single file

2) Execute it on another File by Calling the Functions


In [None]:
# Parallelizing using Pool.apply()

import multiprocessing as mp

# Step 1: Init multiprocessing.Pool()
pool = mp.Pool(mp.cpu_count())

# Step 2: `pool.apply` the `howmany_within_range()`
results = [pool.apply(howmany_within_range, args=(row, 4, 8)) for row in data]

# Step 3: Don't forget to close
pool.close()    

print(results[:10])

## Parallelizing using Pool.map()

**Pool.map()** accepts only one iterable as argument. So as a workaround, **I modify the howmany_within_range function** by setting a default to the minimum and maximum parameters to create a new howmany_within_range_rowonly()

In [None]:
# Parallelizing using Pool.map()
import multiprocessing as mp

# Redefine, with only 1 mandatory argument.
def howmany_within_range_rowonly(row, minimum=4, maximum=8):
    count = 0
    for n in row:
        if minimum <= n <= maximum:
            count = count + 1
    return count

pool = mp.Pool(mp.cpu_count())

results = pool.map(howmany_within_range_rowonly, [row for row in data])

pool.close()

print(results[:10])

## Parallelizing with Pool.apply_async()

**apply_async()** is very similar to apply() except that you need to provide a callback function that tells how the computed results should be stored. However, a caveat with **apply_async()** is, the order of numbers in the result gets jumbled up indicating the processes 

In [None]:
# Parallel processing with Pool.apply_async()

import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())

### Parallelize a Pandas DataFrame

In [None]:
results = []

# Step 1: Redefine, to accept `i`, the iteration number
def howmany_within_range2(i, row, minimum, maximum):
    """Returns how many numbers lie within `maximum` and `minimum` in a given `row`"""
    count = 0
    for n in row:
        if minimum <= n <= maximum:
            count = count + 1
    return (i, count)


# Step 2: Define callback function to collect the output in `results`
def collect_result(result):
    global results
    results.append(result)


# Step 3: Use loop to parallelize
for i, row in enumerate(data):
    pool.apply_async(howmany_within_range2, args=(i, row, 4, 8), callback=collect_result)

# Step 4: Close Pool and let all the processes complete    
pool.close()
pool.join()  # postpones the execution of next line of code until all processes in the queue are done.

# Step 5: Sort results [OPTIONAL]
results.sort(key=lambda x: x[0])
results_final = [r for i, r in results]

print(results_final[:10])

## Parallelizing with Pool.starmap_async()

You saw how **apply_async()** works. Can you imagine and write up an equivalent version for starmap_async and map_async? 

In [5]:
# Step 1: Redefine, to accept `i`, the iteration number
def howmany_within_range2(i, row, minimum, maximum):
    """Returns how many numbers lie within `maximum` and `minimum` in a given `row`"""
    count = 0
    for n in row:
        if minimum <= n <= maximum:
            count = count + 1
    return (i, count)

In [None]:
# Parallelizing with Pool.starmap_async()

import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())

results = []

results = pool.starmap_async(howmany_within_range2, [(i, row, 4, 8) for i, row in enumerate(data)]).get()

# With map, use `howmany_within_range_rowonly` instead
# results = pool.map_async(howmany_within_range_rowonly, [row for row in data]).get()

pool.close()
print(results[:10])

### Parallelize a Pandas DataFrame

When it comes to parallelizing a DataFrame, you can make the function-to-be-parallelized to take as an input parameter:

**1) one row of the dataframe**

**2) one column of the dataframe**

**3) the entire dataframe itself**

The first 2 can be done using **multiprocessing module** itself. But for the last one, that is parallelizing on an entire dataframe, we will use the pathos package that uses dill for serialization internally.

In [6]:
import numpy as np
import pandas as pd
import multiprocessing as mp

df = pd.DataFrame(np.random.randint(3, 10, size=[5, 2]))
print(df.head())

   0  1
0  5  5
1  5  6
2  7  7
3  7  7
4  8  4


###### we exploit the df.itertuples(name=False). By setting name=False, you are passing each row of the dataframe as a simple tuple to the hypotenuse function.

In [7]:
# Row wise Operation
def hypotenuse(row):
    return round(row[1]**2 + row[2]**2, 2)**0.5


In [None]:
with mp.Pool(4) as pool:
    result = pool.imap(hypotenuse, df.itertuples(index=True, name=None), chunksize=10)
    output = [round(x, 2) for x in result]

In [13]:
import numpy as np
import pandas as pd
import multiprocessing as mp
from pathos.multiprocessing import ProcessingPool as Pool

df = pd.DataFrame(np.random.randint(3, 10, size=[500, 2]))

def func(df):
    return df.shape

In [14]:
cores=mp.cpu_count()

df_split = np.array_split(df, cores, axis=0)

# create the multiprocessing pool
pool = Pool(cores)

In [15]:
# process the DataFrame by mapping function to each df across the pool
df_out = np.vstack(pool.map(func, df_split))

In [16]:
df_out

array([[63,  2],
       [63,  2],
       [63,  2],
       [63,  2],
       [62,  2],
       [62,  2],
       [62,  2],
       [62,  2]])

In [17]:
# close down the pool and join
pool.close()
pool.join()
pool.clear()