# Data Handling Workflows

Note that Parsl is not effective if multiple CPU cores aren't available because Parsl's ability to execute tasks in parallel is depenedent on the availability multiple cores.

In [1]:
import multiprocessing
print('Cores available: {}'.format(multiprocessing.cpu_count()))

Cores available: 4


### Importing Libraries and Configuration

In [2]:
import numpy as np
import random
import time
import pandas as pd
import parsl
import os
import montage_wrapper as montage
from parsl.data_provider.files import File
cwd = os.getcwd()

from parsl.app.app import python_app, bash_app
from parsl.providers import LocalProvider
from parsl.channels import LocalChannel

from parsl.config import Config
from parsl.executors import HighThroughputExecutor

config = Config(
    executors=[
        HighThroughputExecutor(
            label="htex_local",
            cores_per_worker=1,
            provider=LocalProvider(
                channel=LocalChannel(),
                init_blocks=1,
                max_blocks=1,
            ),
        )
    ],
)

parsl.load(config)

<parsl.dataflow.dflow.DataFlowKernel at 0x118da8d90>

### Map Reduce

A map reduce is a technique to execute multiple parallel jobs on a dataset to reduce the size of the dataset before executing a final function to get the result. A Map reduce is a more complicated version of synchronisation.

Let's consider a simple example where we are given multiple lists and we want to select the lists with the highest standard deviation.

![](./images/map_reduce.png)

In [3]:
@python_app
def standard_deviation(inputs=[]):
    '''
    A python app to compute the standard deviation of the inputs
    '''
    import numpy as np
    return np.std(inputs)

In [4]:
def make_data():
    '''
    A function to construct data that is a list of lists, each each list having 100 random numbers.
    '''
    lists = []
    
    for _ in range(100):
        new_list = []
        
        for __ in range(100):
            new_list.append(random.random()*100)
            
        lists.append(new_list)
    return lists

our_data = make_data()

In [5]:
'''
Computing the standard deviations for each list
'''

start1 = time.time()

standard_deviations = []

for i in our_data:
    standard_deviations.append(standard_deviation(inputs=i))

'''
Finding the maximum standard deviation
'''

standard_deviations = [i.result() for i in standard_deviations]
print('Maximum Standard Deviation: ', max(standard_deviations))

'''
Finding the list with the maximum standard deviation
'''

maximum = max(standard_deviations)
print('Target List Number:', standard_deviations.index(maximum))

end1 = time.time()

Maximum Standard Deviation:  31.65997545363917
Target List Number: 44


# Hashing

We'll be using a simple hash function to store elements in our database. We'll evaluate the hash values in parallel and then store the items in those locations.

In [6]:
'''
An empty database
'''
database = [0 for i in range(1000)]  

In [7]:
@python_app 
def hash_function(element):
    '''
    We import the haslib library and then create the hash index.
    '''
    
    import hashlib
    number = int(hashlib.md5(element).hexdigest()[:8], 16)%1000
    return number

In [8]:
import random

elements = []

for i in range(100):
    '''
    Making a 5 letter element and collecting 100 such elements.
    '''
    element = '' 
    for _ in range(5):
        element += random.choice('abcdefghijklmopqrstuvwxyz')
    element = element.encode()
    elements.append(element)

start1 = time.time()   
hashes = []
for i in elements:
    '''
    Updating the database for all the elements.
    '''
    hashes.append(hash_function(i))

hashes = [i.result() for i in hashes]

for i in range(len(elements)):
    database[hashes[i]] = elements[i]

end1 = time.time()

Note that this still doesn't solve the problem of overlap of elements. Chaining is the alternative here but in order to implement chaining, we have to evaluate the results which breaks the parallel thread.