# Data Handling Workflows

### Importing Libraries and Configuration

We'll be using the htex configuration for Parsl. Read more [here.]( https://github.com/Parsl/parsl/blob/master/parsl/configs/htex_local.py)

In [2]:
import numpy as np

import parsl
import os
from parsl.app.app import python_app, bash_app
from parsl.configs.local_threads import config

parsl.load(config)

<parsl.dataflow.dflow.DataFlowKernel at 0x1148db4d0>

### Map Reduce

A map reduce is a technique to execute multiple parallel jobs on a dataset to reduce the size of the dataset before executing a final function to get the result. A Map reduce is a more complicated version of synchronisation.

Let's consider a simple example where we are given multiple lists and we want to select the lists with the highest standard deviation.

![](./images/map_reduce.png)

In [2]:
# A python app to compute the standard deviation of the inputs
@python_app
def standard_deviation(inputs=[]):
    import numpy as np
    return np.std(inputs)

In [3]:
import random
# A function to construct data that is a list of lists, each each list having 100 random numbers

def make_data():
    lists = []
    
    for _ in range(100):
        new_list = []
        
        for __ in range(100):
            new_list.append(random.random()*100)
            
        lists.append(new_list)
    return lists

our_data = make_data()

In [4]:
# Computing the standard deviations for each list

standard_deviations = []

for i in our_data:
    standard_deviations.append(standard_deviation(i))

In [5]:
# Finding the maximum standard deviation

standard_deviations = [i.result() for i in standard_deviations]
print('Maximum Standard Deviation: ', max(standard_deviations))

Maximum Standard Deviation:  31.95613285570822


In [6]:
# Finding the list with the maximum standard deviation

maximum = max(standard_deviations)
print('Target List Number:', standard_deviations.index(maximum))

Target List Number: 40


# Hashing

We'll be using a simple hash function to store elements in our database. We'll evaluate the hash values in parallel and then store the items in those locations.

In [26]:
database = [0 for i in range(1000)]  # An empty database

In [22]:
@python_app 
def hash_function(element, database):
    import hashlib
    number = int(hashlib.md5(element).hexdigest()[:8], 16)%1000   # Creating a hash index
    database[number] = element # Storing it in the database based on the hashed index
    return database # return the database

In [27]:
import random

elements = []

for i in range(100):
    element = '' 
    for _ in range(5):
        element += random.choice('abcdefghijklmopqrstuvwxyz')
    element = element.encode() # Making a 5 letter element
    elements.append(element) # collecting 100 such elements

In [28]:
for i in elements: # Updating the database for all the elements
    database = hash_function(i, database)

In [32]:
print(database.result())

[0, b'gwpde', 0, b'gchcx', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, b'rqwik', 0, 0, 0, 0, b'xuthu', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, b'zbbvc', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, b'bcrfc', 0, 0, b'frulk', 0, b'qaqkd', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, b'akybt', 0, 0, 0, b'ytbfd', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, b'hbpiy', 0, b'zgvpq', 0, 0, 0, 0, b'remuf', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, b'tirle', 0, b'rckyk', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, b'qhtww', 0, 0, 0, 0, 0, b'zpsjd', 0, 0, 0, 0, 0, 0, b'mjfhw', 0, 0, 0, 0, 0, b'wsdyh', 0, 0, 0, 0, 0, 0, 0, 0, 0, b'hwqoq', 0, 0, 0, b'ggaxd', 0, 0, 0, b'jlhwc', 0, 0, b'yglvu', 0, 0, 0, 0, 0, 0, 0, 0, b'ftxpc', 0, b'gseqq', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, b'fotsw', 0, 0, 0, b'qswiq', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Note that this still doesn't solve the problem of overlap of elements. Chaining is the alternative here but in order to implement chaining, we have to evaluate the results which breaks the parallel thread.