# Distributed Databases and Big Data

# Solution Workbook


- Name: Syed Ali Alim Rizvi

**About Jupyter Notebook**

*Server Information:*

    The version of the notebook server is 4.2.3 and is running on:
        Python 3.5.2 


*Current Kernel Information:*
    
    Python 3.5.2 
    IPython 5.1.0 


# Table of Contents

- [Parallel Search](#1)
    * [Part 1](#1.1)
    * [Part 2](#1.2)

## Libraries

***`If the library isn't installed already, please unhash and run the code below.`***


In [1]:
#!pip install pandas
#!pip install multiprocessing

In [3]:
import pandas as pd
from multiprocessing import Pool

## data collection

In [4]:
# CLIMATE DATA

#read excel into dataframe
cdf = pd.read_csv('ClimateData.csv')

#create list
cd = []
row = []
for i in range(cdf.shape[0]):
    for j in range(cdf.shape[1]):
        row.append(cdf.iloc[i,j])
    cd.append(row)
    row = []

In [5]:
# FIRE DATA 

#read excel into dataframe
fdf = pd.read_csv('FireData.csv')

#creating list
fd = []
row = []
for i in range(fdf.shape[0]):
    for j in range(fdf.shape[1]):
        row.append(fdf.iloc[i,j])
    fd.append(row)
    row = []


***

# Parallel Search
<a id='1'></a>

## Parallel Search  1.1
<a id='1.1'></a>

1: Write an algorithm to search climate data for the records on 15th December 2017 . Justify
your choice of the data partition technique and search technique you have used.





Solution:

-Partioning
Hash Partitioning: since the search will be on the partioning attribute Range/Hash is better than other. Also since the search is an exact search, Hash partioning is even better than Range partioning as it utilizes only the processor that holds the data for that value. 

-Search:
binary search as data can be sorted with dates as integers, sorting data on integers is easy and hence binary search would perform faster than linear search. 

#### Data partition

##### - Defining hash function

In [6]:
# Define a simple hash function.
def s_hash(x, n):
    """
    Define a simple hash function for demonstration
    Arguments:
    x -- an input date as string
    n -- the number of processors
    Return:
    result -- the hash value of x
    """
    ### START CODE HERE ###
    date_parse = x.split('-')
    result = int(date_parse[1])%n #hash key by months
    
    ### END CODE HERE ###
    return result

##### - Defining hash partition

In [7]:
# Hash data partitionining function.
# We will use the "s_hash" function defined above to realise this partitioning
def h_partition(data, n):
    
    """
    Perform hash data partitioning on data
    Arguments:
    data -- an input dataset which is a list of lists
    n -- the number of processors
    Return:
    result -- the paritioned subsets of data
    """
    
    dic = {} # create a dictionary
    
    for rec in data: # For each data record (list), perform the following:
        
        # Get the hash key of the inputs Date
        date = rec[1]
        h = s_hash(date, n) 
        
        #add records by hash key into dictionary
        dic.setdefault(h, [])
        dic[h].append(rec)
    
    return dic

#### Binary Search Algo

##### - function to change date into integer for comparisor and sorting

In [8]:
def date_to_int(date):
    
    """
    converts a date string of the form 'yyyy-mm-dd' to int of the form 'yyyymmdd'
    arguments:
    date -- the date that needs to be converted to int form 
    return:
    result -- int representation of the date
    """
    
    result = int(date.replace('-',''))
    
    return result

##### -binary search algo

In [9]:
# Binary search function
def binary_search(data, key):
    
    """
    Perform binary search on data for the given key by sorting it by date first
    Arguments:
    data -- an input dataset which is a list
    key -- an query record
    Return:
    result -- the position of searched record
    """

    matched_record = None
    position = -1 # not found position
    lower = 0
    middle = 0
    upper = len(data)-1
    key = date_to_int(key)
    
    data_sorted = sorted(data, key = lambda rec: date_to_int(rec[1]))
    
    ### START CODE HERE ###
    
    while (lower <= upper):
        
        # calculate middle: the half of lower and upper
        middle = int((lower + upper)/2)
        
        #compare date of rec to key
        if date_to_int(data_sorted[middle][1]) == key:
            position = middle
            matched_record = data_sorted[middle]
            break
        elif key > date_to_int(data_sorted[middle][1]):
            lower = middle+1
        else:
            upper = middle-1
            
        
    ### END CODE HERE ###
    return matched_record # ,position

#### multiprocessing

In [10]:
# Parallel searching algorithm for exact match
def parallel_search_exact(data, query, n_processor):

    """
    Perform parallel search for exact match on data for the given key
    Arguments:
    data -- an input dataset which is a list of lists
    query -- a query record ie. date to search on
    n_processor -- the number of parallel processors
    Return:
    results -- the matched record information
    """

    #results = []
    
    # Pool: a Python method enabling parallel processing.
    # We need to set the number of processes to n_processor,
    # which means that the Pool class will only allow 'n_processor' processes
    # running at the same time.
    
    pool = Pool(processes=n_processor)
    
    ### START CODE HERE ###
    
    print("data partitioning: h_partition")
    print("searching method: binary_search")
    
    # Perform data partitioning first
    DD = h_partition(data, n_processor)
    # Each element in DD has a pair (hash key: records)
    query_hash = s_hash(query, n_processor)
    d = list(DD[query_hash])
    result = pool.apply(binary_search, [d, query])
    #results.append(result)
    pool.close() #close pool
    
    
    """
    The method above 'pool.apply()' will lock the function program until all a process
    is finished. Alternatively, we can use the 'pool.apply_sync()' method
    to spawn one process for each CPU core on your machine.
    """

    ### END CODE HERE ###
    
    return result#results

### Solution

In [13]:
parallel_search_exact(cd, '2017-12-15', 3)

data partitioning: h_partition
searching method: binary_search


[948702,
 '2017-12-15',
 18,
 52.0,
 7.0999999999999996,
 14.0,
 '   74.5*',
 '53.1',
 ' 0.00I']

## Parallel Search 1.2
<a id='1.2'></a>

2: Write an algorithm to find the latitude , longitude and confidence when the surface temperature (°C) was between 65 °C and 100 °C . Justify your choice of the data partition technique and search technique you have used.

Solution:

Partition: range partition. The retreival is based on the partitioning attribute hence partioning with semantics would be good; hash and range partioning suits best in this regard. But since the retrieval is based on a range of values hence range partioning better suits our purpose as hash partioning is not good for ranged data retrieval. Even though our range is discrete the time wasted to calculate the hash value of each record attribute would make the hash partioning not the best suited for this purpose. 

Search: Binary Search. Since the search will be done for all data values in the range we will be able to save time and make the search efficient if we sort the data first and then apply the binary search. Sorting the data might feel as an added step but for the following searches it will end up saving time rather than doing linear search for all the values in the range. 

#### Data partitioning

In [14]:
# Range data partitionining function
def range_partition(data, range_indices):
    
    """
    Perform range data partitioning on data
    Arguments:
    data -- an input dataset which is a list
    range_indices -- the index list of ranges to be split 
    Return:
    result -- the paritioned subsets of D
    """
    
    result = []
    
    
    # First, we sort the dataset according their values
    new_data = sorted(data, key= lambda x: x[7])
    
    
    # Calculate the number of bins - 1
    n_bin = len(range_indices)
    
    
    # For each bin, perform the following
    for i in range(n_bin):
        
        # Find elements to be belonging to each range
        s = [x for x in new_data if x[7] < range_indices[i]]
        
        # Add the partitioned list to the result
        result.append(s)
        
        if len(s)>0:
            # Find the last element in the previous partition
            last_element = s[len(s)-1]
        
            # Find the index of of the last element
            last = new_data.index(last_element)

            # Remove the partitioned list from the dataset
            new_data = new_data[int(last)+1:]
    
    # Append the last remaining data list
    result.append([x for x in new_data if x[7] >= range_indices[n_bin-1]])
    
    
    return result

In [15]:
#range_partition(fd, [60,100])

#### Binary Search Algo

In [16]:
# Binary search function
def binary_search(data, query_range):
    
    """
    Perform binary search on data for the given key
    Arguments:
    data -- an input dataset which is a list
    query_range -- a range of records to be found
    Return:
    result -- the position of searched record
    """
    
    # First, we sort the dataset according their values
    new_data = sorted(data, key= lambda x: x[7])
    
    #create list to store matched records
    result = []

    for key in range(query_range[0]+1, query_range[1]): #range exclusive of boundaries
        
        #run the binary search for each key without breaking the loop since there can be multiple records for an attribute
        matched_record = None
        position = -1 # not found position
        lower = 0
        middle = 0
        upper = len(new_data)-1

        while (lower <= upper):

            # calculate middle: the half of lower and upper
            middle = int((lower + upper)/2)
            
            if new_data[middle][7] == key:
                matched_record = new_data.pop(middle) # remove the record after it is found
                result.append(matched_record) #add the record into result
                upper -= 1 #since we have removed a record subtract one from upper
            elif key > new_data[middle][7]:
                lower = middle+1
            else:
                upper = middle-1
                
    return result

In [17]:
#binary_search(fd, [27,31])

#### multiprocessing to apply binary search on ranged partitioned data

In [18]:
# Parallel searching algorithm for range selection
# range partition and binary search

def parallel_search_range(data, query_range, n_processor, partition_range):
    
    """
    Perform parallel search for range selection on data for the given key

    Arguments:
    data -- the input dataset which is a list
    query_range -- a query record in the form of a range (e.g. [30, 50])
    n_processor -- the number of parallel processors
    
    Return:
    results -- the matched record information
    """
    
    processes = [] #list to store all processes
    output = [] #list to store the output of all processes

    pool = Pool(processes = n_processor+1) #one kept aside for the main process and 3 for workers
       
    
    #identify the lower and upper bound of the query range
    low = query_range[0]
    high = query_range[1]
            
    # Partition by range partitioning
    data_parts = range_partition(data, partition_range) 
    
    
    check = False #check on whether to apply processor on looping data partition or no
    
    for data_part in data_parts: # Find the range that may contain the query
        
        if len(data_part)==0:
            continue
        
        if low < data_part[-1][7]: #compare with last element of partition; since data is sorted
            check = True
        
        if check: #if data in current partition assign a worker to it
            process = pool.apply_async(binary_search, args=(data_part, query_range,))
            processes.append(process)
            
        if high < data_part[-1][7]: #if high is already in current data part then break
            break
            
    
    for p in processes: 
        for row in p.get(): #for each processes binary search result (list of records)
            output.append(row[0:2]+row[5:6]+[row[-1]]) #put record in output by only extracting desired data
    
    #close active processes
    pool.close() 
    
    print('key: [lat, long, conf, surf_temp]')
    return output

#### Solution

In [22]:
parallel_search_range(fd, [65, 100], 3, [30, 60, 120])

key: [lat, long, conf, surf_temp]


[[-35.932600000000001, 141.958, 78, 66],
 [-35.194899999999997, 141.06219999999999, 90, 66],
 [-37.260899999999999, 141.79990000000001, 89, 66],
 [-35.807099999999998, 142.73179999999999, 89, 66],
 [-38.042299999999997, 146.40479999999999, 84, 66],
 [-37.8185, 142.5609, 90, 66],
 [-37.484299999999998, 143.0592, 89, 66],
 [-36.1462, 145.20959999999999, 90, 66],
 [-36.6828, 144.78399999999999, 90, 66],
 [-36.886699999999998, 142.18729999999999, 90, 66],
 [-37.6267, 142.99930000000001, 90, 66],
 [-36.539900000000003, 144.678, 90, 66],
 [-36.155900000000003, 141.81020000000001, 80, 66],
 [-36.716500000000003, 142.6155, 90, 66],
 [-36.511600000000001, 144.56200000000001, 89, 66],
 [-37.796599999999998, 142.31960000000001, 76, 66],
 [-36.502299999999998, 143.60380000000001, 85, 66],
 [-36.638399999999997, 142.4075, 90, 66],
 [-36.703200000000002, 144.0992, 90, 66],
 [-37.783499999999997, 148.41380000000001, 90, 66],
 [-37.444000000000003, 148.101, 73, 66],
 [-36.110999999999997, 145.124, 85,


--------------------------------------------------------------------------------------------------------------------------