# Big Data Platform
## Assignment 3: ServerLess

**By:**  

John Doe, 300123123  
Jane Doe, 200123123

<br><br>

**The goal of this assignment is to:**
- Understand and practice the details of Serverless

**Instructions:**
- Students will form teams of two people each, and submit a single homework for each team.
- The same score for the homework will be given to each member of your team.
- Your solution is in the form of a Jupyter notebook file (with extension ipynb).
- Images/Graphs/Tables should be submitted inside the notebook.
- The notebook should be runnable and properly documented. 
- Please answer all the questions and include all your code.
- You are expected to submit a clear and pythonic code.
- You can change functions signatures/definitions.

**Submission:**
- Submission of the homework will be done via Moodle by uploading (not Zip):
    - Jupyter Notebook
    - 2 Log files
    - Additional local scripts
- The homework needs to be entirely in English.
- The deadline for submission is on Moodle.
- Late submission won't be allowed.

  
- In case of identical code submissions - both groups will get a Zero. 
- Some groups might be selected randomly to present their code.

**Requirements:**  
- Python 3.6 should be used.  
- You should implement the algorithms by yourself using only basic Python libraries (such as numpy,pandas,etc.)

<br><br><br><br>

**Grading:**
- Q0 - 10 points - Setup
- Q1 - 40 points - Serverless MapReduceEngine
- Q2 - 20 points - MapReduce job to calculate inverted index
- Q3 - 30 points - Shuffle

`Total: 100`

<br><br>

# Question 0
## Setup

1. Navigate to IBM Cloud and open a trial account. No need to provide a credit card
2. Choose IBM Cloud Object Storage service from the catalog
3. Create a new bucket in IBM Cloud Object Storage
4. Create credentials for the bucket with HMAC (access key and secret key)
5. Choose IBM Cloud Functions service from the catalog and create a service


#### Lithops setup
1. By using “git” tool, install master branch of the Lithops project from
https://github.com/lithops-cloud/lithops
2. Follow Lithops documentation and configure Lithops against IBM Cloud Functions and IBM Cloud Object Storage
3. Configure Lithops log level to be in DEBUG mode
4. Run Hello World example by using Futures API and verify all is working properly.


#### IBM Cloud Object Storage setup
1. Upload all the input CSV files that you used in homework 2 into the bucket you created in IBM Cloud Object Storage


<br><br><br>

In [1]:
import sqlite3
import traceback
from pathlib import Path
from typing import List
import numpy as np
import pandas as pd

In [2]:
import lithops

config = {'lithops': {'backend': 'ibm_cf',
                      "storage": 'ibm_cos',
                      "log_level": "DEBUG"},

          'ibm_cf':  {'endpoint': 'https://eu-gb.functions.cloud.ibm.com',
                      'namespace': 'eyal.michaeli@post.idc.ac.il_dev',
                      'api_key': '8fb50be4-51a4-49e8-87d1-32fb9294fb3d:ohx288bRYbo4gl8A4dT1uO2Tyl5ReFZsqMFmmjaYk6KLgk3ixhpHwwktaRY7Axs2'},

          'ibm_cos': {'storage_bucket': 'cloud-object-storage-oh-cos-standard-o3j',
                      'region': 'eu-de',
                      'access_key': '58f33d9962c241fdb8bcfcb8b4f17339',
                      "secret_key": "628211522fb519c62c0dd6debd505d103662d21dcb64c5e3"}}

BUCKET = config['ibm_cos']['storage_bucket']  # for later use

def hello_world(name):
    return 'Hello {}!'.format(name)

fexec = lithops.FunctionExecutor(config=config)
fexec.call_async(hello_world, 'World')
print(fexec.get_result())


2022-01-07 10:42:47,706 [INFO] lithops.config -- Lithops v2.5.8
2022-01-07 10:42:47,708 [DEBUG] lithops.config -- Loading Serverless backend module: ibm_cf
2022-01-07 10:42:47,915 [DEBUG] lithops.config -- Loading Storage backend module: ibm_cos
2022-01-07 10:42:47,994 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Creating IBM COS client
2022-01-07 10:42:47,996 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Set IBM COS Endpoint to https://s3.eu-de.cloud-object-storage.appdomain.cloud
2022-01-07 10:42:47,999 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Using access_key and secret_key
2022-01-07 10:42:48,249 [INFO] lithops.storage.backends.ibm_cos.ibm_cos -- IBM COS client created - Region: eu-de
2022-01-07 10:42:48,252 [DEBUG] lithops.serverless.backends.ibm_cf.ibm_cf -- Creating IBM Cloud Functions client
2022-01-07 10:42:48,253 [DEBUG] lithops.serverless.backends.ibm_cf.ibm_cf -- Set IBM CF Namespace to eyal.michaeli@post.idc.ac.il_dev
2022-01-07 10:42:48,255 [DEBUG]

Hello World!


# Question 1
## Serverless MapReduceEngine

Modify MapReduceEngine from homework 2 into the MapReduceServerlessEngine where map and reduce tasks executed as a serverless actions, instead of local threads. In particular:
1. Deploy all map tasks as a serverless actions by using Lithops against IBM Cloud Functions.
2. Collect results from all map tasks and store them in the same SQLite as you used in MapReduceEngine and use the same code for the sort and shuffle phase.
3. Deploy reduce tasks by using Lithops against IBM Cloud Functions. Instead of persisting results from reduce tasks, return results back to the MapReduceServerlessEngine and proceed with the same workflow as in MapReduceEngine
4. Return results of reduce tasks to the user

**Please attach:**  
Text file with all log messages Lithops printed to console during the execution. Make
sure log level is set to DEBUG mode.

#### Code:

In [3]:
# insert your path here(it's only for the Data Base):
my_path = "/Users/eyalmichaeli/Desktop/School/Master's/IDC_masters/big_data_platforms_ex2"

path = Path(my_path)

In [4]:
# Create the database "temp_results.db", then close it.
conn = None
cursor = None
try:
    conn = sqlite3.connect(str(path / "temp_results.db"))
    cursor = conn.cursor()
    cursor.execute("CREATE TABLE IF NOT EXISTS temp_results (key, value);")

except Exception:
    traceback.print_exc()

finally:
    cursor.close()
    if conn:
        conn.close()

In [5]:
class MapReduceServerlessEngine:
    """
    a class that implements MapReduce. Gets an Sqlite connection in its __init__.
    calls the functions: inverted_map and inverted_reduce in its execute method,
    which constitutes the MapReduce engine.
    """

    def __init__(self, conn):
        self.conn = conn

    def execute(self, input_data: List[str], map_function, reduce_function, params: dict, print_file_name=False):
        # map part
        try:
            fexec = lithops.FunctionExecutor(config=config)
            map_arguments = list(zip(input_data, [params['column_index'] for i in range(len(input_data))]))
            fexec.map(map_function, map_arguments)
            map_returns = np.array(fexec.get_result()).reshape(10*20, 2)
            pd.DataFrame(map_returns, columns=['key', 'value']).to_sql(name='temp_results', con=self.conn, if_exists='replace', index=False)

            # shuffle and sort part
            results_df = pd.read_sql_query("SELECT key, GROUP_CONCAT(value) as value "
                                           "FROM temp_results "
                                           "GROUP BY key "
                                           "ORDER BY key",
                                           conn)

            # reduce part
            if print_file_name:
                # just for debug purposes
                bool_arr = np.ones(len(results_df))  # 1 is like True
                reduce_arguments = list(zip(results_df["key"].values, results_df["value"].values, bool_arr))
            else:
                reduce_arguments = list(zip(results_df["key"].values, results_df["value"].values))

            fexec.map(reduce_function, reduce_arguments)
            reduce_returns = np.array(fexec.get_result())

        except Exception:
            print("Mapreduce failed")
            traceback.print_exc()
            # close connection to db
            if conn:
                conn.close()
            return

        finally:
            # close connection to db
            if conn:
                conn.close()

        print('MapReduce Completed')
        return reduce_returns


In [6]:
from lithops import Storage
from io import StringIO

def inverted_map(document_name: str, column_index: int) -> List[tuple]:
    """
    reads the CSV document from the local disc and returns a list that contains entries of the form (key_value, document name) for the specific column_index provided.
    :param document_name: csv file name.
    :param column_index: column index in the csv file (Note: starting from 1)
    :return: List[tuple] where each tuple contains 2 strings
    """
    obj_path = Path(document_name)
    csv_file = obj_path.name

    # access IBM storage
    storage = Storage(config=config)

    # convert to string, read to a df
    data = StringIO(storage.get_object(BUCKET, csv_file).decode('utf-8'))
    df = pd.read_csv(data, sep=",")

    # return a list of tuples of key-value pairs
    col_series = df[df.columns[column_index-1]]
    document_path_list = [document_name] * len(df)
    return list(zip(col_series.values, document_path_list))


In [7]:
def inverted_reduce(key: str, documents: str, print_file_name: bool = False) -> List[str]:
    """
    reduce function
    :param key: key value (for example: if the column is 'first_name' it could be 'Albert'.
    :param documents: a string (list) of all CSV documents per given key.
    :param print_file_name: assign print_file_name True if you want the reduce function to print the file names (it's more readable than CSV). and for debugging purposes
    :return: List: [key, documents_formatted_unique]
    """
    documents_formatted_unique = ', '.join(set(documents.replace(', ', ',').split(',')))[: -2] # the [: -2] is to remove the last ', '

    if print_file_name:
        list_of_file_names = [Path(obj_path).name for obj_path in documents_formatted_unique.split(', ')]
        total_files = len(list_of_file_names)
        file_names = '\n'.join(sorted(list_of_file_names))
        print(f"This key: '{key}' has appeared in {total_files} files.\n"
              f"it's in the following files: \n{file_names}\n\n")

    return [key, documents_formatted_unique]

# Task 2
## Submit MapReduce job to calculate inverted index
1. Use input_data: `cos://bucket/<path to CSV data>`
2. Submit MapReduce job with reduce and map functions as you used in homework 2, as follows

    `mapreduce = MapReduceServerlessEngine()`  
    `results = mapreduce.execute(input_data, inverted_map, inverted_index)`   
    `print(results)`

**Please attach:**  
Text file with all log messages Lithops printed to console during the execution. Make
sure log level is set to DEBUG mode.

#### Code:

In [8]:
input_data = [f'cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV{i}.csv' for i in range(1, 21)]

In [9]:
conn = sqlite3.connect(str(path / "temp_results.db"))
mapreduce = MapReduceServerlessEngine(conn=conn)
status = mapreduce.execute(input_data,
                           inverted_map,
                           inverted_reduce,
                           params={'column_index': 1},
                           print_file_name=True) # assign print_file_name True if you want the reduce function to print the file names (it's more readable than CSV), and for debugging purposes
print(status)

2022-01-07 10:42:51,972 [INFO] lithops.config -- Lithops v2.5.8
2022-01-07 10:42:51,975 [DEBUG] lithops.config -- Loading Serverless backend module: ibm_cf
2022-01-07 10:42:51,977 [DEBUG] lithops.config -- Loading Storage backend module: ibm_cos
2022-01-07 10:42:51,981 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Creating IBM COS client
2022-01-07 10:42:51,982 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Set IBM COS Endpoint to https://s3.eu-de.cloud-object-storage.appdomain.cloud
2022-01-07 10:42:51,984 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Using access_key and secret_key
2022-01-07 10:42:51,999 [INFO] lithops.storage.backends.ibm_cos.ibm_cos -- IBM COS client created - Region: eu-de
2022-01-07 10:42:52,003 [DEBUG] lithops.serverless.backends.ibm_cf.ibm_cf -- Creating IBM Cloud Functions client
2022-01-07 10:42:52,006 [DEBUG] lithops.serverless.backends.ibm_cf.ibm_cf -- Set IBM CF Namespace to eyal.michaeli@post.idc.ac.il_dev
2022-01-07 10:42:52,022 [DEBUG]

MapReduce Completed
[['Albert'
  'cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV11.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV16.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV6.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV18.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV14.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV5.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV12.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV15.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV10.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV7.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV9.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV19.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV4.csv, cos://eu-de/cloud-object-storage-oh-cos-standard-o3j/myCSV20.csv, cos://eu-de/cloud-object-storage-oh-cos-standar

In [10]:
# close db connection
if conn:
    conn.close()

# Question 3
## Shuffle

MapReduceServerlessEngine deploys both map and reduce tasks as serverless invocations.
However, once map stage completed, the result are transferred from the map tasks to the SQLite database located on the client machine (laptop in your case), then performed local shuffle and then invoked reduce tasks passing them relevant parameters.

(To support your answers, feel free to use examples, Images, etc.)
<br><br>

**1. Explain why this approach is not efficient and what are cons and pros of such architecture in general. In broader scope you may assume that MapReduceServerlessEngine executed in some powerful machine and not just laptop.**

\<your answer here>

<br><br>
**2. Suggest how can you improve shuffle so intermediate data will not be downloaded to the client at all and shuffle performed in the cloud as well. Explain pros and cons of the approaches you suggest.**


\<your answer here>

<br><br>
**3. Can you make serverless shuffle?**


\<your answer here>

<br><br><br><br>
Good Luck :) 