# Big Data Platform
## Assignment 2: MapReduce

**By:**  

John Doe, 300123123  
Jane Doe, 200123123

<br><br>

**The goal of this assignment is to:**
- Understand and practice the details of MapReduceEngine

**Instructions:**
- Students will form teams of two people each, and submit a single homework for each team.
- The same score for the homework will be given to each member of your team.
- Your solution is in the form of a Jupyter notebook file (with extension ipynb).
- Images/Graphs/Tables should be submitted inside the notebook.
- The notebook should be runnable and properly documented. 
- Please answer all the questions and include all your code.
- You are expected to submit a clear and pythonic code.
- You can change functions signatures/definitions.

**Submission:**
- Submission of the homework will be done via Moodle by uploading a Jupyter notebook.
- The homework needs to be entirely in English.
- The deadline for submission is on Moodle.
- Late submission won't be allowed.
  
  
- In case of identical code submissions - both groups will get a Zero. 
- Some groups might be selected randomly to present their code.

**Requirements:**  
- Python 3.6 should be used.  
- You should implement the algorithms by yourself using only basic Python libraries (such as numpy,pandas,etc.)

<br><br><br><br>

**Grading:**
- Q1 - 5 points - Initial Steps
- Q2 - 50 points - MapReduceEngine
- Q3 - 30 points - Implement the MapReduce Inverted index of the JSON documents
- Q4 - 5 points - Testing Your MapReduce
- Q5 - 10 points - Final Thoughts 

`Total: 100`

**Prerequisites**

In [2]:
# example
import itertools
import sqlite3
import traceback
#!pip install --quiet zipfile36

**Imports**

In [83]:
# general
import os
import time
import random
import warnings
import threading # you can use easier threading packages
from threading import Thread
import concurrent
from pathlib import Path
from multiprocessing import Queue
# ml
import numpy as np
import scipy as sp
import pandas as pd
import logging

# visual
import seaborn as sns
import matplotlib.pyplot as plt

# notebook
from IPython.display import display

**Hide Warnings**

In [84]:
warnings.filterwarnings('ignore')

**Disable Autoscrolling**

In [85]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

**Set Random Seed**

In [86]:
random.seed(123)

<br><br><br><br>
# Question 1
# Initial Steps

Write Python code to create 20 different CSV files in this format:  `myCSV[Number].csv`, where each file contains 10 records. 

The schema is `(‘firstname’,’secondname’,city’)`  

Values should be randomly chosen from the lists: 
- `firstname` : `[John, Dana, Scott, Marc, Steven, Michael, Albert, Johanna]`  
- `city` : `[New York, Haifa, München, London, Palo Alto,  Tel Aviv, Kiel, Hamburg]`  
- `secondname`: any value  

In [68]:
my_path = "/Users/eyalmichaeli/Desktop/School/Master's"
# my_path = "/Users/mymac"
exercise_path = "IDC_masters/big_data_platforms_ex2"

path = Path(my_path) / Path(exercise_path)

In [69]:
firstname = ['John', 'Dana', 'Scott', 'Marc', 'Steven', 'Michael', 'Albert', 'Johanna']
city = ['NewYork', 'Haifa', 'Munchen', 'London', 'PaloAlto', 'TelAviv', 'Kiel', 'Hamburg']
secondname = ['Lennon', 'McCartney', 'Starr', 'Harrison', 'Ono', 'Sutcliffe', 'Epstein', 'Preston']

csvs_path = Path(path / "csvs")
csvs_path.mkdir(parents=True, exist_ok=True)
for i in range(1, 21):
    temp_df = pd.DataFrame({"firstname": np.random.choice(firstname, 10),
                            "secondname": np.random.choice(secondname, 10),
                            "city": np.random.choice(city, 10),
                            })
    temp_df.to_csv(str(csvs_path / f"myCSV[{i}].csv"))

print("Created 20 CSV files")

Created 20 CSV files


Use python to Create `mapreducetemp` and `mapreducefinal` folders

In [70]:
mapreducetemp_folder = Path(path / "mapreducetemp")
mapreducetemp_folder.mkdir(parents=True, exist_ok=True)

mapreducefinal_folder = Path(path / "mapreducefinal")
mapreducefinal_folder.mkdir(parents=True, exist_ok=True)

print("Created folders")

Created folders


<br><br><br>
# Question 2
## MapReduceEngine

Write Python code to create an SQLite database with the following table

`TableName: temp_results`   
`schema: (key:TEXT,value:TEXT)`

In [90]:

conn = None
try:
    conn = sqlite3.connect(str(path / "db.db"))
    cursor = conn.cursor()
    cursor.execute("""CREATE TABLE temp_results (
                   key TEXT
                   value TEXT
                   );""")

except Exception as e:
    traceback.print_exc()

# finally:
#     cursor.close()
#     if conn:
#         conn.close()

Traceback (most recent call last):
  File "<ipython-input-90-c0c9da703476>", line 5, in <module>
    cursor.execute("""CREATE TABLE temp_results (
sqlite3.DatabaseError: file is not a database


1. **Create a Python class** `MapReduceEngine` with method `def execute(input_data, map_function, reduce_function)`, such that:
    - `input_data`: is an array of elements
    - `map_function`: is a pointer to the Python function that returns a list where each entry of the form (key,value) 
    - `reduce_function`: is pointer to the Python function that returns a list where each entry of the form (key,value)

<br><br>

**Implement** the following functionality in the `execute(...)` function:

<br>

1. For each key  from the  input_data, start a new Python thread that executes map_function(key) 
<br><br>
2. Each thread will store results of the map_function into mapreducetemp/part-tmp-X.csv where X is a unique number per each thread. 
<br><br>
3. Keep the list of all threads and check whether they are completed.
<br><br>
4. Once all threads completed, load content of all CSV files into the temp_results table in SQLite.

    Remark: Easiest way to loop over all CSV files and load them into Pandas first, then load into SQLite  
    `data = pd.read_csv(path to csv)`  
    `data.to_sql(‘temp_results’,sql_conn, if_exists=’append’,index=False)`
<br><br>

5. **Write SQL statement** that generates a sorted list by key of the form `(key, value)` where value is concatenation of ALL values in the value column that match specific key. For example, if table has records
<table>
    <tbody>
            <tr>
                <td style="text-align:center">John</td>
                <td style="text-align:center">myCSV1.csv</td>
            </tr>
            <tr>
                <td style="text-align:center">Dana</td>
                <td style="text-align:center">myCSV5.csv</td>
            </tr>
            <tr>
                <td style="text-align:center">John</td>
                <td style="text-align:center">myCSV7.csv</td>
            </tr>
    </tbody>
</table>

    Then SQL statement will return `(‘John’,’myCSV1.csv, myCSV7.csv’)`  
    Remark: use GROUP_CONCAT and also GROUP BY ORDER BY
<br><br><br>
6. **Start a new thread** for each value from the generated list in the previous step, to execute `reduce_function(key,value)` 
<br>    
7. Each thread will store results of reduce_function into `mapreducefinal/part-X-final.csv` file  
<br>
8. Keep list of all threads and check whether they are completed  
<br>
9. Once all threads completed, print on the screen `MapReduce Completed` otherwise print `MapReduce Failed` 



In [87]:
# implement all of the class here
def map_function(key):
    logging.info("Thread %s: starting", key)
    logging.info("Thread %s: finishing", key)
    return key


class MapReduceEngine:

    def __init__(self, conn):
        self.conn = conn

    def execute(self, input_data, map_function, reduce_function):
        thread_list = list()
        csvs_paths = list()
        for csv_key in input_data:
            temp_df = pd.read_csv(csv_key)

            exec = concurrent.futures.ThreadPoolExecutor()

            csv_index = input_data.index(csv_key)  # an index of the relative csv in the input_array
            t = exec.submit(map_function, temp_df, csv_index)
            threads_returns = t.result()
            csv_path = f'{mapreducetemp_folder}/part-tmp-{csv_index}.csv'
            csvs_paths.append(csv_path)
            pd.DataFrame(threads_returns).to_csv(csv_path,
                                                 header=['key', 'value'],
                                                 index=False)
            thread_list.append(t)


        # wait until the threads are completed
        exec.shutdown(wait=True)

        # Once all threads completed, load content of all CSV files into the temp_results table in Sqlite
        for path_to_csv in csvs_paths:
            data = pd.read_csv(path_to_csv)
            data.to_sql('temp_results', self.conn, if_exists='append', index=False)



In [88]:
def inverted_map(key, index):
    tuple_df = key['firstname'] # take only firstname as a key
    tuple_df = tuple_df.reset_index()
    tuple_df['value'] = ['myCSV'+str(index)+'.csv'] * len(key['firstname']) # create filname for each key

    tuple_df = tuple_df[["firstname", "value"]] # slice the dataframe into firstname and value only
    tuple_df = tuple_df.set_index('firstname') # set index to firstname in order to generate key-value list

    tuple_list = tuple_df.reset_index().T.reset_index().T.values.tolist() # make a list of (name,value) list
    tuple_list = tuple_list[1::] # cut 'row' and 'column' headline
    tuple_list = list(map(tuple,tuple_list)) # convert list of lists into list of tuples

    return tuple_list

In [89]:
engine = MapReduceEngine(conn=conn)

input_data = []
for i in range(1, 21):   # generate input data as file names from 0 to 19
    csv_file = 'myCSV['+ str(i)+'].csv'
    input_data.append(str(csvs_path / f"myCSV[{i}].csv"))

mapreduce = MapReduceEngine(conn=conn)
status = mapreduce.execute(input_data, inverted_map, inverted_map)
print(status)

ProgrammingError: Cannot operate on a closed database.

In [61]:
# implement all of the class here
def map_function(key):
    logging.info("Thread %s: starting", key)
    logging.info("Thread %s: finishing", key)
    return key


class MapReduceEngine():

    def __init__(self, conn):
        self.conn = conn

    def execute(self, input_data, map_function, reduce_function):

        thread_list, qs = [], []
        for num, key in enumerate(input_data):
            logging.info("Main    : before creating thread")
            que = Queue()
            qs.append(que)
            t = Thread(target=lambda q, arg1: q.put(map_function(arg1)), args=(que, key))
            logging.info("Main    : before running thread")
            t.start()
            thread_list.append(t)

        for thread in thread_list:
            logging.info("Main    : wait for the thread to finish")
            thread.join()

        csvs_paths = list()
        for que in qs:
            while not que.empty():
                result = que.get()
                print(result)
                csv_path = f'{mapreducetemp_folder}/part-tmp-{num}.csv'
                #result.to_csv(csv_path)
                csvs_paths.append(csv_path)

        # Once all threads completed, load content of all CSV files into the temp_results table in Sqlite
        for path_to_csv in csvs_paths:
            data = pd.read_csv(path_to_csv)
            data.to_sql('temp_results', self.conn, if_exists='append', index=False)

        logging.info("Main    : all done")



In [64]:
MapReduceEngine.execute(list(range(50)), MapReduceEngine.map_function, MapReduceEngine.map_function)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


In [38]:
import logging
import threading
import time

def thread_function(name):
    logging.info("Thread %s: starting", name)
    print(name)
    time.sleep(2.5)
    print(name + 3)
    logging.info("Thread %s: finishing", name)
    return name

if __name__ == "__main__":
    format = "%(asctime)s: %(message)s"
    logging.basicConfig(format=format, level=logging.INFO,
                        datefmt="%H:%M:%S")

    logging.info("Main    : before creating thread")
    x = threading.Thread(target=thread_function, args=(1,))
    logging.info("Main    : before running thread")
    
    
    x.start()
    logging.info("Main    : wait for the thread to finish")
    x.join()
#     print(num)
    logging.info("Main    : all done")

17:54:21: Main    : before creating thread
17:54:21: Main    : before running thread
17:54:21: Thread 1: starting
17:54:21: Main    : wait for the thread to finish


1


17:54:23: Thread 1: finishing
17:54:23: Main    : all done


4


In [56]:
    def execute(input_data):#, map_function, reduce_function):
        thread_list, qs = [], []
        for num, key in enumerate(input_data):
            logging.info("Main    : before creating thread")
            que = Queue()
            qs.append(que)
            t = threading.Thread(target=lambda q, arg1: q.put(map_function(arg1)), args=(que, key 
                                                                                        ))
            logging.info("Main    : before running thread")
            t.start()
            thread_list.append(t)
        for thread in thread_list:
            logging.info("Main    : wait for the thread to finish")
            thread.join()
        for que in qs:
            while not que.empty():
                result = que.get()
#                 result.to_csv(f'{mapreducetemp_folder}/part-tmp-{num}.csv')
                print(result)

    logging.info("Main    : all done")
            

            
    def map_function(key):
        logging.info("Thread %s: starting", key)
        logging.info("Thread %s: finishing", key)
        return key + 2
        
    execute([1,2,3,4,5,6])

18:46:15: Main    : all done
18:46:15: Main    : before creating thread
18:46:15: Main    : before running thread
Exception in thread 18:46:15: Main    : before creating thread
Thread-48:
Traceback (most recent call last):
  File "/Users/mymac/opt/anaconda3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    18:46:15: Main    : before running thread
self.run()
  File "/Users/mymac/opt/anaconda3/lib/python3.8/threading.py", line 870, in run
Exception in thread Thread-49:
Traceback (most recent call last):
  File "/Users/mymac/opt/anaconda3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
18:46:15: Main    : before creating thread
        self._target(*self._args, **self._kwargs)
  File "<ipython-input-56-11a5a5aa93c6>", line 7, in <lambda>
18:46:15: Main    : before running thread
self.run()
  File "/Users/mymac/opt/anaconda3/lib/python3.8/threading.py", line 870, in run
Exception in thread 18:46:15: Main    : before creating thread
Thread-50:
Traceback (most recent

In [32]:
x

<Thread(Thread-12, stopped 123145410392064)>

<br><br><br><br>

# Question 3
## Implement the MapReduce Inverted index of the JSON documents

Implement a function `inverted_map(document_name)` which reads the CSV document from the local disc and return a list that contains entries of the form (key_value, document name).

For example, if myCSV4.csv document has values like:  
`{‘firstname’:’John’,‘secondname’:’Rambo’,‘city’:’Palo Alto’}`

Then `inverted_map(‘myCSV4.csv’)` function will return a list:  
`[(‘firstname_John’,’ myCSV4.csv’),(‘secondname_Rambo’,’ myCSV4.csv’), (‘city_Palo Alto’,’ myCSV4.csv’)]`

In [None]:
def inverted_map(document_name):
    pass

Write a reduce function `inverted_reduce(value, documents)`, where the field “documents” contains a list of all CSV documents per given value.   
This list might have duplicates.   
Reduce function will return new list without duplicates.

For example,  
calling the function `inverted_reduce(‘firstname_Albert’,’myCSV2.csv, myCSV5.csv,myCSV2.csv’)`   
will return a list `[‘firstname_Albert’,’myCSV2.csv, myCSV5.csv,myCSV2.csv’]`

In [None]:
def inverted_reduce(value, documents):
    pass

<br><br><br><br>
# Question 4
## Testing Your MapReduce

**Create Python list** `input_data` : `[‘myCSV1.csv’,.. ,‘myCSV20.csv’]`

In [None]:
input_data = None

**Submit MapReduce as follows:**

In [None]:
mapreduce = MapReduceEngine()
status = mapreduce.execute(input_data, inverted_map, inverted_reduce)
print(status)

Make sure that `MapReduce Completed` should be printed and `mapreducefinal` folder should contain the result files.

**Use python to delete all temporary data from mapreducetemp folder and delete SQLite database:**

In [8]:
pass

<br><br><br><br>

# Question 5
# Final Thoughts

The phase where `MapReduceEngine` reads all temporary files generated by maps and sort them to provide each reducer a specific key is called the **shuffle step**.

Please explain **clearly** what would be the main problem of MapReduce when processing Big Data, if there is no shuffle step at all, meaning reducers will directly read responses from the mappers.

            If you say "I dont know" you will get 2 points :)

<br><br><br><br>
Good Luck :)