A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 1. MapReduce

In this problem, we will use Hadoop Streaming to execute a MapReduce code written in Python.

In [None]:
import os
from nose.tools import assert_equal, assert_true

We use the same `iris.csv` file we first encountered in [Week 12 Problem 2](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week12/assignments/Problem_2_Hadoop_File_System.ipynb). This is the [iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set). This is the [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set). As shown in the following code cell, this CSV file has 5 columns: `SepalLength`, `SepalWidth`, `PetalLength`, `PetalWidth`, and `Name`. The first 4 columns are floating point values, and the remaning column is a string column.

In [None]:
!head /home/data_scientist/data/iris.csv

The goal of this problem is to use MapReduce to find the maximum value in the **`SepalLength`** column. **We ignore all other columns for simplicity.**

## Write a Mapper script in Python.

- Write a Python script that
  - Reads data from standard input (`STDIN`),
  - Skips the first line (The first line of `iris.csv` is the header that has the column titles.), and
  - Outputs to standard output (`STDOUT`) the `SepalLength` column.
  
Hints:
- We read data from standard input, so your code should probably have something like `with sys.stdin as fin:`. We also output the results to standard output, so it should also have something like `with sys.stdout as fout:`. So, your code will probably look something like this:
  ```python
  with sys.stdin as fin:
      with sys.stdout as fout:
          # the rest is pseduo code
          for each line in fin:
              if header:
                  continue
              else:
                  strip line, split line => a, b, c, d, e
                  write a to fout
  ```
- We need to ignore the first row. There are many ways to do this, but I think the easiest way is to
  - Use `enumerate` when iterating though each line, e.g. `for i, row in enumerate(fin):`,
  - Check if the number of iteration is 0,
  - If the number of iteration is 0, use `continue` to skip to the next iteration in the loop, and
  - If the number of iteration is greater than 0, split the line by commas and write to standard output.
  
- In the end,
  ```pyhton
  >>> ! ./mapper.py < /home/data_scientist/data/iris.csv
  ```
  should give
  ```
  5.1
  4.9
  4.7
  4.6
  5.0
  5.4
  4.6
  5.0
  4.4
  4.9
  ...
  ```

In [None]:
%%writefile mapper.py
#!/usr/bin/env python3

import sys

# YOUR CODE HERE

We need to make the file executable.

In [None]:
!chmod u+x mapper.py

Print out the results.

In [None]:
! ./mapper.py < /home/data_scientist/data/iris.csv

In [None]:
test0 = """SepalLength,SepalWidth,PetalLength,PetalWidth,Name
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa""".strip()

with open("test0.csv", "w") as f:
    f.write(test0)
    
test0_out = ! ./mapper.py < test0.csv

for i, row in enumerate(test0_out):
    answer_comma = test0.split("\n")[i + 1]
    answer = answer_comma.split(",")[0]
    assert_equal(row, answer)
    
test1 = """SepalLength,SepalWidth,PetalLength,PetalWidth,Name
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa""".strip()

with open("test1.csv", "w") as f:
    f.write(test1)
    
test1_out = ! ./mapper.py < test1.csv

for i, row in enumerate(test1_out):
    answer_comma = test1.split("\n")[i + 1]
    answer = answer_comma.split(",")[0]
    assert_equal(row, answer)

!rm test0.csv test1.csv

## Write a Reducer script in Python.

- Write a Python script that
  - Reads values from standard input (`STDIN`),
  - Finds the maximum value, and
  - Outputs to standard output (`STDOUT`) the maximum value.

Hints:

- We read data from standard input, so your code should probably have something like `with sys.stdin as fin:`. We also output the results to standard output, so it should also have something like `with sys.stdout as fout:`. So, your code will probably look something like this:
  ```python
  with sys.stdin as fin:
      with sys.stdout as fout:
          # the rest is pseudocode
          for each line in fin:
              strip line => x
              max_x = compare_max(x, max_x)
          # outside the for loop
          write max_x to fout
  ```
- You may want to use the provided `compare_max()` function. This function takes two values, converts the first value to a float if necessary, compares them, and returns the larger value. So, if you iterate through each row and use this function to update the maximum value, you will have the overall maximum at the end of the loop. For example,
  ```python
  >>> x = ['1', '4', '2', '3', '5', '1']
  >>> maximum = -1
  >>> for i in x:
  ...     maximum = compare_max(i, maximum)
  ...     print("current maximum:", maximum)
  >>> print("overall maximum:", maximum)
  ```
  ```
  current maximum: 1.0
  current maximum: 4.0
  current maximum: 4.0
  current maximum: 4.0
  current maximum: 5.0
  current maximum: 5.0
  overall maximum: 5.0
  ```
- In the end, 
  ```python
  >>> ! ./mapper.py < /home/data_scientist/data/iris.csv | ./reducer.py
  ```
should give
  ```
  7.9
  ```

In [None]:
%%writefile reducer.py
#!/usr/bin/env python3

import sys

def compare_max(current_value, current_max):
    """
    Compares two values and returns the larger of
    current_value and current_max.
    
    Parameters
    ----------
    current_value: str or float.
    current_max: float.
    
    Returns
    -------
    float
    """
    try:
        # if current_value is a string,
        # convert it to a float
        current_value = float(current_value)
    except ValueError:
        # if current_value cannot be converted to a float
        # return None and exit immediately
        return
    
    # if current_value > current_max, then current_value
    # should be the new max. Return current_value as the max value.
    if current_value > current_max:
        return current_value
    # if current_value < current_max, then current_max
    # is still the max. Return current_max as the max value.
    else:
        return current_max
    
max_sepal_length = -1.

# YOUR CODE HERE

In [None]:
!chmod u+x reducer.py

In [None]:
! ./mapper.py < /home/data_scientist/data/iris.csv | ./reducer.py

In [None]:
test0 = """SepalLength,SepalWidth,PetalLength,PetalWidth,Name
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa""".strip()

with open("test0.csv", "w") as f:
    f.write(test0)
    
test0_out = ! ./mapper.py < test0.csv | ./reducer.py

assert_equal(test0_out, ['5.1'])    

test1 = """SepalLength,SepalWidth,PetalLength,PetalWidth,Name
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa""".strip()

with open("test1.csv", "w") as f:
    f.write(test1)
    
test1_out = ! ./mapper.py < test1.csv | ./reducer.py

assert_equal(test1_out, ['5.4'])

! rm test0.csv test1.csv

We have our mapper and reducer, so we are now ready to run Hadoop streaming. Let's first do some cleaning up and start up the namenode and datanodes.

In [None]:
! $HADOOP_PREFIX/sbin/stop-dfs.sh
! $HADOOP_PREFIX/sbin/stop-yarn.sh
! rm -rf /tmp/*
! echo "Y" | $HADOOP_PREFIX/bin/hdfs namenode -format 2> /dev/null
! $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
! $HADOOP_PREFIX/sbin/start-dfs.sh
! $HADOOP_PREFIX/sbin/start-yarn.sh
! $HADOOP_PREFIX/bin/hdfs dfsadmin -safemode leave

We need to copy the `iris.csv` file in our local environment to HDFS.

In [None]:
! $HADOOP_PREFIX/bin/hdfs dfs -mkdir -p wc/in
! $HADOOP_PREFIX/bin/hdfs dfs -put /home/data_scientist/data/iris.csv wc/in/iris.csv

We run `mapper.py` and `reducer.py` via Hadoop Streaming.

In [None]:
# Run Python code via Hadoop streaming
!$HADOOP_PREFIX/bin/hadoop jar \
    $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
    -files mapper.py,reducer.py \
    -input wc/in \
    -output wc/out \
    -mapper mapper.py \
    -reducer reducer.py 

In [None]:
ls_wc_out = !$HADOOP_PREFIX/bin/hdfs dfs -ls wc/out
print('\n'.join(ls_wc_out))

In [None]:
assert_true('wc/out/_SUCCESS' in ls_wc_out.s)
assert_true('wc/out/part-00000' in ls_wc_out.s)

In the following code cell, we display the results of this Hadoop Streaming task output. The output should match the Python only MapReduce approach.

In [None]:
!$HADOOP_PREFIX/bin/hdfs dfs -cat wc/out/part-00000

We are done. Having the namenode and datanodes running in the background consumes quite a bit of memory. So we should shut down the nodes at the end of the notebook. Make sure you run the assertion tests in the final code cell.

In [None]:
! $HADOOP_PREFIX/bin/hdfs dfs -rm -r -f -skipTrash wc/out
! $HADOOP_PREFIX/sbin/stop-dfs.sh
! $HADOOP_PREFIX/sbin/stop-yarn.sh
! rm mapper.py reducer.py

In [None]:
check_dfs_stopped = !$HADOOP_PREFIX/sbin/stop-dfs.sh
assert_true("no namenode to stop" in check_dfs_stopped.s)
assert_true("no datanode to stop" in check_dfs_stopped.s)
assert_true("no secondarynamenode to stop" in check_dfs_stopped.s)

check_yarn_stopped = !$HADOOP_PREFIX/sbin/stop-yarn.sh
assert_true("no resourcemanager to stop" in check_yarn_stopped.s)
assert_true("no nodemanager to stop" in check_yarn_stopped.s)
assert_true("no proxyserver to stop" in check_yarn_stopped.s)