### <u>Homework 2: Map/Reduce</u>
- By Thomas Truong

<b>The Data</b>
- There is a data file that needs to be analyzed by a Map/Reduce system.  The file's name is 'cs4650hw2.dat', and it is available on Canvas (Look in Modules > Extra Files > Homework 2.
- The data represents a 2 dimensional array of cells, and there are several values in each cell. The basic task is this: for each row in the array, examine all of the values in all of the cells of that row, returning the smallest value found. At the same time, for each column in the array, examine all of the values in all of the cells in that column, returning the largest value found.
- Now for the details:
  - The columns are labeled 'A' through 'J', so there are 10 columns.
  - The rows are labeled 'K' through 'T', so there are 10 rows.
  - The values are integers from 0 to 999.
  - The data file uses CSV format, with each line of the file giving the column name, the row name, and one of the values for that cell.  For example, some of the lines of the file might be:
    - A,K,652
    - A,K,378
    - A,L,21
    - E,M,411

__Part 1__
- Write a map reduce program that will examine this file, then print, for each column, the name of the column and the largest value found in that column, and for each row, the name of the row and the smallest value found in that row.  Note that the column and row names are strings, a single character.  The output should be of the form:
  - "A", 564
  - "B", 329
- There should be 20 lines in the report, and they do not have to be in any particular order.

In [15]:
%%file row_column_low_high.py
# Step 0: create a new file (lower cased + underscored).
"""row_column_low_high.py

For every column, print high; for every row, print low.
"""

# Step 1: import MRJob.
from mrjob.job import MRJob


# Step 2: create a class that inherits from MRJob.
class RowColumnLowHigh(MRJob):
  """Extracts data from a file.
  
  Prints the row/column with the low/high respectively.
  """

  # Step 3: create mapper (input: file -> data).
  def mapper(self, _, line):
    """Extracts data from a line and maps it."""

    # Clean up the data.
    cleanedLine = line.replace(" ", "")      # Remove whitespaces.
    cleanedLine = cleanedLine.split(",")     # Split into array.
    cleanedLine[2] = int(cleanedLine[2])     # Make int.

    # Check for valid data.
    if (not cleanedLine[0].isalpha()):
      return
    if (not cleanedLine[1].isalpha()):
      return
    if (cleanedLine[2] < 0 or cleanedLine[2] > 999):
      return

    # Cleaning part 2: make uppercase if needed.
    cleanedLine[0] = cleanedLine[0].upper()
    cleanedLine[1] = cleanedLine[1].upper()

    # Check part 2: A-T only.
    if (ord(cleanedLine[1]) > 84):
      return

    # Return column/row with number.
    yield cleanedLine[0], cleanedLine[2]
    yield cleanedLine[1], cleanedLine[2]


  # Step 4: create reducer (output: data -> console/file/etc).
  def reducer(self, key, values):
    """Prints the row/column with the low/high respectively.
    
    For every column, print high.
    For every row, print low.
    """

    # Convert and check letter based on ASCII.
    asciiKey = ord(key)
    # A-J = column (max).
    if (asciiKey >= 65 and asciiKey <= 74):
      yield key, max(values)
    # K-T = row (min).
    else:
      yield key, min(values)


# Step 5: set up main to run program.
if __name__ == "__main__":
  RowColumnLowHigh.run()


Overwriting row_column_low_high.py


In [16]:
!python row_column_low_high.py --no-bootstrap-mrjob cs4650hw1.dat

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/row_column_low_high.thomas.20231006.232407.382699
Running step 1 of 1...
job output is in /tmp/row_column_low_high.thomas.20231006.232407.382699/output
Streaming final output from /tmp/row_column_low_high.thomas.20231006.232407.382699/output...
"R"	2
"J"	992
"T"	2
"F"	997
"A"	994
"P"	12
"G"	997
"S"	5
"I"	995
"N"	13
"E"	998
"C"	987
"D"	995
"H"	997
"M"	0
"L"	3
"K"	0
"O"	1
"Q"	0
"B"	999
Removing temp directory /tmp/row_column_low_high.thomas.20231006.232407.382699...


__Part 2__
- This is a second map reduce program that is similar to the first part, but has an additional feature.  For a column, the output should show the column name, the largest value in that column, and the name of one of the rows where this maximum value is found (there may be more than one cell with the same maximum value).  For each row, the output should show the row name, the smallest value in that row, and the name of one of the columns where this minimum value is found.
- Again, the output should have 20 lines, and they do not have to be in any order.  An example of one line of the report is:
  - "A", {"value":564, "example":"M"}

In [None]:
%%file column_min_max.py
# Step 0: create a new file (lower cased + underscored).
"""column_min_max.py

For every column, print max and the row that has the min.
"""

# Step 1: import MRJob.
from mrjob.job import MRJob


# Step 2: create a class that inherits from MRJob.
class ColumnMinMax(MRJob):
  """Extracts data from a file.
  Prints the column with the largest value,
    and the row that has the minimum value.
  """

  # Step 3: create mapper (input: file -> data).
  def mapper(self, _, line):
    """Extracts data from a line and maps it."""

    # Clean up the data.
    cleanedLine = line.replace(" ", "")      # Remove whitespaces.
    cleanedLine = cleanedLine.split(",")     # Split into array.
    cleanedLine[2] = int(cleanedLine[2])     # Make int.

    # Check for valid data.
    if (not cleanedLine[0].isalpha()):
      return
    if (not cleanedLine[1].isalpha()):
      return
    if (cleanedLine[2] < 0 or cleanedLine[2] > 999):
      return

    # Cleaning part 2: make uppercase if needed.
    cleanedLine[0] = cleanedLine[0].upper()
    cleanedLine[1] = cleanedLine[1].upper()

    # Check part 2: A-T only.
    if (ord(cleanedLine[1]) > 84):
      return

    # Return column/row with number.
    yield cleanedLine[0], cleanedLine[2]
    yield cleanedLine[1], cleanedLine[2]


  # Step 4: create reducer (output: data -> console/file/etc).
  def reducer(self, key, values):
    """Prints the column with the largest value,
      and the row that has the minimum value.
    """
    yield None


# Step 5: set up main to run program.
if __name__ == "__main__":
  ColumnMinMax.run()


__Part 3__
- This is a third map reduce program that is a further extension of Part 2.  In this case, for the output, we want to include a list of all rows/columns that have that maximum/minimum value.  Note that most rows and columns will have a unique extreme, but some will have duplicates.  This output might look like this:
  - "A", ["value":564, "examples": ["M"]}
  - "D", ["value":437, "examples": ["L",  "P"]}