__Part 1__
- Write a map reduce program that will examine this file, then print, for each column, the name of the column and the largest value found in that column, and for each row, the name of the row and the smallest value found in that row.  Note that the column and row names are strings, a single character.  The output should be of the form:
  - "A", 564
  - "B", 329
- There should be 20 lines in the report, and they do not have to be in any particular order.

In [33]:
%%file row_column_low_high.py
# Step 0: create a new file (lower cased + underscored).
"""row_column_low_high.py

For every column, print high; for every row, print low.
"""

# Step 1: import MRJob.
from mrjob.job import MRJob


# Step 2: create a class that inherits from MRJob.
class RowColumnLowHigh(MRJob):
  """Extracts data from a file.
  
  Prints the row/column with the low/high respectively.
  """

  # Step 3: create mapper (input: file -> data).
  def mapper(self, _, line):
    """Extracts data from a line and maps it."""

    # Get rid of whitespace, split into array by comma, and int cast 3rd item.
    cleanedLine = line.replace(" ", "")
    cleanedLine = cleanedLine.split(",")
    cleanedLine[2] = int(cleanedLine[2])

    # Check for valid data.
    if (not cleanedLine[0].isalpha()):
      return
    if (not cleanedLine[1].isalpha()):
      return
    if (cleanedLine[2] < 0 or cleanedLine[2] > 999):
      return

    yield (cleanedLine[0], cleanedLine[1]), cleanedLine[2]


  # Step 4: create reducer (output: data -> console/file/etc).
  def reducer(self, key, values):
    """Prints the row/column with the low/high respectively.
    
    For every column, print high.
    For every row, print low.
    """
    for data in values:
      yield key, data


# Step 5: set up main to run program.
if __name__ == "__main__":
  RowColumnLowHigh.run()


Overwriting row_column_low_high.py


In [34]:
!python row_column_low_high.py --no-bootstrap-mrjob cs4650hw1.dat

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/row_column_low_high.thomas.20231006.111702.211123
Running step 1 of 1...
job output is in /tmp/row_column_low_high.thomas.20231006.111702.211123/output
Streaming final output from /tmp/row_column_low_high.thomas.20231006.111702.211123/output...
["G", "S"]	102
["G", "S"]	159
["G", "S"]	483
["G", "S"]	212
["G", "S"]	143
["G", "S"]	396
["G", "S"]	476
["G", "S"]	619
["G", "S"]	702
["G", "S"]	839
["G", "S"]	484
["G", "S"]	485
["G", "S"]	596
["G", "S"]	954
["G", "S"]	483
["G", "S"]	913
["G", "S"]	879
["G", "S"]	272
["G", "S"]	56
["G", "S"]	35
["G", "T"]	454
["G", "T"]	2
["G", "T"]	152
["G", "T"]	91
["G", "T"]	781
["G", "T"]	557
["G", "T"]	386
["G", "T"]	559
["G", "T"]	735
["G", "T"]	211
["G", "T"]	957
["G", "T"]	18
["G", "T"]	39
["G", "T"]	504
["G", "T"]	667
["G", "T"]	229
["G", "T"]	156
["G", "T"]	317
["G", "T"]	279
["G", "T"]	760
["H", "K"]	385
["H", "K"]	159
["H", "K"]	

__Part 2__
- This is a second map reduce program that is similar to the first part, but has an additional feature.  For a column, the output should show the column name, the largest value in that column, and the name of one of the rows where this maximum value is found (there may be more than one cell with the same maximum value).  For each row, the output should show the row name, the smallest value in that row, and the name of one of the columns where this minimum value is found.
- Again, the output should have 20 lines, and they do not have to be in any order.  An example of one line of the report is:
  - "A", {"value":564, "example":"M"}

__Part 3__
- This is a third map reduce program that is a further extension of Part 2.  In this case, for the output, we want to include a list of all rows/columns that have that maximum/minimum value.  Note that most rows and columns will have a unique extreme, but some will have duplicates.  This output might look like this:
  - "A", ["value":564, "examples": ["M"]}
  - "D", ["value":437, "examples": ["L",  "P"]}