## 1.4 Introduction to Computation at Scale

We are going to use the python [mrjob](https://github.com/Yelp/mrjob) package developed at Yelp.

This package allows us to develop and test map reduce jobs locally and when ready deploy them to a hadoop cluster with hadoop streaming enabled.  We are going to use it to run jobs locally.

To write a map reduce job we need to implement mapper() and reducer() functions.  The mrjob package takes care of the orchestration of the job.  Here is a first example that will count words in a file.  

<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'>To edit the file we are using the Jupyter Notebook Cell Magic '%%file'.  
The file is written to the file system by the notebook when the cell is run.

In [1]:
%%file wordcounter.py 
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, value):
        yield "words", len(value.split())

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()

Overwriting wordcounter.py


The key points to note:

* We inherit from the class MRJob and provide at least one mapper, reducer or combiner method implementation
* All python methods take `self` as their first argument - this is normal - not mrjob specific
* The mappers will be sent a partition of the input data
* The mappers must yield a key value pair - the emitted key value pairs will be sent to reducers - hash function maps the key uniquely to a node
* The mappers and reducers are implemented as Python [generators](https://wiki.python.org/moin/Generators) - allowing the function to be used like an iterator
* The reducers will receive the key and all the values emitted by the mappers with this key
* The reducers must also output key and value pairs
 
<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'>The job is scheduled form the command line.  
We can access the shell with the Jupyter Notebook line magic '!".

In [2]:
! python wordcounter.py data/bike-item-titles-clean.txt > out.txt

No configs found; falling back on auto-configuration
Traceback (most recent call last):
  File "wordcounter.py", line 12, in <module>
    MRWordFrequencyCount.run()
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/mrjob/job.py", line 430, in run
    mr_job.execute()
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/mrjob/job.py", line 448, in execute
    super(MRJob, self).execute()
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/mrjob/launch.py", line 158, in execute
    self.run_job()
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/mrjob/launch.py", line 224, in run_job
    runner.run()
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/mrjob/runner.py", line 473, in run
    self._run()
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/mrjob/sim.py", line 160, in _run
    _error_on_bad_paths(self.fs, self._input_paths)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/mrjob/sim.py", line 552, in _error

The process runs and the output is dumped into the file out.txt.  In this case there is just a single line:

In [3]:
! cat out.txt

Here we have one pass through the file and have computed just the number of words.  We can have more elaborate jobs that compute multiple statistics.  Here we count characters, word and line count - the mapper emits three key value pairs for each line:


In [4]:
%%file wordcounter.py 
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, value):
        yield "chars", len(value)
        yield "words", len(value.split())
        yield "lines", 1
        

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()

Overwriting wordcounter.py


In [5]:
! python wordcounter.py data/bike-item-titles.txt > out.txt

No configs found; falling back on auto-configuration
Creating temp directory /tmp/wordcounter.ubuntu.20160523.012044.403522
Running step 1 of 1...
Streaming final output from /tmp/wordcounter.ubuntu.20160523.012044.403522/output...
Removing temp directory /tmp/wordcounter.ubuntu.20160523.012044.403522...


In [6]:
! cat out.txt

"words"	106980
"chars"	714500
"lines"	9894


## Term Frequency in Map Reduce

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>Using the word count example above can you modify the MR job to compute token frequency across the entire corpus in file `data/bike-item-titles.txt`?  Remember you can only emit (key, value) pairs from the mapper.


**Hint** : the `/data/bike-item-titles.txt` file is quoted like a CSV file.  The easiest way to handle the CSV input presented to the mapper is to use StringIO and csv.reader:

In [7]:
import StringIO
import csv

line = '"Some quoted text about 18"" pizzas"'
for row in csv.reader(StringIO.StringIO(line)):
    print(row)
    for term in row[0].split():
        print(term)

['Some quoted text about 18" pizzas']
Some
quoted
text
about
18"
pizzas


In [8]:
%%file term-frequency.py 
from mrjob.job import MRJob
import StringIO
import csv

class MRTermFrequencyCount(MRJob):

    def mapper(self, _, value):
        # << IMPLEMENT MAPPER >> CODE HERE
        ## HIDE
        for row in csv.reader(StringIO.StringIO(value)):
            for term in row[1].lower().split():
                    yield term, 1

    def reducer(self, key, values):
        # << IMPLEMENT REDUCER >> CODE HERE
        ## HINT
        yield key, sum(values)

if __name__ == '__main__':
    MRTermFrequencyCount.run()

Overwriting term-frequency.py


In [9]:
! python term-frequency.py data/bike-item-titles.txt > out.txt

No configs found; falling back on auto-configuration
Creating temp directory /tmp/term-frequency.ubuntu.20160523.012045.608427
Running step 1 of 1...
Streaming final output from /tmp/term-frequency.ubuntu.20160523.012045.608427/output...
Removing temp directory /tmp/term-frequency.ubuntu.20160523.012045.608427...


<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>Using a line magic `grep` the output file for the term bike.  
You may want to pipe the results of `grep` to `head`.

In [10]:
! grep 'bike' out.txt | head

"inbike"	1
"interbike"	1
"lever-bike"	1
"light/bike"	1
"motor/bike"	1
"motorbike"	8
"mtb/bike"	1
"pbike"	1
"probike"	1
"red-bag/4-bike"	1


## Inverted Index

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'> The goal is to creat an inverted index mapping terms to rows in the file using MRJob.  The row id is in the first column of the file.  
The input file should be `data/bike-item-titles.txt`.   

In [11]:
%%file inverted-index.py 
from mrjob.job import MRJob
import StringIO
import csv

class MRInvertedIndex(MRJob):

    def mapper(self, _, value):
        # << IMPLEMENT MAPPER >> CODE HERE
        ## HIDE
        for row in csv.reader(StringIO.StringIO(value)):
            id = row[0]
            for term in row[1].lower().split():
                    yield term, id
                    
    def reducer(self, key, values):
        # << IMPLEMENT MAPPER >> CODE HERE
        ## HIDE
        for doc in values:
            yield key, doc

if __name__ == '__main__':
    MRInvertedIndex.run()

Overwriting inverted-index.py


In [12]:
! python inverted-index.py data/bike-item-titles.txt > out.txt

No configs found; falling back on auto-configuration
Creating temp directory /tmp/inverted-index.ubuntu.20160523.012047.866588
Running step 1 of 1...
Streaming final output from /tmp/inverted-index.ubuntu.20160523.012047.866588/output...
Removing temp directory /tmp/inverted-index.ubuntu.20160523.012047.866588...


<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>`grep` the output file to find the row numbers where the item title includes the term 'unicycle'.  
Use the UNIX command `awk`, or other UNIX command of your liking, to extract one of those lines to confirm.

In [13]:
#HIDE
! grep '"unicycle"' out.txt
! awk 'NR==2138 {print$0}' data/bike-items.txt

"unicycle"	"1883"
"unicycle"	"2138"
"unicycle"	"3748"
"unicycle"	"7232"
"unicycle"	"8777"
"Electric Unicycle Hybrid Battery 800W Powered Model Q6","**Shipping dates start on the 22nd of February, 2016, please contact us before placing an order!**Here it is! The Electric Urban Transporter. This is THE newest One Wheel Electric Motorcycle (UNICYCLE). Whether you’re looking to travel to work in style, grocery shop, or simply ride green through the city, the Unicycle is the way to go! This is THE greenest, easiest, and coolest way to hop around for the urban dwellers. Transform the way you think about transportation with this Self Balancing"
