## 1.4 Introduction to Computation at Scale

We are going to use the python [mrjob](https://github.com/Yelp/mrjob) package developed at Yelp.

This package allows us to develop and test map reduce jobs locally and when ready deploy them to a hadoop cluster with hadoop streaming enabled.  We are going to use it to run jobs locally.

To write a map reduce job we need to implement mapper() and reducer() functions.  The mrjob package takes care of the orchestration of the job.  Here is a first example that will count words in a file:

In [4]:
%%file wordcounter.py 
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "words", len(line.split())

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()

Writing wordcounter.py


The key points to note:

* We inherit from the class MRJob and provide at least one mapper, reducer or combiner method implementation
* All python methods take `self` as their first argument - this is normal - not mrjob specific
* The mappers will be called once for each line by of the input file specified on the command line
* The mappers must yield a key value pair - the emitted key value pairs will be sent to combiners and reducers
* The reducers will be called once for each key and value emitted by the mappers
* The reducers must also output key and value pairs

Here we can count the words in the bike-items data we were using earlier:

In [5]:
! python wordcounter.py data/bike-items.txt > out.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/wordcounter.csumb.20160209.061521.363793

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /tmp/wordcounter.csumb.20160209.061521.363793/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/wordcounter.csumb.20160209.061521.363793/step-0-mapper-sorted
> sort /tmp/wordcounter.csumb.20160209.061521.363793/step-0-mapper_part-00000
writing to /tmp/wordcounter.csumb.20160209.061521.363793/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/wordcounter.csumb.20160209.061521.363793/step-0-reducer_part-00000 -> /tmp/wordcounter.csumb.20160209.061521.363793/output/part-00000
Streaming final output f

The process runs and the output is dumped into the file out.txt.  In this case there is just a single line:

In [6]:
! cat out.txt

"words"	755154


Here we have one pass through the file and have computed just the number of words.  We can have more elaborate jobs that compute multiple statistics.  Here we count characters, word and line count - the mapper emits three key value pairs for each line:


In [9]:
%%file wordcounter.py 
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1
        

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()

Overwriting wordcounter.py


In [7]:
! python wordcounter.py data/bike-items.txt > out.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/wordcounter.csumb.20160209.061535.032858

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /tmp/wordcounter.csumb.20160209.061535.032858/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/wordcounter.csumb.20160209.061535.032858/step-0-mapper-sorted
> sort /tmp/wordcounter.csumb.20160209.061535.032858/step-0-mapper_part-00000
writing to /tmp/wordcounter.csumb.20160209.061535.032858/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/wordcounter.csumb.20160209.061535.032858/step-0-reducer_part-00000 -> /tmp/wordcounter.csumb.20160209.061535.032858/output/part-00000
Streaming final output f

In [10]:
! cat out.txt

"words"	755154


In [111]:
import re
s='124 "blah blah blah"'
m = re.match(r'\"([a-zA-Z0-9\s\"\"]*)\"',s)
print(m.group(1))

AttributeError: 'NoneType' object has no attribute 'group'

## Term Frequency in Map Reduce

In [165]:
%%file term-frequency.py 
from mrjob.job import MRJob
import re
import StringIO
import csv

class MRTermFrequencyCount(MRJob):

    def mapper(self, _, line):
        for row in csv.reader(StringIO.StringIO(line)):
            for term in row[1].lower().split():
                    yield term, 1

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRTermFrequencyCount.run()

Overwriting term-frequency.py


In [166]:
! python term-frequency.py data/bike-item-titles.txt > out.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/term-frequency.csumb.20160225.042855.060751

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /tmp/term-frequency.csumb.20160225.042855.060751/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/term-frequency.csumb.20160225.042855.060751/step-0-mapper-sorted
> sort /tmp/term-frequency.csumb.20160225.042855.060751/step-0-mapper_part-00000
writing to /tmp/term-frequency.csumb.20160225.042855.060751/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/term-frequency.csumb.20160225.042855.060751/step-0-reducer_part-00000 -> /tmp/term-frequency.csumb.20160225.042855.060751/output/part-00000
Str

In [167]:
! grep 'bike' out.txt


"1-bike"	1
"2-bike"	6
"2bike"	1
"700x23cbike"	1
"bandana/motorbike/chopper/harley"	1
"bicycle/bike"	6
"bicycle/cycling/bike/gate"	1
"bike"	4446
"bike(new)"	1
"bike)"	1
"bike**"	1
"bike-"	2
"bike--campagnolo"	1
"bike--nos"	1
"bike-e293"	1
"bike-eye"	3
"bike-pink"	1
"bike."	1
"bike/bicycle"	10
"bike/bicycle#06"	1
"bike/bicycle/cycling"	2
"bike/cycling"	1
"bike/road"	2
"bike/singlespeed"	1
"bike/tri"	1
"bike36t"	1
"bikehand"	6
"bikemate"	1
"biker"	5
"bikes"	100
"biketour"	1
"cycle/bike"	1
"cyclingbike"	1
"disc-xtreme_e-bike"	1
"e-bike"	5
"ebike"	8
"ebikes"	1
"electrobike"	1
"fatbike"	6
"flybikes"	3
"gobike88"	14
"inbike"	1
"interbike"	1
"lever-bike"	1
"light/bike"	1
"motor/bike"	1
"motorbike"	8
"mtb/bike"	1
"pbike"	1
"probike"	1
"red-bag/4-bike"	1
"roadbike"	2
"sale!bike"	1
"se-bikes"	2
"set/kit/cro-mo/bike"	1
"smp4bike"	1
"sobike"	3
"stowabike"	1
"tube-medium-red-8\"-bike"	1
"wolfbike"	4
"zerobike"	1


## Inverted Index

In [148]:
%%file inverted-index.py 
from mrjob.job import MRJob
import re
import StringIO
import csv

class MRInvertedIndex(MRJob):

    def mapper(self, _, line):
        for row in csv.reader(StringIO.StringIO(line)):
            id = row[0]
            for term in row[1].lower().split():
                    yield term, id
                    
    def reducer(self, key, values):
        for doc in values:
            yield key, doc

if __name__ == '__main__':
    MRInvertedIndex.run()

Overwriting inverted-index.py


In [149]:
! python inverted-index.py data/bike-item-titles.txt > out.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/inverted-index.csumb.20160210.065248.125862

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /tmp/inverted-index.csumb.20160210.065248.125862/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/inverted-index.csumb.20160210.065248.125862/step-0-mapper-sorted
> sort /tmp/inverted-index.csumb.20160210.065248.125862/step-0-mapper_part-00000
writing to /tmp/inverted-index.csumb.20160210.065248.125862/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/inverted-index.csumb.20160210.065248.125862/step-0-reducer_part-00000 -> /tmp/inverted-index.csumb.20160210.065248.125862/output/part-00000
Str

In [162]:
! grep '"unicycle"' out.txt
! awk 'NR==2138 {print$0}' data/bike-items.txt

"unicycle"	"1883"
"unicycle"	"2138"
"unicycle"	"3748"
"unicycle"	"7232"
"unicycle"	"8777"
"Electric Unicycle Hybrid Battery 800W Powered Model Q6","**Shipping dates start on the 22nd of February, 2016, please contact us before placing an order!**Here it is! The Electric Urban Transporter. This is THE newest One Wheel Electric Motorcycle (UNICYCLE). Whether you’re looking to travel to work in style, grocery shop, or simply ride green through the city, the Unicycle is the way to go! This is THE greenest, easiest, and coolest way to hop around for the urban dwellers. Transform the way you think about transportation with this Self Balancing"


In [163]:
! grep '[Uu]nicycle' data/bike-item-titles.txt

1883,"2X Pedal LED Light for self balance electric unicycle scooter airwheel solowheel"
2138,"Electric Unicycle Hybrid Battery 800W Powered Model Q6"
3748,"24"" Butyl Tire Chrome Unicycle Wheel Cycling Mountain Exercise Balance Fitness"
7232,"Unicycle Seat Basic with Seatpost 25.4mm diameter 280mm Four Bolt Black New"
8777,"Torker Unistar LX Pro Unicycle"
