<a href="https://colab.research.google.com/gist/vumaasha/9bd455881e93473aa5abf044adeab775/python-mrjob-demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Midwest Big Data Summer School 2019
## Python MRJob Demo - Wed. May 22, 2019
**Dr. Robert Dyer**

**Assistant Professor, Dept. of Computer Science**

**Bowling Green State University**

### NOTE: click "open in playground mode" in the File menu above so that you can run this notebook!

In this notebook, I will show basic use of MRJob (MapReduce) inside Python.

First, we need to install a few Python packages into the system.

In [None]:
!pip install --quiet mrjob

In [None]:
!wget https://www.gutenberg.org/cache/epub/67979/pg67979.txt

--2023-05-03 12:16:07--  https://www.gutenberg.org/cache/epub/67979/pg67979.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 427836 (418K) [text/plain]
Saving to: ‘pg67979.txt’


2023-05-03 12:16:08 (1.16 MB/s) - ‘pg67979.txt’ saved [427836/427836]



In [None]:
!ls pg67979.txt

pg67979.txt


In [None]:
!head pg67979.txt

﻿The Project Gutenberg eBook of The Blue Castle, by Lucy Maud Montgomery

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.



In [None]:
# copying this file to hdfs

If there are no errors above, then MRJob is properly installed in the system and ready to use.  Let's create a simple MapReduce program to test.  This will save the contents of the cell into a file named wordcount.py so that we can execute it later.

In [None]:
%%file letter_count.py
from mrjob.job import MRJob
import re

class LetterCount(MRJob):
    def mapper(self, key, value):
      splits = re.split('[\s]', value)
      word = splits[0].lower()
      count = splits[1]
      starting_letter = word[1]
      if starting_letter == "x":
        yield starting_letter, int(count)

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
     LetterCount.run()

Overwriting letter_count.py


In [None]:
%%file wordcount.py
from mrjob.job import MRJob
import re

class WordCount(MRJob):
    def mapper(self, key, value):
      words = [s.strip() for s in re.split('[\s]', value) if s]
      for word in words:
        yield word, 1

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
     WordCount.run()

Overwriting wordcount.py


Now that the code is saved to a file, we can run it.  This will run it locally (not on Hadoop) and process any file you pass in as the first argument.  The result will simply print to the console.

In [None]:
!python wordcount.py pg67979.txt > word-count.out

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/wordcount.root.20230503.124836.656580
Running step 1 of 1...
job output is in /tmp/wordcount.root.20230503.124836.656580/output
Streaming final output from /tmp/wordcount.root.20230503.124836.656580/output...
Removing temp directory /tmp/wordcount.root.20230503.124836.656580...


In [None]:
!du -hs *

792K	98-0.txt
792K	98-0.txt.1
4.0K	alpha-count.out
4.0K	letter_count.py
420K	pg67979.txt
55M	sample_data
176K	word-count.out
4.0K	wordcount.py
16K	word-freq.out


In [None]:
!head -10 word-count.out

"appreciate"	1
"appreciation"	1
"apprehension."	1
"approach"	2
"approaching"	1
"approval."	2
"approved"	1
"approved."	1
"apron"	2
"apt"	3


In [None]:
!python letter_count.py word-count.out

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/letter_count.root.20230503.125940.661568
Running step 1 of 1...
job output is in /tmp/letter_count.root.20230503.125940.661568/output
Streaming final output from /tmp/letter_count.root.20230503.125940.661568/output...
"x"	72
Removing temp directory /tmp/letter_count.root.20230503.125940.661568...


In [None]:
!tail -20 word-count.out

"settled"	1
"settled,"	1
"settled."	1
"seven"	1
"seven,"	1
"seven."	2
"seventeen"	2
"seventeen,\u201d"	1
"seventy"	2
"several"	11
"severe"	1
"severed"	1
"severely"	1
"severely."	2
"sew"	1
"sew."	1
"sewing"	1
"shabby"	4
"shabby,"	2
"shabby\u2014nobody"	1


As you can see, it lists all the unique words in the source code and how often each one occured.