In [17]:
!pip install mrjob==0.7.4



If there are no errors above, then MRJob is properly installed in the system and ready to use.  Let's create a simple MapReduce program to test.  This will save the contents of the cell into a file named wordcount.py so that we can execute it later.

+++Remarks by Hui Lin 02/02/2025:

mrjob is a MapReduce framework for Python that is used to write and run MapReduce tasks.

MRJob is the base class, and user-defined tasks need to inherit from it.

re: a regular expression module that is used for text segmentation.

In [3]:
%%file wordcount.py
from mrjob.job import MRJob
import re

class WordCount(MRJob):
    def mapper(self, key, value):    # Map() for spliting the words line by line, "key"-the line Number, and "value"-every single line content.
      words = [s.strip() for s in re.split('[\s]', value) if s]
      for word in words:
        yield word, 1

    def reducer(self, key, values):   # Reduce() for sum up. "key"-distined word, "values"-total number of key
        yield key, sum(values)

if __name__ == '__main__':
     WordCount.run()

Overwriting wordcount.py


Now that the code is saved to a file, we can run it.  This will run it locally (not on Hadoop) and process any file you pass in as the first argument.  The result will simply print to the console.

In [5]:
!python wordcount.py wordcount.py

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/wordcount.root.20250203.053813.239845
Running step 1 of 1...
job output is in /tmp/wordcount.root.20250203.053813.239845/output
Streaming final output from /tmp/wordcount.root.20250203.053813.239845/output...
"#"	2
"'__main__':"	1
"1"	1
"="	1
"=="	1
"MRJob"	1
"Map()"	1
"Number,"	1
"Reduce()"	1
"WordCount(MRJob):"	1
"WordCount.run()"	1
"[s.strip()"	1
"\"key\"-distined"	1
"\"key\"-the"	1
"\"value\"-every"	1
"\"values\"-total"	1
"__name__"	1
"and"	1
"by"	1
"class"	1
"content."	1
"def"	2
"for"	4
"from"	1
"if"	2
"import"	2
"in"	2
"spliting"	1
"sum"	1
"sum(values)"	1
"the"	1
"up."	1
"value)"	1
"value):"	1
"values):"	1
"word"	1
"word,"	2
"words"	2
"words:"	1
"yield"	2
"key"	1
"key,"	3
"line"	3
"line,"	1
"mapper(self,"	1
"mrjob.job"	1
"number"	1
"of"	1
"re"	1
"re.split('[\\s]',"	1
"reducer(self,"	1
"s"	1
"s]"	1
"single"	1
Removing temp directory /tmp/wordcount.root.20250203.

As you can see, it lists all the unique words in the source code and how often each one occured.

In [4]:
# Go to https://www.gutenberg.org/browse/scores/top#books-last7 (most dowloaded ebooks from Gutenberg project) and download top 5 english ebooks. Make sure you download the plain text version. These would be your dataset.

# Make sure you make a copy for yourself, otherwise, you cannot save your changes.

# 1) Run WordCount.py program on the dataset. Make sure to run the program on the entire dataset not on individual files (just list all filenames on the command line).
# 2) (Text normalization & Stop word filtering)
#  2-1 Write a map-reduce to normalize text by handling punctuation, capitalization, and special characters, ensuring that "Word", "word", and "word!", for example, are counted as the same word.
#  2-2 Introduce a step in mapper function to filter out common stop words (e.g. "a", "and", "the", etc.) before counting word frequencies, to focus on more meaningful content.
# 3) Based on requirement 2, write a map-reduce job that for each character in the English alphabet calculates the number of unique words that start with that character (use the multi-step functionality: https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html#multi-step-jobs)
# 4) Based on requirement 2, write a map-reduce job that for each character in the English alphabet calculates the number of words that start with that character and have a frequency of more than 100 in the entire dataset

# Pay special attention to the last two questions. Can you solve them with just one map and one reduce steps?

# Make sure you make a copy for yourself, otherwise, you cannot save your changes.
# Go to the end of the notebook for the assignment.

# Submit the link to your colab notebook and a screenshot of the output for each question. Make sure your notebook is accessible by the public (look for the share button on the top right of your colab notebook)


In [8]:
# upload top 5 ebook
from google.colab import files

uploaded = files.upload()

In [12]:
# 1) Try the whole dataset
!python wordcount.py Frankenstein.txt Middlemarch.txt Moby_Dick.txt Romeo_and_Juliet.txt Simple_Sabotage_Field_Manual.txt > Answer1.txt
# check the result of question 1.
!cat Answer1.txt

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
"reeling"	3
"reeling,"	1
"reelingly"	1
"reelman"	1
"reels"	2
"reels,"	1
"reeving"	1
"refectory"	1
"refer"	17
"reference"	18
"reference,"	2
"reference."	1
"references"	14
"referred"	16
"referred."	1
"referring"	8
"refers"	1
"refill!"	1
"refill"	1
"refill;"	1
"refined"	6
"refined,"	1
"refinement"	4
"refinement,"	1
"refinement."	1
"refinements"	2
"refinements,"	1
"refining"	1
"refiningly"	1
"refinishing"	1
"reflect"	28
"reflect,"	4
"reflect;"	1
"reflected"	33
"reflected,"	3
"reflecting"	6
"reflection"	21
"reflection,"	8
"reflection."	5
"reflections"	19
"reflections,"	5
"reflections."	3
"reflections?"	1
"reflective"	1
"reflectively,"	2
"reflectively;"	1
"reflectiveness,"	1
"reflects"	1
"reflex"	3
"refluent"	1
"reform"	14
"reform,"	4
"reformed"	2
"reformer"	2
"reformer.\u201d"	1
"reforming"	6
"reforms!\u201d"	1
"reforms"	4
"reforms,\u2014though"	1
"reforms;"	1
"refrain"	4
"refrained"	7
"refrained."	1
"refreshed"	3
"refreshed."	1
"refreshing"	1
"refre

In [13]:
# 2) upgrade the map-reduce() for text normalization & word filtering

%%file wordcount2.py
from mrjob.job import MRJob
import re

class WordCount2(MRJob):
    # Define the word list to be filtered
    FLT_WORDS = {
        "a", "an", "the"
    }

    def mapper(self, key, value):
       text = re.sub(r'[^a-zA-Z\s]', ' ', value.lower())  # change the special characters into SPACE, and handle capitalization
       words = re.split(r'\s+', text.strip())             # splited by SPACE
       for word in words:
          if word and word not in self.FLT_WORDS:         # skip the filtered words
              yield word, 1

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
     WordCount2.run()


Writing wordcount2.py


In [14]:
# 2) try the new map-reduce()
!python wordcount2.py Frankenstein.txt Middlemarch.txt Moby_Dick.txt Romeo_and_Juliet.txt Simple_Sabotage_Field_Manual.txt > Answer2.txt
# check the result of question 2.
!cat Answer2.txt

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
"painful"	61
"painfully"	15
"painfulness"	1
"pains"	37
"painstaking"	2
"paint"	13
"painted"	25
"painter"	14
"painters"	3
"painting"	22
"paintings"	7
"paints"	2
"pair"	24
"pairs"	6
"palace"	9
"palaces"	3
"palate"	2
"palavering"	2
"pale"	78
"paled"	1
"paleness"	8
"paler"	5
"pales"	1
"palest"	1
"paley"	1
"palings"	1
"palisades"	1
"palisading"	1
"pall"	3
"pallet"	1
"palliate"	1
"palliative"	1
"pallid"	8
"pallidness"	2
"pallor"	8
"palm"	10
"palma"	1
"palmed"	2
"palmers"	2
"palms"	16
"palmy"	3
"palpable"	5
"palpableness"	1
"palpably"	1
"palpitate"	1
"palpitated"	1
"palpitating"	3
"palpitation"	2
"palsied"	2
"palsy"	2
"paltry"	6
"paly"	1
"pampas"	1
"pampered"	1
"pamphlet"	4
"pamphlets"	8
"pan"	5
"pand"	1
"pandects"	1
"panegyric"	1
"panel"	1
"panelled"	2
"panellings"	1
"panels"	2
"panes"	4
"pang"	13
"pangs"	4
"panic"	9
"panics"	1
"pannangians"	1
"panniers"	1
"panorama"	1
"panoramas"	1
"pans"	6
"pantaloon"	1
"pantaloons"	3
"pantheistic"	2
"pantheists"	1


In [21]:
# 3) try the multi-step functionality for further upgrading

%%file wordcount3.py
from mrjob.job import MRJob
import re
from mrjob.step import MRStep

class WordCount3(MRJob):
    # Define the word list to be filtered
    FLT_WORDS = {
        "a", "an", "the"
    }

    def mapper(self, key, value):
       text = re.sub(r'[^a-zA-Z\s]', ' ', value.lower())  # change the special characters into SPACE, and handle capitalization
       words = re.split(r'\s+', text.strip())             # splited by SPACE
       for word in words:
          if word and word not in self.FLT_WORDS:         # skip the filtered words
              yield word, 1

#    def reducer(self, key, values):
#        yield key, sum(values)
    def reducer_init(self, word, counts):
        yield word[0].lower(), word       # new key-value: (INIT_letter, word)

    def mapper_letter(self, letter, word):
        yield letter, 1                   # to count unique words

    def reducer_letter(self, letter, counts):
        yield letter, sum(counts)         # sum up, group by INIT_letter

    def steps(self):     # apply multi-step
        return [
            MRStep(mapper=self.mapper, reducer=self.reducer_init),
            MRStep(mapper=self.mapper_letter, reducer=self.reducer_letter)
        ]


if __name__ == '__main__':
     WordCount3.run()

Overwriting wordcount3.py


In [22]:
# 3) try the multi-step procedure
!python wordcount3.py Frankenstein.txt Middlemarch.txt Moby_Dick.txt Romeo_and_Juliet.txt Simple_Sabotage_Field_Manual.txt > Answer3.txt
# check the result of question 3.
!cat Answer3.txt


No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/wordcount3.root.20250203.065639.892337
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/wordcount3.root.20250203.065639.892337/output
Streaming final output from /tmp/wordcount3.root.20250203.065639.892337/output...
Removing temp directory /tmp/wordcount3.root.20250203.065639.892337...
"a"	1570
"b"	1460
"c"	2442
"d"	1661
"e"	1142
"f"	1190
"g"	761
"h"	957
"i"	1189
"j"	254
"k"	155
"l"	886
"t"	1297
"u"	893
"v"	462
"w"	816
"x"	42
"y"	73
"z"	26
"m"	1285
"n"	443
"o"	583
"p"	2036
"q"	134
"r"	1393
"s"	2963


In [23]:
# 4) try the multi-step functionality for the requirement related to frequency

%%file wordcount4.py
from mrjob.job import MRJob
import re
from mrjob.step import MRStep

class WordCount4(MRJob):
    # Define the word list to be filtered
    FLT_WORDS = {
        "a", "an", "the"
    }

    def mapper(self, key, value):
       text = re.sub(r'[^a-zA-Z\s]', ' ', value.lower())  # change the special characters into SPACE, and handle capitalization
       words = re.split(r'\s+', text.strip())             # splited by SPACE
       for word in words:
          if word and word not in self.FLT_WORDS:         # skip the filtered words
              yield word, 1

#    def reducer(self, key, values):
#        yield key, sum(values)
    def reducer_count_feq(self, word, counts):   # To count the frequency
        total_count = sum(counts)
        if total_count > 100:  # conditional parameter
            yield word[0].lower(), (word, total_count)  # (Init_letter, word+feq)

    def mapper_by_letter(self, letter, word_count):
        yield letter, word_count                   # (Init_letter, feq)

    def reducer_by_letter(self, letter, word_counts):
        count = 0
        for word, freq in word_counts:
            count += 1
        yield letter, count

    def steps(self):     # apply multi-step
        return [
            MRStep(mapper=self.mapper, reducer=self.reducer_count_feq),
            MRStep(mapper=self.mapper_by_letter, reducer=self.reducer_by_letter)
        ]


if __name__ == '__main__':
     WordCount4.run()

Writing wordcount4.py


In [24]:
# 3) try the multi-step procedure
!python wordcount4.py Frankenstein.txt Middlemarch.txt Moby_Dick.txt Romeo_and_Juliet.txt Simple_Sabotage_Field_Manual.txt > Answer4.txt
# check the result of question 4.
!cat Answer4.txt

# for the question:"Can you solve them with just one map and one reduce steps?"
# According the above procedure, I think the answer is No.


No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/wordcount4.root.20250203.073318.828026
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/wordcount4.root.20250203.073318.828026/output
Streaming final output from /tmp/wordcount4.root.20250203.073318.828026/output...
Removing temp directory /tmp/wordcount4.root.20250203.073318.828026...
"a"	35
"b"	35
"c"	36
"d"	23
"e"	23
"f"	39
"g"	20
"h"	38
"i"	16
"j"	5
"k"	7
"l"	35
"m"	37
"n"	17
"u"	10
"v"	7
"w"	45
"y"	8
"o"	23
"p"	23
"q"	4
"r"	17
"s"	71
"t"	53
