<a href="https://colab.research.google.com/github/KevinYih/BigDataDemo/blob/main/word_count_ebooks_kevin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install mrjob==0.7.4

Collecting mrjob==0.7.4
  Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.6/439.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mrjob
Successfully installed mrjob-0.7.4


If there are no errors above, then MRJob is properly installed in the system and ready to use.

In [60]:
%%file wordcount2.py
#2) (Text normalization & Stop word filtering)
# 2-1 Write a map-reduce to normalize text by handling punctuation, capitalization, and special characters, ensuring that "Word", "word", and "word!", for example, are counted as the same word.
# 2-2 Introduce a step in mapper function to filter out common stop words (e.g. "a", "and", "the", etc.) before counting word frequencies, to focus on more meaningful content.
from mrjob.job import MRJob
import re

class WordCount(MRJob):
    # Common stop words
    STOP_WORDS = set(["a", "an", "and", "the", "in", "on", "at", "for", "with", "is", "it", "this", "that", "there", "to", "by"])

    def mapper(self, key, value):
        # Normalize words: lowercasing and removing non-alphanumeric characters
        words = re.findall(r'\b[a-zA-Z]+\b', value.lower())

        # Emit each word unless it's a stop word
        for word in words:
            if word not in self.STOP_WORDS:
                yield word, 1

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
     WordCount.run()

Overwriting wordcount2.py


In [52]:
%%file wordcount3.py
#3) Based on requirement 2, write a map-reduce job that for each character in the English alphabet calculates the number of unique words that start with that character (use the multi-step functionality: https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html#multi-step-jobs)
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

class WordCount(MRJob):
    # Common stop words
    STOP_WORDS = set(["a", "an", "and", "the", "in", "on", "at", "for", "with", "is", "it", "this", "that", "there", "to", "by"])

    def mapper(self, key, value):
        # Normalize words: lowercasing and removing non-alphanumeric characters
        # words = re.findall(r'\b\w+\b', value.lower())
        words = re.findall(r'\b[a-zA-Z]+\b', value.lower())

        # Emit each word unless it's a stop word   ,,, and word[0].isalpha()
        for word in words:
            if word not in self.STOP_WORDS:
                yield word, 1

    def reducer(self, key, values):
        yield key, sum(values)


    def mapper_get_initial(self, word, count):
        # Emit the initial character and the word
        initial = word[0]
        yield initial, word

    def reducer_count_unique_words(self, initial, words):
        unique_words = set(words)
        yield initial, len(unique_words)

    def steps(self):
        return [
            MRStep(mapper=self.mapper, reducer=self.reducer),
            MRStep(mapper=self.mapper_get_initial, reducer=self.reducer_count_unique_words)
        ]

if __name__ == '__main__':
     WordCount.run()

Overwriting wordcount3.py


In [55]:
%%file wordcount4.py
#4) Based on requirement 2, write a map-reduce job that for each character in the English alphabet calculates the number of words that start with that character and have a frequency of more than 100 in the entire dataset
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

class WordCount(MRJob):
    # Common stop words
    STOP_WORDS = set(["a", "an", "and", "the", "in", "on", "at", "for", "with", "is", "it", "this", "that", "there", "to", "by"])

    def mapper(self, key, value):
        # Normalize words: lowercasing and removing non-alphanumeric characters
        words = re.findall(r'\b[a-zA-Z]+\b', value.lower())

        # Emit each word unless it's a stop word
        for word in words:
            if word not in self.STOP_WORDS:
                yield word, 1

    def reducer(self, key, values):
        yield key, sum(values)

    def mapper_filter_words(self, word, count):
        if count > 1:
            yield word[0], 1

    def reducer_count_initial_letters(self, initial, counts):
        yield initial, sum(counts)


    def steps(self):
        return [
            MRStep(mapper=self.mapper, reducer=self.reducer),
            MRStep(mapper=self.mapper_filter_words, reducer=self.reducer_count_initial_letters)
        ]

if __name__ == '__main__':
     WordCount.run()

Overwriting wordcount4.py


In [59]:
!python wordcount4.py PrideAndPrejudice.txt RomeoAndJuliet.txt TheModernPrometheus.txt TheWhale.txt ToThePersonSittingInDarkness.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/wordcount4.root.20240528.092610.581608
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/wordcount4.root.20240528.092610.581608/output
Streaming final output from /tmp/wordcount4.root.20240528.092610.581608/output...
"a"	859
"b"	728
"c"	1240
"d"	853
"m"	628
"n"	228
"o"	283
"p"	1021
"q"	64
"r"	723
"s"	1570
"e"	614
"f"	623
"g"	401
"h"	537
"i"	611
"j"	127
"k"	92
"l"	453
"t"	640
"u"	321
"v"	215
"w"	464
"x"	23
"y"	50
"z"	8
Removing temp directory /tmp/wordcount4.root.20240528.092610.581608...


This will run it locally (not on Hadoop) and process any file you pass in as the first argument.

In [23]:
# upload top 5 ebook
from google.colab import files

uploaded = files.upload()

Saving PrideAndPrejudice.txt to PrideAndPrejudice.txt
Saving RomeoAndJuliet.txt to RomeoAndJuliet.txt
Saving TheModernPrometheus.txt to TheModernPrometheus.txt
Saving TheWhale.txt to TheWhale.txt
Saving ToThePersonSittingInDarkness.txt to ToThePersonSittingInDarkness.txt


Upload top 5 ebook.



In [None]:
%pwd

'/content'