# MapReduce Assignment


### Extracrting files from big pdf to `file1.txt` and `file2.txt`


Birthdate 07/14/2001 

Target book : `Harry Potter and the Deathly Hallows – J.K. Rowling`

Pages for file1 : `5966` to `5983` (pages 15 to 24 of the book)

Pages for file2 : `6104` to `6120` (pages 102 to 111 of the book)

In [2]:
!pip3 install pypdf

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [3]:
from pypdf import PdfReader

Creating function to filter required pages for file1 and file2 from whole pdf

In [4]:
def generate_files(pdf, f1_start, f1_end, f2_start, f2_end):
    reader = PdfReader(pdf)
    f1 = reader.pages[f1_start:f1_end] 
    f2 = reader.pages[f2_start:f2_end]
    return f1,f2


In [5]:
f1,f2 = generate_files('Harry_Potter_(www.ztcprep.com).pdf', 5966,5983,6104,9120)

Creating file1 and file2 from obtained pages

In [6]:
def generate_txt_files(mapping):
    for filename in mapping:
      with open(filename,"w") as file:
        file.write(' '.join([page.extract_text() for page in mapping[filename]]))
        print(f"Created file: {filename}")

In [7]:
generate_txt_files({'file1.txt' : f1, 'file2.txt' : f2})

Created file: file1.txt
Created file: file2.txt


##### Now installing required libraries

In [8]:
!pip3 install mrjob pyenchant

Defaulting to user installation because normal site-packages is not writeable
Collecting mrjob
  Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
[K     |████████████████████████████████| 439 kB 8.8 MB/s eta 0:00:01
[?25hCollecting pyenchant
  Downloading pyenchant-3.2.2-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 12.0 MB/s eta 0:00:01
[?25hCollecting PyYAML>=3.10
  Downloading PyYAML-6.0.2-cp39-cp39-macosx_11_0_arm64.whl (172 kB)
[K     |████████████████████████████████| 172 kB 23.4 MB/s eta 0:00:01
[?25hInstalling collected packages: PyYAML, pyenchant, mrjob
Successfully installed PyYAML-6.0.2 mrjob-0.7.4 pyenchant-3.2.2
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m


#### PART 1 - Word count

Creating a MapReduce job to count occurences of each word

In [9]:
%%file count_words.py
from mrjob.job import MRJob
import re

class CountWords(MRJob):
    def mapper(self, _, line):
        words = self.tokenize(line)
        for word in words:
            if word and len(word) > 0:
                yield (word, 1)

    def reducer(self, word, counts):
        total_count = sum(counts)
        yield (word, total_count)

    def tokenize(self, line):
        return re.findall(r'\b[a-z]+\b', line.lower())

if __name__ == '__main__':
    CountWords.run()

Overwriting count_words.py


Running the job

In [10]:
!python3 count_words.py file1.txt > file1_output.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /var/folders/t7/042v1y015bl6hfs4h8xjqgqr0000gn/T/count_words.drashi.20241003.230457.894507
Running step 1 of 1...
job output is in /var/folders/t7/042v1y015bl6hfs4h8xjqgqr0000gn/T/count_words.drashi.20241003.230457.894507/output
Streaming final output from /var/folders/t7/042v1y015bl6hfs4h8xjqgqr0000gn/T/count_words.drashi.20241003.230457.894507/output...
Removing temp directory /var/folders/t7/042v1y015bl6hfs4h8xjqgqr0000gn/T/count_words.drashi.20241003.230457.894507...


#### PART 2 - Non English Word Frequency

Creating a MapReduce job to count only valid words

In [11]:
%%file invalid_word_frequency_analyzer.py
from mrjob.job import MRJob
import re
import enchant

class InvalidWordFrequencyAnalyzer(MRJob):

    def __init__(self, *args, **kwargs):
        super(InvalidWordFrequencyAnalyzer, self).__init__(*args, **kwargs)
        self.english_dict = enchant.Dict("en_US")

    def mapper(self, _, line):
        words = self.tokenize(line.lower())
        for word in words:
            if self.is_valid_word(word):
                yield (word, 1)

    def reducer(self, word, counts):
        yield (word, sum(counts))

    def tokenize(self, text):
        return re.findall(r'\b\w+\b', text)

    def is_valid_word(self, word):
        return len(word) > 1 and not self.english_dict.check(word)

if __name__ == '__main__':
    InvalidWordFrequencyAnalyzer.run()

Overwriting invalid_word_frequency_analyzer.py


Running job

In [12]:
!python3 invalid_word_frequency_analyzer.py file2.txt > file2_output.txt

Traceback (most recent call last):
  File "/Users/drashi/Documents/GitHub/DATA-603/assignment-2/invalid_word_frequency_analyzer.py", line 3, in <module>
    import enchant
  File "/Users/drashi/Library/Python/3.9/lib/python/site-packages/enchant/__init__.py", line 81, in <module>
    from enchant import _enchant as _e
  File "/Users/drashi/Library/Python/3.9/lib/python/site-packages/enchant/_enchant.py", line 157, in <module>
    raise ImportError(msg)
ImportError: The 'enchant' C library was not found and maybe needs to be installed.
See  https://pyenchant.github.io/pyenchant/install.html
for details

