# **Word-Analysis on Harry Potter Books**
## *1. Count the number of occurrences of a word*

In this part of the work, MapReduce technique is used to find the occurrences of each word from selected pages.

For the purpose simplicity, the book and pages are selected using the following methods -

- The book: The book number is selected using the month of the programmers birthday. For eg.: I selected book no. 4 (The Goblet of Fire) because my birth month is August(08).
- The pages: The pages dealt in this word count is selected using the date of birth. For eg., I selected pages from 3 to 12 because my birthdate is 03.

*Date of Birth: 03rd August 1999*

[*Data Source*](https://ztcprep.com/library/story/Harry_Potter/Harry_Potter_(www.ztcprep.com).pdf)

In [1]:
import re #the regex library is important in text processing

In [2]:
#!pip install PyPDF2
import PyPDF2 #this library allows reading the pdf file to a text file

In [3]:
'''
The Harry_Potter.pdf file is read in binary format.
Then the required pages are extracted and saved as string data in the 'page' variable.
After that, the text from each page is extracted and a few filters using regex functions are applied to clean the text data.
Finally, the cleaned string is appended to the file named 'file1.txt'
'''
with open('Harry_Potter.pdf', 'rb') as hp_book:
    book_reader = PyPDF2.PdfReader(hp_book)
    for page_number in range(1812,1822):
        page = book_reader.pages[page_number]
        page_text = page.extract_text().replace("\n"," ")
        page_text = re.sub(r'www\.ztcprep\.com'," ",page_text)
        page_text = re.sub(r'P a g e  \| \d{1,2} Harry Potter and the Goblet of Fir e – J\.K\. Rowling', " ",page_text)
        with open('file1.txt', 'a') as file1:
            file1.write(page_text)

In [4]:
# We read the text file to get a look at it
with open('file1.txt','r') as book:
    book_reader = book.read()   
book_reader

'the villagers cared about was the identity of their murderer — for plainly , three apparently healthy people did not all drop dead of natural causes on the same night. The Hanged Man, the village pub, did a roaring trade that night; the whole village seemed to have turned out to discuss the murders. They were rewarded for leaving their firesides when the Riddles’ cook arrived dramatically in their midst and announced to the suddenly silent pub that a man called Frank Bryce had just been arrested. “Frank!” cried several people. “Never!” Frank Bryce was the Riddles’ gardener . He lived alone in a rundown cottage on the grounds of the Riddle House. Frank had come back from the war with a very stif f leg and a great dislike of crowds and loud noises, and had been working for the Riddles ever since.     There was a rush to buy the cook drinks and hear more details. “Always thought he was odd,” she told the eagerly listening villagers, after her fourth sherry . “Unfriendly , like. I’m sure 

### MapReduce Job

In [5]:
'''
The Mapper function finds all the words in the string, converts them into a key, value pair of the format (word,1)
The output from each step is yielded in a variable to be used for the reducer function
'''
def mapper_job(text):
    words = re.findall(r'\w+', text.lower())
    for word in words:
        #print(word)
        yield word, 1

'''
The Reducer function finds instances of occurrence of each word yielded from the mapper function and counts them.
Each time a new word is encountered, a new dictionary element is created, otherwise 1 is added to the existing value of the matching key.
'''
def reduce_job(words):
    word_count = {}
    for word in words:
        if word in word_count.keys():
            word_count[word]+=1
            #print(word_count)
        else:
            word_count[word] = 1
            #print(word_count)
    return word_count

In [6]:
# We call the mapper_job() and reduce_job() functions passing the appropriate arguments
word_count = {}
words = []
with open('file1.txt','r') as wordCountFile:
    text_reader = wordCountFile.read()
    words= mapper_job(text_reader)
    word_count = reduce_job(words) #the final result is stored as a dictionary in word_count variable

In [7]:
len(word_count)

637

In [8]:
word_count

{('the', 1): 150,
 ('villagers', 1): 3,
 ('cared', 1): 1,
 ('about', 1): 5,
 ('was', 1): 27,
 ('identity', 1): 1,
 ('of', 1): 45,
 ('their', 1): 7,
 ('murderer', 1): 2,
 ('for', 1): 14,
 ('plainly', 1): 1,
 ('three', 1): 2,
 ('apparently', 1): 1,
 ('healthy', 1): 1,
 ('people', 1): 3,
 ('did', 1): 6,
 ('not', 1): 4,
 ('all', 1): 7,
 ('drop', 1): 1,
 ('dead', 1): 2,
 ('natural', 1): 1,
 ('causes', 1): 1,
 ('on', 1): 11,
 ('same', 1): 1,
 ('night', 1): 6,
 ('hanged', 1): 2,
 ('man', 1): 9,
 ('village', 1): 5,
 ('pub', 1): 2,
 ('a', 1): 47,
 ('roaring', 1): 1,
 ('trade', 1): 1,
 ('that', 1): 18,
 ('whole', 1): 1,
 ('seemed', 1): 1,
 ('to', 1): 39,
 ('have', 1): 1,
 ('turned', 1): 4,
 ('out', 1): 4,
 ('discuss', 1): 1,
 ('murders', 1): 1,
 ('they', 1): 10,
 ('were', 1): 9,
 ('rewarded', 1): 1,
 ('leaving', 1): 1,
 ('firesides', 1): 1,
 ('when', 1): 4,
 ('riddles', 1): 12,
 ('cook', 1): 3,
 ('arrived', 1): 1,
 ('dramatically', 1): 1,
 ('in', 1): 28,
 ('midst', 1): 1,
 ('and', 1): 43,
 ('ann

In [9]:
word_count_string = str(word_count) #we convert the dictionary into string data type to allow writing into a text file

In [10]:
#The final output is stored in the text file
with open('Word_count.txt','w') as wordcount:
    wordcount.write(word_count_string)

## *2. Extract the non-English Words*

Harry Potter book consists of numerous non-English words, especially, when encountering spells and history. 

This program fetches such words from a selected reading portion using the **pyenchant** library by filtering out the non-English words.

The pages are selected based on the year of birth of the coder, which is 1999. The pages from the 4th book - The Goblet of Fire - from 99 to 109 are selected for this purpose.

In the given pdf, the pages corresponding to 99 - 109 in the 4th book are from 1911 to 1920

*A few errors are encountered due to the change in format from PDF to text file. E.g.: For, the words that begin with an uppercase 'W', there always appears an extra space between the letter 'W' and the remaining portion of the word!*

In [11]:
'''
**Same as the first part
The Harry_Potter.pdf file is read in binary format.
Then the required pages are extracted and saved as string data in the 'page' variable.
After that, the text from each page is extracted and a few filters using regex functions are applied to clean the text data.
Finally, the cleaned string is appended to the file named 'file2.txt'
'''
with open('Harry_Potter.pdf', 'rb') as hp_book_2:
    book_reader = PyPDF2.PdfReader(hp_book_2)
    for page_number in range(1910,1920):
        page = book_reader.pages[page_number]
        page_text = page.extract_text().replace("\n"," ")
        page_text = re.sub(r'www\.ztcprep\.com'," ",page_text)
        page_text = re.sub(r'P a g e  \| \d{2} Harry Potter and the Goblet of Fir e – J\.K\. Rowling', " ",page_text)
        with open('file2.txt', 'a') as file2:
            file2.write(page_text)

In [12]:
# We read the text file to get a look at it
with open('file2.txt','r') as book_2:
    book_reader = book_2.read()   
book_reader

'Mrs. W easley jabbed her wand at the cutlery drawer , which shot open. Harry and Ron both jumped out of the way as several knives soared out of it, flew across the kitchen, and began chopping the potatoes, which had just been tipped back into the sink by the dustpan. “I don’ t know where we went wrong with them,” said Mrs. W easley , putting down her wand and starting to pull out still more saucepans. “It’ s been the same for years, one thing after another , and they won’ t listen to — OH NOT  AGAIN !” She had picked up her wand from the table, and it had emitted a loud squeak and turned into a giant rubber mouse. “One of their fake wands again!” she shouted. “How many times have I told them not to leave them lying around?” She grabbed her real wand and turned around to find that the sauce on the stove was smoking. “C’mon,” Ron said hurriedly to Harry , seizing a   handful of cutlery from the open drawer , “let’ s go and help Bill and Charlie.” They left Mrs. W easley and headed out t

In [13]:
#pip install pyenchant

In [14]:
import enchant # the library helps filter non-english words

In [15]:
enchant.dict_exists('en_GB') #Great Britain English because the author is an English writer.

True

In [16]:
#enchant.list_languages()

In [17]:
spell_checker = enchant.Dict("en_GB") # Defines the enchant instant with GB dictionary

In [18]:
'''
The Mapper function finds all the words that are not in the English dictionary in the string, converts them into a key, value pair of the format (word,1)
The output from each step is yielded in a variable to be used for the reducer function
'''
def mapper_job(text):
    words = re.findall(r'\w+', text)
    for word in words:
        english_word = spell_checker.check(word)
        if english_word == False:
            #print(word)
            yield word, 1

'''
The Reducer function finds instances of occurrence of each non-English word yielded from the mapper function and counts them.
Each time a new word is encountered, a new dictionary element is created, otherwise 1 is added to the existing value of the matching key.
'''
def reduce_job(words):
    word_count = {}
    for word in words:
        if word in word_count.keys():
            word_count[word]+=1
            #print(word_count)
        else:
            word_count[word] = 1
            #print(word_count)
    return word_count

In [23]:
# We call the mapper_job() and reduce_job() functions passing the appropriate arguments
word_count_2 = {}
words = []
with open('file2.txt', 'r') as non_english:
     text_reader = non_english.read()
words= mapper_job(text_reader)
word_count_2 = reduce_job(words) #The final results are stored in the dictionary variable 'word_count_2'

In [24]:
len(word_count_2)

31

In [25]:
word_count_2

{('easley', 1): 11,
 ('Hermione', 1): 7,
 ('Crookshanks', 1): 3,
 ('ellington', 1): 1,
 ('Geor', 1): 2,
 ('ge', 1): 3,
 ('Perce', 1): 1,
 ('ery', 1): 1,
 ('easleys', 1): 1,
 ('ve', 1): 4,
 ('uesday', 1): 1,
 ('orld', 1): 4,
 ('favor', 1): 1,
 ('ou', 1): 4,
 ('Jorkins', 1): 1,
 ('ent', 1): 2,
 ('ganize', 1): 1,
 ('guing', 1): 1,
 ('iktor', 1): 1,
 ('Krum', 1): 2,
 ('wizarding', 1): 1,
 ('ransylvania', 1): 1,
 ('Luxembour', 1): 1,
 ('Gryf', 1): 2,
 ('findor', 1): 2,
 ('Quidditch', 1): 1,
 ('Firebolt', 1): 1,
 ('eah', 1): 1,
 ('ver', 1): 1,
 ('didn', 1): 1,
 ('Diagon', 1): 1}

In [22]:
#the dictionary is converted to string type and is written into the new text file.
non_english_str = str(word_count_2)
with open('Non-English_words.txt','w') as non_english:
    non_english.write(non_english_str)