* Following code implements the process of Map-Reduce. 
* To execute, just run the cells in the order in which they are written
* Finally this python notebook outputs a csv file containing words and number of times they occured. 
* Code is written according to the instructions and the flowchart given in the homework document

**Importing required libraries and packages

In [1]:
import pandas as pd
import re
import operator
import string
from nltk.tokenize.treebank import TreebankWordDetokenizer

**This function takes in the raw text file and it converts upper case characters to lowercase, eliminates digits, also removes punctuations and special symbols

In [2]:
def data_clean(x):
    f = open(x,'r') #Opening the file object
    text = f.read() # Reading from the file object
    a = ''.join([i for i in text if not i.isdigit()]) # Removing all the numbers from text
    a = text.lower() # Converting to lower case
    b = re.sub(r"[^a-z\s]+",' ',a) # Removing punctuations
    text = "".join([s for s in b.strip().splitlines(True) if s.strip()]) # Removing all blank lines
    return(text)

In [3]:
clean = data_clean(r'C:\Users\anuj8\data.txt') #Calling data clean function

**This function takes in clean text as input, generates two partitions, one with first 5000 lines and other with rest text

In [4]:
def data_split(y):
    lines = y.splitlines()   
    split = lines[5001] # Getting the line number 5001
    res = y.partition(split)[0] # Call the inbuilt pandas partition function and take the first half i.e. first 5000 lines
    result = res.split()
    split = lines[5000] # Getting line number 5000
    res1 = y.partition(split)[2] # Call the inbuilt pandas partition function and take the second half i.e. rest of the line after 5000 
    result1 = res1.split()
    return(result,result1)

In [5]:
first,first1 = data_split(clean) # Calling split function

**This function takes in the text and geneartes a set of key value pairs <word,1>

In [6]:
def mapper1(list):
    mydict = [] #Initializing dictionary
    for x in list:
        key_value = [x,1] 
        mydict.append(key_value) #Inserting key value pairs in the dictionary
    temp = pd.DataFrame(mydict,columns = ['word','count'])  
    return (temp)

In [7]:
ans = mapper1(first) # Calling mapper function

**This function takes in the text and geneartes a set of key value pairs <word,1>

In [8]:
def mapper2(list):
    mydict = [] #Initializing dictionary
    for x in list:
        key_value = [x,1] # Generating key value pairs
        mydict.append(key_value) #Inserting key value pairs in the dictionary
    temp = pd.DataFrame(mydict,columns = ['word','count'])  
    return (temp)

In [9]:
ans_1 = mapper2(first1) # Calling mapper function

In [10]:
combined = pd.concat([ans,ans_1]) #Combining outputs of mapper function 1 and 2

**This function performs sorting on the words in an ascending order

In [11]:
def sorting(data):
    test = data.sort_values('word',axis = 0,ascending = True,kind = 'quicksort',ignore_index = True)
    return(test)

In [12]:
ans = sorting(combined) # Calling sorting function

**This function perfroms partition i.e. All the words starting from a-m are stored in one set and the rest in other set

In [13]:
def partition_1(ans):
    for i in ans.index:
        if (ans.loc[i]['word'][0] == 'n'): #This line of code checks for the word begining with letter n appears for the first time.
            break
    first_set = ans[0:i] # words starting with a-m in the first set
    second_set = ans[i:] # words starting with n-z in the first set
    return(first_set,second_set)

In [14]:
first_set,second_set = partition_1(ans) #Calling partition function 

** This function takes key value pairs and does the word count

In [15]:
def reducer1(first_set):
    return(first_set['word'].value_counts())  #Aggregating similiar words together and counting them

**This function takes key value pairs and does the word count

In [16]:
def reducer2(second_set):
    return(second_set['word'].value_counts()) #Aggregating similar words together and counting them

In [17]:
set1 = reducer1(first_set) # Calling reducer function

In [18]:
set2 = reducer2(second_set) # Calling reducer function

**This function takes in the aggregated word count pairs and generates a final csv file

In [19]:
def main(set1,set2):
    set1_1 = pd.DataFrame({'Word':set1.index,'count':set1.values}) #COnverting aggregated key value pairs into dataframe
    set2_2 = pd.DataFrame({'Word':set2.index,'count':set2.values}) #COnverting aggregated key value pairs into dataframe
    result = pd.concat([set1_1, set2_2], axis=0) #Combining the two sets
    result = result.sort_values('Word',axis = 0,ascending = True,kind = 'quicksort',ignore_index = True)
    result.to_csv('final_ou.csv',index = False) # Writing the final output to csv

In [20]:
main(set1,set2)