# Question 6
***
JG Hanekom <br>
20780893 <br>
December <br>
***

# Map Reduce

This notebook is performing map reduce in a simplified manner in Python. Distribution of compute to different nodes is not done here; the purpose rather is to explore how to implement a map or reduce function, assuming that the functionality is provided akin to the libraries mentioned in [Dean and Ghemawat](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf).


This notebook comprises a section defining identity mappers and reducers, along with a `run` method which you may change if necessary. An intermediate sort function is also provided. 

Implement the `mapper` and `reducer` in the Term Vectors section, and use the run cell as provided.


In [1]:
from itertools import groupby
from operator import itemgetter
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import Counter
%config Completer.use_jedi = False
import requests
from urllib.parse import urlsplit

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


  


### Define URL's
Will be using the following movie scripts as URL's

In [2]:
shrek_movie_url = "https://raw.githubusercontent.com/mackenziedg/shrek/master/data/shrek3.txt"
bee_movie_url = "https://gist.githubusercontent.com/ElliotGluck/64b0b814293c09999f765e265aaa2ba1/raw/79f24f9f87654d7ec7c2f6ba83e927852cdbf9a5/gistfile1.txt"

### Define the function that will clean the text files

In [3]:
def cleaner(line):
    # lowercase all words and get alphabetical char only and keeping
    # apostrophe for time being
    words = re.findall(r'[a-z\']+' , line.lower())
    for word in words :
        # we will omit apostrophe ' s assuming users won't type them in a search
        word = word.replace("'" , '')
        if not (word is '' or word in stopwords.words('english')):
            yield word

### Define a function to extract a hostname


In [4]:
def get_url(url):
  return "{0.netloc}".format(urlsplit(url))

url = "http://stackoverflow.com/questions/9626535/get-domain-name-from-url"
get_url(url)

'stackoverflow.com'

### Define the Mapper
The mapper will 
1. Clean the data
2. Count the words
3. Filter out words with only one count

In [5]:
def mapper(key, value):
    """
    The mapper will extract the hostname from the url, count the words
    and only return those with more than 1 word.
    : param key : Raw url 
    : param value : Document text
    """
    r = dict(Counter(cleaner(value)))
    r2 = [(k, v) for k, v in r.items() if v >= 2]
    yield (get_url(key), r2)

### Define the reducer

In [6]:
def reducer(key , list_value):
    """
    User defined reducer. This reducer will return the top 10 most frequently 
    used words.
    : param key : The hostname of the document
    : param list_value : The term vector
    """
    yield (key, list_value[:10])

### Define an intermediate sort function

In [7]:
def intermediate_sort(data):
    """
    This sorts the term vector in a descending order
    """
    result = []
    for v in data:
      result.append((v[0], sorted(v[1], key=itemgetter(1), reverse=True)))
    return result

### Define the run

In [8]:
def run(sources_dict):
    """
    The run function will run the mapper, reducer and intermediate sort ans save the results as lists.
    : param sources_dict : dictionary where (url,page), for example ('www.github.com/','file.txt')
    """

    # Define empty result lists
    map_result =[]
    reduce_result =[]

    # Open the documents and pass to mapper
    for k , v in sources_dict.items(): 
        f = requests.get(k+v)
        map_result += list(mapper(k, f.text))
   
    # Do an intermediate sort
    intermediate_result = intermediate_sort(map_result)

    # Pass to the reducer
    for elem in intermediate_result:
        reduce_result.append(list(reducer(elem[0], elem[1])))
    return map_result, intermediate_result, reduce_result

### Run the mapreduce query

In [9]:
x, y, res = run({'https://raw.githubusercontent.com/mackenziedg/shrek/master/data/': "shrek3.txt" , 'https://gist.githubusercontent.com/ElliotGluck/64b0b814293c09999f765e265aaa2ba1/raw/79f24f9f87654d7ec7c2f6ba83e927852cdbf9a5/': "gistfile1.txt"})

# Return the reducer results
res

[[('raw.githubusercontent.com',
   [('shrek', 647),
    ('charming', 220),
    ('donkey', 212),
    ('artie', 208),
    ('prince', 207),
    ('puss', 194),
    ('fiona', 181),
    ('third', 121),
    ('script', 120),
    ('final', 119)])],
 [('gist.githubusercontent.com',
   [('bee', 93),
    ('im', 80),
    ('dont', 64),
    ('bees', 58),
    ('know', 53),
    ('barry', 50),
    ('honey', 49),
    ('right', 43),
    ('thats', 41),
    ('youre', 40)])]]