<a href="https://colab.research.google.com/github/18708064/big-data/blob/master/Postblock_Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction**


This notebook demonstrates the implementation of the MapReduce paradigm to generate term vectors from documents. The goal is to explore how to use the mapper and reducer functions to extract words from multiple documents and compute their frequencies. The MapReduce model is commonly used for distributed computing and data aggregation in big data applications.

In this task, we simulate the process by applying MapReduce locally, processing multiple documents hosted on different websites. Each document contains text, and our job is to extract words and determine how frequently each word appears within a given document. Additionally, the reducer function will filter out words that appear less than twice, ensuring only frequently occurring words are included in the final result.

In [7]:
from itertools import groupby

In [8]:
from collections import defaultdict

# Helper function to sort and group intermediate data by key
def intermediate_sort(data):
    data = sorted(data)
    return [(k, [v for _, v in g]) for k, g in groupby(data, key=lambda x: x[0])]


In [9]:
# Mapper function
def mapper(hostname, document_content):
    words = document_content.split()
    for word in words:
        yield (hostname, word), 1

# Reducer function with filtering for frequency >= 2
def reducer(key, values):
    total_count = sum(values)
    if total_count >= 2:
        yield key, total_count


In [10]:
def run(sources_dict):
    map_result = []
    reduce_result = []

    # Run the mapper on each document
    for hostname, content in sources_dict.items():
        map_result.extend(list(mapper(hostname, content)))

    # Sort and group the intermediate results
    intermediate_result = intermediate_sort(map_result)

    # Run the reducer
    for elem in intermediate_result:
        reduce_result.extend(list(reducer(elem[0], elem[1])))

    return map_result, intermediate_result, reduce_result


Example Input Data

In [11]:
# Example input: documents hosted on different sites
documents = {
    'www.site1.com': 'apple banana apple',
    'www.site2.com': 'banana orange apple',
    'www.site3.com': 'orange banana banana'
}


In [12]:
# Execute the MapReduce process
_, _, result = run(documents)

# Display the final results
print("Final Results:", result)


Final Results: [(('www.site1.com', 'apple'), 2), (('www.site3.com', 'banana'), 2)]
