## Goals

In this guided project, we'll work with data scraped from `Wikipedia`. Volunteer content contributors and editors maintain Wikipedia by continuously improving content. 

We'll implement a simplified version of the `grep` command-line utility to search for data in `54` megabytes worth of articles. If you're not familiar with the grep command, the grep utility essentially allows searching for textual data in all files from a given directory.

- Search for all occurrences of a string in all of the files.
- Provide a case-insensitive option to the search.
- Refine the result by providing the specific locations of the files.

## Data

Articles were saved using the last component of their URLs. For example, a page on Wikipedia has the URL structure https://en.wikipedia.org/wiki/Yarkant_County. If we were saving the article with the previous URL, we'd save it to the file Yarkant_County.html. All the data files are in the wiki folder. Note that the files are raw HTML — We don't need to understand HTML for this guided project. We're going to treat those files like plain-text and we won't rely on any of the specific structure of those files.

# 1. Introducing to Wikipedia Data


In [1]:
# import libraries
import os
import math
import functools
from multiprocessing import Pool

In [2]:
# List all of the files in the wiki folder.

folder_name = "wiki"
file_names = os.listdir(folder_name)
# print(file_names)

# Count and display the number of files in the wiki folder.

n_files = len(file_names)
print(n_files)

999


In [3]:
# Read the first file in the wiki folder, and print its contents.

with open(os.path.join(folder_name, file_names[0])) as f:
    lines= [line for line in f.readlines()]
#     print(lines)

# 2. Adding the MapReduce Framework

In [4]:
def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    with Pool(num_processes) as pool:
        chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

# Count the total number of lines in all files. 

Bonus: Use MapReduce for this step.

In [5]:
def map_line_count(file_names):
    total = 0
    for fn in file_names:
        with open(os.path.join(folder_name, fn)) as f:
            total += len(f.readlines())
    return total

def reduce_line_count(count1, count2):
    return count1 + count2

map_reduce(file_names, 8, map_line_count, reduce_line_count)

499797

# 3. Grep Exact Match
- Use MapReduce to create a function that, given a string, creates a dictionary where the keys are the file names and the values are lists with all line indexes that contain the given string.

- Use the function to find all occurrences of the string `"data"` in the files stored in the `wiki` folder.

In [6]:
target = "data"

def map_grep(file_names):
    grep = {}
    for fn in file_names:
        with open(os.path.join(folder_name, fn)) as f:
            lines = [line for line in f.readlines()]
        for idx, line in enumerate(lines):
            if target in line:
                if fn not in grep:
                    grep[fn] = []
                grep[fn].append(idx)
    return grep

def reduce_grep(line1, line2):
    line1.update(line2)
    return line1

grep_v1 = map_reduce(file_names, 8, map_grep, reduce_grep)
len(grep_v1)

999

# 4. Grep Case Insensitive


In [7]:
target = "data"

def map_grep_insensitive(file_names):
    grep = {}
    for fn in file_names:
        with open(os.path.join(folder_name, fn)) as f:
            lines = [line for line in f.readlines()]
        for idx, line in enumerate(lines):
            if target.lower() in line:
                if fn not in grep:
                    grep[fn] = []
                grep[fn].append(idx)
    return grep

grep_v2 = map_reduce(file_names, 8, map_grep_insensitive, reduce_grep)
len(grep_v2)

999

# 5. Checking the Implementation

Let's verify that the new implementation works by seeing if it finds more matches than the previous implementation.

- Store the results of the search for the string "data" with both versions of the algorithms into variables.

- For each file, if there are more matches in the case insensitive result, print the file name new together with the number of new matches.

In [8]:
file_name = [fn for fn, grep_list in grep_v1.items() if len(grep_list) < len(grep_v2[fn])] 

In [9]:
print(file_name)

[]


In [10]:
for fn in grep_v2:
    if fn not in grep_v1:
        print("Found {} new matches on file {}".format(len(grep_v2[fn]), fn))
    elif len(grep_v2[fn]) > len(grep_v1[fn]):
        print("Found {} new matches on file {}".format(len(grep_v2[fn]) - len(grep_v1[fn]), fn))