# Using MapReduce to Analyze Wikipedia Pages

## Introduction

In this project, we will be analyzing data scraped from Wikipedia pages to search for specific instances. First we will create a MapReduce framework to analyze the data more quickly and efficiently. Then, we will implement a simplified version of the `grep` command-line utility to search for specific strings.

## List and count all files in the wiki folder

In [1]:
import os

file_names = os.listdir("wiki")
print(len(file_names))

999


## Read and display the first file in the folder

In [2]:
with open(os.path.join("wiki", file_names[0])) as f:
    print(f.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Bay of Concepción - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Bay_of_Concepción","wgTitle":"Bay of Concepción","wgCurRevisionId":647460156,"wgRevisionId":647460156,"wgArticleId":16044270,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","All stub articles","Landforms of Bío Bío Region","Bays of Chile","Bío Bío Region geography stubs"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNa

## Adding the MapReduce Framework

In [3]:
from multiprocessing import Pool
import math
import functools


def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]


def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    pool = Pool(num_processes)
    chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

## Counting the total number of lines in all files (using MapReduce)

In [4]:
def map_line_count(file_names):
    total = 0
    for fn in file_names:
        with open(os.path.join("wiki", fn)) as f:
            total += len(f.readlines())
    return total


def reduce_line_count(count1, count2):
    return count1 + count2


target = "data"
map_reduce(file_names, 10, map_line_count, reduce_line_count)

499797

## Grep exact match

First we will implement a MapReduce grep algorithm to locate all lines from the `wiki` folder that contain a given string.

In [5]:
# The target variable is defined outside and contains the string we're looking for
def map_grep(file_names):
    results = {}
    for fn in file_names:
        with open(fn) as f:
            lines = [line for line in f.readlines()]
        for line_index, line in enumerate(lines):
            if target in line:
                if fn not in results:
                    results[fn] = []
                results[fn].append(line_index)
    return results


def reduce_grep(lines1, lines2):
    lines1.update(lines2)
    return lines1


def mapreduce_grep(path, num_processes):
    file_names = [os.path.join(path, fn) for fn in os.listdir(path)]
    return map_reduce(file_names, num_processes,  map_grep, reduce_grep)

## Find all occurrences of the string "data" in the wiki folder

In [6]:
target = "data"
data_occurrences = mapreduce_grep("wiki", 10)

## Allowing for case insensitive matches

In [7]:
def map_grep_insensitive(file_names):
    results = {}
    for fn in file_names:
        with open(fn) as f:
            lines = [line.lower() for line in f.readlines()]
        for line_index, line in enumerate(lines):
            if target.lower() in line:
                if fn not in results:
                    results[fn] = []
                results[fn].append(line_index)
    return results


def mapreduce_grep_insensitive(path, num_processes):
    file_names = [os.path.join(path, fn) for fn in os.listdir(path)]
    return map_reduce(file_names, num_processes,  map_grep_insensitive, reduce_grep)


target = "data"
new_data_occurrences = mapreduce_grep_insensitive("wiki", 10)

## Checking the implementation

In [8]:
for fn in new_data_occurrences:
    if fn not in data_occurrences:
        print("Found {} new matches on file {}".format(
            len(new_data_occurrences[fn]), fn))
    elif len(new_data_occurrences[fn]) > len(data_occurrences[fn]):
        print("Found {} new matches on file {}".format(
            len(new_data_occurrences[fn]) - len(data_occurrences[fn]), fn))

Found 1 new matches on file wiki/Table_Point_Formation.html
Found 1 new matches on file wiki/Ingrid_GuimarC3A3es.html
Found 2 new matches on file wiki/Jules_Verne_ATV.html
Found 1 new matches on file wiki/Pictogram.html
Found 2 new matches on file wiki/Claire_Danes.html
Found 1 new matches on file wiki/PTPRS.html
Found 1 new matches on file wiki/A_Beautiful_Valley.html
Found 1 new matches on file wiki/Mudramothiram.html
Found 2 new matches on file wiki/Gordon_Bau.html
Found 1 new matches on file wiki/Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html
Found 3 new matches on file wiki/Code_page_1023.html
Found 1 new matches on file wiki/Cryptographic_primitive.html
Found 1 new matches on file wiki/Alex_Kurtzman.html
Found 1 new matches on file wiki/Filip_Pyrochta.html
Found 1 new matches on file wiki/Morgana_King.html
Found 1 new matches on file wiki/Don_Parsons_(ice_hockey).html
Found 1 new matches on file wiki/Bias.html
Found 2 new matches on file wiki/Tomohiko_ItC58D_(director).html
Found

## Finding match indexes on lines

In [9]:
def find_match_indexes(line, target):
    results = []
    i = line.find(target, 0)
    while i != -1:
        results.append(i)
        i = line.find(target, i + 1)
    return results


# Test implementation
s = "Data science is related to data mining, machine learning and big data.".lower()
print(find_match_indexes(s, "data"))

[0, 27, 65]


## Finding all match locations for "science"

In [10]:
def map_grep_match_indexes(file_names):
    results = {}
    for fn in file_names:
        with open(fn) as f:
            lines = [line.lower() for line in f.readlines()]
        for line_index, line in enumerate(lines):
            match_indexes = find_match_indexes(line, target.lower())
            if fn not in results:
                results[fn] = []
            results[fn] += [(line_index, match_index)
                            for match_index in match_indexes]
    return results


def mapreduce_grep_match_indexes(path, num_processes):
    file_names = [os.path.join(path, fn) for fn in os.listdir(path)]
    return map_reduce(file_names, num_processes,  map_grep_match_indexes, reduce_grep)


target = "science"
occurrences = mapreduce_grep_match_indexes("wiki", 10)

## Displaying the results

In [11]:
import csv

# Variable for how many characters to show before and after the match
context_delta = 30

with open("results.csv", "w") as f:
    writer = csv.writer(f)
    rows = [["File", "Line", "Index", "Context"]]
    for fn in occurrences:
        with open(fn) as f:
            lines = [line.strip() for line in f.readlines()]
        for line, index in occurrences[fn]:
            start = max(index - context_delta, 0)
            end = index + len(target) + context_delta
            rows.append([fn, line, index, lines[line][start:end]])
    writer.writerows(rows)

In [12]:
import pandas
df = pandas.read_csv("results.csv")
df.head(10)

Unnamed: 0,File,Line,Index,Context
0,wiki/Valentin_Yanin.html,6,840,"embers of the USSR Academy of Sciences"",""Full ..."
1,wiki/Valentin_Yanin.html,6,890,"ers of the Russian Academy of Sciences"",""Demid..."
2,wiki/Valentin_Yanin.html,66,90,"href=""/wiki/Soviet_Academy_of_Sciences"" class=..."
3,wiki/Valentin_Yanin.html,66,145,"ect"" title=""Soviet Academy of Sciences"">Soviet..."
4,wiki/Valentin_Yanin.html,66,173,"f Sciences"">Soviet Academy of Sciences</a>; he..."
5,wiki/Valentin_Yanin.html,144,1440,"rs_of_the_USSR_Academy_of_Sciences"" title=""Cat..."
6,wiki/Valentin_Yanin.html,144,1502,"rs of the USSR Academy of Sciences"">Full Membe..."
7,wiki/Valentin_Yanin.html,144,1548,rs of the USSR Academy of Sciences</a></li><li...
8,wiki/Valentin_Yanin.html,144,1632,"of_the_Russian_Academy_of_Sciences"" title=""Cat..."
9,wiki/Valentin_Yanin.html,144,1697,"of the Russian Academy of Sciences"">Full Membe..."


## Conclusion

In this project, we were able to successfully implement a MapReduce framework to analyze scraped Wikipedia pages. We were also able to implement a `grep` algorithm to find matches of strings and their locations on all files. In the future, we could optimize our algorithm even further to make advanced searches on regular expressions.