# Guided Project: Analyzing Wikipedia Pages

## Introduction

This guided project comes as a completion of the Parallel Processing course.
In this project, we will work with data scraped from Wikipedia.

We'll implement a simplified version of the `grep` command-line utility to seach for data in 54MB worth of articles. The `grep` utility essentially allows searching for textual data in all files from a given directory.

We will also make use of MapReduce to parallelize our processes.

The dataset is composed of full articles from Wikipedia. Articles were saved using the last component of their URLs. For example, a page on Wikipedia has the URL structure https://en.wikipedia.org/wiki/Yarkant_County. If we were saving the article with the previous URL, we'd save it to the file `Yarkant_County.html`. All the data files are in the `wiki` folder. Note that the files are raw HTML.

Our main goals will be the following:
- Search for all occurences of a string in all of the files
- Provide a case-insensitive option to the search 
- Refine the result by providing the specific locations of the occurences

## Exploring the Data

We start by listing all the files in the `wiki` folder.

In [1]:
import os
import pprint

files_names = sorted(os.listdir("wiki"))
pprint.pprint(files_names)

['100_Greatest_Romanians.html',
 '104th_Logistic_Support_Brigade_(United_Kingdom).html',
 '16th_Virginia_Infantry.html',
 '1896_Indiana_Hoosiers_football_team.html',
 '1898_Colgate_football_team.html',
 '1910_in_literature.html',
 '1915_Montana_football_team.html',
 '1951_National_League_tiebreaker_series.html',
 '1953E2809354_FA_Cup_qualifying_rounds.html',
 '1958_Wightman_Cup.html',
 '1988_State_of_Origin_series.html',
 '1st_Strategic_Aerospace_Division.html',
 '2001_Australian_Individual_Speedway_Championship.html',
 '2001_NCAA_Division_I_Field_Hockey_Championship.html',
 '2004_Tuvalu_ADivision.html',
 '2005E2809306_in_Welsh_football.html',
 '2007E2809308_Huddersfield_Town_A.F.C._season.html',
 '2008_Fed_Cup_World_Group_II.html',
 '2009_English_cricket_season.html',
 '2009_World_Junior_Ice_Hockey_Championships_rosters.html',
 '2010_Karshi_Challenger_E28093_Singles.html',
 '2011E2809312_Western_Collegiate_Hockey_Association_women27s_ice_hockey_season.html',
 '2011_ITU_Duathlon_World_

In [2]:
print(len(files_names))

999


There are a total of 999 files in the `wiki` folder.

Let's now read the first file in the `files_names` list.

In [3]:
folder_name = "wiki"
first_file_name = files_names[0]

with open(os.path.join(folder_name, first_file_name)) as f:
    print(f.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>100 Greatest Romanians - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"100_Greatest_Romanians","wgTitle":"100 Greatest Romanians","wgCurRevisionId":739997309,"wgRevisionId":739997309,"wgArticleId":5885981,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from November 2012","Articles containing Romanian-language text","Greatest Nationals","Lists of Romanian people","Romanian Television","Romanian television series"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"w

In this guided project, we will use the `map_reduce()` function that we have developed throughout the course to parallelize our process. Below is the code we came up with:

In [4]:
import math
import functools
from multiprocessing import Pool

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    pool = Pool(num_processes)
    chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

We want to explore the data a little bit more and count the total number of lines in all files stored in the `wiki` folder. There are several ways to do this. One of them is to use MapReduce, to speed up the computation.

In the next block of code, we define a mapper and a reducer function that will be called by the `map_reduce()` function.
We will then compute the total number of lines on all files by parallelizing on 8 processes.

In [5]:
def map_line_count(files_names):
    number_lines = 0
    for file in files_names:
        with open(os.path.join("wiki", file)) as f:
            number_lines += len(f.readlines())
    return number_lines

def reduce_line_count(number_lines_1, number_lines_2):
    return number_lines_1 + number_lines_2

total_number_lines = map_reduce(files_names, 8, map_line_count, reduce_line_count)
print(total_number_lines)

499797


The 999 files in the `wiki` folder contain a total of 499797 lines.

## Implementing a MapReduce Grep Algorithm

In this section, we'll implement a first MapReduce grep algorithm. The goal is to locate all lines in all files from the `wiki` folder that contain a given string.

Given a target word, the output should be a dictionnary where the keys are the files names and the values are the lists of all line numbers that contain the target word.

In [6]:
def map_grep(files_names):
    results = {}
    for fn in files_names:
        with open(fn) as f:
            lines = [line for line in f.readlines()]
        for line_index, line in enumerate(lines):
            if target in line:
                if fn not in results:
                    results[fn] = []
                results[fn].append(line_index)
    return results

def reduce_grep(results1, results2):
    results1.update(results2)
    return results1

def map_reduce_grep(dirpath, num_processes):
    files_names = [os.path.join(dirpath,fn) for fn in os.listdir(dirpath)]
    return map_reduce(files_names, num_processes, map_grep, reduce_grep)

We use the `map_reduce_grep` function to find all occurences of the string `data` in the files stored in the `wiki` folder.

In [7]:
target = "data"
data_occurences = map_reduce_grep("wiki", 8)

pprint.pprint(data_occurences)

{'wiki/100_Greatest_Romanians.html': [45, 52, 261, 344, 361],
 'wiki/104th_Logistic_Support_Brigade_(United_Kingdom).html': [49,
                                                               57,
                                                               61,
                                                               170,
                                                               253,
                                                               270],
 'wiki/16th_Virginia_Infantry.html': [49,
                                      63,
                                      94,
                                      98,
                                      102,
                                      104,
                                      146,
                                      153,
                                      236,
                                      253,
                                      280],
 'wiki/1896_Indiana_Hoosiers_football_team.html': [153, 159, 4

                                    202,
                                    247,
                                    330,
                                    355],
 'wiki/Appa_(film).html': [49, 269, 352, 369],
 'wiki/Arabic_Toilers27_Movement.html': [43, 46, 51, 54, 67, 69, 118, 201, 218],
 'wiki/Arcadio_GonzC3A1lez.html': [46,
                                   47,
                                   48,
                                   49,
                                   50,
                                   51,
                                   55,
                                   62,
                                   64,
                                   104,
                                   113,
                                   196,
                                   213,
                                   240],
 'wiki/Archibald_Gordon_(British_Army_officer).html': [50,
                                                       51,
                                     

                                    411,
                                    429,
                                    441,
                                    454,
                                    467,
                                    558,
                                    1039,
                                    1105,
                                    1123,
                                    1152,
                                    1168,
                                    1181,
                                    1195,
                                    1209,
                                    1543,
                                    1581,
                                    1664,
                                    1689],
 'wiki/Bay_of_ConcepciC3B3n.html': [6, 45, 58, 60, 62, 105, 188, 205],
 'wiki/Bazemore_Alabama.html': [6, 57, 59, 69, 152, 204, 206, 255, 338, 355],
 'wiki/Belgium_women27s_national_field_hockey_team.html': [49,
                                     

                                  6884,
                                  6901],
 'wiki/Doug_Sahm_and_Band.html': [49, 129, 264, 310, 393, 410],
 'wiki/Doumanaba.html': [6, 56, 58, 68, 72, 129, 131, 154, 204, 440, 523, 540],
 'wiki/Dowell_Philip_O27Reilly.html': [155, 238, 255],
 'wiki/Dragnet_(franchise).html': [6, 50, 229, 231, 238, 362, 620, 703, 728],
 'wiki/DresdenPlauen_railway_station.html': [6,
                                             49,
                                             57,
                                             68,
                                             140,
                                             196,
                                             201,
                                             247,
                                             330,
                                             355],
 'wiki/DuC5A1anovo.html': [6, 57, 59, 69, 73, 290, 291, 293, 342, 425, 442],
 'wiki/Durham_Women27s_F.C..html': [95,
                             

                             199,
                             200,
                             201,
                             203,
                             204,
                             750,
                             788,
                             871,
                             888],
 'wiki/FC3A9lix_CC3A1rdenas.html': [48, 284, 286, 335, 418, 443],
 'wiki/FC_Bobruisk.html': [43, 46, 202, 295, 297, 346, 429, 446],
 'wiki/FC_Khikhani_Khulo.html': [74, 76, 117, 125, 208, 225, 252],
 'wiki/Faces_(RunE28093D.M.C._song).html': [49, 259, 261, 310, 393, 410],
 'wiki/Fahy_County_Mayo.html': [6, 43, 46, 52, 54, 56, 105, 188, 205],
 'wiki/Failing_Office_Building.html': [6,
                                       60,
                                       68,
                                       70,
                                       79,
                                       81,
                                       90,
                                       92,
      

                       57,
                       59,
                       69,
                       73,
                       119,
                       1093,
                       1096,
                       1098,
                       1147,
                       1230,
                       1247],
 'wiki/Kusaka_Station.html': [6, 44, 47, 53, 173, 175, 177, 226, 309, 334],
 'wiki/Kyokutou_I_Love_You.html': [49, 476, 559, 576],
 'wiki/L._Fry.html': [43, 46, 56, 172, 175, 399, 482, 499],
 'wiki/La_Franja.html': [6, 49, 280, 919, 1306, 1346, 1379, 1417, 1500, 1525],
 'wiki/La_Roca_de_la_Sierra.html': [6,
                                    55,
                                    66,
                                    68,
                                    78,
                                    82,
                                    86,
                                    344,
                                    349,
                                    351,
                 

 'wiki/Olivaceous_flatbill.html': [50, 59, 72, 113, 118, 120, 169, 252, 277],
 'wiki/Olive_Dennis.html': [49, 165, 248, 265],
 'wiki/Oliver_Twist_(1912_American_film).html': [284, 286, 335, 418, 435],
 'wiki/Omaha_Racers.html': [380, 463, 480],
 'wiki/Omar_Onsi.html': [97, 99, 148, 231, 248],
 'wiki/Omiodes_iridias.html': [108, 114, 119, 169, 252, 269],
 'wiki/One_Night_of_Sin.html': [50, 110, 220, 418, 501, 518],
 'wiki/Ordinary_Virginia.html': [6,
                                 57,
                                 59,
                                 69,
                                 71,
                                 82,
                                 170,
                                 288,
                                 290,
                                 339,
                                 422,
                                 439],
 'wiki/Ordinary_singularity.html': [49, 51, 95, 178, 195],
 'wiki/Oued_TlC3A9lat.html': [6, 52, 56, 96, 112, 238, 240, 242, 291, 374

                             193,
                             201,
                             205,
                             212,
                             549,
                             632,
                             649],
 'wiki/Uralochka_Zlatoust.html': [47, 152, 235, 252],
 'wiki/Urban_chicken.html': [199, 282, 299],
 'wiki/Urodilatin.html': [47, 50, 129, 157, 158, 198, 281, 298],
 'wiki/Urs_Burkart.html': [73, 91, 94, 96, 145, 228, 245],
 'wiki/Uruguayan_constitutional_referendum_2014.html': [51,
                                                        282,
                                                        440,
                                                        442,
                                                        491,
                                                        574,
                                                        591],
 'wiki/Usher_Gahagan.html': [47, 80, 163, 180],
 'wiki/UtielRequena.html': [6, 45, 92, 123, 206, 223],
 'wiki/V

In [8]:
print(len(data_occurences))

999


The word `data` is featured in all of the articles. Let's have a look at the first article in the list to verify it is correctly found.
The first occurence of this word in this article is at line 45.

In [9]:
folder_name = "wiki"
first_file_name = files_names[0]

with open(os.path.join(folder_name, first_file_name)) as f:
    lines = [line for line in f.readlines()]
    print(lines[45])

<div class="thumbinner" style="width:202px;"><a href="/wiki/File:Mari_romani_logo-tv.jpg" class="image"><img alt="" src="//upload.wikimedia.org/wikipedia/en/thumb/c/c3/Mari_romani_logo-tv.jpg/200px-Mari_romani_logo-tv.jpg" width="200" height="122" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/en/thumb/c/c3/Mari_romani_logo-tv.jpg/300px-Mari_romani_logo-tv.jpg 1.5x, //upload.wikimedia.org/wikipedia/en/c/c3/Mari_romani_logo-tv.jpg 2x" data-file-width="373" data-file-height="228" /></a>



We can see that the word `data` occurs twice at the end of the 46th line (line at index 45). It is part of image properties `data-file-width` and `data-file-heigt`.
We expect most of the articles in the list to have at least one picture associated to them. So it is not astonishing to find at least two occurences of `data` in each article.

Let's count the total number of occurences in all files to finish with!

In [10]:
total_count = 0
for key, value in data_occurences.items():
    total_count += len(value)
print(total_count)

10339


The word `data` occurs 10339 times in the entire data set.

## Making the Grep Function Case-Insensitive

Let's now improve our grep funtion by making it case insensitive. This means that the case of the characters in the strings won't matter.

To implement that, we need to convert all files as well as the target word to lowercase.

In [11]:
def map_grep_insensitive(files_names):
    results = {}
    for fn in files_names:
        with open(fn) as f:
            lines = [line.lower() for line in f.readlines()]
        for line_index, line in enumerate(lines):
            if target.lower() in line:
                if fn not in results:
                    results[fn] = []
                results[fn].append(line_index)
    return results

def map_reduce_grep_insensitive(dirpath, num_processes):
    files_names = [os.path.join(dirpath,fn) for fn in os.listdir(dirpath)]
    return map_reduce(files_names, num_processes, map_grep_insensitive, reduce_grep)

Let's try to find all occurences of the word `data` again.

In [12]:
target = "data"

data_occurences_insensitive = map_reduce_grep_insensitive("wiki", 8)
pprint.pprint(data_occurences_insensitive)

{'wiki/100_Greatest_Romanians.html': [45, 52, 261, 344, 361],
 'wiki/104th_Logistic_Support_Brigade_(United_Kingdom).html': [49,
                                                               57,
                                                               61,
                                                               170,
                                                               253,
                                                               270],
 'wiki/16th_Virginia_Infantry.html': [49,
                                      63,
                                      94,
                                      98,
                                      102,
                                      104,
                                      146,
                                      153,
                                      236,
                                      253,
                                      280],
 'wiki/1896_Indiana_Hoosiers_football_team.html': [153, 159, 4

                                     137,
                                     201,
                                     253,
                                     296,
                                     307,
                                     309,
                                     358,
                                     441,
                                     466],
 'wiki/Alex_Kurtzman.html': [49, 338, 353, 387, 398, 481, 498, 525],
 'wiki/Alex_McEachern.html': [49, 107, 109, 158, 241, 258],
 'wiki/Alexander_Rizzoni.html': [6, 45, 57, 67, 89, 105, 151, 234, 259],
 'wiki/Alexios_Aspietes.html': [98, 181, 198],
 'wiki/Alpine_skiing_at_the_1994_Winter_Olympics_E28093_Men27s_combined.html': [50,
                                                                                77,
                                                                                80,
                                                                                83,
                                    

 'wiki/Daniel_Cerone.html': [64, 300, 301, 309, 311, 360, 443, 460],
 'wiki/Danish_Maritime_Safety_Administration.html': [50,
                                                     110,
                                                     128,
                                                     141,
                                                     149,
                                                     158,
                                                     203,
                                                     286,
                                                     303],
 'wiki/Danny_Gray.html': [43, 46, 123, 125, 174, 257, 274],
 'wiki/David_Beasley.html': [49, 202, 309, 351, 434, 451],
 'wiki/David_Jesson.html': [82, 151, 234, 251],
 'wiki/David_Sands_(psychologist).html': [154, 165, 248, 265, 292],
 'wiki/David_Solomona.html': [47, 565, 648, 665],
 'wiki/De_La_Salle_University_E28093_DasmariC3B1as.html': [6,
                                                           44,

 'wiki/Kailas_Pal.html': [88, 171, 188],
 'wiki/Kamakshyanagar.html': [6, 57, 59, 69, 71, 82, 86, 477, 560, 577],
 'wiki/Kanakomyrtus.html': [144, 227, 252],
 'wiki/Kate_Harwood.html': [43, 46, 79, 86, 138, 221, 238],
 'wiki/Kate_Quarrie.html': [79, 81, 130, 213, 230],
 'wiki/Kattukukke.html': [6, 43, 46, 60, 117, 120, 122, 124, 173, 256, 273],
 'wiki/Keegan_Pereira_(footballer).html': [177, 284, 367, 384],
 'wiki/Keeled_sideband.html': [55, 117, 133, 183, 266, 283],
 'wiki/Kellie_Jones.html': [94, 97, 103, 105, 154, 237, 254],
 'wiki/Kelvin_Mbilinyi.html': [43, 46, 136, 144, 227, 244, 271],
 'wiki/Kelvin_R._Throop.html': [49, 51, 93, 176, 193],
 'wiki/Kendal_Williams.html': [50, 99, 105, 113, 118, 270, 353, 370],
 'wiki/Kendalia_Texas.html': [6,
                              52,
                              62,
                              64,
                              74,
                              76,
                              87,
                              236,
    

                                    106,
                                    107,
                                    230,
                                    235,
                                    246,
                                    431,
                                    514,
                                    531],
 'wiki/Nobuhiko_Ushiba.html': [261, 344, 361],
 'wiki/Noise_(song).html': [49, 572, 655, 672],
 'wiki/Norderbrarup.html': [6,
                            52,
                            63,
                            65,
                            78,
                            83,
                            308,
                            330,
                            332,
                            370,
                            381,
                            464,
                            489,
                            516],
 'wiki/North_Coast_(RTA_Rapid_Transit_station).html': [6,
                                                       55,
   

                                  288,
                                  300,
                                  312,
                                  324,
                                  336,
                                  348,
                                  360,
                                  405,
                                  581,
                                  664,
                                  689],
 'wiki/The_K_of_D.html': [43, 46, 101, 184, 201],
 'wiki/The_Land_of_the_Dead.html': [49, 396, 479, 496],
 'wiki/The_Lavender_Cowboy.html': [45, 109, 192, 209],
 'wiki/The_LoveGirl_and_the_Innocent.html': [43, 46, 190, 273, 290],
 'wiki/The_Master_Plan_(album).html': [50, 128, 287, 370, 387],
 'wiki/The_Portal_(community_center).html': [6, 178, 211, 294, 311],
 'wiki/The_Purpose_Driven_Church.html': [45, 124, 207, 224],
 'wiki/The_Right_Kinda_Lover.html': [44, 47, 57, 315, 398, 415],
 'wiki/The_Showdown_Effect.html': [49, 165, 248, 265],
 'wiki/The_Summer_King.htm

Let's verify that the new implementation works by seeing if it finds more matches than the previous implementation.

In [13]:
for fn in data_occurences_insensitive:
    if fn not in data_occurences:
        print("Found {} matches on file {}".format(len(data_occurences_insensitive[fn]), fn))
    elif len(data_occurences_insensitive[fn]) > len(data_occurences[fn]):
        print("Found {} new matches on file {}".format(len(data_occurences_insensitive[fn]) - len(data_occurences[fn]), fn))

Found 1 new matches on file wiki/Table_Point_Formation.html
Found 1 new matches on file wiki/Ingrid_GuimarC3A3es.html
Found 2 new matches on file wiki/Jules_Verne_ATV.html
Found 1 new matches on file wiki/Pictogram.html
Found 2 new matches on file wiki/Claire_Danes.html
Found 1 new matches on file wiki/PTPRS.html
Found 1 new matches on file wiki/A_Beautiful_Valley.html
Found 1 new matches on file wiki/Mudramothiram.html
Found 2 new matches on file wiki/Gordon_Bau.html
Found 1 new matches on file wiki/Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html
Found 3 new matches on file wiki/Code_page_1023.html
Found 1 new matches on file wiki/Cryptographic_primitive.html
Found 1 new matches on file wiki/Alex_Kurtzman.html
Found 1 new matches on file wiki/Filip_Pyrochta.html
Found 1 new matches on file wiki/Morgana_King.html
Found 1 new matches on file wiki/Don_Parsons_(ice_hockey).html
Found 1 new matches on file wiki/Bias.html
Found 2 new matches on file wiki/Tomohiko_ItC58D_(director).html
Found

By making the Grep function case insensitive, we managed to match more occurences of the word `data` within our list of files.

Right now, we are only finding the line numbers where there is at least one occurrence. Let's extend the algorithm so that it provides information about the location of the matches in those lines.

## Locating Matches Within Lines

The new implementation should return pairs of indices where the first value is the line index and the second value is the index of the first character of the match on that line.

Before we modify the mapper function, we need to create a new function that finds all occurences of a target with a string.

In [14]:
def find_match_indexes(line, target):
    results = []
    i = line.find(target, 0)
    while i != -1:
        results.append(i)
        i = line.find(target, i + 1)
    return results

# Implementation test
s = "Data science is related to data mining, machine learning and big data.".lower()
print(find_match_indexes(s, "data"))

[0, 27, 65]


We can then modify the mapper function which will call the newly created `find_match_indexes` function.

In [15]:
def map_grep_indexes(files_names):
    results = {}
    for fn in files_names:
        with open(fn) as f:
            lines = [line.lower() for line in f.readlines()]
        for line_index, line in enumerate(lines):
            match_indexes = find_match_indexes(line, target)
            if fn not in results:
                results[fn] = []
            results[fn] += [(line_index, match_index) for match_index in match_indexes]
    return results

def map_reduce_grep_indexes(dirpath, num_processes):
    files_names = [os.path.join(dirpath,fn) for fn in os.listdir(dirpath)]
    return map_reduce(files_names, num_processes, map_grep_indexes, reduce_grep)

Let's try this new implementation with the word `science`!

In [16]:
target = "science"
occurences = map_reduce_grep_indexes("wiki", 8)

pprint.pprint(occurences)

{'wiki/100_Greatest_Romanians.html': [],
 'wiki/104th_Logistic_Support_Brigade_(United_Kingdom).html': [],
 'wiki/16th_Virginia_Infantry.html': [],
 'wiki/1896_Indiana_Hoosiers_football_team.html': [],
 'wiki/1898_Colgate_football_team.html': [],
 'wiki/1910_in_literature.html': [(115, 27),
                                  (115, 51),
                                  (115, 60),
                                  (253, 133),
                                  (253, 169),
                                  (253, 198),
                                  (280, 162),
                                  (280, 186),
                                  (280, 203)],
 'wiki/1915_Montana_football_team.html': [],
 'wiki/1951_National_League_tiebreaker_series.html': [],
 'wiki/1953E2809354_FA_Cup_qualifying_rounds.html': [],
 'wiki/1958_Wightman_Cup.html': [],
 'wiki/1988_State_of_Origin_series.html': [],
 'wiki/1st_Strategic_Aerospace_Division.html': [],
 'wiki/2001_Australian_Individual_Speedway_Champio

 'wiki/Hideki_Kase.html': [],
 'wiki/High_Efficiency_Image_File_Format.html': [],
 'wiki/Hilyard_Robinson.html': [],
 'wiki/Hoghiz.html': [],
 'wiki/Holly_Golightly_(comics).html': [],
 'wiki/Hope_7_(album).html': [],
 'wiki/Horace_Pinker.html': [],
 'wiki/How_Can_We_Hang_On_to_a_Dream3F.html': [],
 'wiki/Hypocysta_angustata.html': [],
 'wiki/I27m_Walking_Behind_You.html': [],
 'wiki/I27ve_Heard_That_Song_Before.html': [],
 'wiki/ISMACryp.html': [],
 'wiki/I_Am_Cold.html': [],
 'wiki/I_Marine_Expeditionary_Force.html': [],
 'wiki/I_Q_(book_series).html': [],
 'wiki/Igor_and_Grichka_Bogdanoff.html': [(6, 851),
                                          (72, 436),
                                          (72, 460),
                                          (72, 477),
                                          (72, 506),
                                          (101, 75),
                                          (101, 357),
                                          (101, 519),
          

                             (106, 1059),
                             (106, 1135),
                             (106, 1219),
                             (118, 295),
                             (118, 386),
                             (118, 431),
                             (118, 468),
                             (118, 498),
                             (118, 747),
                             (118, 972),
                             (166, 868),
                             (166, 943),
                             (166, 1002)],
 'wiki/Women27s_javelin_throw_world_record_progression.html': [],
 'wiki/Woodward_Dream_Cruise.html': [],
 'wiki/Work_domain_analysis.html': [],
 'wiki/Wreck_Island_Natural_Area_Preserve.html': [],
 'wiki/Wren_Alabama.html': [],
 'wiki/Wu_Jin.html': [],
 'wiki/XCops_(band).html': [],
 'wiki/YC3AAn_BC3A1i.html': [],
 'wiki/Yarkant_County.html': [],
 'wiki/Yarumal.html': [],
 'wiki/Yasid.html': [],
 'wiki/Yemeni_rial.html': [],
 'wiki/Yoshinkan.html': [],
 'wi

Our grep algorithms can now find all matches and return their exact location. However, with the dictionary it produces, it's not very easy to see those matches. Let's write the results out in a better way.

## Writing the Results Out

In order to better analyze the matches, we will write the results into a CSV file.

We will use four columns:
- `File`: Shows the name of the file the target word was matched in
- `Line`: Shows the index of the line the target word was matched in
- `Index`: Shows the index on the line where the target word is located
- `Context`: Shows the text around the match so that users can see the context

In [17]:
import csv

# Number of characters to show before and after the match
context_delta = 30

with open("results.csv", "w") as f:
    writer = csv.writer(f)
    rows = [["File", "Line", "Index", "Context"]]
    for fn in occurences:
        with open(fn) as f:
            lines = [line.strip() for line in f.readlines()]
        for line, index in occurences[fn]:
            start = max(index - context_delta, 0)
            end   = index + len(target) + context_delta
            rows.append([fn, line, index, lines[line][start:end]])
    writer.writerows(rows)

Let's finish by reading this CSV file in a Pandas dataframe.

In [19]:
import pandas as pd

df = pd.read_csv("results.csv")
df.head(10)

Unnamed: 0,File,Line,Index,Context
0,wiki/Valentin_Yanin.html,6,840,"embers of the USSR Academy of Sciences"",""Full ..."
1,wiki/Valentin_Yanin.html,6,890,"ers of the Russian Academy of Sciences"",""Demid..."
2,wiki/Valentin_Yanin.html,66,90,"href=""/wiki/Soviet_Academy_of_Sciences"" class=..."
3,wiki/Valentin_Yanin.html,66,145,"ect"" title=""Soviet Academy of Sciences"">Soviet..."
4,wiki/Valentin_Yanin.html,66,173,"f Sciences"">Soviet Academy of Sciences</a>; he..."
5,wiki/Valentin_Yanin.html,144,1440,"rs_of_the_USSR_Academy_of_Sciences"" title=""Cat..."
6,wiki/Valentin_Yanin.html,144,1502,"rs of the USSR Academy of Sciences"">Full Membe..."
7,wiki/Valentin_Yanin.html,144,1548,rs of the USSR Academy of Sciences</a></li><li...
8,wiki/Valentin_Yanin.html,144,1632,"of_the_Russian_Academy_of_Sciences"" title=""Cat..."
9,wiki/Valentin_Yanin.html,144,1697,"of the Russian Academy of Sciences"">Full Membe..."


## Conclusion

Locating data from text files is a very common and time-consuming operation when many files are involved. By using MapReduce, we can significantly reduce the time required to locate that data.

In this guided project we've implemented a MapReduce grep algorithm that locates all matches of a given string within all files in a given folder.

There are many improvements we could add to our algorithm (the grep command offers many other options). Here are some ideas:
- Consider files located in subdirectories
- Use the `re` module to make it possible to search for regular expresions
- Make it possible to specify the search options rather than having a search function for each set of options