# <center> Analyzing Wikipedia Pages with MapReduce

The goal of this project is implementing a simplified version of the 'grep' command-line utility to search for data in 54 megabytes worth of articles. 

Articles were saved using the last component of their URLs. If someone were saving the article with the previous URL, he'd save it to the file URL.html. Main goals will be the following : 

- Search for all occurrences of a string in all of the files.
- Provide a case-insensitive option to the search.
- Refine the result by providing the specific locations of files. 

## List files 

In [1]:
# List all of the files in the wiki folders
import os 
file_names = os.listdir("wiki")

# Count and display the number of files in the wiki folders 
print(f"The number of files in the wiki folders : {len(file_names)}")

The number of files in the wiki folders : 999


In [2]:
# Read the first file in the wiki folder, and print its contents 
folder_name = "wiki"

with open(os.path.join(folder_name, file_names[0])) as file : 
    lines = [line for line in file.readlines()]
    
for line in lines : 
    print(line)

<!DOCTYPE html>

<html class="client-nojs" lang="en" dir="ltr">

<head>

<meta charset="UTF-8"/>

<title>Bay of Concepción - Wikipedia</title>

<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>

<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Bay_of_Concepción","wgTitle":"Bay of Concepción","wgCurRevisionId":647460156,"wgRevisionId":647460156,"wgArticleId":16044270,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","All stub articles","Landforms of Bío Bío Region","Bays of Chile","Bío Bío Region geography stubs"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgM

## Adding the MapReduce Framework

In [3]:
import math
import functools
from multiprocessing import Pool

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    pool = Pool(num_processes)
    chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

In [4]:
# The number of CPU 
import os 

print(f"We will use {os.cpu_count()} processors for MapReduce")

We will use 8 processors for MapReduce


## The total number of lines in all files 

Because there are 999 file names in 'wiki' directory, file names should be divided into file names chunk to count the total number of lines in all files. In mapper function, function will return the number of lines in all file chunks. In reducer function, function will sum values of total lines in file chunks.

In [5]:
# Count the total number of lines in all files 

def map_total(file_name_chunk) : 
    tot_count = 0 
    for file_name in file_name_chunk : 
        with open(os.path.join(folder_name, file_name)) as file : 
            lines = [line for line in file.readlines()]
            tot_count += len(lines)
    return tot_count 
    
def reduce_total(count1, count2) :
    return count1 + count2 

total_number_of_lines = map_reduce(file_names, 8, map_total, reduce_total)

In [6]:
print(f"The total number of lines in all files is {total_number_of_lines}")

The total number of lines in all files is 499797


## Grep Exact Match

The goal of below function is to locate all lines in all files from the wiki folder that contains a given string. The output should be a dictionary where the keys are the file names and the values are the list of all lines numbers that contain the string. 

mapper function will capture the index of string in each lines and store them in dictionary. mapper function will get input from chunk of file names. reducer function will merge ther result of mapper function in one dictionary. 

In [7]:
# mapper function

def map_grep(file_names_chunk) : 
    result_dct = {} 
    for file_name in file_names_chunk : 
        with open(os.path.join(folder_name, file_name)) as file :
            lines = [line for line in file.readlines()]
        for idx, line in enumerate(lines) : 
            if target in line : 
                if file_name not in result_dct : 
                    result_dct[file_name] = []
                result_dct[file_name].append(idx) 
    return result_dct 
    
# reducer function

def reduce_grep(dict1, dict2) : 
    dict1.update(dict2)
    return dict1

In [8]:
# Testing 
target = "data"
target_match = map_reduce(file_names, 8, map_grep, reduce_grep)

## Grep string in all files making it case insensitive 

In [9]:
# mapper function

def map_grep_insensitive(file_names_chunk) : 
    result_dct = {} 
    for file_name in file_names_chunk : 
        with open(os.path.join(folder_name, file_name)) as file :
            lines = [line.lower() for line in file.readlines()]
        for idx, line in enumerate(lines) : 
            if target.lower() in line : 
                if file_name not in result_dct : 
                    result_dct[file_name] = []
                result_dct[file_name].append(idx) 
    return result_dct 
    
# reducer function

def reduce_grep_insensitive(dict1, dict2) : 
    dict1.update(dict2)
    return dict1

In [10]:
# Testing
target = "data"
target_match_insensitive = map_reduce(file_names, 8, map_grep_insensitive, reduce_grep_insensitive)

### Comparison of algorithms between normal and insensitive

For each file, if there are more matches in the case insensitive result, print the file name new together with the number of new matches. 

In [11]:
for file_name in target_match_insensitive : 
    if file_name not in target_match : 
        print(f"Find new {len(target_match_insensitive[file_name])} line match in {file_name}")
    elif len(target_match_insensitive[file_name]) > len(target_match[file_name]) : 
        print(f"Find new {len(target_match_insensitive[file_name]) - len(target_match[file_name])} line match in {file_name}")

Find new 1 line match in Table_Point_Formation.html
Find new 1 line match in Ingrid_GuimarC3A3es.html
Find new 2 line match in Jules_Verne_ATV.html
Find new 1 line match in Pictogram.html
Find new 2 line match in Claire_Danes.html
Find new 1 line match in PTPRS.html
Find new 1 line match in A_Beautiful_Valley.html
Find new 1 line match in Mudramothiram.html
Find new 2 line match in Gordon_Bau.html
Find new 1 line match in Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html
Find new 3 line match in Code_page_1023.html
Find new 1 line match in Cryptographic_primitive.html
Find new 1 line match in Alex_Kurtzman.html
Find new 1 line match in Filip_Pyrochta.html
Find new 1 line match in Morgana_King.html
Find new 1 line match in Don_Parsons_(ice_hockey).html
Find new 1 line match in Bias.html
Find new 2 line match in Tomohiko_ItC58D_(director).html
Find new 1 line match in Imperial_Venus_(film).html
Find new 1 line match in Camp_Nelson_Confederate_Cemetery.html
Find new 1 line match in Benny_Lee

## Finding Match Positions on Lines 

In this section, i will extend the algorithm so that it provides information about the location of the matches in those lines. The new implementation will return pairs of indices where the first value is the line index and the second index if teh index of the first character of the match on that line. 

In [12]:
def index_matches(line, target) : 
    result = []
    i = line.find(target, 0) 
    while i != -1 : 
        result.append(i)
        i = line.find(target, i + 1)
    return result 

# Testing 
test = "Where is my wallet"
print(index_matches(test, 'is'))

[6]


In [13]:
# mapper function

def map_grep_indexes(file_names_chunk) : 
    result_dct = {} 
    for file_name in file_names_chunk : 
        with open(os.path.join(folder_name, file_name)) as file :
            lines = [line.lower() for line in file.readlines()]
        for idx, line in enumerate(lines) : 
            if target.lower() in line : 
                inline_idxs = index_matches(line, target.lower())
                if file_name not in result_dct : 
                    result_dct[file_name] = []
                result_dct[file_name] += [(idx, inline_idx) for inline_idx in inline_idxs]
    return result_dct 
    
# reducer function

def reduce_grep_indexes(dict1, dict2) : 
    dict1.update(dict2)
    return dict1

In [14]:
# Testing 
target = "data"
target_match_indexes = map_reduce(file_names, 8, map_grep_indexes, reduce_grep_indexes)

## Displaying the Results 

To see result easily, result will be written into a CSV file. 

1. File : shows the name of the file of the match
2. Line : shows the index of the lien of the match 
3. Index : shows the index on the line of the match 
4. context : shows the text arount the match so that users can see the conetxt 

In [29]:
import csv 
with open('target_context.csv', mode = 'w', newline = '') as file : 
    result_writer = csv.writer(file)
    rows = [['File', 'Line', 'Index', 'Context']]
    
    for file_name in target_match_indexes : 
        with open(os.path.join(folder_name, file_name)) as file : 
            lines = [line.lower() for line in file.readlines()]
        for line, index in target_match_indexes[file_name] : 
            rows.append([file_name, line, index, lines[line][index:index+40]])
    result_writer.writerows(rows)

In [30]:
import pandas as pd 
df = pd.read_csv('target_context.csv')
df.head()

Unnamed: 0,File,Line,Index,Context
0,Bay_of_ConcepciC3B3n.html,6,422,"data"",""all stub articles"",""landforms of"
1,Bay_of_ConcepciC3B3n.html,45,628,"data-file-width=""960"" data-file-height="""
2,Bay_of_ConcepciC3B3n.html,45,650,"data-file-height=""1192"" /></a>\n"
3,Bay_of_ConcepciC3B3n.html,58,447,"data for this location""><span class=""lat"
4,Bay_of_ConcepciC3B3n.html,58,692,"data for this location"">36.683°s 73.033°"
