<h1>
<center>
Dataquest Guided Project 22:
Analyzing Wikipedia Pages
</center>
</h1>

## Introduction

This is part of the Dataquest program.

- part of paths **Data Engineer**
    - Step 2: **Handling Large Data Sets in Python**
        - Course 2:  **Optimizing Code Performance On Large Datasets **
            - CPU Bound Programs
            - I/O Bound Programs
            - Overcoming The Limitations of Threads
            - Quickly Analyzing Data With Parallel Processing
       
As this is a guided project, we are following and deepening the steps suggested by Dataquest. In this project, we will practise working with large datasets in pandas.

## Use case : Analyzing Wikipedia Pages

In this guided project, we will be working with data scraped from [Wikipedia](https://www.wikipedia.org/), a popular online encyclopedia. Wikipedia is maintained by volunteer content contributors and editors who continuously improve content. 

In this project, we'll be analyzing 54 megabytes worth of articles to figure out patterns in the Wikipedia writing and content presentation style. The articles were scraped by hitting random pages on Wikipedia, then downloading the contents using the [requests](http://docs.python-requests.org/en/master/) package.

All the data are store in the "wiki" folder.

Our main goals will be to : 
- Extract only the text from the Wikipedia pages, and remove all the HTML and Javascript markup.  
- Remove common page headers and footers from Wikipedia pages
- Figure out what tags are the most common in Wikipedia pages
- Figure out patters in the text.

## Explore the folder

First, let's list all of the files in the wiki folder.

In [1]:
import os
os.listdir("wiki")

['Furubira_District,_Hokkaido.html',
 'Valentin_Yanin.html',
 'Kings_XI_Punjab_in_2014.html',
 'William_Harvey_Lillard.html',
 'Radial_Road_3.html',
 'George_Weldrick.html',
 'Zgornji_Otok.html',
 'Blue_Heelers_(season_8).html',
 'Taggen_Nunatak.html',
 '1951_National_League_tie-breaker_series.html',
 'List_of_number-one_singles_of_1993_(Finland).html',
 'Vrila.html',
 'William_Henry_Porter.html',
 'Clive_Brown_(footballer).html',
 '2010_Karshi_Challenger_%E2%80%93_Singles.html',
 'Blick_nach_Rechts.html',
 'Central_District_(Rezvanshahr_County).html',
 'Gal%C3%A1pagos,_Guadalajara.html',
 'Campus_of_Texas_A%26M_University.html',
 'Alexios_Aspietes.html',
 'Mei_Lanfang.html',
 'Thalkirchen-Obersendling-Forstenried-F%C3%BCrstenried-Solln.html',
 'Coalville_Town_railway_station.html',
 'Gennady_Lesun.html',
 'Bartrum_Glacier.html',
 'Victor_S._Mamatey.html',
 'Gottfried_Keller.html',
 'Table_Point_Formation.html',
 'Nobuhiko_Ushiba.html',
 'Master_of_Space_and_Time.html',
 'Early_medieva

Let's count up the number of files in this folder: 

In [2]:
len(os.listdir("wiki"))

999

Let's now display a single file from the wiki folder to look at the raw HTML : 

In [3]:
with open("wiki/The_Land_of_the_Dead.html") as f: 
    file = f.read()
print(file)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>The Land of the Dead - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"The_Land_of_the_Dead","wgTitle":"The Land of the Dead","wgCurRevisionId":754928354,"wgRevisionId":754928354,"wgArticleId":1633029,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Fifth Doctor audio plays","2000 audio plays"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","Augu

We notice that the content is inside the div tag with the "content" id.

## Read in the data

Now that we know the file structure, and the structure of a single file, we can read in all of the files. This will get us started in our explorations. 

In [4]:
import time

def read_data(filename):
    with open(filename) as f:
        data = f.read()
    return data

start = time.time()
filenames = ["wiki/{}".format(f) for f in os.listdir("wiki")]

content = list()
for filename in filenames : 
    content.append(read_data(filename))

end = time.time()
print(end - start)

0.1779499053955078


As this task is I/O bound, we can use threads to help us read in the data more quickly.

In [5]:
import concurrent.futures
import time

pool = concurrent.futures.ThreadPoolExecutor(max_workers=4)

def read_data(filename):
    with open(filename) as f:
        data = f.read()
    return data

start = time.time()
filenames = ["wiki/{}".format(f) for f in os.listdir("wiki")]
content = pool.map(read_data, filenames)
content = list(content)

end = time.time()
print(end - start)

0.4769864082336426


In this case, it seems that threading doesn't help to improve the performance. It may be because although files are opened, most of the task is offset by the overhead of creating new threads

In [6]:
articles = [f.replace(".html", "").replace("wiki/", "") for f in filenames]

## Remove Extraneous Markup

Now that we have read in the data files, we can remove the extraneous markup that is outside the div#content tag that most of the content seems to be inside. 
We can use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) package for that. BeautifulSoup enables us to extract all of the content inside a specific tag.

In [7]:
from bs4 import BeautifulSoup

def parse_html(html):
    soup = BeautifulSoup(html, "html.parser")
    content = str(soup.find_all("div", id="content")[0])
    return content

start = time.time()

parsed = list()
for cont in content :
    parsed.append(parse_html(cont))
    
end = time.time()
print(end-start)

50.30543303489685


This operation is more CPU intensive than before. Let's try using a process pool to see if the speed improves. Let's attempt with different numbers of workers and see how that affects performance. 

In [8]:
from bs4 import BeautifulSoup

def parsed_html(html):
    soup = BeautifulSoup(html, "html.parser")
    content = str(soup.find_all("div", id="content")[0])
    return content

for workers in range(1,6):
    parsed = None
    start = time.time()
    pool = concurrent.futures.ProcessPoolExecutor(max_workers=workers)
    parsed = pool.map(parsed_html, content)
    parsed = list(parsed)
    end = time.time()
    print("Time elapsed for {} workers : {}".format(workers, end-start))

Time elapsed for 1 workers : 49.68454551696777
Time elapsed for 2 workers : 35.298086166381836
Time elapsed for 3 workers : 34.58976650238037
Time elapsed for 4 workers : 35.049715995788574
Time elapsed for 5 workers : 35.6042754650116


It seems that 2 workers are the best for our computer, which is surprising as we have 4 CPUs. It depends on what other computations the CPU is doing at the moment. We note that using 1 worker is 1 second longer than processing without the threading strategy. It is logic as mapping the pool is an additional computation. 

## Finding Common Tags

Now that we've extracted the main part of each page let's count up how many times each tag occurs. This will give us clues about how Wikipedia pages are typically structured. For example, if there are a lot of "a" tags on each page, we know that Wikipedia articles tend to be very connected to other articles or pages. On the other hand, a lot of div tags will tell us that Wikipedia pages tend to have a nested structure with many page elements.
We'll directly 

In [9]:
from bs4 import BeautifulSoup
import pandas as pd

def count_tags(html):
    soup = BeautifulSoup(html, 'html.parser')
    tags = {}
    for tag in soup.find_all():
        if tag.name not in tags:
            tags[tag.name] = 0
        tags[tag.name] += 1
    return tags

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
tags = pool.map(count_tags, parsed)

tag_counts = {}
for tag in tags:
    for k,v in tag.items():
        if k not in tag_counts:
            tag_counts[k] = 0
        tag_counts[k] += v
end = time.time()

print(end - start)

19.87275218963623


In [10]:
sorted(tag_counts.items(), key=lambda x: (-x[1], x[0]))

[('a', 161065),
 ('li', 85779),
 ('span', 67350),
 ('td', 57673),
 ('div', 28581),
 ('tr', 27300),
 ('i', 18246),
 ('th', 14472),
 ('b', 14455),
 ('sup', 11157),
 ('ul', 10972),
 ('p', 7998),
 ('img', 6701),
 ('br', 4986),
 ('h2', 4045),
 ('table', 4010),
 ('abbr', 3665),
 ('cite', 3563),
 ('small', 3272),
 ('dd', 1376),
 ('h1', 999),
 ('noscript', 999),
 ('ol', 858),
 ('h3', 777),
 ('strong', 599),
 ('dl', 457),
 ('dt', 334),
 ('caption', 200),
 ('sub', 151),
 ('h4', 117),
 ('code', 108),
 ('wbr', 85),
 ('q', 76),
 ('big', 75),
 ('center', 64),
 ('blockquote', 58),
 ('hr', 51),
 ('u', 51),
 ('font', 40),
 ('area', 39),
 ('rp', 32),
 ('rb', 16),
 ('rt', 16),
 ('ruby', 16),
 ('s', 10),
 ('bdi', 4),
 ('h5', 4),
 ('annotation', 2),
 ('audio', 2),
 ('del', 2),
 ('map', 2),
 ('math', 2),
 ('mo', 2),
 ('mrow', 2),
 ('mstyle', 2),
 ('samp', 2),
 ('semantics', 2),
 ('source', 2),
 ('h6', 1),
 ('pre', 1)]

Based on our findings, it looks like there are quite a few td, a, li, and span tags. This indicates that articles tend to have lots of links, along with lists and tables. Links are the most numerous tag, which indicates how interconnected articles on Wikipedia are.

## Finding Common Words

Let's consider only words with more than 5 letters.

In [21]:
from bs4 import BeautifulSoup
from collections import Counter
import re

def count_words(html):
    soup = BeautifulSoup(html, 'html.parser')
    words = {}
    text = soup.get_text()
    text = re.sub("\W+", " ", text.lower())
    words = text.split(" ")
    words = [w for w in words if len(w) >= 5]
    return Counter(words).most_common(1500)

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
words = pool.map(count_words, parsed)
words = list(words)

word_counts = {}
for wc in words:
    for word, count in wc:
        if word not in word_counts:
            word_counts[word] = 0
        word_counts[word] += 1
end = time.time()

print(end - start)
word_counts

20.31219458580017


{'dewulf': 1,
 'propaganda': 8,
 'watson': 14,
 'lungsod': 1,
 'andijani': 1,
 'tabimorelin': 1,
 'mulholland': 1,
 'flori': 1,
 'adrain': 1,
 'middle': 61,
 'garnered': 3,
 'pearsall': 1,
 'furthermore': 13,
 'démocratie': 1,
 'pekel': 1,
 'myriad': 3,
 '705th': 1,
 'teamminnesota': 1,
 'shelapur': 1,
 'hamlets': 1,
 'freshwater': 7,
 'hangenbieten': 1,
 'slopes': 6,
 'recognize': 5,
 'lagoa': 1,
 'eaten': 2,
 'χαρβάτι': 1,
 'mayocounty': 1,
 'snowboarder': 1,
 'sirdar': 1,
 'constructive': 1,
 'moralphilosophie': 1,
 'idstedt': 1,
 'brushing': 1,
 '47273': 1,
 'aldopentoses': 1,
 'cisticolas': 1,
 'cottonfields': 1,
 'castilblanco': 1,
 'bellas': 1,
 'collapse': 9,
 '1200mm': 1,
 'kaduna': 1,
 'phalanx': 2,
 'bourneville': 1,
 'vasculaire': 1,
 'gower': 4,
 'mostly': 33,
 'caddo': 1,
 'molecular': 20,
 'toulon': 5,
 'teacher': 22,
 'pikimachay': 1,
 'correct': 31,
 'gentile': 1,
 'michiko': 1,
 'colima': 1,
 'smothers': 1,
 'metiseducation': 1,
 'jetterswiller': 1,
 'canberra': 7,
 '

In [16]:
sorted(word_counts.items(), key=lambda x: (-x[1], x[0]))

[('wikipedia', 337),
 ('retrieved', 170),
 ('articles', 147),
 ('categories', 123),
 ('article', 86),
 ('species', 69),
 ('county', 62),
 ('united', 54),
 ('family', 48),
 ('university', 48),
 ('school', 41),
 ('state', 41),
 ('sources', 40),
 ('football', 37),
 ('september', 36),
 ('states', 36),
 ('district', 34),
 ('title', 34),
 ('world', 34),
 ('april', 31),
 ('national', 31),
 ('american', 30),
 ('north', 30),
 ('february', 27),
 ('south', 27),
 ('village', 27),
 ('which', 27),
 ('career', 26),
 ('population', 26),
 ('album', 25),
 ('december', 25),
 ('november', 25),
 ('march', 24),
 ('music', 24),
 ('station', 24),
 ('external', 23),
 ('french', 23),
 ('german', 23),
 ('january', 22),
 ('league', 22),
 ('october', 22),
 ('women', 22),
 ('australian', 20),
 ('first', 20),
 ('season', 20),
 ('california', 19),
 ('coordinates', 19),
 ('india', 19),
 ('british', 18),
 ('identifierswikipedia', 18),
 ('party', 18),
 ('river', 18),
 ('television', 18),
 ('august', 17),
 ('history', 17

    Selecting the top 10 words from each article speeds up performance  lirr (19,57 seconds) versus 