# Analyzing Wikipedia Pages

## Introduction

In this guided project, we'll be analyzing 54 megabytes worth of articles to figure out patterns in the Wikipedia writing and content presentation style. The articles were scraped by hitting random pages on Wikipedia, then downloading the contents using the requests package. The scraping code is in this project folder, in the **scrape_random.py** file.

Articles were saved using the last component of their URLs. For example, a page on Wikipedia has the URL structure **https://en.wikipedia.org/wiki/Yarkant_County**. If we were saving the article with the previous URL, we'd save it to the file **Yarkant_County.html**. All the data files are stored in the **wiki** folder. Note that the files are raw HTML.

Note that the pages are a fairly standard HTML pages, and has embedded Javascript code. We'll be able to ignore the embedded Javascript during our analysis, so don't worry too much about it right now.

Our main goals will be to:
- Extract only the text from the Wikipedia pages, and remove all HTML and Javascript markup.
- Remove common page headers and footers from the Wikipedia pages.
- Figure out what tags are the most common in Wikipedia pages.
- Figure out patterns in the text.

## Importing packages

In [1]:
import os
import time
import concurrent.futures
import re
from bs4 import BeautifulSoup
from collections import Counter

## Wiki folder content

Let's check the content in the **wiki** folder:

In [2]:
filenames = os.listdir("my_datasets/wiki/")
filenames[:10]

['%C3%89cole_des_Mines_de_Douai.html',
 '%C3%89taule.html',
 '%C5%8Cnog%C5%8D_Station.html',
 '100_Greatest_Romanians.html',
 '104th_Logistic_Support_Brigade_(United_Kingdom).html',
 '16th_Virginia_Infantry.html',
 '1896_Indiana_Hoosiers_football_team.html',
 '1898_Colgate_football_team.html',
 '1910_in_literature.html',
 '1915_Montana_football_team.html']

In [3]:
print("Number of wiki files: {}".format(len(filenames)))

Number of wiki files: 999


Let's open the first file:

In [4]:
filepath = "my_datasets/wiki/{}".format(filenames[0])
with open(filepath, encoding="utf-8") as f:
    print(f.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>École des Mines de Douai - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"École_des_Mines_de_Douai","wgTitle":"École des Mines de Douai","wgCurRevisionId":766474818,"wgRevisionId":766474818,"wgArticleId":1225267,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","Articles lacking in-text citations from March 2016","All articles lacking in-text citations","Grandes écoles","Educational institutions established in 1878","Université Lille Nord de France","Douai"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentMod

## Reading In The Data

Now that we know the file structure, and the structure of a single file, we can read in all of the files. This will get us started in our explorations.

As this task is **I/O bound**, we can use **threads** to help us read in the data more quickly. It may make sense to benchmark the performance with multiple threads versus single threaded performance.

In [5]:
def read_file(filepath):
    with open(filepath, encoding="utf-8") as f:
        return f.read()

In [6]:
#Without using a pool
start = time.time()

filepaths = ["my_datasets/wiki/{}".format(filename) for filename in filenames]
content = [read_file(filepath) for filepath in filepaths]

total_time = time.time() - start
print("Content list length: {}".format(len(content)))
print("Time spent: {}".format(total_time))

Content list length: 999
Time spent: 0.14764022827148438


In [7]:
#Using 2 threads
start = time.time()

pool = concurrent.futures.ThreadPoolExecutor(max_workers = 2)
filepaths = ["my_datasets/wiki/{}".format(filename) for filename in filenames]
content = list(pool.map(read_file, filepaths))
#pool.shutdown()

total_time = time.time() - start
print("Content list length: {}".format(len(content)))
print("Time spent: {}".format(total_time))

Content list length: 999
Time spent: 0.19248533248901367


In [8]:
#Using 3 threads
start = time.time()

pool = concurrent.futures.ThreadPoolExecutor(max_workers = 3)
filepaths = ["my_datasets/wiki/{}".format(filename) for filename in filenames]
content = list(pool.map(read_file, filepaths))
#pool.shutdown()

total_time = time.time() - start
print("Content list length: {}".format(len(content)))
print("Time spent: {}".format(total_time))

Content list length: 999
Time spent: 0.17758417129516602


In [9]:
#Using 4 threads
start = time.time()

pool = concurrent.futures.ThreadPoolExecutor(max_workers = 4)
filepaths = ["my_datasets/wiki/{}".format(filename) for filename in filenames]
content = list(pool.map(read_file, filepaths))
#pool.shutdown()

total_time = time.time() - start
print("Content list length: {}".format(len(content)))
print("Time spent: {}".format(total_time))

Content list length: 999
Time spent: 0.18501758575439453


In [10]:
#Articles list: filename without .html sufix)
articles = [filename.replace(".html","") for filename in filenames]
articles[:10]

['%C3%89cole_des_Mines_de_Douai',
 '%C3%89taule',
 '%C5%8Cnog%C5%8D_Station',
 '100_Greatest_Romanians',
 '104th_Logistic_Support_Brigade_(United_Kingdom)',
 '16th_Virginia_Infantry',
 '1896_Indiana_Hoosiers_football_team',
 '1898_Colgate_football_team',
 '1910_in_literature',
 '1915_Montana_football_team']

After doing some profiling, it doesn't appear that threading makes a huge difference to performance. It may be because although files are opened, most of the task is offset by the overhead of creating or managing threads.

## Remove Extraneous Markup

Now that we've read in the data files, we can remove the extraneous markup that's outside the **div#content** tag that most of the content seems to be inside.

We can use the BeautifulSoup package for this. BeautifulSoup enables us to extract all of the content inside a specific tag. Using the BeautifulSoup package, we'll parse each wiki article, then extract the div with id content and everything inside it.

Since this operation is more CPU intensive than before, let's try using a process pool to see if the speed improves.

In [11]:
def parse_content(read_file):
    soup = BeautifulSoup(read_file, 'html.parser')
    return str(soup.find_all("div", id="content")[0])

In [12]:
#Multiprocess in Windows10 does not work

# start = time.time()

# pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
# parsed = list(pool.map(parse_content, content))

# total_time = time.time() - start
# print("Time spent: {}".format(total_time))

In [13]:
#Without multiprocessing
start = time.time()

parsed = [parse_content(read_file) for read_file in content]

total_time = time.time() - start
print("Time spent: {}".format(total_time))

Time spent: 29.264676094055176


In [14]:
parsed[0]

'<div class="mw-body" id="content" role="main">\n<a id="top"></a>\n<div id="siteNotice"><!-- CentralNotice --></div>\n<div class="mw-indicators">\n</div>\n<h1 class="firstHeading" id="firstHeading" lang="en">École des Mines de Douai</h1>\n<div class="mw-body-content" id="bodyContent">\n<div id="siteSub">From Wikipedia, the free encyclopedia</div>\n<div id="contentSub"></div>\n<div class="mw-jump" id="jump-to-nav">\n\t\t\t\t\tJump to:\t\t\t\t\t<a href="#mw-head">navigation</a>, \t\t\t\t\t<a href="#p-search">search</a>\n</div>\n<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"><table class="infobox vcard" style="width:22em">\n<caption class="fn org">École nationale supérieure des mines de Douai (Mines Douai)</caption>\n<tr>\n<th scope="row" style="padding-right:0.65em;">Type</th>\n<td><a href="/wiki/Grandes_%C3%A9coles" title="Grandes écoles">Grande école</a></td>\n</tr>\n<tr>\n<th scope="row" style="padding-right:0.65em;">Established</th>\n<td>1878</td>\n</tr>\n<tr>\n

## Finding Common Tags

Now that we've extracted the main part of each page, let's count up how many times each tag occurs. This will give us clues about how Wikipedia pages are typically structured. For example, if there are a lot of a tags on each page, we know that Wikipedia articles tend to be very connected to other articles or pages. On the other hand, a lot of div tags will tell us that Wikipedia pages tend to have a nested structure with many page elements.

We can count tags using the **BeautifulSoup.find_all()** method with no argument, then iterating through all of the tags.

In [15]:
def count_tags(parsed_file):
    soup = BeautifulSoup(parsed_file, 'html.parser')
    tags = {}
    for tag in soup.find_all():
        if tag.name not in tags:
            tags[tag.name] = 0
        tags[tag.name] += 1
    return Counter(tags)

In [16]:
#Multiprocess not working on W10

# start = time.time()

# pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
# tags = list(pool.map(count_tags, parsed))

# tag_counts = sum(tags, Counter())
        
# total_time = time.time() - start
# print("Time spent: {}".format(total_time))

# tag_counts

In [17]:
#Without multiprocessing
start = time.time()

tags = [count_tags(parsed_file) for parsed_file in parsed]
tag_counts = sum(tags, Counter())
        
total_time = time.time() - start
print("Time spent: {}".format(total_time))

Time spent: 15.955629110336304


In [18]:
tag_counts.most_common(10)

[('a', 161065),
 ('li', 85779),
 ('span', 67350),
 ('td', 57673),
 ('div', 28581),
 ('tr', 27300),
 ('i', 18246),
 ('th', 14472),
 ('b', 14455),
 ('sup', 11157)]

Based on our findings, it looks like there are quite a few td, a, li, and span tags. This indicates that articles tend to have lots of links, along with lists and tables. Links are the most numerous tag, which indicates how interconnected articles on Wikipedia are.

## Finding Common Words

After finding the common tags, we should be able to find the common words in the article body. We can apply any definition of "word" that we want, but it might be helpful to apply similar criteria to what we saw in the last mission.

One thing to be aware of here is that depending on the words you choose, you may run out of memory, or performance may be slow.

In [19]:
#Considering only the 10 most common words per file, to avoid memory problems
def count_words(parsed_file):
    soup = BeautifulSoup(parsed_file, 'html.parser')
    words = {}
    text = soup.get_text()
    text = re.sub("\W+", " ", text.lower())
    words = text.split(" ")
    words = [w for w in words if len(w) >= 5]
    return Counter(words).most_common(10)

In [20]:
#Multiprocessing not working on W10

# start = time.time()

# pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
# words = list(pool.map(count_words, parsed))

# word_counts = sum(words,Counter())
        
# total_time = time.time() - start
# print("Time spent: {}".format(total_time))

In [21]:
#Without multiprocessing

start = time.time()

words = [count_words(parsed_file) for parsed_file in parsed]
word_counts = {}
for wc in words:
    for word, count in wc:
        if word not in word_counts:
            word_counts[word] = 0
        word_counts[word] += 1
        
total_time = time.time() - start
print("Time spent: {}".format(total_time))

Time spent: 18.161108016967773


In [22]:
Counter(word_counts).most_common(10)

[('wikipedia', 431),
 ('retrieved', 169),
 ('articles', 132),
 ('article', 85),
 ('species', 69),
 ('county', 64),
 ('categories', 58),
 ('united', 50),
 ('university', 47),
 ('family', 45)]

These are the results considering only the 10 most common words per file, to avoid memory problems. We could take other considerations and get different results.