# Chapter 3 Writing Web Crawlers
## Traversing a Single Domain

To review what we have learned in previous chapters, we're going to write Python snippets to retrieve a list of links on any arbitrary Wikipedia page. Although Wikipedia has its own API to help us get access to data more efficiently, we practice on Wikipedia because 1) it's stable; 2) it has a simple HTML structure. We use Kevin Bacon's page as an example: [http://en.wikipedia.org/wiki/Kevin_Bacon](http://en.wikipedia.org/wiki/Kevin_Bacon).

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

wiki_url = "http://en.wikipedia.org/wiki/Kevin_Bacon"

http_res = urlopen(wiki_url)
bs = BeautifulSoup(http_res, 'html.parser')
links = []
for link in bs.find_all('a'):
    if "href" in link.attrs:
        links.append(link.attrs["href"])
print(len(links))

856


Since there are too many entries in the list, we print out the first 15 for analysis.

In [2]:
for link in links[:15]:
    print(link)

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
http://baconbros.com/
#cite_note-1
#cite_note-actor-2
/wiki/Footloose_(1984_film)


From the above results, we find that there're still non-Wiki links. For example, "#mw-head" and "http://baconbros.com". But Wiki links have the same pattern "*/wiki/*". Therefore, we use Regex `^(/wiki/)((?!:).)*$` to help us to filter out Wiki links. Unlike the textbook, we use Regex on `links` rather than in `find_all()` here because we don't want to parse the data again, which takes a long time if we have a huge amount of data.

In [3]:
import re

wiki_links = []
for link in links:
    if re.match(r"^(/wiki/)((?!:).)*$", link):
        wiki_links.append(link)
print(len(wiki_links))

394


In [4]:
for link in wiki_links[:15]:
    print(link)

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company


## Crawling an Entire Site

Now that you have had a taste of web scraping, you are able to crawl an entire Wikipedia! To do so, we also need a deeper understanding of the structure of Wikipedia:

- All titles are under `h1->span` tags
- First paragraph of text lives under `div#mw-content-text->p`.
- Edit links are under `span#wb-langlinks-edit->a`.

In [5]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def get_links(pageUrl):
    global pages
    http_res = urlopen("http://en.wikipedia.org{}".format(pageUrl))
    bs = BeautifulSoup(http_res, "html.parser")
    try:
        print(bs.h1.get_text())
        # uncomment to print first paragraph
        # print(bs.find(id="mw-content-text").find_all("p")[0])
        print(bs.find("span", {"class": "wb-langlinks-edit"})
                .find("a").attrs["href"])
    except AttributeError:
        print("This page is missing something! Continuing.")
    
    # DFS
    for link in bs.find_all("a", href=re.compile(r"^(/wiki/)")):
        if len(pages) >= 5:
            break
        if "href" in link.attrs:
            if link.attrs["href"] not in pages:
                new_page = link.attrs["href"]
                print("-"*20)
                print(new_page)
                pages.add(new_page)
                get_links(new_page)

In [6]:
get_links("")

Main Page
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia
Wikipedia
https://www.wikidata.org/wiki/Special:EntityPage/Q52#sitelinks-wikipedia
--------------------
/wiki/Wikipedia:Protection_policy#semi
Wikipedia:Protection policy
https://www.wikidata.org/wiki/Special:EntityPage/Q4616470#sitelinks-wikipedia
--------------------
/wiki/Wikipedia:Requests_for_page_protection
Wikipedia:Requests for page protection
https://www.wikidata.org/wiki/Special:EntityPage/Q5478470#sitelinks-wikipedia
--------------------
/wiki/Wikipedia:Requests_for_permissions
Wikipedia:Requests for permissions
https://www.wikidata.org/wiki/Special:EntityPage/Q5453037#sitelinks-wikipedia
--------------------
/wiki/Wikipedia:Protection_policy#template
Wikipedia:Protection policy
https://www.wikidata.org/wiki/Special:EntityPage/Q4616470#sitelinks-wikipedia


## Crawling Across the Internet

Here we give another example starting at [http://oreilly.com](http://oreilly.com) and randomly hopping from external link to external link.

In [7]:
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, includeUrl):
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)
    internalLinks = []
    #Finds all links that begin with a "/"
    for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if(link.attrs['href'].startswith('/')):
                    internalLinks.append(includeUrl+link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks
            
#Retrieves a list of all external links found on a page
def getExternalLinks(bs, excludeUrl):
    externalLinks = []
    #Finds all links that start with "http" that do
    #not contain the current URL
    for link in bs.find_all('a', href=re.compile('^(http|www)((?!'+excludeUrl+').)*$')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks

def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bs = BeautifulSoup(html, 'html.parser')
    externalLinks = getExternalLinks(bs, urlparse(startingPage).netloc)
    if len(externalLinks) == 0:
        print('No external links, looking around the site for one')
        domain = '{}://{}'.format(urlparse(startingPage).scheme, urlparse(startingPage).netloc)
        internalLinks = getInternalLinks(bs, domain)
        return getRandomExternalLink(internalLinks[random.randint(0,
                                    len(internalLinks)-1)])
    else:
        return externalLinks[random.randint(0, len(externalLinks)-1)]
    
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink(startingSite)
    print('Random external link is: {}'.format(externalLink))
    # uncomment to get "infinite" external links
    # followExternalOnly(externalLink)
            
followExternalOnly('http://oreilly.com')

Random external link is: https://www.facebook.com/OReilly/


Of course, we can also get all unique external links using `set()`.

In [8]:
# Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()


def getAllExternalLinks(siteUrl):
    html = urlopen(siteUrl)
    domain = '{}://{}'.format(urlparse(siteUrl).scheme,
                              urlparse(siteUrl).netloc)
    bs = BeautifulSoup(html, 'html.parser')
    internalLinks = getInternalLinks(bs, domain)
    externalLinks = getExternalLinks(bs, domain)
    
    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            print(link)
    for link in internalLinks:
        if link not in allIntLinks:
            allIntLinks.add(link)
            # uncomment to get "all" external links
            # getAllExternalLinks(link)

allIntLinks.add('http://oreilly.com')
getAllExternalLinks('http://oreilly.com')

https://www.oreilly.com
https://www.oreilly.com/sign-in.html
https://www.oreilly.com/online-learning/try-now.html
https://www.oreilly.com/online-learning/index.html
https://www.oreilly.com/online-learning/individuals.html
https://www.oreilly.com/online-learning/teams.html
https://www.oreilly.com/online-learning/enterprise.html
https://www.oreilly.com/online-learning/government.html
https://www.oreilly.com/online-learning/academic.html
https://www.oreilly.com/online-learning/features.html
https://www.oreilly.com/online-learning/pricing.html
https://www.oreilly.com/blended-courses.html
https://www.oreilly.com/conferences/
https://www.oreilly.com/ideas/
https://www.oreilly.com/about/approach.html
https://conferences.oreilly.com/oscon/oscon-or
https://www.oreilly.com/whats-new.html
https://conferences.oreilly.com/strata/strata-eu
https://conferences.oreilly.com/velocity/vl-ca
https://learning.oreilly.com/register/
https://learning.oreilly.com/team-setup/
https://conferences.oreilly.com/sof