# Chapter xx

*Data Structures and Information Retrieval in Python*

Copyright 2021 Allen Downey

License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)

In [1]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
# download('https://github.com/AllenDowney/DSIRP/raw/main/utils.py')

[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/chapters/chap01.ipynb)

# Getting to Philosophy

The goal of this chapter is to develop a Web crawler that tests the
"Getting to Philosophy" conjecture, which we presented in
Section [\[the-road-ahead\]](#the-road-ahead){reference-type="ref"
reference="the-road-ahead"}.

In [55]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
(<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>),
<i><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</i>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [49]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)
type(soup)

bs4.BeautifulSoup

In [50]:
def iterative_DFS(root):
    stack = []
    stack.append(root)
    
    while(stack):
        tag = stack.pop()
        yield tag

        children = getattr(tag, "contents", [])
        for child in reversed(children):
            stack.append(child)

In [51]:
from bs4 import NavigableString

for element in iterative_DFS(soup):
    if isinstance(element, NavigableString):
        print(element.string, end='')

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and ((their) names) were
(Elsie),
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [73]:
from bs4 import Tag

def link_generator(root):
    paren_stack = []

    for element in iterative_DFS(root):
        if isinstance(element, NavigableString):
            for char in element.string:
                if char == '(':
                    paren_stack.append(char)
                if char == ')':
                    paren_stack.pop()

        if isinstance(element, Tag) and element.name == "a":
            if len(paren_stack):
                continue
            yield element

In [74]:
it = link_generator(soup)
it

<generator object link_generator at 0x7fdae01973c0>

In [77]:
link = next(it)
link

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [121]:
def in_bad_tag(element, bad_tags=['i', 'table']):
    if isinstance(element, BeautifulSoup):
        return False
    if isinstance(element, Tag) and element.name in bad_tags:
        return True
    return in_bad_tag(element.parent)

In [122]:
for link in link_generator(soup):
    if in_bad_tag(link):
        continue
    print(link)

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


In [123]:
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
download(url)

In [124]:
from os.path import basename

filename = basename(url)
fp = open(filename)
soup2 = BeautifulSoup(fp)

In [125]:
root = soup2.find(class_='mw-body-content')

In [140]:
def valid_link_generator(root):
    for link in link_generator(root):
        if in_bad_tag(link):
            continue
            
        href = link.get("href", '')
        if not href.startswith('/wiki'):
            continue

        class_ = link.get("class", '')
        if "mw-disambig" in class_:
            continue
            
        yield link

In [141]:
it = valid_link_generator(root)
link = next(it)
link

<a class="mw-redirect" href="/wiki/Interpreted_language" title="Interpreted language">interpreted</a>

## `WikiFetcher`

When you write a Web crawler, it is easy to download too many pages too
fast, which might violate the terms of service for the server you are
downloading from. To help you avoid that, I provide a class called
`WikiFetcher` that does two things:

1.  It encapsulates the code we demonstrated in the previous chapter for
    downloading pages from Wikipedia, parsing the HTML, and selecting
    the content text.

2.  It measures the time between requests and, if we don't leave enough
    time between requests, it sleeps until a reasonable interval has
    elapsed. By default, the interval is one second.

Here's the definition of `WikiFetcher`:

In [12]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from time import time, sleep
    
class WikiFetcher:
    next_request_time = None
    min_interval = 1  # second

    def fetch_wikipedia(self, url):
        self.sleep_if_needed()
        fp = urlopen(url)
        soup = BeautifulSoup(fp, 'html.parser')
        return soup

    def sleep_if_needed(self):
        if self.next_request_time:
            sleep_time = self.next_request_time - time()    
            if sleep_time > 0:
                sleep(sleep_time)
        
        self.next_request_time = time() + self.min_interval

The only public method is `fetchWikipedia`, which takes a URL as a
`String` and returns an `Elements` collection that contains one DOM
element for each paragraph in the content text. This code should look
familiar.

The new code is in `sleepIfNeeded`, which checks the time since the last
request and sleeps if the elapsed time is less than `minInterval`, which
is in milliseconds.

That's all there is to `WikiFetcher`. Here's an example that
demonstrates how it's used:

In [13]:
wf = WikiFetcher()

In [14]:
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

print(time())
wf.fetch_wikipedia(url)
print(time())
wf.fetch_wikipedia(url)
print(time())

1627917177.0105095
1627917177.4788275
1627917178.482072


## Exercise 5

In `WikiPhilosophy.java` you'll find a simple `main` method that shows
how to use some of these pieces. Starting with this code, your job is to
write a crawler that:

1.  Takes a URL for a Wikipedia page, downloads it, and parses it.

2.  It should traverse the resulting DOM tree to find the first *valid*
    link. I'll explain what "valid" means below.

3.  If the page has no links, or if the first link is a page we have
    already seen, the program should indicate failure and exit.

4.  If the link matches the URL of the Wikipedia page on philosophy, the
    program should indicate success and exit.

5.  Otherwise it should go back to Step 1.

The program should build a `List` of the URLs it visits and display the
results at the end (whether it succeeds or fails).

So what should we consider a "valid" link? You have some choices here.
Various versions of the "Getting to Philosophy" conjecture use slightly
different rules, but here are some options:

1.  The link should be in the content text of the page, not in a sidebar
    or boxout.

2.  It should not be in italics or in parentheses.

3.  You should skip external links, links to the current page, and red
    links.

4.  In some versions, you should skip a link if the text starts with an
    uppercase letter.

You don't have to enforce all of these rules, but we recommend that you
at least handle parentheses, italics, and links to the current page.

If you feel like you have enough information to get started, go ahead.
Or you might want to read these hints:

1.  As you traverse the tree, the two kinds of `Node` you will need to
    deal with are `TextNode` and `Element`. If you find an `Element`,
    you will probably have to typecast it to access the tag and other
    information.

2.  When you find an `Element` that contains a link, you can check
    whether it is in italics by following parent links up the tree. If
    there is an `<i>` or `<em>` tag in the parent chain, the link is in
    italics.

3.  To check whether a link is in parentheses, you will have to scan
    through the text as you traverse the tree and keep track of opening
    and closing parentheses (ideally your solution should be able to
    handle nested parentheses (like this)).

4.  If you start from the Java page, you should get to Philosophy after
    following seven links, unless something has changed since I ran the
    code.

OK, that's all the help you're going to get. Now it's up to you. Have
fun!