# Chapter 6

*Data Structures and Information Retrieval in Python*

Copyright 2021 Allen Downey

License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)

In [88]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
# download('https://github.com/AllenDowney/DSIRP/raw/main/utils.py')

[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/chapters/chap01.ipynb)

# Tree traversal

This chapter introduces the application we will develop during the rest
of the book, a web search engine. I describe the elements of a search
engine and introduce the first application, a Web crawler that downloads
and parses pages from Wikipedia. This chapter also presents a recursive
implementation of depth-first search and an iterative implementation
that uses a Java `Deque` to implement a "last in, first out" stack.

## Search engines

A **web search engine**, like Google Search or Bing, takes a set of
"search terms" and returns a list of web pages that are relevant to
those terms (I'll discuss what "relevant" means later). You can read
more at <http://thinkdast.com/searcheng>, but I'll explain what you need
as we go along.

The essential components of a search engine are:

-   Crawling: We'll need a program that can download a web page, parse
    it, and extract the text and any links to other pages.

-   Indexing: We'll need a data structure that makes it possible to look
    up a search term and find the pages that contain it.

-   Retrieval: And we'll need a way to collect results from the Index
    and identify pages that are most relevant to the search terms.

We'll start with the crawler. The goal of a crawler is to discover and
download a set of web pages. For search engines like Google and Bing,
the goal is to find *all* web pages, but often crawlers are limited to a
smaller domain. In our case, we will only read pages from Wikipedia.

As a first step, we'll build a crawler that reads a Wikipedia page,
finds the first link, follows the link to another page, and repeats. We
will use this crawler to test the "Getting to Philosophy" conjecture,
which states:

> Clicking on the first lowercase link in the main text of a Wikipedia
> article, and then repeating the process for subsequent articles,
> usually eventually gets one to the Philosophy article.

This conjecture is stated at <http://thinkdast.com/getphil>, and you can
read its history there.

Testing the conjecture will allow us to build the basic pieces of a
crawler without having to crawl the entire web, or even all of
Wikipedia. And I think the exercise is kind of fun!

In a few chapters, we'll work on the indexer, and then we'll get to the
retriever.

## Using BeautifulSoup

When you download a web page, the contents are written in HyperText Markup Language, aka HTML. 
For example, here is a minimal HTML document, which I borrowed from the [BeautifulSoup documentation](https://beautiful-soup-4.readthedocs.io), but the text is from Lewis Carroll's [*Alice's Adventures in Wonderland*](https://www.gutenberg.org/files/11/11-h/11-h.htm).

In [89]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [151]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)
type(soup)

bs4.BeautifulSoup

In [163]:
soup.__class__.__mro__

(bs4.BeautifulSoup, bs4.element.Tag, bs4.element.PageElement, object)

In [153]:
soup

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [152]:
soup.children

<list_iterator at 0x7fe77f4e9730>

In [92]:
for element in soup.children:
    print(type(element))

<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>


In [93]:
for element in soup.descendants:
    print(type(element))

<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>


In [94]:
first_link = soup.find("a")
first_link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [95]:
links = soup.find_all("a")
len(links)

3

In [96]:
link = soup.find(id="link2")
link

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

In [97]:
title = soup.find(class_="title")
title

<p class="title"><b>The Dormouse's story</b></p>

In [98]:
paragraphs = soup.find_all("p")
len(paragraphs)

3

In [99]:
for para in paragraphs:
    print(type(para))

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>


In [100]:
for para in paragraphs:
    print(para.name)

p
p
p


In [101]:
for para in paragraphs:
    print(para)

<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>


In [102]:
for para in paragraphs:
    for element in para.children:
        print(type(element))

<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>


In [103]:
from bs4 import NavigableString

for para in paragraphs:
    for element in para.children:
        if isinstance(element, NavigableString):
            print(type(element))

<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.NavigableString'>


In [104]:
from bs4 import NavigableString

for para in paragraphs:
    for element in para.children:
        if isinstance(element, NavigableString):
            print(element)

Once upon a time there were three little sisters; and their names were

,

 and

;
and they lived at the bottom of a well.
...


## Depth-first search

There are several ways you might reasonably traverse a tree, each with
different applications. We'll start with "depth-first search", or DFS.
DFS starts at the root of the tree and selects the first child. If the
child has children, it selects the first child again. When it gets to a
node with no children, it backtracks, moving up the tree to the parent
node, where it selects the next child if there is one; otherwise it
backtracks again. When it has explored the last child of the root, it's
done.

There are two common ways to implement DFS, recursively and iteratively.
The recursive implementation is simple and elegant:

In [105]:
def recursive_DFS(element):
    if isinstance(element, NavigableString):
        print(element, end='')

    children = getattr(element, "children", [])
    for child in children:
        recursive_DFS(child)

In [106]:
recursive_DFS(soup)


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


## Stacks in Python

A stack is a data structure that is similar to a list: it is a
collection that maintains the order of the elements. The primary
difference between a stack and a list is that the stack provides fewer
methods. In the usual convention, it provides:

-   `append`: which adds an element to the top of the stack.

-   `pop`: which removes and returns the top-most element from the
    stack.

Because `pop` always returns the top-most element, a stack is also
called a "LIFO", which stands for "last in, first out". An alternative
to a stack is a "queue", which returns elements in the same order they
are added; that is, "first in, first out", or FIFO.

It might not be obvious why stacks and queues are useful: they don't
provide any capabilities that aren't provided by lists; in fact, they
provide fewer capabilities. So why not use lists for everything? There
are two reasons:

1.  If you limit yourself to a small set of methods --- that is, a small
    API --- your code will be more readable and less error-prone. For
    example, if you use a list to represent a stack, you might
    accidentally remove an element in the wrong order. With the stack
    API, this kind of mistake is literally impossible. And the best way
    to avoid errors is to make them impossible.

2.  If a data structure provides a small API, it is easier to implement
    efficiently. For example, a simple way to implement a stack is a
    singly-linked list. When we push an element onto the stack, we add
    it to the beginning of the list; when we pop an element, we remove
    it from the beginning. For a linked list, adding and removing from
    the beginning are constant time operations, so this implementation
    is efficient. Conversely, big APIs are harder to implement
    efficiently.

## Iterative DFS

Here is an iterative version of DFS that uses a list to represent a stack of elements:

In [107]:
def iterative_DFS(root):
    stack = []
    stack.append(root)
    
    while(stack):
        element = stack.pop()
        if isinstance(element, NavigableString):
            print(element, end='')

        children = getattr(element, "contents", [])
        for child in reversed(children):
            stack.append(child)

The parameter, `root`, is the root of the tree we want to traverse, so
we start by creating the stack and pushing the root onto it.

The loop continues until the stack is empty. Each time through, it pops
a `Node` off the stack. If it gets a `TextNode`, it prints the contents.
Then it pushes the children onto the stack. In order to process the
children in the right order, we have to push them onto the stack in
reverse order; we do that by copying the children into an `ArrayList`,
reversing the elements in place, and then iterating through the reversed
`ArrayList`.




In [108]:
iterative_DFS(soup)


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [109]:
from bs4 import Tag

def find(root, tag_name):
    stack = []
    stack.append(root)
    
    while(stack):
        element = stack.pop()
        if isinstance(element, Tag):
            if element.name == tag_name:
                return element

        children = getattr(element, "contents", [])
        for child in reversed(children):
            stack.append(child)

In [110]:
find(soup, "p")

<p class="title"><b>The Dormouse's story</b></p>

In [111]:
find(soup, "a")

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [112]:
find(soup, "not a tag")

In [113]:
def find_all(root, tag_name):
    stack = []
    stack.append(root)
    
    while(stack):
        element = stack.pop()
        if isinstance(element, Tag):
            if element.name == tag_name:
                yield element

        children = getattr(element, "contents", [])
        for child in reversed(children):
            stack.append(child)

In [114]:
it = find_all(soup, "a")
it

<generator object find_all at 0x7fe798396200>

In [64]:
for tag in it:
    print(tag)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


In [143]:
def recursive_find_all(element, tag_name):
    if isinstance(element, Tag):
        if element.name == tag_name:
            yield element

    children = getattr(element, "children", [])
    for child in children:
        yield from recursive_find_all(child, tag_name)

In [144]:
it = recursive_find_all(soup, "a")

for tag in it:
    print(tag)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


In [145]:
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
download(url)

from os.path import basename

filename = basename(url)

fp = open(filename)
soup2 = BeautifulSoup(fp)

In [148]:
find(soup2, "a")

<a id="top"></a>

In [150]:
for element in find_all(soup2, "a"):
    print(element)

<a id="top"></a>
<a href="/wiki/Wikipedia:Good_articles" title="This is a good article. Click here for more information."><img alt="This is a good article. Click here for more information." data-file-height="185" data-file-width="180" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/19px-Symbol_support_vote.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/29px-Symbol_support_vote.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/39px-Symbol_support_vote.svg.png 2x" width="19"/></a>
<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>
<a class="mw-jump-link" href="#searchInput">Jump to search</a>
<a class="mw-redirect mw-disambig" href="/wiki/Python_(disambiguation)" title="Python (disambiguation)">Python (disambiguation)</a>
<a class="image" href="/wiki/File:Python_logo_and_wordmark.svg"><img alt="Python logo and wordmark.svg" data-file-height="144