# Depth First Search

*Data Structures and Information Retrieval in Python*

Copyright 2021 Allen Downey

License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)

[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/chapters/dfs.ipynb)

This notebook presents "depth first search" as a way to iterate through the nodes in a tree.
This algorithm applies to any kind of tree, but since we need an example, we'll use BeautifulSoup, which is a Python module that reads HTML (and related languages) and builds a tree that represents the content.

## Using BeautifulSoup

When you download a web page, the contents are written in HyperText Markup Language, aka HTML. 
For example, here is a minimal HTML document, which I borrowed from the [BeautifulSoup documentation](https://beautiful-soup-4.readthedocs.io), but the text is from Lewis Carroll's [*Alice's Adventures in Wonderland*](https://www.gutenberg.org/files/11/11-h/11-h.htm).

In [4]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

Here's how we use BeautifulSoup to read it.

The result is a `BeautifulSoup` object that represents the root of the tree.

The nodes in the tree have different types, which we will get to know as we go along.
To get started, `BeautifulSoup` object is a kind of `Tag`, which is a kind of `PageElement`, which we can learn by asking its class for its "[method resolution order](https://www.geeksforgeeks.org/method-resolution-order-in-python-inheritance/)".

If we display the soup, it reproduces the HTML.

`prettify` uses indentation to show the structure of the document.

BeautifulSoup provides convenient ways to navigate the soup.

`PageElement` objects have a property called `children` which returns an iterator of the `PageElement` objects it contains.

We can use a for loop to iterate through them.

This soup contains only a single child, which is a `Tag`.
`PageElement` also provides `descendants`, which iterates all of its children, their children, their children's children, and so on.

It also provides `find`, which takes an HTML tag name and returns the first tag in the tree with the given name.

And `find_all`, which iterates should all matching tags.

It's also possible to search for tags with given properties.

To find a tag with a given class, you have to use `class_`, because `class` is a Python keyword.

The following cells use the tools we have so far to explore the soup.

## Depth-first search

In the previous section, we found all of the paragraphs, iterated their children, and printed the `NavigableStrings` we found, but we did not get all of the text, because some of it is embedded in children of children, and so on.

To get all of the text, we'll use a "depth-first search", or DFS.
DFS starts at the root of the tree and selects the first child. If the
child has children, it selects the first child again. When it gets to a
node with no children, it backtracks, moving up the tree to the parent
node, where it selects the next child if there is one; otherwise it
backtracks again. When it has explored the last child of the root, it's
done.

There are two common ways to implement DFS, recursively and iteratively.
The recursive implementation is simple and elegant:

Here is an iterative version of DFS that uses a list to represent a stack of elements:

The parameter, `root`, is the root of the tree we want to traverse, so
we start by creating the stack and pushing the root onto it.

The loop continues until the stack is empty. Each time through, it pops
a `PageElement` off the stack. If it gets a `NavigableString`, it prints the contents.
Then it pushes the children onto the stack. In order to process the
children in the right order, we have to push them onto the stack in
reverse order.


**Exercise:** Write a function similar to `PageElement.find` that takes a `PageElement` and a tag name and returns the first tag with the given name. You can write it iteratively or recursively.

Here's how to check whether a `PageElement` is a `Tag`.

```
from bs4 import Tag
isinstance(element, Tag)
```

**Exercise:** Write a generator function similar to `PageElement.find_all` that takes a `PageElement` and a tag name and yields all tags with the given name. You can write it iteratively or recursively.