When Michelangelo was asked how he could sculpt a work of art as masterful as his
David, he is famously reported to have said, “It is easy. You just chip away the stone
that doesn’t look like David.”

Although web scraping is unlike marble sculpting in most other respects, you must
take a similar attitude when it comes to extracting the information you’re seeking
from complicated web pages. In this chapter, we’ll explore various techniques to chip
away any content that doesn’t look like content you want, until you arrive at the information
you’re seeking. Complicated HTML pages may be look intimidating at first,
but just keep chipping!

### Another Serving of BeautifulSoup

In Chapter 4, you took a quick look at installing and running BeautifulSoup, as well
as selecting objects one at a time. In this section, we’ll discuss searching for tags by
attributes, working with lists of tags, and navigating parse trees.

Nearly every website you encounter contains stylesheets. Stylesheets are created so
that web browsers can render HTML into colorful and aesthetically pleasing designs
for humans. You might think of this styling layer as, at the very least, perfectly ignorable
for web scrapers—but not so fast! CSS is, in fact, a huge boon for web scrapers
because it requires the differentiation of HTML elements in order to style them
differently.

CSS provides an incentive for web developers to add tags to HTML elements they
might have otherwise left with the exact same markup. Some tags might look like this:

```
<span class="green"></span>
```

Others look like this:

```
<span class="red"></span>
```

Web scrapers can easily separate these two tags based on their class; for example, they
might use BeautifulSoup to grab all the red text but none of the green text. Because
CSS relies on these identifying attributes to style sites appropriately, you are almost
guaranteed that these class and id attributes will be plentiful on most modern
websites.

Let’s create an example web scraper that scrapes the page located at http://www.pythonscraping.com/pages/warandpeace.html

On this page, the lines spoken by characters in the story are in red, whereas the
names of characters are in green. You can see the span tags, which reference the
appropriate CSS classes, in the following sample of the page’s source code:

```
<span class="red">Heavens! what a virulent attack!</span> replied
<span class="green">the prince</span>, not in the least disconcerted
by this reception.
```
You can grab the entire page and create a BeautifulSoup object with it by using a
program similar to the one used in Chapter 4:

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(), 'html.parser')

Using this BeautifulSoup object, you can use the find_all function to extract a
Python list of proper nouns found by selecting only the text within `<span class="green"></span>` tags (find_all is an extremely flexible function you’ll be
using a lot later in this book):

In [2]:
nameList = bs.find_all('span', {'class':'green'})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


When run, it should list all the proper nouns in the text, in the order they appear in
War and Peace. How does it work? Previously, you’ve called bs.tagName to get the
first occurrence of that tag on the page. Now, you’re calling bs.find_all(tagName,
tagAttributes) to get a list of all of the tags on the page, rather than just the first.

After getting a list of names, the program iterates through all names in the list and
prints name.get_text() in order to separate the content from the tags.

### When to get_text() and When to Preserve Tags

.get_text() strips all tags from the document you are working
with and returns a Unicode string containing the text only. For
example, if you are working with a large block of text that contains
many hyperlinks, paragraphs, and other tags, all those will be stripped
away, and you’ll be left with a tagless block of text.

Keep in mind that it’s much easier to find what you’re looking for
in a BeautifulSoup object than in a block of text. Calling
.get_text() should always be the last thing you do, immediately
before you print, store, or manipulate your final data. In
general, you should try to preserve the tag structure of a document
as long as possible.