# Chapter 2 Advanced HTML Parsing
## You Don’t Always Need a Hammer

With the help of `find()` and `find_all()` function, you may end up with ugly and unreadable codes like:
```py
bs.find_all('table')[4].find_all('tr')[2].find('td').find_all('div')[1].find('a')
```

There are a few options:
- Find a better rendered HTML, for example "print this page" link or mobile version of the site.
- Check the JavaScript file.
- Mine information from URL.
- Try to use other websites to get the same data.

Sometimes we can ask `BeautifulSoup` for help:

## Another Serving of BeautifulSoup

If the information you need share the same format, for example they have the same `class` tags in HTML, it's easy to use `find_all()` to extract all these data. We use our example page to explain the usage: [http://www.pythonscraping.com/pages/warandpeace.html](http://www.pythonscraping.com/pages/warandpeace.html).

First, we parse all contents from HTML (To simplify our codes, we omit the error-handling):

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

http_res = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bs = BeautifulSoup(http_res.read(), "html.parser")

For example, we want to extract all character names (texts in green). Here's a sample of codes you can see using "inspect" mode in your browser:

```html
<span class="red">Heavens! what a virulent attack!</span> replied
<span class="green">the prince</span>, not in the least disconcerted by this reception.
```

In [2]:
name_list = bs.findAll("span", {"class": "green"})
for name in name_list[:5]: # print first 5 names for display purpose
    print(name.get_text()) # get plain text without tags

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg


Here's the definitions of `find()` and `find_all()` in `BeautifulSoup` documentation:

```py
find(name, attrs, recursive, string, **kwargs)
find_all(name, attrs, recursive, string, limit, **kwargs)
```

We usually only use `name` and `attrs`. By default, `recursive` is set to `True`.

In [3]:
bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"])

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

In [4]:
prince_occur = bs.find_all(text="the prince")
print(len(prince_occur))

7


When parse strctured data on websites, getting to know the data "family" is important. Here we use another example page for demo: [http://www.pythonscraping.com/pages/page3.html](http://www.pythonscraping.com/pages/page3.html).

In [5]:
http_res = urlopen("http://www.pythonscraping.com/pages/page3.html")
bs = BeautifulSoup(http_res.read(), "html.parser")

**Children** are exactly one tag below a parent, whereas **descendants** can be any level in the tree below a parent.

In [6]:
table_children = list(bs.find('table',{'id':'giftList'}).children)
print(table_children[1]) # the first element is a new line character

<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


In [7]:
print(len(table_children))

13


In [8]:
table_descendants = list(bs.find('table',{'id':'giftList'}).descendants)
print(len(table_descendants))

86


Similarly, we have **siblings** to collect data from tables. Let's get the header of the table first:

In [9]:
table_header = bs.find('table',{'id':'giftList'}).tr
print([column.get_text().strip() for column in table_header])

['Item Title', 'Description', 'Cost', 'Image']


Here's the first raw in the table. Note that `next_siblings` will return a generator object that doesn't include the object (the header here) itself.

In [10]:
list(table_header.next_siblings)[1]

<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>

Of course, elements have **parents**. In the following example, we use the information of the picture to get the corresponding price.

In [11]:
print(
    bs.find("img", {"src": "../img/gifts/img1.jpg"})
      .parent.previous_sibling.get_text()
      .strip()
)

$15.00


## Regular Expressions

Although we assume you have certain knowledge about regular expressions, let's have a very breif review together:

`aaaaa`: five `b`s  
`bb*`: at least 1 `a`s  
`(cc)*`: any number (including 0) of pairs `cc`  
`(d|e)`: `d` or `e`  

Table 2-1 in the textbook contains more commonly used Regex symbols.

Now that we know Regex and image paths have a pattern of "*../imag/gifts/img*", we can use Regex to retrieve all image paths. Of course, `BeautifulSoup` supports Regex in all arguments.

In [12]:
import re

images = bs.find_all("img", 
                     {"src": re.compile(
                         "\.\.\/img\/gifts/img.*\.jpg") # '\' is escape character
                     })
for image in images:
    print(image["src"])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


## Accessing Attributes

In the example above, you may notice that we print `image["src"]` instead of `image` because we want paths only. That's how we access the `src` attribute. It's clear if we print one of `images`.

In [13]:
print(images[0])

<img src="../img/gifts/img1.jpg"/>


## Lambda Expressions

It's fine if you are not familiar with lambda expressions (or anonymous functions). We have this part to let you know that `BeautifulSoup` supports lambda expressions as well. Besides, it helps you to understand or even write fancy codes.

The following example finds all tags that have exactly two attributes. For example, the logo has `src` and `style` attributes.

In [14]:
tag_2_attrs = bs.find_all(lambda tag: len(tag.attrs) == 2)
print(len(tag_2_attrs))

6


In [15]:
print(tag_2_attrs[0])

<img src="../img/gifts/logo.jpg" style="float:left;"/>


The following two snippets have the same functionality. The second one uses lambda expressions.

In [16]:
bs.find_all("", text="Or maybe he\'s only resting?")

["Or maybe he's only resting?"]

In [17]:
bs.find_all(lambda tag: tag.get_text() == "Or maybe he\'s only resting?")

[<span class="excitingNote">Or maybe he's only resting?</span>]