# Beautiful Soup

Have a look at the Beautiful Soup [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [1]:
import bs4

bs4.__version__

'4.5.3'

In [2]:
from bs4 import BeautifulSoup
import requests
import re

## Opening the Document ("Making the Soup")

In [12]:
URL = "http://bit.ly/2eL160q"
#URL = 'http://www.whatsmyua.info/api/v1/ua'

The first argument to the `BeautifulSoup` constructor is a string or open file handle.

The second argument to the `BeautifulSoup` constructor is a parser. Potential values are (in order of decreasing merit):

- `'lxml'` (XML parser)
- `'html5lib'`
- `'html.parser'` (Python’s built-in HTML parser).

This will determine which library will be used to parse the content of the HTML document. Find out more about [specifying the parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use). If you don't specify a parser, Beautiful Soup will choose one for you.

The choice of parser will affect the representation of a document in Beautiful Soup. For a perfectly formed HTML document this won't make any difference. However, if there are imperfections in the document then the way that each of these parsers attempts to fix the document will result in different internal representations.

In [13]:
with requests.get(URL) as r:
    soup = BeautifulSoup(r.text, 'html.parser')

In [14]:
type(soup)

bs4.BeautifulSoup

You can also create soup directly from a local file.

The `prettify()` method produces a well formatted string representation.

In [15]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta content="IE=edge" http-equiv="X-UA-Compatible">
   <meta content="171012896285253" property="fb:app_id"/>
   <meta content="/browserconfig.xml" name="msapplication-config">
    <meta content="1.3 ha farm in Byrne Valley, A historic home in Byrne Valley.
A chance to own one of the first dwellings ever built in Byrne Vall" name="description"/>
    <meta content="Private Property" property="og:site_name"/>
    <meta content="en_us" property="og:locale"/>
    <meta content="1.3 ha farm for sale in Byrne Valley | T479322 | Private Property" property="og:title"/>
    <meta content="https://prppublicstore.blob.core.windows.net/live-za-images/property/552/33/2366552/images/property-2366552-66071495_e.jpg" property="og:image"/>
    <meta content="600" property="og:image:width"/>
    <meta content="450" property="og:image:height"/>
    <meta content="privatepropertyfb:property" property="og:type"/>
    <meta content="https://www.privateproperty.co.za/for-sa

### Encodings

Beautiful Soup is generally pretty clever about finding the right encoding for a document. However, you can also specify the encoding by providing a `from_encoding` argument to the `BeautifulSoup` constructor.

## Accessing the Document

During the parsing process the document is converted into a tree of objects attached to the `BeautifulSoup` object.

### Tags

Tags are attached to the `BeautifulSoup` object as attributes.

In [16]:
soup.h1

<h1>1.3 ha farm in Byrne Valley</h1>

In [17]:
type(soup.h1)

bs4.element.Tag

### Tag Name

Tags have a `name` property.

In [18]:
soup.h1.name

'h1'

### Tag Attributes

In [19]:
soup.p

<p class="title">1.3 ha farm in Byrne Valley</p>

Individual tag attributes can be accessed by treated the tag like a dictionary.

In [20]:
soup.p['class']

['title']

This is a good place to use the `get()` method to handle the case where a particular attribute might not be present.

Or you can retrieve the whole dictionary.

In [21]:
soup.p.attrs

{'class': ['title']}

Attributes are mutable.

In [22]:
soup.p['id'] = 'first_paragraph'
soup.p

<p class="title" id="first_paragraph">1.3 ha farm in Byrne Valley</p>

### Strings

Many tags contain textual content.

In [23]:
soup.h1.string

'1.3 ha farm in Byrne Valley'

These have a special type in Beautiful Soup.

In [24]:
type(soup.h1.string)

bs4.element.NavigableString

The `string` attribute is only available if a tag has no descendants.

If there are multiple descendants then you can use the `strings` and `stripped_strings` attributes, both of which are iterators.

If you want to do further operations with such a string you should turn it into a normal Python string first, otherwise you'll be lumped with all of the overhead of the `BeautifulSoup` object.

In [25]:
h1 = str(soup.h1.string)
h1

'1.3 ha farm in Byrne Valley'

In [26]:
type(h1)

str

You can achieve the same result more directly with the `text` attribute. In general you'll probably end up using this more often than the `string` attribute.

In [27]:
soup.h1.text

'1.3 ha farm in Byrne Valley'

In [28]:
type(soup.h1.text)

str

## Navigating the Document

As we've seen above, one way to navigate a document is simply to use tag names. These names can be nested indefinitely.

In [29]:
soup.section.div.p

<p class="title" id="first_paragraph">1.3 ha farm in Byrne Valley</p>

But this is rather inflexible because it will only give you the *first* occurrence.

### Children and Descendants

The `children` attribute is an iterator over a tag's *direct* children.

In [30]:
soup.header

<header>
<div class="wrapper">
<div class="row">
<a class="logo" href="/"></a>
<nav class="siteNav row" id="siteNav">
<ul class="leftSide row searchVisible">
<li class="navigation relativePosition">
<a class="topMenu" href="#">For Sale</a>
<div class="dropDown leftRelativeToNavigation">
<a href="/for-sale/kwazulu-natal/drakensberg/richmond-and-surrounds/byrne-valley/2546?propertyTypes=0|1|2|7|3|10&amp;nostore=true" title="Property For Sale in Byrne Valley">
<span class="menuTitle">All Property for Sale</span><span class="desc"> in Byrne Valley</span>
</a>
<a href="/for-sale/kwazulu-natal/drakensberg/richmond-and-surrounds/byrne-valley/2546?propertyTypes=0&amp;nostore=true" title="Houses For Sale in Byrne Valley">
<span class="menuTitle">Houses For Sale</span><span class="desc"> in Byrne Valley</span>
</a>
<a href="/for-sale/kwazulu-natal/drakensberg/richmond-and-surrounds/byrne-valley/2546?propertyTypes=1&amp;nostore=true" title="Apartments For Sale in Byrne Valley">
<span class="menuT

The `children` attribute is an iterator over children.

In [31]:
len(list(soup.header.children))

3

The `descendants` attribute is an iterator over *all* descendants.

In [32]:
len(list(soup.header.descendants))

549

### Parent and Ancestors

You can navigate up the tree using the `parent` attribute.

In [33]:
soup.section.div.p.parent

<div class="listingInfo">
<p class="title" id="first_paragraph">1.3 ha farm in Byrne Valley</p>
<p class="price">
<span class="detailsCurrencySymbol">R</span>
<span class="detailsPrice">700 000</span>
</p>
</div>

The `parents` attribute will iterate from the immediate parent all the way up to the top of the tree.

### Siblings

A tag's siblings can be accessed with the `next_sibling` and `previous_sibling` attributes. Siblings are often simply strings rather than tags.

In [34]:
soup.section.div.p.next_sibling

'\n'

The `next_siblings` and `previous_siblings` attributes are iterators over siblings.

In [35]:
for sibling in soup.section.div.p.next_siblings:
    print(repr(sibling))

'\n'
<p class="price">
<span class="detailsCurrencySymbol">R</span>
<span class="detailsPrice">700 000</span>
</p>
'\n'


## Searching the Document

You can search a document using

- a tag's name or
- a tag's attributes.

These can also be combined to generate complex queries.

The `find_all()` method has the following parameters:

- `name` (tag name),
- `attrs`: (tag attributes),
- `recursive`,
- `text`,
- `limit` and
- `**kwargs` (other keyword arguments).

This will locate all tags which match the criteria.

If you only want a single match then it's better to simply use the `find()` method, which accepts the same arguments as `find_all()`.

### Seach by Tag Name

Search for tags by name.

In [36]:
soup.find_all('a')

[<a class="logo" href="/"></a>,
 <a class="topMenu" href="#">For Sale</a>,
 <a href="/for-sale/kwazulu-natal/drakensberg/richmond-and-surrounds/byrne-valley/2546?propertyTypes=0|1|2|7|3|10&amp;nostore=true" title="Property For Sale in Byrne Valley">
 <span class="menuTitle">All Property for Sale</span><span class="desc"> in Byrne Valley</span>
 </a>,
 <a href="/for-sale/kwazulu-natal/drakensberg/richmond-and-surrounds/byrne-valley/2546?propertyTypes=0&amp;nostore=true" title="Houses For Sale in Byrne Valley">
 <span class="menuTitle">Houses For Sale</span><span class="desc"> in Byrne Valley</span>
 </a>,
 <a href="/for-sale/kwazulu-natal/drakensberg/richmond-and-surrounds/byrne-valley/2546?propertyTypes=1&amp;nostore=true" title="Apartments For Sale in Byrne Valley">
 <span class="menuTitle">Apartments For Sale</span><span class="desc"> in Byrne Valley</span>
 </a>,
 <a href="/for-sale/kwazulu-natal/drakensberg/richmond-and-surrounds/byrne-valley/2546?propertyTypes=2&amp;nostore=true" 

If multiple tag names are provided in a list then tags that match *any* of those names with be returned.

In [None]:
soup.find_all(['h2', 'h3'])

You can pass a function as the `name` argument, in which case `find_all()` will return only those tags for which the function evaluates to `True`.

In [37]:
def links_with_id(tag):
    return tag.name == "a" and tag.has_attr('id')

In [None]:
soup.find_all(links_with_id)

You don't need to start searching from the top of the document. Any `Tag` can be the starting point for a search, in which case only that portion of the document tree will be traversed.

In [None]:
soup.section.find_all('p')

### Search by Attributes

In [None]:
soup.find_all(attrs = {'class': 'topMenu'})

If multiple attributes are provided then they are combined with logical OR.

### Search by Text Content

You can search for text content by using the `text` argument to `find_all()`.

In [None]:
soup.find_all(text='Bedrooms')

Regular expressions work well with a text search.

In [None]:
soup.find_all(text=re.compile('B.*rooms'))

The `text` argument can be combined with tag name to select only specific tags which contain the text.

In [None]:
soup.find_all('span', text=re.compile('B.*rooms'))

### Search with Keywords

In [None]:
soup.find_all(id='loginSignup')

Note that you can't specify `class` as a keyword argument because it's a reserved word in Python. Instead you need to use `class_`.

In [None]:
soup.find_all('a', class_='topMenu')

The combination of keyword arguments and regular expressions can be rather powerful.

In [None]:
soup.find_all(href=re.compile("^/to-rent"))

### Limiting the Number of Results

In [None]:
soup.find_all(href=re.compile("^/to-rent"), limit = 2)

### Recursing into the Document Tree

With the `recursive` argument set to `False` only direct children will be considered.

### Searching for Parents

The methods `find_parents()` and `find_parent()` accept the same arguments as `find_all()` and `find()` respectively.

These methods allow you to search a tag's parents.

You'll probably find that the `parent` and `parents` attributes are sufficient for your purposes though.

### Searching for Siblings

The methods `find_next_siblings()` and `find_next_sibling()` accept the same arguments as `find_all()` and `find()` respectively.

These are probably more useful than the `next_sibling` and `next_siblings` attributes because they return tags not strings.

In [None]:
for sibling in soup.h1.next_siblings:
    print(repr(sibling))

In [None]:
soup.h1.next_sibling

In [None]:
soup.h1.find_next_sibling('a')

The methods `find_previous_siblings()` and `find_previous_sibling()` are analogous but allow you to search siblings in the opposite direction.

## Using CSS Selectors

Use the `select()` method to pass in a CSS selector.

In [None]:
soup.select('div.topLeft > div > div.titleContainer > h2')

You can choose only the first match by using `select_one()`.

In [None]:
soup.select_one('div.topLeft > div > div.titleContainer > h2')

## Output

The `prettify()` method can be used to neatly format tags. Beautiful Soup will internally convert [HTML entities](https://dev.w3.org/html5/html-author/charref) into the equivalent [Unicode](https://en.wikipedia.org/wiki/Unicode) characters. You can choose different output formats by providing a `formatter` argument to `prettify()`.

### Text from Tags

We've seen that the `string` attribute extracts the text content from a tag. But what about nested tags?

In [None]:
price = soup.select_one('div.topLeft > div > div.titleContainer > h2')
price

We can extract the textual content from a whole chunk of HTML using the `get_text()` method.

In [None]:
price.get_text()

That's far from ideal. We'll strip white space from each tag's text.

In [None]:
price.get_text(strip = True)

Then specify how strings from different tags are joined together.

In [None]:
price.get_text(" ", strip = True)