# [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
## Quick Start

Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland:

In [1]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

Running the “three sisters” document through Beautiful Soup gives us a `BeautifulSoup` object, which represents the document as a nested data structure:

In [2]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


Here are some simple ways to navigate that data structure:

In [3]:
print(soup.title)

<title>The Dormouse's story</title>


In [4]:
print(soup.title.name)

title


In [5]:
print(soup.title.string)

The Dormouse's story


In [6]:
print(soup.title.parent.name)

head


In [7]:
print(soup.p)

<p class="title"><b>The Dormouse's story</b></p>


In [8]:
print(soup.p['class'])

['title']


In [9]:
print(soup.a)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


In [10]:
print(soup.find_all('a'))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


In [11]:
print(soup.find(id="link3"))

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


One common task is extracting all the URLs found within a page’s `<a>` tags:

In [12]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


Another common task is extracting all the text from a page:

In [13]:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



## Making the soup

To parse a document, pass it into the `BeautifulSoup` constructor. You can pass in a string or an open filehandle:

In [14]:
from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

soup = BeautifulSoup("<html>a web page</html>", 'html.parser')

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

In [15]:
print(BeautifulSoup("<html><head></head><body>Sacr&eacute; bleu!</body></html>", "html.parser"))

<html><head></head><body>Sacré bleu!</body></html>


Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. ([See Parsing XML.](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#id17))

## Kinds of objects

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: `Tag`, `NavigableString`, `BeautifulSoup`, and `Comment`.

### Tag

A `Tag` object corresponds to an XML or HTML tag in the original document:

In [16]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
type(tag)

bs4.element.Tag

Tags have a lot of attributes and methods, and I’ll cover most of them in [Navigating the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree) and [Searching the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree). For now, the most important features of a tag are its name and attributes.

#### Name

Every tag has a name, accessible as `.name`:

In [17]:
print(tag.name)

b


If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

In [18]:
tag.name = "blockquote"
print(tag)

<blockquote class="boldest">Extremely bold</blockquote>


#### Attributes

A tag may have any number of attributes. The tag `<b id="boldest">` has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

In [19]:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
print(tag['id'])

boldest


You can access that dictionary directly as `.attrs`:

In [20]:
print(tag.attrs)

{'id': 'boldest'}


You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [21]:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
print(tag)

<b another-attribute="1" id="verybold">bold</b>


In [22]:
del tag['id']
del tag['another-attribute']
print(tag)

<b>bold</b>


In [23]:
if tag.attrs:
    print("Chido")
else:
    print("No Dict")

No Dict


##### Multi-valued attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is `class` (that is, a tag can have more than one CSS class). Others include `rel`, `rev`, `accept-charset`, `headers`, and `accesskey`. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

In [24]:
css_soup = BeautifulSoup('<p class="body"></p>', 'html.parser')
print(css_soup.p['class'])

['body']


In [25]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
print(css_soup.p['class'])

['body', 'strikeout']


If an attribute *looks* like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

In [26]:
id_soup = BeautifulSoup('<p id="my id"></p>', 'html.parser')
print(id_soup.p['id'])

my id


When you turn a tag back into a string, multiple attribute values are consolidated:

In [27]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser')
print(rel_soup.a['rel'])

['index']


In [28]:
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

<p>Back to the <a rel="index contents">homepage</a></p>


You can disable this by passing `multi_valued_attributes=None` as a keyword argument into the `BeautifulSoup` constructor:

In [29]:
no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser', multi_valued_attributes=None)
print(no_list_soup.p['class'])

body strikeout


You can use `get_attribute_list` to get a value that’s always a list, whether or not it’s a multi-valued atribute:

In [30]:
print(id_soup.p.get_attribute_list('id'))

['my id']


If you parse a document as XML, there are no multi-valued attributes:

In [31]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
print(xml_soup.p['class'])

body strikeout


Again, you can configure this using the `multi_valued_attributes` argument:

In [32]:
class_is_multi= { '*' : 'class'}
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)
print(xml_soup.p['class'])

['body', 'strikeout']


You probably won’t need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification:

In [33]:
from bs4.builder import builder_registry
builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES

{'*': ['class', 'accesskey', 'dropzone'],
 'a': ['rel', 'rev'],
 'link': ['rel', 'rev'],
 'td': ['headers'],
 'th': ['headers'],
 'form': ['accept-charset'],
 'object': ['archive'],
 'area': ['rel'],
 'icon': ['sizes'],
 'iframe': ['sandbox'],
 'output': ['for']}

### `NavigableString`

A string corresponds to a bit of text within a tag. Beautiful Soup uses the `NavigableString` class to contain these bits of text:

In [34]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(tag.string)

Extremely bold


In [35]:
print(type(tag.string))

<class 'bs4.element.NavigableString'>


A `NavigableString` is just like a Python Unicode string, except that it also supports some of the features described in [Navigating the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree) and [Searching the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree). You can convert a `NavigableString` to a Unicode string with `str`:

In [36]:
unicode_string = str(tag.string)
print(unicode_string)

Extremely bold


In [37]:
print(type(unicode_string))

<class 'str'>


You can’t edit a string in place, but you can replace one string with another, using [replace_with()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#replace-with):

In [38]:
tag.string.replace_with("No longer bold")
print(tag)

<b class="boldest">No longer bold</b>


`NavigableString` supports most of the features described in [Navigating the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree) and [Searching the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree), but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the `.contents` or `.string` attributes, or the `find()` method.

If you want to use a `NavigableString` outside of Beautiful Soup, you should call `unicode()` on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.

### `BeautifulSoup`

The `BeautifulSoup` object represents the parsed document as a whole. For most purposes, you can treat it as a [Tag](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag) object. This means it supports most of the methods described in [Navigating the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree) and [Searching the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

You can also pass a `BeautifulSoup` object into one of the methods defined in [Modifying the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#modifying-the-tree), just as you would a [Tag](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag). This lets you do things like combine two parsed documents:

In [39]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)

'INSERT FOOTER HERE'

In [40]:
print(doc)

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>


Since the `BeautifulSoup` object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its `.name`, so it’s been given the special `.name` “\[document\]”:

In [41]:
print(soup.name)

[document]


### Comments and other special strings

`Tag`, `NavigableString`, and `BeautifulSoup` cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The main one you’ll probably encounter is the comment:

In [42]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print(type(comment))

<class 'bs4.element.Comment'>


The `Comment` object is just a special type of `NavigableString`:

In [43]:
print(comment)

Hey, buddy. Want to buy a used parser?


But when it appears as part of an HTML document, a `Comment` is displayed with special formatting:

In [44]:
print(soup.b.prettify())

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>


Beautiful Soup also defines classes called `Stylesheet`, `Script`, and `TemplateString`, for embedded CSS stylesheets (any strings found inside a `<style>` tag), embedded Javascript (any strings found in a `<script>` tag), and HTML templates (any strings inside a `<template>` tag). These classes work exactly the same way as `NavigableString`; their only purpose is to make it easier to pick out the main body of the page, by ignoring strings that represent something else. *(These classes are new in Beautiful Soup 4.9.0, and the html5lib parser doesn’t use them.)*

Beautiful Soup defines classes for anything else that might show up in an XML document: `CData`, `ProcessingInstruction`, `Declaration`, and `Doctype`. Like `Comment`, these classes are subclasses of `NavigableString` that add something extra to the string. Here’s an example that replaces the comment with a CDATA block:

In [45]:
from bs4 import CData
cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())

<b>
 <![CDATA[A CDATA block]]>
</b>


## Navigating the tree

Here’s the “Three sisters” HTML document again:

In [46]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

I’ll use this as an example to show you how to move from one part of a document to another.

### Going down

Tags may contain strings and other tags. These elements are the tag’s *children*. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.

#### Navigating using tag names

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the \<head\> tag, just say `soup.head`:

In [47]:
print(soup.head)

<head><title>The Dormouse's story</title></head>


In [48]:
print(soup.title)

<title>The Dormouse's story</title>


You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first \<b\> tag beneath the \<body\> tag:

In [49]:
print(soup.body.b)

<b>The Dormouse's story</b>


Using a tag name as an attribute will give you only the *first* tab by that name:

In [50]:
print(soup.a)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


If you need get *all* the \<a\> tags, or anything more complicated than the first tag with a certain name, you'll need to use one of the methods described in [Searching the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree), such as ***`find_all()`***:

In [51]:
print(soup.find_all("a"))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


#### `.contents` and `.children`

A tag's children are available in a list called `.contents`:

In [52]:
head_tag = soup.head
print(head_tag)

<head><title>The Dormouse's story</title></head>


In [53]:
print(head_tag.contents)

[<title>The Dormouse's story</title>]


In [54]:
title_tag = head_tag.contents[0]
print(title_tag)

<title>The Dormouse's story</title>


In [55]:
print(title_tag.contents)

["The Dormouse's story"]


The `BeautifulSoup` object itself has children. In this case, the \<html\> tag is the child of the `BeautifulSoup` object.:

In [56]:
print(len(soup.contents))

2


In [57]:
print(soup.contents[1].name)

html


A string does not have `.contents`, because it can’t contain anything:

In [58]:
text = title_tag.contents[0]
print(text.contents)

AttributeError: 'NavigableString' object has no attribute 'contents'

Instead of getting them as a list, you can iterate over a tag’s children using the `.children` generator:

In [59]:
for child in title_tag.children:
    print(child)

The Dormouse's story


If you want to modify a tag’s children, use the methods described in [Modifying the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#modifying-the-tree). Don’t modify the `.contents` list directly: that can lead to problems that are subtle and difficult to spot.

#### `.descendants`

The `.contents` and `.children` attributes only consider a tag’s *direct* children. For instance, the \<head\> tag has a single direct child–the \<title\> tag:

In [60]:
print(head_tag.contents)

[<title>The Dormouse's story</title>]


But the \<title\> tag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the \<head\> tag. The `.descendants` attribute lets you iterate over *all* of a tag’s children, recursively: its direct children, the children of its direct children, and so on:

In [61]:
for child in head_tag.descendants:
    print(child)

<title>The Dormouse's story</title>
The Dormouse's story


The \<head\> tag has only one child, but it has two descendants: the \<title\> tag and the \<title\> tag’s child. The `BeautifulSoup` object only has one direct child (the \<html\> tag), but it has a whole lot of descendants:

In [62]:
print(len(list(soup.children)))

2


In [63]:
print(len(list(soup.descendants)))

27


#### `.string`

If a tag has only one child, and that child is a `NavigableString`, the child is made available as `.string`:

In [64]:
print(title_tag.string)

The Dormouse's story


If a tag’s only child is another tag, and *that* tag has a `.string`, then the parent tag is considered to have the same `.string` as its child:

In [65]:
print(head_tag.contents)

[<title>The Dormouse's story</title>]


In [66]:
print(head_tag.string)

The Dormouse's story


If a tag contains more than one thing, then it’s not clear what `.string` should refer to, so `.string` is defined to be `None`:

In [67]:
print(soup.html.string)

None


#### `.strings` and `stripped_strings`

If there’s more than one thing inside a tag, you can still look at just the strings. Use the `.strings` generator:

In [68]:
for string in soup.strings:
    print(repr(string))

'\n'
"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'


These strings tend to have a lot of extra whitespace, which you can remove by using the `.stripped_strings` generator instead:

In [69]:
for string in soup.stripped_strings:
    print(repr(string))

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'


Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed.

### Going up

Continuing the “family tree” analogy, every tag and every string has a *parent*: the tag that contains it.

#### `.parent`

You can access an element’s parent with the `.parent` attribute. In the example “three sisters” document, the \<head\> tag is the parent of the \<title\> tag:

In [70]:
title_tag = soup.title
print(title_tag)

<title>The Dormouse's story</title>


In [71]:
print(title_tag.parent)

<head><title>The Dormouse's story</title></head>


The title string itself has a parent: the \<title\> tag that contains it:

In [72]:
print(title_tag.string.parent)

<title>The Dormouse's story</title>


The parent of a top-level tag like \<html\> is the `BeautifulSoup` object itself:

In [73]:
html_tag = soup.html
print(type(html_tag.parent))

<class 'bs4.BeautifulSoup'>


And the `.parent` of a `BeautifulSoup` object is defined as None:

In [74]:
print(soup.parent)

None


#### `.parents`

You can iterate over all of an element’s parents with `.parents`. This example uses `.parents` to travel from an  \<a\> tag buried deep within the document, to the very top of the document:

In [75]:
link = soup.a
print(link)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


In [76]:
for parent in link.parents:
    print(parent.name)

p
body
html
[document]


### Going sideways

#### `.next_sibling` and `.previous_sibling`

#### `.next_siblings` and `.previous_siblings`

### Going back and forth

#### `.next_element` and `.previous_element`

#### `.next_elements` and `.previous_elements`

## Searching the tree

### Kinds of filters

#### A string

#### A regular expression

#### A list

#### `True`

#### A function

### `find_all()`

#### The `name` argument

#### The keyword arguments

#### Searching by CSS class

#### The `string` argument

#### The `limit` argument

#### The `recursive` argument

### Calling a tag is like calling `find_all()`

### `find()`

### `find_parents()` and `find_parent()`

### `find_next_siblings()` and `find_next_sibling()`

### `find_previous_siblings()` and `find_previous_sibling()`

### `find_all_next()` and `find_next()`

### `find_all_previous()` and `find_previous()`

### CSS selectors

## Modifying the tree